pardata.dataset.Dataset

class pardata.dataset.Dataset(schema, data_dir, *, mode=InitializationMode.LAZY)

Bases: object

Models a particular dataset version along with download & load functionality.

Parameters
  • schema (Dict[str, Any]) – Schema dict of a particular dataset version.

  • data_dir (Union[str, os.PathLike]) – Directory to/from which the dataset should be downloaded/loaded from. The path can be either absolute or relative to the current working directory, but will be converted to the absolute path immediately upon initialization.

  • mode (pardata._dataset.Dataset.InitializationMode) – Mode with which to treat a dataset. Available options are: Dataset.InitializationMode.LAZY, Dataset.InitializationMode.DOWNLOAD_ONLY, Dataset.InitializationMode.LOAD_ONLY, and Dataset.InitializationMode.DOWNLOAD_AND_LOAD.

Raises

ValueError – An invalid mode was specified for handling the dataset.

Return type

None

Example:

>>> from tempfile import TemporaryDirectory
>>> import pprint
>>> import pardata
>>> dataset_schemata = pardata.schema.DatasetSchemaCollection('./tests/schemata/datasets.yaml')
>>> jfk_schema_dict = dataset_schemata.export_schema('datasets', 'noaa_jfk', '1.1.4')
>>> pprint.pprint(jfk_schema_dict)
{'description': ...
 'download_url': '...noaa-weather-data-jfk-airport.tar.gz',
 ...
 'subdatasets': {'jfk_weather_cleaned': {...
                                         'format': {'id': 'table/csv',
                                                    ...}},
                                         ...
                                         'path': 'noaa-weather-data-jfk-airport/jfk_weather_cleaned.csv'}}}
>>> jfk_data_dir = TemporaryDirectory()
>>> jfk_dataset = Dataset(schema=jfk_schema_dict, data_dir=jfk_data_dir.name)
>>> jfk_dataset.download()
>>> data = jfk_dataset.load()
>>> data['jfk_weather_cleaned'].shape
(75119, 16)
>>> jfk_dataset.delete()  # The directory jfk_data_dir is deleted here
>>> jfk_dataset.is_downloaded()
False

Methods

delete(*[, force])

Clear the data directory.

download(*[, check, verify_checksum])

Downloads, extracts, and removes dataset archive.

is_downloaded()

Check to see if the dataset was downloaded.

load([subdatasets, format_loader_map, check])

Load data files to RAM.

Attributes

data

Access loaded data objects.