Basic Usage
To start using ParData, first import the package:
>>> import pardata
This implicitly calls pardata.init()
and initializes ParData to be ready to retrieve datasets from the default
repository (Note: We will learn about switching to a non-default repository in Example: Create a Schema for Dakota’s Height Journey). To see the available
datasets and their versions included in this repository, run
>>> pardata.list_all_datasets()
{'claim_sentences_search': ('1.0.2',), ..., 'wikitext103': ('1.0.1',)}
To look up information about a particular dataset, use the function pardata.describe_dataset()
as shown below:
>>> print(pardata.describe_dataset('wikipedia_category_stance'))
Description: Wikipedia categories and lists annotated for stance (Pro/Con) towards the concepts
Homepage: https://developer.ibm.com/exchanges/data/all/wikipedia-category-stance/
Size: 525K
Published date: 2019-08-01
License: Creative Commons Attribution 3.0 Unported
Available subdatasets: full
Use pardata.load_dataset()
to load a dataset. It will first download the specified dataset with the specified version
(default is the latest version) if it’s not already downloaded, and then load it in memory.
>>> wcs_data = pardata.load_dataset('wikipedia_category_stance')
ParData will download the IBM Debater® Wikipedia Category Stance dataset (latest version
1.0.2
) if it’s not already downloaded, and then load it to the variable wcs_data
.
wcs_data
is a dict
that stores the loaded content of the dataset. It has one key 'full'
, which is the
identifier of a subdataset, i.e., a subdivision of the whole dataset. wikipedia_category_stance
has only one
subdataset because it’s a small dataset, hence wcs_data
has only one key. Because wikipedia_category_stance
is
in CSV format, ParData will automatically load the dataset to wcs_data['full']
as a pandas.DataFrame
object, which is a
convenient way to manipulate CSV files in Python:
>>> type(wcs_data['full'])
<class 'pandas.core.frame.DataFrame'>
We can also customize how and what type of object each file format should be loaded to in ParData. How to do so is outside the scope of this tutorial.
By default, pardata.load_dataset()
downloads to and loads from
~/.pardata/data/<dataset-name>/<dataset-version>/
. To view this information, run:
>>> pardata.get_config().DATADIR
PosixPath('/home/username/.pardata/data')
To change this default data directory, use pardata.init()
.
pardata.init(DATADIR='new/dir/to/download/load/from')