Schema File Format References

For an introduction of the schema file format, check out Schema File Format.

Top-Level Keys

A schema file is a yaml file that gathers multiple schemata. It can be seen as representing a dataset repository. These schemata may be gathered together because they are within the same community, share the same ownership, or for any other reasons. It is typically structured as:

api_name: com.ibm.pardata.v1
last_updated: <ISO date string of last update>
datasets:
  <dataset1_id>:
    <version1>:
      <the schema of dataset1_id version1>
    <version2>:
      <the schema of dataset1_id version2>
    ...
  <dataset2_id>:
    <version1>:
      <the schema of dataset2_id version1>
    <version2>:
      <the schema of dataset2_id version2>
    ...

api_name: Should always be com.ibm.pardata.v1.

last_updated: The date in which the file was last updated. This should be in ISO format, such as 2021-06-20.

datasets

A dictionary of datasets. The key of each item is a dataset identifier, and the value of each item is another dictionary. In this dictionary, each key is a version string (e.g., "1.2.3") representing a particular version of a dataset in the dataset series; each value is a schema for the dataset.

<dataset1_id>:
  <version1>:
    <the schema of dataset1_id version1>
  <version2>:
    <the schema of dataset1_id version2>
  ...

Schema

A schema is a tree-like data structure that describes a dataset or a format, or a license. It contains information such as the download URL, license, format descriptions of a dataset. We usually represent them in the yaml format, or as a Python dict in a Python program.

Each schema follows the following structure:

name: <dataset name>
published: <published date>
homepage: <homepage URL>
download_url: <download URL>
sha512sum: <sha512sum of the dataset file>
license: <SPDX license token or custom license symbol>
estimated_size: <estimated size>
description: <description>
subdatasets:
  <subdataset1_id>:
    <Subdataset Dict: Dictionary that describes subdataset1_id>
  <subdataset2_id>:
    <Subdataset Dict: Dictionary that describes subdataset2_id>
  ...

name: Human-readable name of the dataset.

published: Published date in ISO format.

homepage: Homepage URL.

download_url: Download URL.

sha512sum: sha512sum of the dataset file. It can be generated using sha512sum data-file.tar.gz)

license: SPDX license token or custom license symbol.

estimated_size: Estimated size of the dataset.

description: Description of the dataset.

subdatasets: A dictionary that divides the datasets into multiple subdatasets and describes them. The keys of the dictionary are subdataset identifiers and values are dictionaries that describe the subdataset.

Subdataset Dict

A subdataset dict describes a subdataset, which is a logical subdivision of the dataset.

name: Name of the subdataset.

description: Description of the subdataset.

format

A dictionary that describes the format of the subdataset.

id: Identifier of the format specified in FORMAT_SCHEMA_FILE_URL.

path

Path to the file of this subdataset. It can also be a dictionary to specify a regular expression. For example,

path:
  type: regex
  value: "TensorFlow-Speech-Commands/house/.*\\.wav"

options

A dictionary that specifies the options for a particular format. The specification varies by format.

audio/wav: No options.
image/jpeg: No options.
image/png: No options.
table/csv
- columns: A dictionary in which keys are the names of the columns and values are the type of the entries in the columns. If it is not specified, then pandas defaults are used.
- delimiter: The delimiter of the CSV files. Default: ,
- encoding: Encoding of the CSV files. Default: UTF-8
text/plain + encoding: Encoding of the text files. Default: UTF-8