Schema File Format
A schema is a tree-like data structure that describes a dataset or a format, or a license. It contains information such
as the download URL, license, format descriptions of a dataset. We usually represent them in the yaml format, or as a
Python dict
in a Python program.
name: <dataset name>
published: <published date>
homepage: <homepage URL>
download_url: <download URL>
sha512sum: <sha512sum of the dataset file>
license: <SPDX license token or custom license symbol>
estimated_size: <estimated size>
description: <description>
subdatasets:
<subdataset1_id>:
<Dictionary that describes the subdataset>
<subdataset2_id>:
<Dictionary that describes the subdataset>
...
Below is an example schema that describes the IBM Debater® Thematic Clustering of Sentences dataset:
name: IBM Debater® Thematic Clustering of Sentences
published: 2019-08-01
homepage: https://developer.ibm.com/exchanges/data/all/thematic-clustering-of-sentences/
download_url: https://dax-cdn.cdn.appdomain.cloud/dax-thematic-clustering-of-sentences/1.0.2/thematic-clustering-of-sentences.tar.gz
sha512sum: 08a3f1a9dc06083eb51874e90d7241f67b676af2cbc28fe6a312694051f53391fc95de70fdcdce404de3578fa389558220ea38d34f70265ed88220d0b14f1aba
license: cc_by_sa_30
estimated_size: 10.6M
description: "A benchmark of sentence-clustering based on the partition of Wikipedia articles into sections."
subdatasets:
full:
name: IBM Debater® Thematic Clustering of Sentences
description: IBM Debater® Thematic Clustering of Sentences complete dataset
format:
id: table/csv
options:
columns:
article_title: 'string'
sentence: 'string'
cluster_title: 'string'
article_link: 'string'
path: dataset.csv
A schema file is a yaml file that gathers multiple schemata. It can be seen as representing a dataset repository. These schemata may be gathered together because they are within the same community, share the same ownership, or for any other reasons. It is typically structured as
api_name: com.ibm.pardata.v1
name: <Identifier of the schema file>
last_updated: <ISO date string of last update>
datasets:
<dataset1_id>:
<version1>:
<the schema of dataset1_id version1>
<version2>:
<the schema of dataset1_id version2>
...
<dataset2_id>:
<version1>:
<the schema of dataset2_id version1>
<version2>:
<the schema of dataset2_id version2>
...
Check out Schema File Format References for a complete reference and default ParData repository for a live example.
For further details, check out Schema File Format References.