Schema File Format

A schema is a tree-like data structure that describes a dataset or a format, or a license. It contains information such as the download URL, license, format descriptions of a dataset. We usually represent them in the yaml format, or as a Python dict in a Python program.

name: <dataset name>
published: <published date>
homepage: <homepage URL>
download_url: <download URL>
sha512sum: <sha512sum of the dataset file>
license: <SPDX license token or custom license symbol>
estimated_size: <estimated size>
description: <description>
    <Dictionary that describes the subdataset>
    <Dictionary that describes the subdataset>

Below is an example schema that describes the IBM Debater® Thematic Clustering of Sentences dataset:

name: IBM Debater® Thematic Clustering of Sentences
published: 2019-08-01
sha512sum: 08a3f1a9dc06083eb51874e90d7241f67b676af2cbc28fe6a312694051f53391fc95de70fdcdce404de3578fa389558220ea38d34f70265ed88220d0b14f1aba
license: cc_by_sa_30
estimated_size: 10.6M
description: "A benchmark of sentence-clustering based on the partition of Wikipedia articles into sections."
    name: IBM Debater® Thematic Clustering of Sentences
    description: IBM Debater® Thematic Clustering of Sentences complete dataset
      id: table/csv
          article_title: 'string'
          sentence: 'string'
          cluster_title: 'string'
          article_link: 'string'
    path: dataset.csv

A schema file is a yaml file that gathers multiple schemata. It can be seen as representing a dataset repository. These schemata may be gathered together because they are within the same community, share the same ownership, or for any other reasons. It is typically structured as

name: <Identifier of the schema file>
last_updated: <ISO date string of last update>
      <the schema of dataset1_id version1>
      <the schema of dataset1_id version2>
      <the schema of dataset2_id version1>
      <the schema of dataset2_id version2>

Check out Schema File Format References for a complete reference and default ParData repository for a live example.

For further details, check out Schema File Format References.