Metalearning Database

D3M provides a metalearning database to support research to improve AutoML system capabilities. This database contains metadata about datasets, primitives, pipelines, and the results of executing pipelines on datasets (pipeline runs).

The metalearning database is powered by ElasticSearch. All the data is publicly available. The data has been generated primarily during formal D3M system evaluations and by D3M participants, but anyone can contribute. It can be explored using the Marvin dashboard or any ElasticSearch client.

The metalearning database endpoint is hosted at

https://metalearning.datadrivendiscovery.org/es

Database Structure

The Metalearning Database holds five different normalized documents in separate Elasticsearch indexes, including datasets, problems, primitives, pipelines, and pipeline runs. These documents contain only metadata, for example, a natural language description of a dataset or the dataset’s source URI. They do not contain, for example, the actual instances of data in a dataset. Each of these documents conform to their respective metadata schemas.

Datasets

Dataset metadata is stored in the datasets index. These documents contain, for example, a natural language description of the dataset and dataset’s source URI. See the dataset schema for a complete description. Actual datasets curated by D3M can be found in the datasets repository.

Problems

Problem metatdata is stored in the problems index. A problem describes a type of machine learning task, including referencing to a dataset, identifying the target column(s), giving performance metrics to optimize, and listing task keywords (e.g. classification, image, remote sensing, etc.). See the problem schema for a complete description. Note that many problems can reference the same dataset, for example, by identifying different columns as the target.

Primitives

Primitive metatdata is stored in the primitives index. A primitive is a high-level machine learning algorithm. The primitive metadata describes what kind of algorithm the primitive is, what hyperparameters and methods it has, how to install it, and who authored it. See the primitives documentation or the primitives schema for more details. An index of the latest versions of these documents can be found in the primitives repository. See also the source code of the common D3M primitives. Primitives can also be browsed and filtered by author, algorithm type, primitive family and many other attributes in the Marvin dashboard.

Pipelines

Pipeline metatdata is stored in the pipelines index. Pipelines describe precisely which primitives are used (by referencing primitive documents) and how they are composed together to build an end-to-end machine learning model. D3M provides a reference runtime for executing pipelines on a given dataset and problem. The execution of pipelines on a dataset is carefully recorded into a pipeline run document. For more details, see the pipeline overview or the pipeline schema. For help on building a pipeline, see an example.

Pipeline Runs

Pipeline Run metatdata is stored in the pipeline_runs index. Pipeline run documents contain an execution trace of running a particular pipeline on a particular dataset and problem. In addition to references to the pipeline, dataset, and problem, this document contains information about how the dataset may have been split for evaluation, performance metrics (e.g. accuracy), predictions, primitive hyperparameters, execution start and end timestamps, primitive methods called, random seeds, logging output, execution environment (e.g. CPUs and RAM available), and much more. See the pipeline run documentation or the pipeline run schema for more details.

Pipeline runs contain all information necessary to reproduce the execution of the referenced pipeline on the referenced dataset and problem. Reproducing a pipeline run requires that the user has access to the same dataset, primitive, and runtime versions. The reference runtime provides basic functionality for reproducing pipeline runs.

Other Indexes (Beta)

Other indexes are being designed and populated to simplify usage of the metalearning database. The simplifications include removing large fields (e.g. especially predictions) and denormalizing references to other documents.

Downloading Data

The data in the metalearning database is publicly available. This data can be downloaded from the endpoint

https://metalearning.datadrivendiscovery.org/es

For downloading small amounts of data, use any ElasticSearch client. For bulk downloads, see the available pre-made dumps.

Custom bulk downloads can be made using an Elasticsearch client such as elasticsearch-dump. Warning: the metalearning database is large and growing. Custom bulk downloads make take a long time to run. It is highly recommended that you refine your dump query as much as possible.

The following is example usage of elasticsearch-dump and requires node package manager (npm). (Note that starting with elasticsearch-dump 6.32.0, nodejs 10.0.0 or higher is required.)

Install elasticsearch-dump

npm install elasticdump

Dump all documents within a specific document ingest timestamp range, e.g pipeline runs ingested in January 2020

npx elasticdump \
    --input=https://metalearning.datadrivendiscovery.org/es \
    --input-index=pipeline_runs \
    --output=pipeline_runs.json \
    --sourceOnly \
    --searchBody='{ "query": {"range": {"_ingest_timestamp": {"gte": "2020-01-01T00:00:00Z", "lt": "2020-02-01T00:00:00Z"}}}, "_source": {"exclude": ["run.results.predictions", "steps.*.method_calls"]}}'

Pipeline run documents can be very large, especially due to the predictions and method calls fields. The above example shows how to exclude those fields. In general, a dump may be made using any ElasticSearch query.

Uploading Data

Uploading new documents to the database can be done using the HTTP API. (In the future, the reference runtime will be able to automatically upload documents for you.)

Important: Requests to upload documents are validated before the documents are ingested. This validation includes checking that referenced documents have already been uploaded to the database. Thus, before uploading a new pipeline run document, for example, the referenced dataset, problem, and pipeline and primitive documents must already be uploaded.

Submitter Tokens

Optionally, you may request a submitter name and token. This allows other users of the metalearning database to find documents submitted by a particular person or organization. The submitter name is publicly available and shows who authored and submitted the document. The submitter token is your password to authenticate your identity and should be kept private.

Reporting Issues

To report issues with the Metalearning Database or coordinate development work, visit the GitLab repository. The source code for the HTTP API and document validation is available there too.