Welcome to Data Driven Discovery of Models (D3M) program documentation.

First, some terminology:

D3M core package

The core package provides the interface of TA1 primitives, data types of values which can be passed between them during execution, the pipeline language, the metadata associated with values being passed between primitives, provides a reference runtime, and contains a lot of other useful code write primitives, generate pipelines, and run them.

Documentation for it is available here. If you are just starting with D3M, this documentation is the best starting point. Development happens in this repository. The index of known primitives is available here.

TA3-TA2 API

TA2 (AutoML) systems and TA3 systems have to communicate between them. This API defines the protocol to do so. If you are not writing a TA3 system you might think this API is not for you, but in fact it is used as a standard interface to interact with any D3M-compatible AutoML system. Example. Also see this simple TA3 system if you are starting your own TA3 system (an AutoML dashboard?) and would need an inspiration, or example.

Datasets

D3M program provides many datasets in an uniform structure. Public datasets (those we can distribute) are available as a git repository. The structure of those datasets is described and standardized in this repository. The core package knows how to read those datasets into the standard Dataset object and convert any available metadata into standard D3M metadata (which is similar but not exactly the same as D3M dataset structure, as we want it to be more general).

Metadata

All values being passed between TA1 primitives have additional metadata associated with them to help TA1 primitives make better sense of data. Metadata also serves as a way to pass additional information to other primitives. Primitives themselves can be described with metadata as well. And pipelines, problem descriptions, and records of pipeline runs are also seen as metadata. All that is standardized through JSON schemas. In addition, we use semantic types and maintain a list of commonly used semantic types in the program.

Metalearning database

Every pipeline which is run with the reference runtime produces a record of that run, called pipeline run. Those pipeline runs (together with metadata about input datasets and problem description) are stored in centralized and shared metalearning database, building towards a large metalearning dataset. Ideally, all those pipeline runs are fully reproducible. We use metalearning repository to coordinate work around metalearning.

Docker images

D3M program has many moving pieces, many primitives, with many dependencies. Putting them all together to work correctly can be tricky. This is why we provide Docker images with all primitives and dependencies installed, and configured to work both with or without GPUs. Download a Docker image, datasets, and you are ready to go to run some pipelines.