Advanced Pipelines

TODO: Document that custom/additional fields are allowed (which are part of digest). Document _prefix fields (which are not part of digest).

TODO: Document sub-pipeline step. Document how data references for sub-pipelines are done.

TODO: Document placeholder step.

TODO: Document resolving of pipelines (by filename based on ID in the pipeline search path).

Interaction with Problem Description

TODO: Passing true targets and LUPI through semantic types from the problem description.

Container types

All input and output (container) values passed between primitives should expose a Sequence protocol (sequence in samples) and provide metadata attribute with metadata.

d3m.container module exposes such standard types:

List can be used to create a simple list container.

It is strongly encouraged to use the DataFrame container type for primitives which do not have strong reasons to use something else (Datasets to operate on initial pipeline input, or optimized high-dimensional packed data in ndarrays, or lists to pass as values to hyper-parameters). This makes it easier to operate just on columns without type casting while the data is being transformed to make it useful for models.

When deciding which container type to use for inputs and outputs of a primitive, consider as well where an expected place for your primitive is in the pipeline. Generally, pipelines tend to have primitives operating on Dataset at the beginning, then use DataFrame and then convert to ndarray.

Data types

Container types can contain values of the following types:

Placeholders

Placeholders can be used to define pipeline templates to be used outside of the metalearning context. A placeholder is replaced with a pipeline step to form a pipeline. Restrictions of placeholders may apply on the number of them, their position, allowed inputs and outputs, etc.