Write a Good Primitive¶
This is a list of dos, don’ts and things to consider when crafting a new primitive or updating an existing one. This list is not exhaustive so please add new items to the list as they are discovered! An example of a primitive that endeavors to adheres to all of the following guidance can be found here:
DO’s
Do complete the documentation on the primitive such as:
Primitive family, algorithm type.
Docstring of the primitive’s Python class.
One line summary first:
Primitive name should be close to this.
Primitive path should be close to this as well.
Longer documentation/description after, all in the main docstring of the class.
Provide pipeline examples together with the primitive annotation.
Docstrings in numpy style.
Please use reStructuredText instead of markdown or other formats.
Maintain a change-log of alterations to the primitive (somewhere in the primitive’s repo, consider using a standard format).
One should also add point of contact information and the git repository link in primitive’s metadata (
source.name
,source.contact
andsource.uris
metadata fields).Add your primitive name to the list of primitive names if it does not already exist. Chances are that your generic primitive name is in that list and you should use that name for your primitive.
Do annotate your Primitive with Python types.
Do make sure the output from your produce method is a d3m container type.
If your primitive is operating on columns and rows:
Do include
d3mIndex
column in produced output if input hasd3mIndex
column. * You can make this behavior controlled by theadd_index_columns
hyper-parameter.If a primitive has a hyper-paramer to directly set which columns to operate on, do use column indices and not column names to identify those columns.
Consider using a pair of hyper-parameters:
use_columns
andexclude_columns
with standard logic.
When deciding on which columns to operate, when using semantic types, do use
https://metadata.datadrivendiscovery.org/types/TrueTarget
andhttps://metadata.datadrivendiscovery.org/types/RedactedPrivilegedData
semantic types and nothttps://metadata.datadrivendiscovery.org/types/SuggestedTarget
andhttps://metadata.datadrivendiscovery.org/types/SuggestedPrivilegedData
. The latter are semantic types which come from the dataset, the former are those which come from the problem description. While it is true that currently generally they always match, in fact primitives should just respect those coming from the problem description. The dataset has them so that one can create problem descriptions on the fly, if needed.
Be mindful that data being passed through a pipeline also has metadata:
If your primitive generates new data (e.g., new columns), add metadata suitable for those columns:
Name the column appropriately for human consumption by setting column’s
name
metadata.Set semantic types appropriately.
If your primitive is producing target predictions, add
https://metadata.datadrivendiscovery.org/types/PredictedTarget
to a column containing those predictions.Remember metadata encountered on target columns during fitting, and reuse that metadata as much as reasonable when producing target predictions.
If your primitive is transforming existing data (e.g., transforming columns), reuse as much metadata from original data as reasonable, but do update metadata based on new data.
If structural type of the column changes, make sure you note this change in metadata as well.
Support also non-standard metadata and try to pass it through as-is if possible.
Do write unit tests for your primitives. This greatly aids porting to a new version of the core package.
Test pickle and unpickle of the primitive (both fitted and unfitted primitives).
Test with use of semantic types to select columns to operate on, and without the use of semantic types.
Test with all return types:
append
,replace
,new
.Test all hyper-parameter values with their
sample
method.Use/contribute to tests data repository.
Do clearly define hyper-parameters (bounds, descriptions, semantic types).
Suggest new classes of hyper-parameters if needed.
Consider if
upper_inclusive
andlower_inclusive
values should be included or not for every hyper-parameterDefine reasonable hyper-parameters which can be automatically populated/searched by TA2. A hyper-parameter such as
hyperparams.Hyperparameter[typing.Sequence[Any]]
is not useful in this case.Ensure that your primitive can be run successfully with default settings for all hyper-parameters.
If there are combinations of hyper-parameters settings that are suboptimal please note this in the documentation. For example: “If hyper-parameter A is set to a True, hyper-parameter B must always be a positive integer”.
Do bump primitive version when changing hyper-parameters, method signatures or params. In short, on any API change of your primitive.
If your primitive can use GPUs if available, set
can_use_gpus
primitive’s metadata to true.If your primitive can use different number of CPUs/cores, expose a hyper-parameter with semantic types
https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter
andhttps://metadata.datadrivendiscovery.org/types/CPUResourcesUseParameter
and allow caller to control the number of CPUs/cores used through it.Make sure that the default value of such hyper-parameter is 1.
DON’Ts
Don’t change the input DataFrame! Make a copy and make changes to the copy instead. The original input DataFrame is assumed never to change between primitives in the pipeline.
Don’t return DataFrames with a (non-default) Pandas DataFrame index. It can be utilized internally, but drop it before returning. On output a default index should be provided.
PLEASE CONSIDER
Consider using/supporting semantic types to select which columns to operate on, and use the use_semantic_types hyper-parameter.
Consider allowing three types of outputs strategies:
new
/append
/replace
output, if operating on columns, controlled by thereturn_result
hyper-parameter.Consider picking the input and output format/structure of data to match other primitives of the same family/type. If necessary, convert data to the format you need inside your primitive. Pipelines tend to start with datasets, then go to dataframes, and then to ndarrays sometimes, returning predictions as a dataframe. Consider where your primitive in a pipeline generally should be and consider that when deciding on what are inputs and outputs of your primitive. Consider that your primitive will be chosen dynamically by a TA2 and will be expected to behave in predictable ways based on family and base class.
Consider using a specific hyper-parameter class instead of the hyper-parameter base class as it is not very useful for TA2s. For example use
hyperparams.Set
instead ofhyperparams.Hyperparameter[typing.Sequence[Any]]
. It is better to use the former as it is far more descriptive.Use a base class for your primitive which makes sense based on semantics of the base class and not necessarily how a human would understand the primitive.
Consider that your primitive will be chosen dynamically by a TA2 and will be expected to behave in predictable ways based on primitive family and base class.