Load Datasets and Problems


This package also provides a Python class to load and represent datasets in Python through the d3m.container.dataset module. This container value can serve as an input to the whole pipeline and be used as input for primitives which operate on a dataset as a whole. It allows to register multiple loaders to support different formats of datasets. You need to pass an URI to a dataset and it automatically picks the right loader. By default it supports URIs for: D3M datasets, CSV files, OpenML and Sklearn datasets.

  • D3M datasets. Only file:// URI scheme is supported and URI should point to the datasetDoc.json file. Example: file:///path/to/datasetDoc.json

  • CSV files. Many URI schemes are supported, including remote ones like http://. URI should point to a file with .csv extension. Example: http://example.com/iris.csv

  • OpenML datasets. You need to provide the URL of the dataset page. Example: https://www.openml.org/d/31

  • Some Sklearn datasets from sklearn.datasets. Example: sklearn://boston

To load a dataset, you just need to call the method load from the Dataset class passing as a parameter the URI. Bellow, you can see how to load an OpenML dataset:

from d3m.container import Dataset

dataset_uri = 'https://www.openml.org/d/62'
dataset = Dataset.load(dataset_uri)

You can save the previously loaded dataset in D3M format using the save_container method. You just need to provide the path where the dataset will be saved:

from d3m.container.utils import save_container

destination_path = 'path_to_save_the_dataset'
save_container(dataset, destination_path)

load and save_container methods automatically convert and save non-D3M datasets (e.g. CSV files) to D3M format. However, if you want to do this process manually, here you can find more information.

TODO: How to write a dataset/problem loader.

TODO: Document OpenML Crawler.

Problem Descriptions

d3m.metadata.problem module provides a parser for problem description into a normalized Python object.

You can load a problem description and get the loaded object dumped back by running:

python3 -m d3m problem describe <path to problemDoc.json>