d3m.container.dataset

class d3m.container.dataset.ComputeDigest(value)[source]

Bases: d3m.utils.Enum

Enumeration of possible approaches to computing dataset digest.

ALWAYS = 'ALWAYS'[source]
NEVER = 'NEVER'[source]
ONLY_IF_MISSING = 'ONLY_IF_MISSING'[source]
class d3m.container.dataset.Dataset(resources, metadata=None, *, load_lazy=None, generate_metadata=False, check=True, source=None, timestamp=None)[source]

Bases: dict

A class representing a dataset.

Internally, it is a dictionary containing multiple resources (e.g., tables).

Parameters
  • resources (Mapping) – A map from resource IDs to resources.

  • metadata (d3m.metadata.base.DataMetadata) – Metadata associated with the data.

  • load_lazy (Optional[Callable[[Dataset], None]]) – If constructing a lazy dataset, calling this function will read all the data and convert the dataset to a non-lazy one.

  • generate_metadata (bool) – Automatically generate and update the metadata.

  • check (bool) – DEPRECATED: argument ignored.

  • source (Optional[Any]) – DEPRECATED: argument ignored.

  • timestamp (Optional[datetime]) – DEPRECATED: argument ignored.

copy()[source]
Return type

~D

get_column_references_by_column_index()[source]
Return type

Dict[str, Dict[ColumnReference, List[ColumnReference]]]

get_relations_graph()[source]

Builds the relations graph for the dataset.

Each key in the output corresponds to a resource/table. The value under a key is the list of edges this table has. The edge is represented by a tuple of four elements. For example, if the edge is (resource_id, True, index_1, index_2, custom_state), it means that there is a foreign key that points to table resource_id. Specifically, index_1 column in the current table points to index_2 column in the table resource_id.

custom_state is an empty dict when returned from this method, but allows users of this graph to store custom state there.

Returns

Returns the relation graph in adjacency representation.

Return type

Dict[str, List[Tuple[str, bool, int, int, Dict]]]

is_lazy()[source]

Return whether this dataset instance is lazy and not all data has been loaded.

Returns

True if this dataset instance is lazy.

Return type

bool

classmethod load(dataset_uri, *, dataset_id=None, dataset_version=None, dataset_name=None, lazy=False, compute_digest=<ComputeDigest.ONLY_IF_MISSING: 'ONLY_IF_MISSING'>, strict_digest=False, handle_score_split=True)[source]

Tries to load dataset from dataset_uri using all registered dataset loaders.

Parameters
  • dataset_uri (str) – A URI to load.

  • dataset_id (Optional[str]) – Override dataset ID determined by the loader.

  • dataset_version (Optional[str]) – Override dataset version determined by the loader.

  • dataset_name (Optional[str]) – Override dataset name determined by the loader.

  • lazy (bool) – If True, load only top-level metadata and not whole dataset.

  • compute_digest (ComputeDigest) – Compute a digest over the data?

  • strict_digest (bool) – If computed digest does not match the one provided in metadata, raise an exception?

  • handle_score_split (bool) – If a scoring dataset has target values in a separate file, merge them in?

Returns

A loaded dataset.

Return type

Dataset

load_lazy()[source]

Read all the data and convert the dataset to a non-lazy one.

Return type

None

classmethod register_loader(loader)[source]

Registers a new dataset loader.

Parameters

loader (Loader) – An instance of the loader class implementing a new loader.

Return type

None

classmethod register_saver(saver)[source]

Registers a new dataset saver.

Parameters

saver (Saver) – An instance of the saver class implementing a new saver.

Return type

None

save(dataset_uri, *, compute_digest=<ComputeDigest.ALWAYS: 'ALWAYS'>, preserve_metadata=True)[source]

Tries to save dataset to dataset_uri using all registered dataset savers.

Parameters
  • dataset_uri (str) – A URI to save to.

  • compute_digest (ComputeDigest) – Compute digest over the data when saving?

  • preserve_metadata (bool) – When saving a dataset, store its metadata as well?

Return type

None

select_rows(row_indices_to_keep)[source]

Generate a new Dataset from the row indices for DataFrames.

Parameters

row_indices_to_keep (Mapping[str, Sequence[int]]) – This is a dict where key is resource ID and value is a sequence of row indices to keep. If a resource ID is missing, the whole related resource is kept.

Returns

Returns a new Dataset.

Return type

~D

to_json_structure(*, canonical=False)[source]

Returns only a top-level dataset description.

Return type

Dict

loaders: List[d3m.container.dataset.Loader] = [<d3m.container.dataset.D3MDatasetLoader object>, <d3m.container.dataset.CSVLoader object>, <d3m.container.dataset.SklearnExampleLoader object>, <d3m.container.dataset.OpenMLDatasetLoader object>][source]
metadata: d3m.metadata.base.DataMetadata[source]
savers: List[d3m.container.dataset.Saver] = [<d3m.container.dataset.D3MDatasetSaver object>][source]