d3m.container.dataset¶

class d3m.container.dataset.ComputeDigest(value)[source]¶

Bases: d3m.utils.Enum

Enumeration of possible approaches to computing dataset digest.

ALWAYS = 'ALWAYS'[source]¶

NEVER = 'NEVER'[source]¶

ONLY_IF_MISSING = 'ONLY_IF_MISSING'[source]¶

class d3m.container.dataset.Dataset(resources, metadata=None, *, load_lazy=None, generate_metadata=False, check=True, source=None, timestamp=None)[source]¶

Bases: dict

A class representing a dataset.

Internally, it is a dictionary containing multiple resources (e.g., tables).

Parameters

resources (Mapping) – A map from resource IDs to resources.
metadata (d3m.metadata.base.DataMetadata) – Metadata associated with the data.
load_lazy (Optional[Callable[[Dataset], None]]) – If constructing a lazy dataset, calling this function will read all the data and convert the dataset to a non-lazy one.
generate_metadata (bool) – Automatically generate and update the metadata.
check (bool) – DEPRECATED: argument ignored.
source (Optional[Any]) – DEPRECATED: argument ignored.
timestamp (Optional[datetime]) – DEPRECATED: argument ignored.

copy()[source]¶

Return type: ~D

get_column_references_by_column_index()[source]¶

Return type: Dict[str, Dict[ColumnReference, List[ColumnReference]]]

get_relations_graph()[source]¶

Builds the relations graph for the dataset.

Each key in the output corresponds to a resource/table. The value under a key is the list of edges this table has. The edge is represented by a tuple of four elements. For example, if the edge is (resource_id, True, index_1, index_2, custom_state), it means that there is a foreign key that points to table resource_id. Specifically, index_1 column in the current table points to index_2 column in the table resource_id.

custom_state is an empty dict when returned from this method, but allows users of this graph to store custom state there.

Returns: Returns the relation graph in adjacency representation.
Return type: Dict[str, List[Tuple[str, bool, int, int, Dict]]]

is_lazy()[source]¶

Return whether this dataset instance is lazy and not all data has been loaded.

Returns: True if this dataset instance is lazy.
Return type: bool

classmethod load(dataset_uri, *, dataset_id=None, dataset_version=None, dataset_name=None, lazy=False, compute_digest=<ComputeDigest.ONLY_IF_MISSING: 'ONLY_IF_MISSING'>, strict_digest=False, handle_score_split=True)[source]¶

Tries to load dataset from dataset_uri using all registered dataset loaders.

Parameters

dataset_uri (str) – A URI to load.
dataset_id (Optional[str]) – Override dataset ID determined by the loader.
dataset_version (Optional[str]) – Override dataset version determined by the loader.
dataset_name (Optional[str]) – Override dataset name determined by the loader.
lazy (bool) – If True, load only top-level metadata and not whole dataset.
compute_digest (ComputeDigest) – Compute a digest over the data?
strict_digest (bool) – If computed digest does not match the one provided in metadata, raise an exception?
handle_score_split (bool) – If a scoring dataset has target values in a separate file, merge them in?

Returns

A loaded dataset.

Return type

Dataset

load_lazy()[source]¶

Read all the data and convert the dataset to a non-lazy one.

Return type: None

classmethod register_loader(loader)[source]¶

Registers a new dataset loader.

Parameters: loader (Loader) – An instance of the loader class implementing a new loader.
Return type: None

classmethod register_saver(saver)[source]¶

Registers a new dataset saver.

Parameters: saver (Saver) – An instance of the saver class implementing a new saver.
Return type: None

save(dataset_uri, *, compute_digest=<ComputeDigest.ALWAYS: 'ALWAYS'>, preserve_metadata=True)[source]¶

Tries to save dataset to dataset_uri using all registered dataset savers.

Parameters

dataset_uri (str) – A URI to save to.
compute_digest (ComputeDigest) – Compute digest over the data when saving?
preserve_metadata (bool) – When saving a dataset, store its metadata as well?

Return type

None

select_rows(row_indices_to_keep)[source]¶

Generate a new Dataset from the row indices for DataFrames.

Parameters: row_indices_to_keep (Mapping[str, Sequence[int]]) – This is a dict where key is resource ID and value is a sequence of row indices to keep. If a resource ID is missing, the whole related resource is kept.
Returns: Returns a new Dataset.
Return type: ~D

to_json_structure(*, canonical=False)[source]¶

Returns only a top-level dataset description.

Return type: Dict

loaders: List[d3m.container.dataset.Loader] = [<d3m.container.dataset.D3MDatasetLoader object>, <d3m.container.dataset.CSVLoader object>, <d3m.container.dataset.SklearnExampleLoader object>, <d3m.container.dataset.OpenMLDatasetLoader object>][source]¶

metadata: d3m.metadata.base.DataMetadata[source]¶

savers: List[d3m.container.dataset.Saver] = [<d3m.container.dataset.D3MDatasetSaver object>][source]¶

d3m.container.dataset¶

Version

Table of Contents