d3m.contrib.openml.crawler

d3m.contrib.openml.crawler.crawl_openml(save_dir, task_types, *, data_pipeline, data_params=None, context, random_seed=0, volumes_dir=None, scratch_dir=None, runtime_environment=None, max_tasks=None, ignore_tasks=[], ignore_datasets=[], dataset_resolver=None, problem_resolver=None, compute_digest=<ComputeDigest.ONLY_IF_MISSING: 'ONLY_IF_MISSING'>, strict_digest=False)[source]

A function that crawls OpenML tasks and corresponding datasets and converts them to D3M datasets and problems.

Parameters
  • save_dir (str) – A directory where to save datasets and problems.

  • task_types (Sequence[OpenMLTaskType]) – Task types to crawl.

  • data_pipeline (Pipeline) – A data preparation pipeline used for splitting.

  • data_params (Optional[Dict[str, str]]) – A dictionary that contains the hyper-parameters for the data prepration pipeline.

  • context (Context) – In which context to run pipelines.

  • random_seed (int) – A random seed to use for every run. This control all randomness during the run.

  • volumes_dir (Optional[str]) – Path to a directory with static files required by primitives.

  • scratch_dir (Optional[str]) – Path to a directory to store any temporary files needed during execution.

  • runtime_environment (Optional[RuntimeEnvironment]) – A description of the runtime environment.

  • max_tasks (Optional[int]) – Maximum number of tasks to crawl, no limit if None or 0.

  • dataset_resolver (Optional[Callable]) – A dataset resolver to use.

  • problem_resolver (Optional[Callable]) – A problem description resolver to use.

  • compute_digest (ComputeDigest) – Compute a digest over the data?

  • strict_digest (bool) – If computed digest does not match the one provided in metadata, raise an exception?

Returns

A boolean set to true if there was an error during the call.

Return type

bool

d3m.contrib.openml.crawler.crawl_openml_handler(arguments, *, pipeline_resolver=None, dataset_resolver=None, problem_resolver=None)[source]
Return type

None

d3m.contrib.openml.crawler.crawl_openml_task(datasets, task_id, save_dir, *, data_pipeline, data_params=None, context, random_seed=0, volumes_dir=None, scratch_dir=None, runtime_environment=None, dataset_resolver=None, problem_resolver=None, compute_digest=<ComputeDigest.ONLY_IF_MISSING: 'ONLY_IF_MISSING'>, strict_digest=False)[source]

A function that crawls an OpenML task and corresponding dataset, do the split using a data preparation pipeline, and stores the splits as D3M dataset and problem description.

Parameters
  • datasets (Dict[str, str]) – A mapping between known dataset IDs and their paths. Is updated in-place.

  • task_id (int) – An integer representing and OpenML task id to crawl and convert.

  • save_dir (str) – A directory where to save datasets and problems.

  • data_pipeline (Pipeline) – A data preparation pipeline used for splitting.

  • data_params (Optional[Dict[str, str]]) – A dictionary that contains the hyper-parameters for the data prepration pipeline.

  • context (Context) – In which context to run pipelines.

  • random_seed (int) – A random seed to use for every run. This control all randomness during the run.

  • volumes_dir (Optional[str]) – Path to a directory with static files required by primitives.

  • scratch_dir (Optional[str]) – Path to a directory to store any temporary files needed during execution.

  • runtime_environment (Optional[RuntimeEnvironment]) – A description of the runtime environment.

  • dataset_resolver (Optional[Callable]) – A dataset resolver to use.

  • problem_resolver (Optional[Callable]) – A problem description resolver to use.

  • compute_digest (ComputeDigest) – Compute a digest over the data?

  • strict_digest (bool) – If computed digest does not match the one provided in metadata, raise an exception?

Return type

None