docs.openml.org Open in urlscan Pro
131.155.11.11  Public Scan

Submitted URL: http://docs.openml.org/
Effective URL: https://docs.openml.org/
Submission: On March 29 via api from US — Scanned from NL

Form analysis 0 forms found in the DOM

Text Content

Skip to content

OpenML Documentation
Get started
GitHub
 * 19
 * 17

 * Bootcamp
 * Contributors

OpenML Documentation
GitHub
 * 19
 * 17

 * Bootcamp
   Bootcamp
    * Get started Get started
      Table of contents
       * Concepts
       * Data
          * Discovery
          * Sharing
          * Automated analysis
          * Dataset ID and versions
          * Dataset status
          * Special attributes
      
       * Tasks
          * Collaborative benchmarks
          * Discover the best algorithms
          * Automating benchmarks
      
       * Flows
          * Analysing algorithm performance
          * Automated sharing
          * Reproducing algorithms and experiments
      
       * Runs
          * Automated reproducible evaluations
          * Online organization
          * Independent (server-side) evaluation
      
       * Collections and benchmarks
          * Benchmarking suites
          * Benchmark studies
      
       * Tagging
       * Openness and Authentication
       * Sharing (under construction)
   
    * Integrations
      Integrations
       * scikit-learn
       * mlr
       * WEKA
       * MOA
   
    * Benchmarking
    * Governance
    * Terms

 * Contributors
   Contributors
    * How to Contribute
    * Documenting
    * Website
      Website
       * Getting started
       * Flask backend
       * React frontend
       * Dash visualizations
   
    * Backend
      Backend
       * Local Installation
       * API Development
       * Evaluation Engine
   
    * OpenML Definition
    * Client Development
    * Datasets
    * Resources

Table of contents
 * Concepts
 * Data
    * Discovery
    * Sharing
    * Automated analysis
    * Dataset ID and versions
    * Dataset status
    * Special attributes

 * Tasks
    * Collaborative benchmarks
    * Discover the best algorithms
    * Automating benchmarks

 * Flows
    * Analysing algorithm performance
    * Automated sharing
    * Reproducing algorithms and experiments

 * Runs
    * Automated reproducible evaluations
    * Online organization
    * Independent (server-side) evaluation

 * Collections and benchmarks
    * Benchmarking suites
    * Benchmark studies

 * Tagging
 * Openness and Authentication
 * Sharing (under construction)


AN OPEN, AUTOMATED, AND FRICTIONLESS MACHINE LEARNING ENVIRONMENT.

  1000s of data sets, uniformly formatted, easy to load, organized online

 Models and pipelines automatically uploaded from machine learning libraries

  Extensive APIs to integrate OpenML into your tools and scripts

  Easily reproducible results (e.g. models, evaluations) for comparison and
reuse

  Stand on the shoulders of giants, and collaborate in real time

  Make your work more visible and reusable

  Built for automation: streamline your experiments and model building


CONCEPTS¶

OpenML operates on a number of core concepts which are important to understand:

Datasets
Datasets are pretty straight-forward. Tabular datasets are self-contained,
consisting of a number of rows (instances) and columns (features), including
their data types. Other modalities (e.g. images) are included via paths to files
stored within the same folder.
Datasets are uniformly formatted (S3 buckets with Parquet tables, JSON metadata,
and media files), and are auto-converted and auto-loaded in your desired format
by the APIs (e.g. in Python) in a single line of code.
Example: The Iris dataset or the Plankton dataset

Tasks
A task consists of a dataset, together with a machine learning task to perform,
such as classification or clustering and an evaluation method. For supervised
tasks, this also specifies the target column in the data.
Example: Classifying different iris species from other attributes and evaluate
using 10-fold cross-validation.

Flows
A flow identifies a particular machine learning algorithm (a pipeline or
untrained model) from a particular library or framework, such as scikit-learn,
pyTorch, or MLR. It contains details about the structure of the model/pipeline,
dependencies (e.g. the library and its version) and a list of settable
hyperparameters. In short, it is a serialized description of the algorithm that
in many cases can also be deserialized to reinstantiate the exact same algorithm
in a particular library.
Example: scikit-learn's RandomForest or a simple TensorFlow model

Runs
A run is an experiment - it evaluates a particular flow (pipeline/model) with
particular hyperparameter settings, on a particular task. Depending on the task
it will include certain results, such as model evaluations (e.g. accuracies),
model predictions, and other output files (e.g. the trained model).
Example: Classifying Gamma rays with scikit-learn's RandomForest


DATA¶


DISCOVERY¶

OpenML allows fine-grained search over thousands of machine learning datasets.
Via the website, you can filter by many dataset properties, such as size, type,
format, and many more. Via the APIs you have access to many more filters, and
you can download a complete table with statistics of all datasest. Via the APIs
you can also load datasets directly into your preferred data structures such as
numpy (example in Python). We are also working on better organization of all
datasets by topic




SHARING¶

You can upload and download datasets through the website or though the APIs
(recommended). You can share data directly from common data science libraries,
e.g. from Python or R dataframes, in a few lines of code. The OpenML APIs will
automatically extract lots of meta-data and store all datasets in a uniform
format.



    import pandas as pd
    import openml as oml

    # Create an OpenML dataset from a pandas dataframe
    df = pd.DataFrame(data, columns=attribute_names)
    my_data = oml.datasets.functions.create_dataset(
        name="covertype", description="Predicting forest cover ...",
        licence="CC0", data=df
    )

    # Share the dataset on OpenML
    my_data.publish()


Every dataset gets a dedicated page on OpenML with all known information, and
can be edited further online.



Data hosted elsewhere can be referenced by URL. We are also working on
interconnecting OpenML with other machine learning data set repositories


AUTOMATED ANALYSIS¶

OpenML will automatically analyze the data and compute a range of data quality
characteristics. These include simple statistics such as the number of examples
and features, but also potential quality issues (e.g. missing values) and more
advanced statistics (e.g. the mutual information in the features and benchmark
performances of simple models). These can be useful to find, filter and compare
datasets, or to automate data preprocessing. We are also working on simple
metrics and automated dataset quality reports

The Analysis tab (see image below, or try it live) also shows an automated and
interactive analysis of all datasets. This runs on open-source Python code via
Dash and we welcome all contributions



The third tab, 'Tasks', lists all tasks created on the dataset. More on that
below.


DATASET ID AND VERSIONS¶

A dataset can be uniquely identified by its dataset ID, which is shown on the
website and returned by the API. It's 1596 in the covertype example above. They
can also be referenced by name and ID. OpenML assigns incremental version
numbers per upload with the same name. You can also add a free-form
version_label with every upload.


DATASET STATUS¶

When you upload a dataset, it will be marked in_preparation until it is
(automatically) verified. Once approved, the dataset will become active (or
verified). If a severe issue has been found with a dataset, it can become
deactivated (or deprecated) signaling that it should not be used. By default,
dataset search only returns verified datasets, but you can access and download
datasets with any status.


SPECIAL ATTRIBUTES¶

Machine learning datasets often have special attributes that require special
handling in order to build useful models. OpenML marks these as special
attributes.

A target attribute is the column that is to be predicted, also known as
dependent variable. Datasets can have a default target attribute set by the
author, but OpenML tasks can also overrule this. Example: The default target
variable for the MNIST dataset is to predict the class from pixel values, and
most supervised tasks will have the class as their target. However, one can also
create a task aimed at predicting the value of pixel257 given all the other
pixel values and the class column.

Row id attributes indicate externally defined row IDs (e.g. instance in dataset
164). Ignore attributes are other columns that should not be included in
training data (e.g. Player in dataset 185). OpenML will clearly mark these, and
will (by default) drop these columns when constructing training sets.


TASKS¶

Tasks describe what to do with the data. OpenML covers several task types, such
as classification and clustering. Tasks are containers including the data and
other information such as train/test splits, and define what needs to be
returned. They are machine-readable so that you can automate machine learning
experiments, and easily compare algorithms evaluations (using the exact same
train-test splits) against all other benchmarks shared by others on OpenML.


COLLABORATIVE BENCHMARKS¶

Tasks are real-time, collaborative benchmarks (e.g. see MNIST below). In the
Analysis tab, you can view timelines and leaderboards, and learn from all prior
submissions to design even better algorithms.




DISCOVER THE BEST ALGORITHMS¶

All algorithms evaluated on the same task (with the same train-test splits) can
be directly compared to each other, so you can easily look up which algorithms
perform best overall, and download their exact configurations. Likewise, you can
look up the best algorithms for similar tasks to know what to try first.




AUTOMATING BENCHMARKS¶

You can search and download existing tasks, evaluate your algorithms, and
automatically share the results (which are stored in a run). Here's what this
looks like in the Python API. You can do the same across hundreds of tasks at
once.

    from sklearn import ensemble
    from openml import tasks, runs

    # Build any model you like
    clf = ensemble.RandomForestClassifier()

    # Download any OpenML task (includes the datasets)
    task = tasks.get_task(3954)

    # Automatically evaluate your model on the task
    run = runs.run_model_on_task(clf, task)

    # Share the results on OpenML.
    run.publish()


You can create new tasks via the website or via the APIs as well.


FLOWS¶

Flows are machine learning pipelines, models, or scripts. They are typically
uploaded directly from machine learning libraries (e.g. scikit-learn, pyTorch,
TensorFlow, MLR, WEKA,...) via the corresponding APIs. Associated code (e.g., on
GitHub) can be referenced by URL.


ANALYSING ALGORITHM PERFORMANCE¶

Every flow gets a dedicated page with all known information. The Analysis tab
shows an automated interactive analysis of all collected results. For instance,
below are the results of a scikit-learn pipeline including missing value
imputation, feature encoding, and a RandomForest model. It shows the results
across multiple tasks, and how the AUC score is affected by certain
hyperparameters.



This helps to better understand specific models, as well as their strengths and
weaknesses.


AUTOMATED SHARING¶

When you evaluate algorithms and share the results, OpenML will automatically
extract all the details of the algorithm (dependencies, structure, and all
hyperparameters), and upload them in the background.

    from sklearn import ensemble
    from openml import tasks, runs

    # Build any model you like.
    clf = ensemble.RandomForestClassifier()

    # Evaluate the model on a task
    run = runs.run_model_on_task(clf, task)

    # Share the results, including the flow and all its details.
    run.publish()



REPRODUCING ALGORITHMS AND EXPERIMENTS¶

Given an OpenML run, the exact same algorithm or model, with exactly the same
hyperparameters, can be reconstructed within the same machine learning library
to easily reproduce earlier results.

    from openml import runs

    # Rebuild the (scikit-learn) pipeline from run 9864498
    model = openml.runs.initialize_model_from_run(9864498)


Note

You may need the exact same library version to reconstruct flows. The API will
always state the required version. We aim to add support for VMs so that flows
can be easily (re)run in any environment


RUNS¶


AUTOMATED REPRODUCIBLE EVALUATIONS¶

Runs are experiments (benchmarks) evaluating a specific flows on a specific
task. As shown above, they are typically submitted automatically by machine
learning libraries through the OpenML APIs), including lots of automatically
extracted meta-data, to create reproducible experiments. With a few for-loops
you can easily run (and share) millions of experiments.


ONLINE ORGANIZATION¶

OpenML organizes all runs online, linked to the underlying data, flows,
parameter settings, people, and other details. See the many examples above,
where every dot in the scatterplots is a single OpenML run.


INDEPENDENT (SERVER-SIDE) EVALUATION¶

OpenML runs include all information needed to independently evaluate models. For
most tasks, this includes all predictions, for all train-test splits, for all
instances in the dataset, including all class confidences. When a run is
uploaded, OpenML automatically evaluates every run using a wide array of
evaluation metrics. This makes them directly comparable with all other runs
shared on OpenML. For completeness, OpenML will also upload locally computed
evaluation metrics and runtimes.

New metrics can also be added to OpenML's evaluation engine, and computed for
all runs afterwards. Or, you can download OpenML runs and analyse the results
any way you like.

Note

Please note that while OpenML tries to maximise reproducibility, exactly
reproducing all results may not always be possible because of changes in numeric
libraries, operating systems, and hardware.


COLLECTIONS AND BENCHMARKS¶

You can combine tasks and runs into collections, to run experiments across many
tasks at once and collect all results. Each collection gets its own page, which
can be linked to publications so that others can find all the details online.


BENCHMARKING SUITES¶

Collections of tasks can be published as benchmarking suites. Seamlessly
integrated into the OpenML platform, benchmark suites standardize the setup,
execution, analysis, and reporting of benchmarks. Moreover, they make
benchmarking a whole lot easier:
- all datasets are uniformly formatted in standardized data formats
- they can be easily downloaded programmatically through APIs and client
libraries
- they come with machine-readable meta-information, such as the occurrence of
missing values, to train algorithms correctly
- standardized train-test splits are provided to ensure that results can be
objectively compared - results can be shared in a reproducible way through the
APIs
- results from other users can be easily downloaded and reused

You can search for all existing benchmarking suites or create your own. For all
further details, see the benchmarking guide.




BENCHMARK STUDIES¶

Collections of runs can be published as benchmarking studies. They contain the
results of all runs (possibly millions) executed on a specific benchmarking
suite. OpenML allows you to easily download all such results at once via the
APIs, but also visualized them online in the Analysis tab (next to the complete
list of included tasks and runs). Below is an example of a benchmark study for
AutoML algorithms.




TAGGING¶

Datasets, tasks, runs and flows can be assigned tags, either via the web
interface or the API. These tags can be used to search and annotate datasets, or
simply to better organize your own datasets and experiments.

For example, the tag OpenML-CC18 refers to all tasks included in the OpenML-CC18
benchmarkign suite.


OPENNESS AND AUTHENTICATION¶

You can download and inspect all datasets, tasks, flows and runs through the
website or the API without creating an account. However, if you want to upload
datasets or experiments, you need to create an account, sign in, and find your
API key on your profile page.

This key can then be used with any of the OpenML APIs.


SHARING (UNDER CONSTRUCTION)¶

Currently, anything on OpenML can be shared publicly or kept private to a single
user. We are working on sharing features that allow you to share your materials
with other users without making them entirely public. Watch this space

Next scikit-learn

Made with Material for MkDocs