Remotely-Sourced Zoo Datasets#

This page describes how to work with and create zoo datasets whose download/preparation methods are hosted via GitHub repositories or public URLs.

Note

To download from a private GitHub repository that you have access to, provide your GitHub personal access token by setting the GITHUB_TOKEN environment variable.

Note

Check out voxel51/coco-2017 and voxel51/caltech101 for examples of remotely-sourced datasets.

Working with remotely-sourced datasets#

Working with remotely-sourced zoo datasets is just like built-in zoo datasets, as both varieties support the full zoo API.

When specifying remote sources, you can provide any of the following:

  • A GitHub repo URL like https://github.com/<user>/<repo>

  • A GitHub ref like https://github.com/<user>/<repo>/tree/<branch> or https://github.com/<user>/<repo>/commit/<commit>

  • A GitHub ref string like <user>/<repo>[/<ref>]

  • A publicly accessible URL of an archive (eg zip or tar) file

Here’s the basic recipe for working with remotely-sourced zoo datasets:

Use load_zoo_dataset() to download and load a remotely-sourced zoo dataset into a FiftyOne dataset:

1import fiftyone as fo
2import fiftyone.zoo as foz
3
4dataset = foz.load_zoo_dataset(
5    "https://github.com/voxel51/coco-2017",
6    split="validation",
7)
8
9session = fo.launch_app(dataset)

Once you’ve downloaded all or part of a remotely-sourced zoo dataset, it will subsequently appear as an available zoo dataset under the name in the dataset’s fiftyone.yml when you call list_zoo_datasets():

1available_datasets = foz.list_zoo_datasets()
2
3print(available_datasets)
4# [..., "voxel51/coco-2017", ...]

You can also download a remotely-sourced zoo dataset without (yet) loading it into a FiftyOne dataset by calling download_zoo_dataset():

1dataset = foz.download_zoo_dataset(
2    "https://github.com/voxel51/coco-2017",
3    split="validation",
4)

You can delete the local copy of a remotely-sourced zoo dataset (or individual split(s) of it) via delete_zoo_dataset() by providing either the datasets’s name or the remote source from which you downloaded it:

1# These are equivalent
2foz.delete_zoo_dataset("voxel51/coco-2017", split="validation")
3foz.delete_zoo_dataset(
4    "https://github.com/voxel51/coco-2017", split="validation"
5)
6
7# These are equivalent
8foz.delete_zoo_dataset("voxel51/coco-2017")
9foz.delete_zoo_dataset("https://github.com/voxel51/coco-2017")

Creating remotely-sourced datasets#

A remotely-sourced dataset is defined by a directory with the following contents:

fiftyone.yml
__init__.py
    def download_and_prepare(dataset_dir, split=None, **kwargs):
        pass

    def load_dataset(dataset, dataset_dir, split=None, **kwargs):
        pass

Each component is described in detail below.

Note

By convention, datasets also contain an optional README.md file that provides additional information about the dataset and example syntaxes for downloading and working with it.

fiftyone.yml#

The dataset’s fiftyone.yml or fiftyone.yaml file defines relevant metadata about the dataset:

Field

Required?

Description

name

yes

The name of the dataset. Once you’ve downloaded all or part of a remotely-sourced zoo dataset, it will subsequently appear as an available zoo dataset under this name when using the zoo API

type

Declare that the directory defines a dataset. This can be omitted for backwards compatibility, but it is recommended to specify this

author

The author of the dataset

version

The version of the dataset

url

The source (eg GitHub repository) where the directory containing this file is hosted

source

The original source of the dataset

license

The license under which the dataset is distributed

description

A brief description of the dataset

fiftyone.version

A semver version specifier (or *) describing the required FiftyOne version for the dataset to load properly

supports_partial_downloads

Specify true or false whether parts of the dataset can be downloaded/loaded by providing kwargs to download_zoo_dataset() or load_zoo_dataset() as described here. If omitted, this is assumed to be false

tags

A list of tags for the dataset. Useful in conjunction with list_zoo_datasets()

splits

A list of the dataset’s supported splits. This should be omitted if the dataset does not contain splits

size_samples

The total number of samples in the dataset, or a list of per-split sizes

Here are two example dataset YAML files:

 1name: voxel51/coco-2017
 2type: dataset
 3author: The COCO Consortium
 4version: 1.0.0
 5url: https://github.com/voxel51/coco-2017
 6source: http://cocodataset.org/#home
 7license: https://cocodataset.org/#termsofuse
 8description: The COCO-2017 dataset
 9fiftyone:
10  version: "*"
11supports_partial_downloads: true
12tags:
13 - image
14 - detection
15 - segmentation
16splits:
17 - train
18 - validation
19 - test
20size_samples:
21 - train: 118287
22 - test: 40670
23 - validation: 5000

Download and prepare#

All dataset’s __init__.py files must define a download_and_prepare() method with the signature below:

 1def download_and_prepare(dataset_dir, split=None, **kwargs):
 2    """Downloads the dataset and prepares it for loading into FiftyOne.
 3
 4    Args:
 5        dataset_dir: the directory in which to construct the dataset
 6        split (None): a specific split to download, if the dataset supports
 7            splits. The supported split values are defined by the dataset's
 8            YAML file
 9        **kwargs: optional keyword arguments that your dataset can define to
10            configure what/how the download is performed
11
12    Returns:
13        a tuple of
14
15        -   ``dataset_type``: a ``fiftyone.types.Dataset`` type that the
16            dataset is stored in locally, or None if the dataset provides
17            its own ``load_dataset()`` method
18        -   ``num_samples``: the total number of downloaded samples for the
19            dataset or split
20        -   ``classes``: a list of classes in the dataset, or None if not
21            applicable
22    """
23
24    # Download files and organize them in `dataset_dir`
25    ...
26
27    # Define how the data is stored
28    dataset_type = fo.types.ImageClassificationDirectoryTree
29    dataset_type = None  # custom ``load_dataset()`` method
30
31    # Indicate how many samples have been downloaded
32    # May be less than the total size if partial downloads have been used
33    num_samples = 10000
34
35    # Optionally report what classes exist in the dataset
36    classes = None
37    classes = ["cat", "dog", ...]
38
39    return dataset_type, num_samples, classes

This method is called under-the-hood when a user calls download_zoo_dataset() or load_zoo_dataset(), and its job is to download any relevant files from the web and organize and/or prepare them as necessary into a format that’s ready to be loaded into a FiftyOne dataset.

The dataset_type that download_and_prepare() returns defines how it the dataset is ultimately loaded into FiftyOne:

  • Built-in importer: in many cases, FiftyOne already contains a built-in importer that can be leveraged to load data on disk into FiftyOne. Remotely-sourced datasets can take advantage of this by simply returning the appropriate dataset_type from download_and_prepare(), which is then used to load the data into FiftyOne as follows:

1# If the dataset has splits, `dataset_dir` will be the split directory
2dataset_importer_cls = dataset_type.get_dataset_importer_cls()
3dataset_importer = dataset_importer_cls(dataset_dir=dataset_dir, **kwargs)
4
5dataset.add_importer(dataset_importer, **kwargs)
  • Custom loader: if dataset_type=None is returned, then __init__.py must also contain a load_dataset() method as described below that handles loading the data into FiftyOne as follows:

1load_dataset(dataset, dataset_dir, **kwargs)

Load dataset#

Datasets that don’t use a built-in importer must also define a load_dataset() method in their __init__.py with the signature below:

 1def load_dataset(dataset, dataset_dir, split=None, **kwargs):
 2    """Loads the dataset into the given FiftyOne dataset.
 3
 4    Args:
 5        dataset: a :class:`fiftyone.core.dataset.Dataset` to which to import
 6        dataset_dir: the directory to which the dataset was downloaded
 7        split (None): a split to load. The supported values are
 8            ``("train", "validation", "test")``
 9        **kwargs: optional keyword arguments that your dataset can define to
10            configure what/how the load is performed
11    """
12
13    # Load data into samples
14    samples = [...]
15
16    # Add samples to the dataset
17    dataset.add_samples(samples)

This method’s job is to load the filepaths and any relevant labels into Sample objects and then call add_samples() or a similar method to add them to the provided Dataset.

Partial downloads#

Remotely-sourced datasets can support partial downloads, which is useful for a variety of reasons, including:

  • A dataset may contain labels for multiple task types but the user is only interested in a subset of them

  • The dataset may be very large and the user only wants to download a small subset of the samples to get familiar with the dataset

Datasets that support partial downloads should declare this in their fiftyone.yml:

supports_partial_downloads: true

The partial download behavior itself is defined via **kwargs in the dataset’s __init__.py methods:

1def download_and_prepare(dataset_dir, split=None, **kwargs):
2    pass
3
4def load_dataset(dataset, dataset_dir, split=None, **kwargs):
5    pass

When download_zoo_dataset(url, ..., **kwargs) is called, any kwargs declared by download_and_prepare() are passed through to it.

When load_zoo_dataset(name_or_url, ..., **kwargs) is called, any kwargs declared by download_and_prepare() and load_dataset() are passed through to them, respectively.