Writing Custom Sample Parsers#

This recipe demonstrates how to write a custom SampleParser and use it to add samples in your custom format to a FiftyOne Dataset.

Setup#

If you haven’t already, install FiftyOne:

[ ]:
!pip install fiftyone

In this receipe we’ll use the TorchVision Datasets library to download the CIFAR-10 dataset to use as sample data to feed our custom parser.

You can install the necessary packages, if necessary, as follows:

[ ]:
!pip install torch torchvision

Writing a SampleParser#

FiftyOne provides a SampleParser interface that defines how it parses provided samples when methods such as Dataset.add_labeled_images() and Dataset.ingest_labeled_images() are used.

SampleParser itself is an abstract interface; the concrete interface that you should implement is determined by the type of samples that you are importing. See writing a custom SampleParser for full details.

In this recipe, we’ll write a custom LabeledImageSampleParser that can parse labeled images from a PyTorch Dataset.

Here’s the complete definition of the SampleParser:

[1]:
import fiftyone as fo
import fiftyone.utils.data as foud


class PyTorchClassificationDatasetSampleParser(foud.LabeledImageSampleParser):
    """Parser for image classification samples loaded from a PyTorch dataset.

    This parser can parse samples from a ``torch.utils.data.DataLoader`` that
    emits ``(img_tensor, target)`` tuples, where::

        - `img_tensor`: is a PyTorch Tensor containing the image
        - `target`: the integer index of the target class

    Args:
        classes: the list of class label strings
    """

    def __init__(self, classes):
        self.classes = classes

    @property
    def has_image_path(self):
        """Whether this parser produces paths to images on disk for samples
        that it parses.
        """
        return False

    @property
    def has_image_metadata(self):
        """Whether this parser produces
        :class:`fiftyone.core.metadata.ImageMetadata` instances for samples
        that it parses.
        """
        return False

    @property
    def label_cls(self):
        """The :class:`fiftyone.core.labels.Label` class(es) returned by this
        parser.

        This can be any of the following:

        -   a :class:`fiftyone.core.labels.Label` class. In this case, the
            parser is guaranteed to return labels of this type
        -   a list or tuple of :class:`fiftyone.core.labels.Label` classes. In
            this case, the parser can produce a single label field of any of
            these types
        -   a dict mapping keys to :class:`fiftyone.core.labels.Label` classes.
            In this case, the parser will return label dictionaries with keys
            and value-types specified by this dictionary. Not all keys need be
            present in the imported labels
        -   ``None``. In this case, the parser makes no guarantees about the
            labels that it may return
        """
        return fo.Classification

    def get_image(self):
        """Returns the image from the current sample.

        Returns:
            a numpy image
        """
        img_tensor = self.current_sample[0]
        return img_tensor.cpu().numpy()

    def get_label(self):
        """Returns the label for the current sample.

        Returns:
            a :class:`fiftyone.core.labels.Label` instance, or a dictionary
            mapping field names to :class:`fiftyone.core.labels.Label`
            instances, or ``None`` if the sample is unlabeled
        """
        target = self.current_sample[1]
        return fo.Classification(label=self.classes[int(target)])

Note that PyTorchClassificationDatasetSampleParser specifies has_image_path == False and has_image_metadata == False, because the PyTorch dataset directly provides the in-memory image, not its path on disk.

Ingesting samples into a dataset#

In order to use PyTorchClassificationDatasetSampleParser, we need a PyTorch Dataset from which to feed it samples.

Let’s use the CIFAR-10 dataset from the TorchVision Datasets library:

[2]:
import torch
import torchvision


# Downloads the test split of the CIFAR-10 dataset and prepares it for loading
# in a DataLoader
dataset = torchvision.datasets.CIFAR10(
    "/tmp/fiftyone/custom-parser/pytorch",
    train=False,
    download=True,
    transform=torchvision.transforms.ToTensor(),
)
classes = dataset.classes
data_loader = torch.utils.data.DataLoader(dataset, batch_size=1)
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /tmp/fiftyone/custom-parser/pytorch/cifar-10-python.tar.gz
Extracting /tmp/fiftyone/custom-parser/pytorch/cifar-10-python.tar.gz to /tmp/fiftyone/custom-parser/pytorch

Now we can load the samples into the dataset. Since our custom sample parser declares has_image_path == False, we must use the Dataset.ingest_labeled_images() method to load the samples into a FiftyOne dataset, which will write the individual images to disk as they are ingested so that FiftyOne can access them.

[3]:
dataset = fo.Dataset("cifar10-samples")

sample_parser = PyTorchClassificationDatasetSampleParser(classes)

# The directory to use to store the individual images on disk
dataset_dir = "/tmp/fiftyone/custom-parser/fiftyone"

# Ingest the samples from the data loader
dataset.ingest_labeled_images(data_loader, sample_parser, dataset_dir=dataset_dir)

print("Loaded %d samples" % len(dataset))
 100% |███| 10000/10000 [6.7s elapsed, 0s remaining, 1.5K samples/s]
Loaded 10000 samples

Let’s inspect the contents of the dataset to verify that the samples were loaded as expected:

[4]:
# Print summary information about the dataset
print(dataset)
Name:           cifar10-samples
Persistent:     False
Num samples:    10000
Tags:           []
Sample fields:
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    ground_truth: fiftyone.core.fields.StringField
[5]:
# Print a few samples from the dataset
print(dataset.head())
<Sample: {
    'dataset_name': 'cifar10-samples',
    'id': '5f15aeab6d4e59654468a14e',
    'filepath': '/tmp/fiftyone/custom-parser/fiftyone/000001.jpg',
    'tags': BaseList([]),
    'metadata': None,
    'ground_truth': 'cat',
}>
<Sample: {
    'dataset_name': 'cifar10-samples',
    'id': '5f15aeab6d4e59654468a14f',
    'filepath': '/tmp/fiftyone/custom-parser/fiftyone/000002.jpg',
    'tags': BaseList([]),
    'metadata': None,
    'ground_truth': 'ship',
}>
<Sample: {
    'dataset_name': 'cifar10-samples',
    'id': '5f15aeab6d4e59654468a150',
    'filepath': '/tmp/fiftyone/custom-parser/fiftyone/000003.jpg',
    'tags': BaseList([]),
    'metadata': None,
    'ground_truth': 'ship',
}>

We can also verify that the ingested images were written to disk as expected:

[27]:
!ls -lah /tmp/fiftyone/custom-parser/fiftyone | head -n 10
total 0
drwxr-xr-x  10002 voxel51  wheel   313K Jul 20 10:34 .
drwxr-xr-x      4 voxel51  wheel   128B Jul 20 10:34 ..
-rw-r--r--      1 voxel51  wheel     0B Jul 20 10:34 000001.jpg
-rw-r--r--      1 voxel51  wheel     0B Jul 20 10:34 000002.jpg
-rw-r--r--      1 voxel51  wheel     0B Jul 20 10:34 000003.jpg
-rw-r--r--      1 voxel51  wheel     0B Jul 20 10:34 000004.jpg
-rw-r--r--      1 voxel51  wheel     0B Jul 20 10:34 000005.jpg
-rw-r--r--      1 voxel51  wheel     0B Jul 20 10:34 000006.jpg
-rw-r--r--      1 voxel51  wheel     0B Jul 20 10:34 000007.jpg

Adding samples to a dataset#

If our LabeledImageSampleParser declared has_image_path == True, then we could use Dataset.add_labeled_images() to add samples to FiftyOne datasets without creating a copy of the source images on disk.

However, our sample parser does not provide image paths, so an informative error message is raised if we try to use it in an unsupported way:

[6]:
dataset = fo.Dataset()

sample_parser = PyTorchClassificationDatasetSampleParser(classes)

# Won't work because our SampleParser does not provide paths to its source images on disk
dataset.add_labeled_images(data_loader, sample_parser)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-a3d739e371af> in <module>
      4
      5 # Won't work because our SampleParser does not provide paths to its source images on disk
----> 6 dataset.add_labeled_images(data_loader, sample_parser)

~/dev/fiftyone/fiftyone/core/dataset.py in add_labeled_images(self, samples, sample_parser, label_field, tags, expand_schema)
    729         if not sample_parser.has_image_path:
    730             raise ValueError(
--> 731                 "Sample parser must have `has_image_path == True` to add its "
    732                 "samples to the dataset"
    733             )

ValueError: Sample parser must have `has_image_path == True` to add its samples to the dataset

Cleanup#

You can cleanup the files generated by this recipe by running:

[7]:
!rm -rf /tmp/fiftyone