Run in Google Colab | View source on GitHub | Download notebook |
Merging Datasets¶
This recipe demonstrates a simple pattern for merging FiftyOne Datasets via Dataset.merge_samples().
Merging datasets is an easy way to:
Combine multiple datasets with information about the same underlying raw media (images and videos)
Add model predictions to a FiftyOne dataset, to compare with ground truth annotations and/or other models
Setup¶
If you haven’t already, install FiftyOne:
[ ]:
!pip install fiftyone
In this recipe, we’ll work with a dataset downloaded from the FiftyOne Dataset Zoo.
To access the dataset, install torch
and torchvision
, if necessary:
[ ]:
!pip install torch torchvision
Then download the test split of CIFAR-10:
[1]:
# Download the validation split of COCO-2017
!fiftyone zoo datasets download cifar10 --splits test
Split 'test' already downloaded
Merging model predictions¶
Load the test split of CIFAR-10 into FiftyOne:
[1]:
import random
import os
import fiftyone as fo
import fiftyone.zoo as foz
# Load test split of CIFAR-10
dataset = foz.load_zoo_dataset("cifar10", split="test", dataset_name="merge-example")
classes = dataset.info["classes"]
print(dataset)
Split 'test' already downloaded
Loading 'cifar10' split 'test'
100% |███| 10000/10000 [14.1s elapsed, 0s remaining, 718.2 samples/s]
Name: merge-example
Media type: image
Num samples: 10000
Persistent: False
Info: {'classes': ['airplane', 'automobile', 'bird', ...]}
Tags: ['test']
Sample fields:
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
The dataset contains ground truth labels in its ground_truth
field:
[2]:
# Print a sample from the dataset
print(dataset.first())
<Sample: {
'id': '5fee1a40f653ce52a9d077b1',
'media_type': 'image',
'filepath': '/Users/Brian/fiftyone/cifar10/test/data/000001.jpg',
'tags': BaseList(['test']),
'metadata': None,
'ground_truth': <Classification: {
'id': '5fee1a40f653ce52a9d077b0',
'label': 'horse',
'confidence': None,
'logits': None,
}>,
}>
Suppose you would like to add model predictions to some samples from the dataset.
The usual way to do this is to just iterate over the dataset and add your predictions directly to the samples:
[3]:
def run_inference(filepath):
# Run inference on `filepath` here.
# For simplicity, we'll just generate a random label
label = random.choice(classes)
return fo.Classification(label=label)
[4]:
# Choose 100 samples at random
random_samples = dataset.take(100)
# Add model predictions to dataset
for sample in random_samples:
sample["predictions"] = run_inference(sample.filepath)
sample.save()
print(dataset)
Name: merge-example
Media type: image
Num samples: 10000
Persistent: False
Info: {'classes': ['airplane', 'automobile', 'bird', ...]}
Tags: ['test']
Sample fields:
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
However, suppose you store the predictions in a separate dataset:
[5]:
# Filepaths of images to proces
filepaths = [s.filepath for s in dataset.take(100)]
# Run inference
predictions = fo.Dataset()
for filepath in filepaths:
sample = fo.Sample(filepath=filepath)
sample["predictions"] = run_inference(filepath)
predictions.add_sample(sample)
print(predictions)
Name: 2020.12.31.12.37.09
Media type: image
Num samples: 100
Persistent: False
Info: {}
Tags: []
Sample fields:
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
You can easily merge the predictions
dataset into the main dataset via Dataset.merge_samples().
Let’s start by creating a fresh copy of CIFAR-10 that doesn’t have predictions:
[6]:
dataset2 = dataset.exclude_fields("predictions").clone(name="merge-example2")
print(dataset2)
Name: merge-example2
Media type: image
Num samples: 10000
Persistent: False
Info: {'classes': ['airplane', 'automobile', 'bird', ...]}
Tags: ['test']
Sample fields:
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
Now let’s merge the predictions into the fresh dataset:
[7]:
# Merge predictions
dataset2.merge_samples(predictions)
# Verify that 100 samples in `dataset2` now have predictions
print(dataset2.exists("predictions"))
Dataset: merge-example2
Media type: image
Num samples: 100
Tags: []
Sample fields:
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
View stages:
1. Exists(field='predictions', bool=True)
Let’s print a sample with predictions to verify that the merge happened as expected:
[8]:
# Print a sample with predictions
print(dataset2.exists("predictions").first())
<SampleView: {
'id': '5fee1a40f653ce52a9d07883',
'media_type': 'image',
'filepath': '/Users/Brian/fiftyone/cifar10/test/data/000071.jpg',
'tags': BaseList([]),
'metadata': None,
'ground_truth': <Classification: {
'id': '5fee1a40f653ce52a9d07882',
'label': 'frog',
'confidence': None,
'logits': None,
}>,
'predictions': <Classification: {
'id': '5fee1a56f653ce52a9d0ee71',
'label': 'horse',
'confidence': None,
'logits': None,
}>,
}>
Customizing the merge key¶
By default, samples with the same absolute filepath
are merged. However, you can customize this as desired via various keyword arguments of Dataset.merge_samples().
For example, the command below will merge samples with the same base filename, ignoring the directory:
[9]:
# Create another fresh dataset to work with
dataset3 = dataset.exclude_fields("predictions").clone(name="merge-example3")
# Merge predictions, using the base filename of the samples to decide which samples to merge
# In this case, we've already performed the merge, so the existing data is overwritten
key_fcn = lambda sample: os.path.basename(sample.filepath)
dataset3.merge_samples(predictions, key_fcn=key_fcn)
Indexing dataset...
100% |███| 10000/10000 [3.6s elapsed, 0s remaining, 2.8K samples/s]
Merging samples...
100% |███████| 100/100 [348.5ms elapsed, 0s remaining, 287.0 samples/s]
Let’s print a sample with predictions to verify that the merge happened as expected:
[10]:
# Print a sample with predictions
print(dataset3.exists("predictions").first())
<SampleView: {
'id': '5fee1a40f653ce52a9d07883',
'media_type': 'image',
'filepath': '/Users/Brian/fiftyone/cifar10/test/data/000071.jpg',
'tags': BaseList([]),
'metadata': None,
'ground_truth': <Classification: {
'id': '5fee1a40f653ce52a9d07882',
'label': 'frog',
'confidence': None,
'logits': None,
}>,
'predictions': <Classification: {
'id': '5fee1a56f653ce52a9d0ee71',
'label': 'horse',
'confidence': None,
'logits': None,
}>,
}>