Run in Google Colab | View source on GitHub | Download notebook |
Exploring Image Uniqueness with FiftyOne¶
During model training, the best results will be seen when training on unique data samples. For example, finding and removing similar samples in your dataset can avoid accidental concept imbalance that can bias the learning of your model. Or, if duplicate or near-duplicate data is present in both training and validation/test splits, evaluation results may not be reliable. Just to name a few.
In this tutorial, we explore how FiftyOne’s image uniqueness tool can be used to analyze and extract insights from raw (unlabeled) datasets.
We’ll cover the following concepts:
Loading a dataset from the FiftyOne Dataset Zoo
Applying FiftyOne’s uniqueness method to your dataset
Launching the FiftyOne App and visualizing/exploring your data
Identifying duplicate and near-duplicate images in your dataset
Identifying the most unique/representative images in your dataset
So, what’s the takeaway?
This tutorial shows how FiftyOne can automatically find and remove near-duplicate images in your datasets and recommend the most unique samples in your data, enabling you to start your model training off right with a high-quality bootstrapped training set.
Setup¶
If you haven’t already, install FiftyOne:
[ ]:
!pip install fiftyone
This tutorial requires either Torchvision Datasets or TensorFlow Datasets to download the CIFAR-10 dataset used below.
You can, for example, install PyTorch as follows:
[ ]:
!pip install torch torchvision
Part 1: Finding duplicate and near-duplicate images¶
A common problem in dataset creation is duplicated data. Although this could be found using file hashing—as in the image_deduplication walkthrough—it is less possible when small manipulations have occurred in the data. Even more critical for workflows involving model training is the need to get as much power out of each data samples as possible; near-duplicates, which are samples that are exceptionally similar to one another, are intrinsically less valuable for the training scenario. Let’s see if we can find such duplicates and near-duplicates in a common dataset: CIFAR-10.
Load the dataset¶
Open a Python shell to begin. We will use the CIFAR-10 dataset, which is available in the FiftyOne Dataset Zoo:
[1]:
import fiftyone as fo
import fiftyone.zoo as foz
# Load the CIFAR-10 test split
# Downloads the dataset from the web if necessary
dataset = foz.load_zoo_dataset("cifar10", split="test")
Split 'test' already downloaded
Loading 'cifar10' split 'test'
100% |█████████████| 10000/10000 [9.6s elapsed, 0s remaining, 1.0K samples/s]
Dataset 'cifar10-test' created
The dataset contains the ground truth labels in a ground_truth
field:
[2]:
print(dataset)
Name: cifar10-test
Media type: image
Num samples: 10000
Persistent: False
Tags: ['test']
Sample fields:
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
[3]:
print(dataset.first())
<Sample: {
'id': '6066448c7d373b861836bba8',
'media_type': 'image',
'filepath': '/home/ben/fiftyone/cifar10/test/data/000001.jpg',
'tags': BaseList(['test']),
'metadata': None,
'ground_truth': <Classification: {
'id': '6066448c7d373b861836bba7',
'tags': BaseList([]),
'label': 'cat',
'confidence': None,
'logits': None,
}>,
}>
Let’s launch the FiftyOne App and use the GUI to explore the dataset visually before we go any further:
[4]:
session = fo.launch_app(dataset)
Compute uniqueness¶
Now we can process the entire dataset for uniqueness. This is a fairly expensive operation, but should finish in a few minutes at most. We are processing through all samples in the dataset, then building a representation that relates the samples to each other. Finally, we analyze this representation to output uniqueness scores for each sample.
[5]:
import fiftyone.brain as fob
fob.compute_uniqueness(dataset)
Generating embeddings...
0% ||------------| 16/10000 [95.0ms elapsed, 59.3s remaining, 168.5 samples/s] 100% |█████████████| 10000/10000 [1.2m elapsed, 0s remaining, 166.0 samples/s]
Computing uniqueness...
Uniqueness computation complete
The above method populates a uniqueness
field on each sample that contains the sample’s uniqueness score. Let’s confirm this by printing some information about the dataset:
[6]:
# Now the samples have a "uniqueness" field on them
print(dataset)
Name: cifar10-test
Media type: image
Num samples: 10000
Persistent: False
Tags: ['test']
Sample fields:
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
uniqueness: fiftyone.core.fields.FloatField
[7]:
print(dataset.first())
<Sample: {
'id': '6066448c7d373b861836bba8',
'media_type': 'image',
'filepath': '/home/ben/fiftyone/cifar10/test/data/000001.jpg',
'tags': BaseList(['test']),
'metadata': None,
'ground_truth': <Classification: {
'id': '6066448c7d373b861836bba7',
'tags': BaseList([]),
'label': 'cat',
'confidence': None,
'logits': None,
}>,
'uniqueness': 0.4978482190810026,
}>
Visualize to find duplicate and near-duplicate images¶
Now, let’s visually inspect the least unique images in the dataset to see if our dataset has any issues:
[8]:
# Sort in increasing order of uniqueness (least unique first)
dups_view = dataset.sort_by("uniqueness")
# Open view in the App
session.view = dups_view