Run in Google Colab | View source on GitHub | Download notebook |
Image Deduplication with FiftyOneΒΆ
This recipe demonstrates a simple use case of using FiftyOne to detect and remove duplicate images from your dataset.
SetupΒΆ
If you havenβt already, install FiftyOne:
[ ]:
!pip install fiftyone
This notebook also requires the tensorflow
package:
[ ]:
!pip install tensorflow
Download the dataΒΆ
First we download the dataset to disk. The dataset is a 1000 sample subset of CIFAR-100, a dataset of 32x32 pixel images with one of 100 different classification labels such as apple
, bicycle
, porcupine
, etc. You can use this helper script.
[1]:
from image_deduplication_helpers import download_dataset
download_dataset()
Downloading dataset of 1000 samples to:
/tmp/fiftyone/cifar100_with_duplicates
and corrupting the data (5% duplicates)
Download successful
The above script uses tensorflow.keras.datasets
to download the dataset, so you must have TensorFlow installed.
The dataset is organized on disk as follows:
/tmp/fiftyone/
βββ cifar100_with_duplicates/
βββ <classA>/
β βββ <image1>.jpg
β βββ <image2>.jpg
β βββ ...
βββ <classB>/
β βββ <image1>.jpg
β βββ <image2>.jpg
β βββ ...
βββ ...
As we will soon come to discover, some of these samples are duplicates and we have no clue which they are!
Create a datasetΒΆ
Letβs start by importing the FiftyOne library:
[2]:
import fiftyone as fo
Letβs use a utililty method provided by FiftyOne to load the image classification dataset from disk:
[3]:
import os
dataset_name = "cifar100_with_duplicates"
dataset_dir = os.path.join("/tmp/fiftyone", dataset_name)
dataset = fo.Dataset.from_dir(
dataset_dir,
fo.types.ImageClassificationDirectoryTree,
name=dataset_name
)
100% |ββββββββββββ| 1000/1000 [1.2s elapsed, 0s remaining, 718.5 samples/s]
Explore the datasetΒΆ
We can poke around in the dataset:
[4]:
# Print summary information about the dataset
print(dataset)
Name: cifar100_with_duplicates
Media type: image
Num samples: 1000
Persistent: False
Info: {'classes': ['apple', 'aquarium_fish', 'baby', ...]}
Tags: []
Sample fields:
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
[5]:
# Print a sample
print(dataset.first())
<Sample: {
'id': '5ff8dc665b5b9368e094de5a',
'media_type': 'image',
'filepath': '/tmp/fiftyone/cifar100_with_duplicates/apple/113.jpg',
'tags': BaseList([]),
'metadata': None,
'ground_truth': <Classification: {
'id': '5ff8dc665b5b9368e094de59',
'label': 'apple',
'confidence': None,
'logits': None,
}>,
}>
Create a view that contains only samples whose ground truth label is mountain
:
[6]:
# Used to write view expressions that involve sample fields
from fiftyone import ViewField as F
view = dataset.match(F("ground_truth.label") == "mountain")
# Print summary information about the view
print(view)
Dataset: cifar100_with_duplicates
Media type: image
Num samples: 8
Tags: []
Sample fields:
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
View stages:
1. Match(filter={'$expr': {'$eq': [...]}})
[7]:
# Print the first sample in the view
print(view.first())
<SampleView: {
'id': '5ff8dc675b5b9368e094e436',
'media_type': 'image',
'filepath': '/tmp/fiftyone/cifar100_with_duplicates/mountain/0.jpg',
'tags': BaseList([]),
'metadata': None,
'ground_truth': <Classification: {
'id': '5ff8dc675b5b9368e094e435',
'label': 'mountain',
'confidence': None,
'logits': None,
}>,
}>
Create a view with samples sorted by their ground truth labels in reverse alphabetical order:
[8]:
view = dataset.sort_by("ground_truth", reverse=True)
# Print summary information about the view
print(view)
Dataset: cifar100_with_duplicates
Media type: image
Num samples: 1000
Tags: []
Sample fields:
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
View stages:
1. SortBy(field_or_expr='ground_truth', reverse=True)
[9]:
# Print the first sample in the view
print(view.first())
<SampleView: {
'id': '5ff8dc685b5b9368e094ea0f',
'media_type': 'image',
'filepath': '/tmp/fiftyone/cifar100_with_duplicates/worm/905.jpg',
'tags': BaseList([]),
'metadata': None,
'ground_truth': <Classification: {
'id': '5ff8dc685b5b9368e094ea0e',
'label': 'worm',
'confidence': None,
'logits': None,
}>,
}>
Visualize the datasetΒΆ
Start browsing the dataset:
[10]:
session = fo.launch_app(dataset)
Narrow your scope to 10 random samples:
[11]:
session.view = dataset.take(10)