Evaluating a Classifier with FiftyOne#

This notebook demonstrates an end-to-end example of fine-tuning a classification model using fastai on a Kaggle dataset and using FiftyOne to evaluate it and understand the strengths and weaknesses of both the model and the underlying ground truth annotations.

Specifically, we’ll cover:

So, what’s the takeaway?

The loss function of your model training loop alone doesn’t give you the full picture of a model. In practice, the limiting factor on your model’s performance is often data quality issues that FiftyOne can help you address. In this notebook, we’ll cover:

Running the workflow presented here on your ML projects will help you to understand the current failure modes (edge cases) of your model and how to fix them, including:

  • Identifying scenarios that require additional training samples in order to boost your model’s performance

  • Deciding whether your ground truth annotations have errors/weaknesses that need to be corrected before any subsequent model training will be profitable

Setup#

If you haven’t already, install FiftyOne:

[ ]:
!pip install fiftyone

We’ll also need torch and torchvision installed:

[1]:
!pip install torch torchvision

Download dataset#

Let’s start by downloading the Malaria Cell Images Dataset from Kaggle using the Kaggle API:

[ ]:
!pip install --upgrade kaggle
[4]:
%%bash

# You can create an account for free and get an API token as follows:
# kaggle.com > account > API > Create new API token
export KAGGLE_USERNAME=XXXXXXXXXXXXXXXX
export KAGGLE_KEY=XXXXXXXXXXXXXXXX

kaggle datasets download -d iarunava/cell-images-for-detecting-malaria
Downloading cell-images-for-detecting-malaria.zip

100%|██████████| 675M/675M [00:23<00:00, 30.7MB/s]
[5]:
%%bash

unzip -q cell-images-for-detecting-malaria.zip

rm -rf cell_images/cell_images
rm cell_images/Parasitized/Thumbs.db
rm cell_images/Uninfected/Thumbs.db
rm cell-images-for-detecting-malaria.zip

The unzipped dataset consists of a cell_images/ folder with two subdirectories—Uninfected and Parasitized—that each contain 13782 example images of the respective class of this binary classification task:

[6]:
%%bash

ls -lah cell_images/Uninfected | head
ls -lah cell_images/Parasitized | head

printf "\nClass counts\n"
ls -lah cell_images/Uninfected | wc -l
ls -lah cell_images/Parasitized | wc -l
total 354848
drwxr-xr-x  13781 voxel51  staff   431K Feb 18 08:56 .
drwxr-xr-x      4 voxel51  staff   128B Feb 18 08:56 ..
-rw-r--r--      1 voxel51  staff    11K Oct 14  2019 C100P61ThinF_IMG_20150918_144104_cell_128.png
-rw-r--r--      1 voxel51  staff    11K Oct 14  2019 C100P61ThinF_IMG_20150918_144104_cell_131.png
-rw-r--r--      1 voxel51  staff   9.7K Oct 14  2019 C100P61ThinF_IMG_20150918_144104_cell_144.png
-rw-r--r--      1 voxel51  staff   5.8K Oct 14  2019 C100P61ThinF_IMG_20150918_144104_cell_21.png
-rw-r--r--      1 voxel51  staff   9.4K Oct 14  2019 C100P61ThinF_IMG_20150918_144104_cell_25.png
-rw-r--r--      1 voxel51  staff   7.5K Oct 14  2019 C100P61ThinF_IMG_20150918_144104_cell_34.png
-rw-r--r--      1 voxel51  staff    10K Oct 14  2019 C100P61ThinF_IMG_20150918_144104_cell_48.png
total 404008
drwxr-xr-x  13781 voxel51  staff   431K Feb 18 08:56 .
drwxr-xr-x      4 voxel51  staff   128B Feb 18 08:56 ..
-rw-r--r--      1 voxel51  staff    14K Oct 14  2019 C100P61ThinF_IMG_20150918_144104_cell_162.png
-rw-r--r--      1 voxel51  staff    18K Oct 14  2019 C100P61ThinF_IMG_20150918_144104_cell_163.png
-rw-r--r--      1 voxel51  staff    13K Oct 14  2019 C100P61ThinF_IMG_20150918_144104_cell_164.png
-rw-r--r--      1 voxel51  staff    13K Oct 14  2019 C100P61ThinF_IMG_20150918_144104_cell_165.png
-rw-r--r--      1 voxel51  staff    11K Oct 14  2019 C100P61ThinF_IMG_20150918_144104_cell_166.png
-rw-r--r--      1 voxel51  staff    14K Oct 14  2019 C100P61ThinF_IMG_20150918_144104_cell_167.png
-rw-r--r--      1 voxel51  staff    11K Oct 14  2019 C100P61ThinF_IMG_20150918_144104_cell_168.png

Class counts
   13782
   13782

Load dataset into FiftyOne#

Let’s load the dataset into FiftyOne and explore it!

[ ]:
import os
import fiftyone as fo

DATASET_DIR = os.path.join(os.getcwd(),"cell_images/")

Create FiftyOne dataset#

FiftyOne provides builtin support for loading datasets in dozens of common formats with a single line of code:

[ ]:
# Create FiftyOne dataset
dataset = fo.Dataset.from_dir(
    DATASET_DIR,
    fo.types.ImageClassificationDirectoryTree,
    name="malaria-cell-images",
)
dataset.persistent = True

print(dataset)
 100% |███| 27558/27558 [35.8s elapsed, 0s remaining, 765.8 samples/s]
Name:           malaria-cell-images
Media type:     image
Num samples:    27558
Persistent:     True
Info:           {'classes': ['Parasitized', 'Uninfected']}
Tags:           []
Sample fields:
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)

(Future use) Load an existing FiftyOne dataset#

Now that the data is loaded into FiftyOne, you can easily work with the same dataset in a future session on the same machine by loading it by name:

[ ]:
# Load existing dataset
dataset = fo.load_dataset("malaria-cell-images")
print(dataset)

Index the dataset by visual uniqueness#

Let’s start by indexing the dataset by visual uniqueness using FiftyOne’s image uniqueness method.

This method adds a scalar uniqueness field to each sample that measures the relative visual uniqueness of each sample compared to the other samples in the dataset.

[ ]:
import fiftyone.brain as fob

fob.compute_uniqueness(dataset)
Loading uniqueness model...
Downloading model from Google Drive ID '1SIO9XreK0w1ja4EuhBWcR10CnWxCOsom'...
 100% |████|  100.6Mb/100.6Mb [135.7ms elapsed, 0s remaining, 741.3Mb/s]
Preparing data...
Generating embeddings...
 100% |███| 27558/27558 [39.6s elapsed, 0s remaining, 618.6 samples/s]
Computing uniqueness...
Saving results...
 100% |███| 27558/27558 [42.9s elapsed, 0s remaining, 681.0 samples/s]
Uniqueness computation complete

Visualize dataset in the App#

Now let’s launch the FiftyOne App and use it to interactively explore the dataset.

For example, try using the view bar to sort the samples so that we can view the most visually unique samples in the dataset:

[2]:
# Most of the MOST UNIQUE samples are parasitized
session = fo.launch_app(dataset)

Now let’s add a Limit(500) stage in the view bar and open the Labels tab to view some statistics about the 500 most unique samples in the dataset.

Notice that a vast majority of the most visually unique samples in the dataset are Parasitized, which makes sense because these are the infected, abnormal cells.

[6]:
session.show()

Conversely, if we use the view bar to show the 500 least visually unique samples, we find that 499 of them are Uninfected!

[7]:
# All of the LEAST UNIQUE samples are uninfected
session.show()
[8]:
session.show()

Training a model#

Now that we have some basic intuition about the dataset, let’s train a model!

In this example, we’ll use fastai to fine-tune a pre-trained model on our dataset in just a few lines of code and a few minutes of GPU time.

[ ]:
!pip install --upgrade fastai
[ ]:
import numpy as np
from fastai.data.all import *
from fastai.vision.data import *
from fastai.vision.all import *

The code sample below loads the dataset into a fastai data loader:

[ ]:
# Load dataset into fastai

path = Path(DATASET_DIR)

splitter = RandomSplitter(valid_pct=0.2)

item_tfms = [Resize(224)]
batch_tfms = [
    *aug_transforms(flip_vert=True, max_zoom=1.2, max_warp=0),
    Normalize.from_stats(*imagenet_stats),
]

data_block = DataBlock(
    blocks=[ImageBlock, CategoryBlock],
    get_items=get_image_files,
    get_y=parent_label,
    splitter=splitter,
    item_tfms=item_tfms,
    batch_tfms=batch_tfms,
)

data = data_block.dataloaders(path, bs=64)
data.show_batch()
../_images/tutorials_evaluate_classifications_30_0.png

Now let’s load a pre-trained xresnet34 model:

[ ]:
# Load a pre-trained model
learner = cnn_learner(data, xresnet34, metrics=[accuracy]).to_fp16()

and fine-tune it for 15 epochs on our dataset:

[ ]:
# Fine-tune model on our dataset
learner.fine_tune(15)
epoch train_loss valid_loss accuracy time
0 0.346846 0.330612 0.878606 01:27
epoch train_loss valid_loss accuracy time
0 0.242244 0.199095 0.928325 01:43
1 0.215641 0.166363 0.943205 01:42
2 0.196613 0.149990 0.946834 01:43
3 0.185642 0.135028 0.952822 01:42
4 0.156264 0.128932 0.953366 01:43
5 0.157303 0.127865 0.955181 01:42
6 0.153651 0.117362 0.957177 01:42
7 0.150719 0.120508 0.956088 01:42
8 0.137772 0.114590 0.955181 01:42
9 0.131181 0.113628 0.956632 01:42
10 0.130191 0.107792 0.961894 01:42
11 0.132632 0.111199 0.959898 01:42
12 0.119349 0.106245 0.962257 01:43
13 0.125340 0.106004 0.961169 01:42
14 0.121119 0.106404 0.962257 01:42

In this case, we reached 96.2% validation accuracy in about 25 minutes!

Let’s preview some sample predictions using fastai:

[ ]:
learner.show_results()
../_images/tutorials_evaluate_classifications_36_1.png

Save model checkpoint#

Let’s save a checkpoint of our model so we can load it later.

[ ]:
# Save model checkpoint
learner.save("xresnet34-malaria")
Path('models/xresnet34-malaria.pth')

If you’re working in a Colab notebook and would like to download your model, you can do so as follows:

[ ]:
# (Colab only) Download model to your machine
from google.colab import files

files.download("models/xresnet34-malaria.pth")

(Future use) Load saved model#

Run this block if you would like to load a model that your previously trained and exported as a checkpoint.

For Colab users, run this first block to upload the checkpoint from your local machine:

[ ]:
# (Colab only) Upload model from your machine
from google.colab import files

uploaded = files.upload()
for filename in uploaded.keys():
    print("Uploaded '%s'" % filename)

fastai expects the model to be in a models/ directory, so let’s move it:

[ ]:
%%bash

mkdir -p models/
mv xresnet34-malaria.pth models/

Now we can load the saved model:

[ ]:
# Loads `models/xresnet34-malaria.pth` generated by `.save()`
learner = cnn_learner(data, xresnet34, metrics=[accuracy]).to_fp16()
learner.load("xresnet34-malaria")

Evaluating model with FiftyOne#

While 96% accuracy sounds great, aggregate evaluation metrics are not enough to get a full understanding of the performance of a model and what needs to be done to further improve it.

Add predictions to FiftyOne dataset#

Let’s add our model’s predictions to our FiftyOne dataset so we can evaluate it in more detail:

[ ]:
from fiftyone import ViewField as F

def do_inference(learner, dl, dataset, classes, tag):
    # Perform inference
    preds, _ = learner.get_preds(ds_idx=dl.split_idx)
    preds = preds.numpy()

    # Save predictions to FiftyOne dataset
    with fo.ProgressBar() as pb:
        for filepath, scores in zip(pb(dl.items), preds):
            sample = dataset[str(filepath)]
            target = np.argmax(scores)
            sample.tags = [tag]
            sample["predictions"] = fo.Classification(
                label=classes[target],
                confidence=scores[target],
                logits=np.log(scores),
            )
            sample.save()

classes = list(data.vocab)

# Run inference on train split
do_inference(learner, data.train, dataset, classes, "train")

# Run inference on validation split
do_inference(learner, data.valid, dataset, classes, "validation")
 100% |███| 22047/22047 [1.1m elapsed, 0s remaining, 324.2 samples/s]

The predictions are stored in a predictions field of our dataset:

[ ]:
print(dataset)
Name:           malaria-cell-images
Media type:     image
Num samples:    27558
Persistent:     True
Info:           {'classes': ['Parasitized', 'Uninfected']}
Tags:           ['train', 'validation']
Sample fields:
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    uniqueness:   fiftyone.core.fields.FloatField
    predictions:  fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)

We’ve added predictions for both the train split:

[ ]:
print(dataset.match_tags("train").first())
<SampleView: {
    'id': '601acd101a0300d4addd48cd',
    'media_type': 'image',
    'filepath': '/content/cell_images/Parasitized/C100P61ThinF_IMG_20150918_144104_cell_162.png',
    'tags': BaseList(['train']),
    'metadata': None,
    'ground_truth': <Classification: {
        'id': '601acd101a0300d4addd48cc',
        'label': 'Parasitized',
        'confidence': None,
        'logits': None,
    }>,
    'uniqueness': 0.43538014682836707,
    'predictions': <Classification: {
        'id': '601ae8711a0300d4ade1dc03',
        'label': 'Parasitized',
        'confidence': 0.9984512329101562,
        'logits': array([-1.5499677e-03, -6.4702997e+00], dtype=float32),
    }>,
}>

and the validation split:

[ ]:
print(dataset.match_tags("validation").first())
<SampleView: {
    'id': '601acd101a0300d4addd48e5',
    'media_type': 'image',
    'filepath': '/content/cell_images/Parasitized/C100P61ThinF_IMG_20150918_144104_cell_170.png',
    'tags': BaseList(['validation']),
    'metadata': None,
    'ground_truth': <Classification: {
        'id': '601acd101a0300d4addd48e4',
        'label': 'Parasitized',
        'confidence': None,
        'logits': None,
    }>,
    'uniqueness': 0.31238555314371125,
    'predictions': <Classification: {
        'id': '601ae69b1a0300d4ade1901f',
        'label': 'Parasitized',
        'confidence': 0.9914804697036743,
        'logits': array([-0.00855603, -4.765392  ], dtype=float32),
    }>,
}>

Running the evaluation#

FiftyOne provides a powerful evaluation API for evaluating various types of models at the aggregate and sample-level.

In this case, we’ll use the binary classification functionality to analyze our model:

[9]:
# Evaluate the predictions in the `predictions` field with respect to the
# labels in the `ground_truth` field
results = dataset.evaluate_classifications(
    "predictions",
    gt_field="ground_truth",
    eval_key="eval",
    method="binary",
    classes=["Uninfected", "Parasitized"],
)

The method returned a results object that provides a number of convenient methods for analyzing our predictions.

Viewing aggregate metrics#

Let’s start by printing a classification report:

[6]:
results.print_report()
              precision    recall  f1-score   support

  Uninfected       0.95      0.98      0.96     13779
 Parasitized       0.98      0.95      0.96     13779

    accuracy                           0.96     27558
   macro avg       0.96      0.96      0.96     27558
weighted avg       0.96      0.96      0.96     27558

Now, how about a confusion matrix:

[7]:
plot = results.plot_confusion_matrix()
plot.show()
../_images/tutorials_evaluate_classifications_60_0.png
[8]:
plot.freeze()  # replaces interactive plot with static image

and finally a precision-recall curve:

[9]:
plot = results.plot_pr_curve()
plot.show()
../_images/tutorials_evaluate_classifications_63_0.png
[10]:
plot.freeze()  # replaces interactive plot with static image

The evaluation method also populated a new eval field on our samples that records whether each prediction is a true positive (TP), false positive (FP), false negative (FN), or true negative (TN).

In a few minutes, we’ll use this field to interactively explore each type of prediction visually in the App. But for now, let’s check the distribution of these labels:

[10]:
print(dataset.count_values("eval"))
{'FN': 708, 'FP': 334, 'TN': 13445, 'TP': 13071}

Visualizing the most unique predictions#

Now that we have a sense for the aggregate performance of our model, let’s dive into sample-level analysis by loading a dataset view in the App that shows the correctly predicted samples from the validation split, sorted in descending order by the visual uniqueness that we previously computed and stored in the uniqueness field of the dataset:

[21]:
# Show most unique CORRECT predictions on validation split
session.view = (
    dataset
    .match_tags("validation")
    .match(F("predictions.label") == F("ground_truth.label"))
    .sort_by("uniqueness", reverse=True)
)