Using Aggregations#

Datasets are the core data structure in FiftyOne, allowing you to represent your raw data, labels, and associated metadata. When you query and manipulate a Dataset object using dataset views, a DatasetView object is returned, which represents a filtered view into a subset of the underlying dataset’s contents.

Complementary to this data model, one is often interested in computing aggregate statistics about datasets, such as label counts, distributions, and ranges, where each Sample is reduced to a single quantity in the aggregate results.

The fiftyone.core.aggregations module offers a declarative and highly-efficient approach to computing summary statistics about your datasets and views.

Overview#

All builtin aggregations are subclasses of the Aggregation class, each encapsulating the computation of a different statistic about your data.

Aggregations are conveniently exposed as methods on all Dataset and DatasetView objects:

1import fiftyone.zoo as foz
2
3dataset = foz.load_zoo_dataset("quickstart")
4
5# List available aggregations
6print(dataset.list_aggregations())
7# ['bounds', 'count', 'count_values', 'distinct', ..., 'sum']

Think of aggregations as more efficient, concise alternatives to writing explicit loops over your dataset to compute a statistic:

 1from collections import defaultdict
 2
 3# Compute label histogram manually
 4manual_counts = defaultdict(int)
 5for sample in dataset:
 6    for detection in sample.ground_truth.detections:
 7        manual_counts[detection.label] += 1
 8
 9# Compute via aggregation
10counts = dataset.count_values("ground_truth.detections.label")
11print(counts)  # same as `manual_counts` above

You can even aggregate on expressions that transform the data in arbitrarily complex ways:

 1from fiftyone import ViewField as F
 2
 3# Expression that computes the number of predicted objects
 4num_objects = F("predictions.detections").length()
 5
 6# The `(min, max)` number of predictions per sample
 7print(dataset.bounds(num_objects))
 8
 9# The average number of predictions per sample
10print(dataset.mean(num_objects))

The sections below discuss the available aggregations in more detail. You can also refer to the fiftyone.core.aggregations module documentation for detailed examples of using each aggregation.

Note

All aggregations can operate on embedded sample fields using the embedded.field.name syntax.

Aggregation fields can also include array fields. Most array fields are automatically unwound, but you can always manually unwind an array using the embedded.array[].field syntax. See this section for more details.

Compute bounds#

You can use the bounds() aggregation to compute the [min, max] range of a numeric field of a dataset:

 1import fiftyone.zoo as foz
 2
 3dataset = foz.load_zoo_dataset("quickstart")
 4
 5# Compute the bounds of the `uniqueness` field
 6bounds = dataset.bounds("uniqueness")
 7print(bounds)
 8# (0.15001302256126986, 1.0)
 9
10# Compute the bounds of the detection confidences in the `predictions` field
11bounds = dataset.bounds("predictions.detections.confidence")
12print(bounds)
13# (0.05003104358911514, 0.9999035596847534)

Count items#

You can use the count() aggregation to compute the number of non-None field values in a collection:

 1import fiftyone.zoo as foz
 2
 3dataset = foz.load_zoo_dataset("quickstart")
 4
 5# Compute the number of samples in the dataset
 6count = dataset.count()
 7print(count)
 8# 200
 9
10# Compute the number of samples with `predictions`
11count = dataset.count("predictions")
12print(count)
13# 200
14
15# Compute the number of detections in the `ground_truth` field
16count = dataset.count("predictions.detections")
17print(count)
18# 5620

Count values#

You can use the count_values() aggregation to compute the occurrences of field values in a collection:

 1import fiftyone.zoo as foz
 2
 3dataset = foz.load_zoo_dataset("quickstart")
 4
 5# Compute the number of samples in the dataset
 6counts = dataset.count_values("tags")
 7print(counts)
 8# {'validation': 200}
 9
10# Compute a histogram of the predicted labels in the `predictions` field
11counts = dataset.count_values("predictions.detections.label")
12print(counts)
13# {'bicycle': 13, 'hot dog': 8, ..., 'skis': 52}

Distinct values#

You can use the distinct() aggregation to compute the distinct values of a field in a collection:

 1import fiftyone.zoo as foz
 2
 3dataset = foz.load_zoo_dataset("quickstart")
 4
 5# Get the distinct tags on the dataset
 6values = dataset.distinct("tags")
 7print(values)
 8# ['validation']
 9
10# Get the distinct labels in the `predictions` field
11values = dataset.distinct("predictions.detections.label")
12print(values)
13# ['airplane', 'apple', 'backpack', ..., 'wine glass', 'zebra']

Histogram values#

You can use the histogram_values() aggregation to compute the histograms of numeric fields of a collection:

 1import numpy as np
 2import matplotlib.pyplot as plt
 3
 4import fiftyone.zoo as foz
 5
 6def plot_hist(counts, edges):
 7    counts = np.asarray(counts)
 8    edges = np.asarray(edges)
 9    left_edges = edges[:-1]
10    widths = edges[1:] - edges[:-1]
11    plt.bar(left_edges, counts, width=widths, align="edge")
12
13dataset = foz.load_zoo_dataset("quickstart")
14
15#
16# Compute a histogram of the `uniqueness` field
17#
18
19counts, edges, other = dataset.histogram_values("uniqueness", bins=50)
20
21plot_hist(counts, edges)
22plt.show(block=False)
histogram-values

Schema#

You can use the schema() aggregation to extract the names and types of the attributes of a specified embedded document field across all samples in a collection.

Schema aggregations are useful for detecting the presence and types of dynamic attributes of Label fields across a collection.

1import fiftyone.zoo as foz
2
3dataset = foz.load_zoo_dataset("quickstart")
4
5# Extract the names and types of all dynamic attributes on the
6# `ground_truth` detections
7print(dataset.schema("ground_truth.detections", dynamic_only=True))
{
    'area': <fiftyone.core.fields.FloatField object at 0x7fc94015fb50>,
    'iscrowd': <fiftyone.core.fields.FloatField object at 0x7fc964869fd0>,
}

You can also use the list_schema() aggregation to extract the value type(s) in a list field across all samples in a collection:

 1from datetime import datetime
 2import fiftyone as fo
 3
 4dataset = fo.Dataset()
 5
 6sample1 = fo.Sample(
 7    filepath="image1.png",
 8    ground_truth=fo.Classification(
 9        label="cat",
10        info=[
11            fo.DynamicEmbeddedDocument(
12                task="initial_annotation",
13                author="Alice",
14                timestamp=datetime(1970, 1, 1),
15                notes=["foo", "bar"],
16            ),
17            fo.DynamicEmbeddedDocument(
18                task="editing_pass",
19                author="Bob",
20                timestamp=datetime.utcnow(),
21            ),
22        ],
23    ),
24)
25
26sample2 = fo.Sample(
27    filepath="image2.png",
28    ground_truth=fo.Classification(
29        label="dog",
30        info=[
31            fo.DynamicEmbeddedDocument(
32                task="initial_annotation",
33                author="Bob",
34                timestamp=datetime(2018, 10, 18),
35                notes=["spam", "eggs"],
36            ),
37        ],
38    ),
39)
40
41dataset.add_samples([sample1, sample2])
42
43# Determine that `ground_truth.info` contains embedded documents
44print(dataset.list_schema("ground_truth.info"))
45# fo.EmbeddedDocumentField
46
47# Determine the fields of the embedded documents in the list
48print(dataset.schema("ground_truth.info[]"))
49# {'task': StringField, ..., 'notes': ListField}
50
51# Determine the type of the values in the nested `notes` list field
52# Since `ground_truth.info` is not yet declared on the dataset's schema, we
53# must manually include `[]` to unwind the info lists
54print(dataset.list_schema("ground_truth.info[].notes"))
55# fo.StringField
56
57# Declare the `ground_truth.info` field
58dataset.add_sample_field(
59    "ground_truth.info",
60    fo.ListField,
61    subfield=fo.EmbeddedDocumentField,
62    embedded_doc_type=fo.DynamicEmbeddedDocument,
63)
64
65# Now we can inspect the nested `notes` field without unwinding
66print(dataset.list_schema("ground_truth.info.notes"))
67# fo.StringField

Note

Schema aggregations are used internally by get_dynamic_field_schema() to impute the types of undeclared lists and embedded documents in a dataset.

Sum values#

You can use the sum() aggregation to compute the sum of the (non-None) values of a field in a collection:

 1import fiftyone.zoo as foz
 2
 3dataset = foz.load_zoo_dataset("quickstart")
 4
 5# Compute average confidence of detections in the `predictions` field
 6print(
 7    dataset.sum("predictions.detections.confidence") /
 8    dataset.count("predictions.detections.confidence")
 9)
10# 0.34994137249820706

Min values#

You can use the min() aggregation to compute the minimum of the (non-None) values of a field in a collection:

1import fiftyone.zoo as foz
2
3dataset = foz.load_zoo_dataset("quickstart")
4
5# Compute minimum confidence of detections in the `predictions` field
6print(dataset.min("predictions.detections.confidence"))
7# 0.05003104358911514

Max values#

You can use the max() aggregation to compute the maximum of the (non-None) values of a field in a collection:

1import fiftyone.zoo as foz
2
3dataset = foz.load_zoo_dataset("quickstart")
4
5# Compute maximum confidence of detections in the `predictions` field
6print(dataset.max("predictions.detections.confidence"))
7# 0.9999035596847534

Mean values#

You can use the mean() aggregation to compute the arithmetic mean of the (non-None) values of a field in a collection:

1import fiftyone.zoo as foz
2
3dataset = foz.load_zoo_dataset("quickstart")
4
5# Compute average confidence of detections in the `predictions` field
6print(dataset.mean("predictions.detections.confidence"))
7# 0.34994137249820706

Quantiles#

You can use the quantiles() aggregation to compute the quantile(s) of the (non-None) values of a field in a collection:

 1import fiftyone.zoo as foz
 2
 3dataset = foz.load_zoo_dataset("quickstart")
 4
 5# Compute quantiles of the `uniqueness` field
 6print(dataset.quantiles("uniqueness", [0.25, 0.5, 0.75, 0.9]))
 7# [0.22027, 0.33771, 0.62554, 0.69488]
 8
 9# Compute quantiles of detection confidence in the `predictions` field
10quantiles = dataset.quantiles(
11    "predictions.detections.confidence",
12    [0.25, 0.5, 0.75, 0.9],
13)
14print(quantiles)
15# [0.09231, 0.20251, 0.56273, 0.94354]

Standard deviation#

You can use the std() aggregation to compute the standard deviation of the (non-None) values of a field in a collection:

1import fiftyone.zoo as foz
2
3dataset = foz.load_zoo_dataset("quickstart")
4
5# Compute standard deviation of the confidence of detections in the
6# `predictions` field
7print(dataset.std("predictions.detections.confidence"))
8# 0.3184061813934825

Values#

You can use the values() aggregation to extract a list containing the values of a field across all samples in a collection:

 1import fiftyone.zoo as foz
 2
 3dataset = foz.load_zoo_dataset("quickstart")
 4
 5# Extract the `uniqueness` values for all samples
 6uniqueness = dataset.values("uniqueness")
 7print(len(uniqueness))  # 200
 8
 9# Extract the labels for all predictions
10labels = dataset.values("predictions.detections.label")
11print(len(labels))  # 200
12print(labels[0]) # ['bird', ..., 'bear', 'sheep']

Note

Unlike other aggregations, values() does not automatically unwind list fields, which ensures that the returned values match the potentially-nested structure of the documents.

You can opt-in to unwinding specific list fields using the [] syntax, or you can pass the optional unwind=True parameter to unwind all supported list fields. See Aggregating list fields for more information.

Advanced usage#

Aggregating list fields#

Aggregations that operate on scalar fields can also be applied to the elements of list fields by appending [] to the list component of the field path.

The example below demonstrates this capability:

 1import fiftyone as fo
 2
 3dataset = fo.Dataset()
 4dataset.add_samples(
 5    [
 6        fo.Sample(
 7            filepath="/path/to/image1.png",
 8            keypoints=fo.Keypoint(points=[(0, 0), (1, 1)]),
 9            classes=fo.Classification(
10                label="cat", confidence=0.9, friends=["dog", "squirrel"]
11            ),
12        ),
13        fo.Sample(
14            filepath="/path/to/image2.png",
15            keypoints=fo.Keypoint(points=[(0, 0), (0.5, 0.5), (1, 1)]),
16            classes=fo.Classification(
17                label="dog", confidence=0.8, friends=["rabbit", "squirrel"],
18            ),
19        ),
20    ]
21)
22
23#
24# Count the number of keypoints in the dataset
25#
26# The `points` list attribute is declared on the `Keypoint` class, so it is
27# automatically unwound
28#
29count = dataset.count("keypoints.points")
30print(count)
31# 5
32
33#
34# Compute the values in the `friends` field of the predictions
35#
36# The `friends` list attribute is a dynamic custom attribute, so we must
37# explicitly request that it be unwound
38#
39counts = dataset.count_values("classes.friends[]")
40print(counts)
41# {'dog': 1, 'squirrel': 2, 'rabbit': 1}

Note

FiftyOne will automatically unwind all array fields that are defined in the dataset’s schema without requiring you to explicitly specify this via the [] syntax. This includes the following cases:

Top-level list fields: When you write an aggregation that refers to a top-level list field of a dataset; i.e., list_field is automatically coerced to list_field[], if necessary.

Frame fields: When you write an aggregation that refers to a frame-level field of a video dataset; i.e., frames.classification.label is automatically coerced to frames[].classification.label if necessary.

Embedded list fields: When you write an aggregation that refers to a list attribute that is declared on a Sample, Frame, or Label class, such as the Classification.tags, Detections.detections, or Keypoint.points attributes; i.e., ground_truth.detections.label is automatically coerced to ground_truth.detections[].label, if necessary.

Aggregating expressions#

Aggregations also support performing more complex computations on fields via the optional expr argument, which is supported by all aggregations and allows you to specify a ViewExpression defining an arbitrary transformation of the field you’re operating on prior to aggregating.

The following examples demonstrate the power of aggregating with expressions:

The code sample below computes some statistics about the number of predicted objects in a dataset:

 1import fiftyone as fo
 2import fiftyone.zoo as foz
 3from fiftyone import ViewField as F
 4
 5dataset = foz.load_zoo_dataset("quickstart")
 6
 7# Expression that computes the number of predicted objects
 8num_objects = F("predictions.detections").length()
 9
10# The `(min, max)` number of predictions per sample
11print(dataset.bounds(num_objects))
12
13# The average number of predictions per sample
14print(dataset.mean(num_objects))
15
16# Two equivalent ways of computing the total number of predictions
17print(dataset.sum(num_objects))
18print(dataset.count("predictions.detections"))

Note

When aggregating expressions, field names may contain list fields, and such field paths are handled as explained above.

However, there is one important exception when expressions are involved: fields paths that end in array fields are not automatically unwound, you must specify that they should be unwound by appending []. This change in default behavior allows for the possibility that the ViewExpression you provide is intended to operate on the array as a whole.

import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F

dataset = foz.load_zoo_dataset("quickstart")

# Counts the number of predicted objects
# Here, `predictions.detections` is treated as `predictions.detections[]`
print(dataset.count("predictions.detections"))

# Counts the number of predicted objects with confidence > 0.9
# Here, `predictions.detections` is not automatically unwound
num_preds = F("predictions.detections").filter(F("confidence") > 0.9).length()
print(dataset.sum(num_preds))

# Computes the (min, max) bounding box area in normalized coordinates
# Here we must manually specify that we want to unwind terminal list field
# `predictions.detections` by appending `[]`
bbox_area = F("bounding_box")[2] * F("bounding_box")[3]
print(dataset.bounds(F("ground_truth.detections[]").apply(bbox_area)))

Batching aggregations#

Rather than computing a single aggregation by invoking methods on a Dataset or DatasetView object, you can also instantiate an Aggregation object directly. In this case, the aggregation is not tied to any dataset or view, only to the parameters such as field name that define it.

 1import fiftyone as fo
 2
 3# will count the number of samples in a dataset
 4sample_count = fo.Count()
 5
 6# will count the labels in a `ground_truth` detections field
 7count_values = fo.CountValues("ground_truth.detections.label")
 8
 9# will compute a histogram of the `uniqueness` field
10histogram_values = fo.HistogramValues("uniqueness", bins=50)

Instantiating aggregations in this way allows you to execute multiple aggregations on a dataset or view efficiently in a batch via aggregate():

 1import fiftyone.zoo as foz
 2
 3dataset = foz.load_zoo_dataset("quickstart")
 4
 5results = dataset.aggregate([sample_count, count_values, histogram_values])
 6
 7print(results[0])
 8# 200
 9
10print(results[1])
11# {'bowl': 15, 'scissors': 1, 'cup': 21, ..., 'vase': 1, 'sports ball': 3}
12
13print(results[2][0])  # counts
14# [0, 0, 0, ..., 15, 12, ..., 0, 0]
15
16print(results[2][1])  # edges
17# [0.0, 0.02, 0.04, ..., 0.98, 1.0]

Transforming data before aggregating#

You can use view stages like map_values() and map_labels() in concert with aggregations to efficiently compute statistics on your datasets.

For example, suppose you would like to compute the histogram of the labels in a dataset with certain labels grouped into a single category. You can use map_values() + count_values() to succinctly express this:

 1import fiftyone as fo
 2import fiftyone.zoo as foz
 3
 4dataset = foz.load_zoo_dataset("quickstart")
 5
 6# Map `cat` and `dog` to `pet`
 7labels_map = {"cat": "pet", "dog": "pet"}
 8
 9counts = (
10    dataset
11    .map_values("ground_truth.detections.label", labels_map)
12    .count_values("ground_truth.detections.label")
13)
14
15print(counts)
16# {'toothbrush': 2, 'train': 5, ..., 'pet': 31, ..., 'cow': 22}

Or, suppose you would like to compute the average confidence of a model’s predictions, ignoring any values less than 0.5. You can use filter_labels() + sum() + count() to succinctly express this:

 1import fiftyone as fo
 2import fiftyone.zoo as foz
 3from fiftyone import ViewField as F
 4
 5dataset = foz.load_zoo_dataset("quickstart")
 6
 7avg_conf = (
 8    dataset
 9    .filter_labels("predictions", F("confidence") >= 0.5)
10    .mean("predictions.detections.confidence")
11)
12
13print(avg_conf)
14# 0.8170506501060617

Aggregating frame labels#

You can compute aggregations on the frame labels of a video dataset by adding the frames prefix to the relevant frame field name:

 1import fiftyone as fo
 2import fiftyone.zoo as foz
 3
 4dataset = foz.load_zoo_dataset("quickstart-video")
 5
 6# Count the number of video frames
 7count = dataset.count("frames")
 8print(count)
 9# 1279
10
11# Compute a histogram of per-frame object labels
12counts = dataset.count_values("frames.detections.detections.label")
13print(counts)
14# {'person': 1108, 'vehicle': 7511, 'road sign': 2726}