# Using Aggregations¶

Datasets are the core data structure in FiftyOne, allowing you to represent your raw data, labels, and associated metadata. When you query and manipulate a Dataset object using dataset views, a DatasetView object is returned, which represents a filtered view into a subset of the underlying dataset’s contents.

Complementary to this data model, one is often interested in computing aggregate statistics about datasets, such as label counts, distributions, and ranges, where each Sample is reduced to a single quantity in the aggregate results.

The fiftyone.core.aggregations module offers a declarative and highly-efficient approach to computing summary statistics about your datasets and views.

## Overview¶

All builtin aggregations are subclasses of the Aggregation class, each encapsulating the computation of a different statistic about your data.

Aggregations are conveniently exposed as methods on all Dataset and DatasetView objects:

 1 2 3 4 5 6 7 import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") # List available aggregations print(dataset.list_aggregations()) # ['bounds', 'count', 'count_values', 'distinct', ..., 'sum'] 

Think of aggregations as more efficient, concise alternatives to writing explicit loops over your dataset to compute a statistic:

  1 2 3 4 5 6 7 8 9 10 11 from collections import defaultdict # Compute label histogram manually manual_counts = defaultdict(int) for sample in dataset: for detection in sample.ground_truth.detections: manual_counts[detection.label] += 1 # Compute via aggregation counts = dataset.count_values("ground_truth.detections.label") print(counts) # same as manual_counts above 

You can even aggregate on expressions that transform the data in arbitrarily complex ways:

  1 2 3 4 5 6 7 8 9 10 from fiftyone import ViewField as F # Expression that computes the number of predicted objects num_objects = F("predictions.detections").length() # The (min, max) number of predictions per sample print(dataset.bounds(num_objects)) # The average number of predictions per sample print(dataset.mean(num_objects)) 

The sections below discuss the available aggregations in more detail. You can also refer to the fiftyone.core.aggregations module documentation for detailed examples of using each aggregation.

Note

All aggregations can operate on embedded sample fields using the embedded.field.name syntax.

Aggregation fields can also include array fields. Most array fields are automatically unwound, but you can always manually unwind an array using the embedded.array[].field syntax. See this section for more details.

## Compute bounds¶

You can use the bounds() aggregation to compute the [min, max] range of a numeric field of a dataset:

  1 2 3 4 5 6 7 8 9 10 11 12 13 import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") # Compute the bounds of the uniqueness field bounds = dataset.bounds("uniqueness") print(bounds) # (0.15001302256126986, 1.0) # Compute the bounds of the detection confidences in the predictions field bounds = dataset.bounds("predictions.detections.confidence") print(bounds) # (0.05003104358911514, 0.9999035596847534) 

## Count items¶

You can use the count() aggregation to compute the number of non-None field values in a collection:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") # Compute the number of samples in the dataset count = dataset.count() print(count) # 200 # Compute the number of samples with predictions count = dataset.count("predictions") print(count) # 200 # Compute the number of detections in the ground_truth field count = dataset.count("predictions.detections") print(count) # 5620 

## Count values¶

You can use the count_values() aggregation to compute the occurrences of field values in a collection:

  1 2 3 4 5 6 7 8 9 10 11 12 13 import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") # Compute the number of samples in the dataset counts = dataset.count_values("tags") print(counts) # {'validation': 200} # Compute a histogram of the predicted labels in the predictions field counts = dataset.count_values("predictions.detections.label") print(counts) # {'bicycle': 13, 'hot dog': 8, ..., 'skis': 52} 

## Distinct values¶

You can use the distinct() aggregation to compute the distinct values of a field in a collection:

  1 2 3 4 5 6 7 8 9 10 11 12 13 import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") # Get the distinct tags on the dataset values = dataset.distinct("tags") print(values) # ['validation'] # Get the distinct labels in the predictions field values = dataset.distinct("predictions.detections.label") print(values) # ['airplane', 'apple', 'backpack', ..., 'wine glass', 'zebra'] 

## Histogram values¶

You can use the histogram_values() aggregation to compute the histograms of numeric fields of a collection:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import numpy as np import matplotlib.pyplot as plt import fiftyone.zoo as foz def plot_hist(counts, edges): counts = np.asarray(counts) edges = np.asarray(edges) left_edges = edges[:-1] widths = edges[1:] - edges[:-1] plt.bar(left_edges, counts, width=widths, align="edge") dataset = foz.load_zoo_dataset("quickstart") # # Compute a histogram of the uniqueness field # counts, edges, other = dataset.histogram_values("uniqueness", bins=50) plot_hist(counts, edges) plt.show(block=False) 

## Schema¶

You can use the schema() aggregation to extract the names and types of the attributes of a specified embedded document field across all samples in a collection.

Schema aggregations are useful for detecting the presence and types of dynamic attributes of Label fields across a collection.

 1 2 3 4 5 6 7 import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") # Extract the names and types of all dynamic attributes on the # ground_truth detections print(dataset.schema("ground_truth.detections", dynamic_only=True)) 
{
'area': <fiftyone.core.fields.FloatField object at 0x7fc94015fb50>,
'iscrowd': <fiftyone.core.fields.FloatField object at 0x7fc964869fd0>,
}


## Sum values¶

You can use the sum() aggregation to compute the sum of the (non-None) values of a field in a collection:

  1 2 3 4 5 6 7 8 9 10 import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") # Compute average confidence of detections in the predictions field print( dataset.sum("predictions.detections.confidence") / dataset.count("predictions.detections.confidence") ) # 0.34994137249820706 

## Mean values¶

You can use the mean() aggregation to compute the arithmetic mean of the (non-None) values of a field in a collection:

 1 2 3 4 5 6 7 import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") # Compute average confidence of detections in the predictions field print(dataset.mean("predictions.detections.confidence")) # 0.34994137249820706 

## Quantiles¶

You can use the quantiles() aggregation to compute the quantile(s) of the (non-None) values of a field in a collection:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") # Compute quantiles of the uniqueness field print(dataset.quantiles("uniqueness", [0.25, 0.5, 0.75, 0.9])) # [0.22027, 0.33771, 0.62554, 0.69488] # Compute quantiles of detection confidence in the predictions field quantiles = dataset.quantiles( "predictions.detections.confidence", [0.25, 0.5, 0.75, 0.9], ) print(quantiles) # [0.09231, 0.20251, 0.56273, 0.94354] 

## Standard deviation¶

You can use the std() aggregation to compute the standard deviation of the (non-None) values of a field in a collection:

 1 2 3 4 5 6 7 8 import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") # Compute standard deviation of the confidence of detections in the # predictions field print(dataset.std("predictions.detections.confidence")) # 0.3184061813934825 

## Values¶

You can use the values() aggregation to extract a list containing the values of a field across all samples in a collection:

  1 2 3 4 5 6 7 8 9 10 11 12 import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") # Extract the uniqueness values for all samples uniqueness = dataset.values("uniqueness") print(len(uniqueness)) # 200 # Extract the labels for all predictions labels = dataset.values("predictions.detections.label") print(len(labels)) # 200 print(labels[0]) # ['bird', ..., 'bear', 'sheep'] 

Note

Unlike other aggregations, values() does not automatically unwind list fields, which ensures that the returned values match the potentially-nested structure of the documents.

You can opt-in to unwinding specific list fields using the [] syntax, or you can pass the optional unwind=True parameter to unwind all supported list fields. See Aggregating list fields for more information.

## Advanced usage¶

### Aggregating list fields¶

Aggregations that operate on scalar fields can also be applied to the elements of list fields by appending [] to the list component of the field path.

The example below demonstrates this capability:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 import fiftyone as fo dataset = fo.Dataset() dataset.add_samples( [ fo.Sample( filepath="/path/to/image1.png", keypoints=fo.Keypoint(points=[(0, 0), (1, 1)]), classes=fo.Classification( label="cat", confidence=0.9, friends=["dog", "squirrel"] ), ), fo.Sample( filepath="/path/to/image2.png", keypoints=fo.Keypoint(points=[(0, 0), (0.5, 0.5), (1, 1)]), classes=fo.Classification( label="dog", confidence=0.8, friends=["rabbit", "squirrel"], ), ), ] ) # # Count the number of keypoints in the dataset # # The points list attribute is declared on the Keypoint class, so it is # automatically unwound # count = dataset.count("keypoints.points") print(count) # 5 # # Compute the values in the friends field of the predictions # # The friends list attribute is a dynamic custom attribute, so we must # explicitly request that it be unwound # counts = dataset.count_values("classes.friends[]") print(counts) # {'dog': 1, 'squirrel': 2, 'rabbit': 1} 

Note

FiftyOne will automatically unwind all array fields that are defined in the dataset’s schema without requiring you to explicitly specify this via the [] syntax. This includes the following cases:

Top-level list fields: When you write an aggregation that refers to a top-level list field of a dataset; i.e., list_field is automatically coerced to list_field[], if necessary.

Frame fields: When you write an aggregation that refers to a frame-level field of a video dataset; i.e., frames.classification.label is automatically coerced to frames[].classifcation.label if necessary.

Embedded list fields: When you write an aggregation that refers to a list attribute that is declared on a Sample, Frame, or Label class, such as the Classification.tags, Detections.detections, or Keypoint.points attributes; i.e., ground_truth.detections.label is automatically coerced to ground_truth.detections[].label, if necessary.

### Aggregating expressions¶

Aggregations also support performing more complex computations on fields via the optional expr argument, which is supported by all aggregations and allows you to specify a ViewExpression defining an arbitrary transformation of the field you’re operating on prior to aggregating.

The following examples demonstrate the power of aggregating with expressions:

The code sample below computes some statistics about the number of predicted objects in a dataset:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import fiftyone as fo import fiftyone.zoo as foz from fiftyone import ViewField as F dataset = foz.load_zoo_dataset("quickstart") # Expression that computes the number of predicted objects num_objects = F("predictions.detections").length() # The (min, max) number of predictions per sample print(dataset.bounds(num_objects)) # The average number of predictions per sample print(dataset.mean(num_objects)) # Two equivalent ways of computing the total number of predictions print(dataset.sum(num_objects)) print(dataset.count("predictions.detections")) 

The code sample below computes some statistics about predicted object labels after doing some normalization:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import fiftyone as fo import fiftyone.zoo as foz from fiftyone import ViewField as F dataset = foz.load_zoo_dataset("quickstart") ANIMALS = [ "bear", "bird", "cat", "cow", "dog", "elephant", "giraffe", "horse", "sheep", "zebra" ] # Expression that replaces all animal labels with "animal" and then # capitalizes all labels normed_labels = F("predictions.detections.label").map_values( {a: "animal" for a in ANIMALS} ).upper() # A histogram of normalized predicted labels print(dataset.count_values(normed_labels)) 

The code sample below computes some statistics about the sizes of ground truth and predicted bounding boxes in a dataset, in pixels:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import fiftyone as fo import fiftyone.zoo as foz from fiftyone import ViewField as F dataset = foz.load_zoo_dataset("quickstart") dataset.compute_metadata() # Expression that computes the area of a bounding box, in pixels # Bboxes are in [top-left-x, top-left-y, width, height] format bbox_width = F("bounding_box")[2] * F("$metadata.width") bbox_height = F("bounding_box")[3] * F("$metadata.height") bbox_area = bbox_width * bbox_height # Expression that computes the area of ground truth bboxes gt_areas = F("ground_truth.detections[]").apply(bbox_area) # Compute (min, max, mean) of ground truth bounding boxes print(dataset.bounds(gt_areas)) print(dataset.mean(gt_areas)) 

Note

When aggregating expressions, field names may contain list fields, and such field paths are handled as explained above.

However, there is one important exception when expressions are involved: fields paths that end in array fields are not automatically unwound, you must specify that they should be unwound by appending []. This change in default behavior allows for the possibility that the ViewExpression you provide is intended to operate on the array as a whole.

import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F

dataset = foz.load_zoo_dataset("quickstart")

# Counts the number of predicted objects
# Here, predictions.detections is treated as predictions.detections[]
print(dataset.count("predictions.detections"))

# Counts the number of predicted objects with confidence > 0.9
# Here, predictions.detections is not automatically unwound
num_preds = F("predictions.detections").filter(F("confidence") > 0.9).length()
print(dataset.sum(num_preds))

# Computes the (min, max) bounding box area in normalized coordinates
# Here we must manually specify that we want to unwind terminal list field
# predictions.detections by appending []
bbox_area = F("bounding_box")[2] * F("bounding_box")[3]
print(dataset.bounds(F("ground_truth.detections[]").apply(bbox_area)))


### Batching aggregations¶

Rather than computing a single aggregation by invoking methods on a Dataset or DatasetView object, you can also instantiate an Aggregation object directly. In this case, the aggregation is not tied to any dataset or view, only to the parameters such as field name that define it.

  1 2 3 4 5 6 7 8 9 10 import fiftyone as fo # will count the number of samples in a dataset sample_count = fo.Count() # will count the labels in a ground_truth detections field count_values = fo.CountValues("ground_truth.detections.label") # will compute a histogram of the uniqueness field histogram_values = fo.HistogramValues("uniqueness", bins=50) 

Instantiating aggregations in this way allows you to execute multiple aggregations on a dataset or view efficiently in a batch via aggregate():

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") results = dataset.aggregate([sample_count, count_values, histogram_values]) print(results[0]) # 200 print(results[1]) # {'bowl': 15, 'scissors': 1, 'cup': 21, ..., 'vase': 1, 'sports ball': 3} print(results[2][0]) # counts # [0, 0, 0, ..., 15, 12, ..., 0, 0] print(results[2][1]) # edges # [0.0, 0.02, 0.04, ..., 0.98, 1.0] 

### Transforming data before aggregating¶

You can use view stages like map_labels() in concert with aggregations to efficiently compute statistics on your datasets.

For example, suppose you would like to compute the histogram of the labels in a dataset with certain labels grouped into a single category. You can use map_labels() + count_values() to succinctly express this:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import fiftyone as fo import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") # Map cat and dog to pet labels_map = {"cat": "pet", "dog": "pet"} counts = ( dataset .map_labels("ground_truth", labels_map) .count_values("ground_truth.detections.label") ) print(counts) # {'toothbrush': 2, 'train': 5, ..., 'pet': 31, ..., 'cow': 22} 

Or, suppose you would like to compute the average confidence of a model’s predictions, ignoring any values less than 0.5. You can use filter_labels() + sum() + count() to succinctly express this:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 import fiftyone as fo import fiftyone.zoo as foz from fiftyone import ViewField as F dataset = foz.load_zoo_dataset("quickstart") avg_conf = ( dataset .filter_labels("predictions", F("confidence") >= 0.5) .mean("predictions.detections.confidence") ) print(avg_conf) # 0.8170506501060617 

### Aggregating frame labels¶

You can compute aggregations on the frame labels of a video dataset by adding the frames prefix to the relevant frame field name:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 import fiftyone as fo import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart-video") # Count the number of video frames count = dataset.count("frames") print(count) # 1279 # Compute a histogram of per-frame object labels counts = dataset.count_values("frames.detections.detections.label") print(counts) # {'person': 1108, 'vehicle': 7511, 'road sign': 2726}