pandas-style queries in FiftyOne¶

Overview¶

pandas is a Python library for data analysis. The central object in pandas is a DataFrame, which is a two-dimensional labeled data structure that handles tabular data. pandas is optimized for storing, manipulating, and analyzing tabular data, making it useful for a wide variety of data science, data engineering, and machine learning tasks.

FiftyOne, is an open-source Python library for building high-quality datasets and computer vision models. The central object in FiftyOne is the Dataset, which allows for efficient handling of datasets consisting of images, videos, geospatial, or 3D data, as well as the corresponding metadata and labels associated with the media (which are often more complex than what can be represented in a two-dimensional data structure).

While they apply to different types of data, the pandas DataFrame and FiftyOne Dataset classes share many similar functionalities. In this overview, we’ll present a side-by-side comparison of common operations in the two libraries.

If you’re already a pandas power user, then you’ll be a FiftyOne power user too after running through this tutorial!

Getting started¶

The first thing to do is to install FiftyOne:

[ ]:

!pip install fiftyone

Then we will import pandas and FiftyOne:

[2]:

import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F  # For handling expressions in matching and filtering

[3]:

import numpy as np
import pandas as pd

In this tutorial, we will download example data for illustrative purposes. Before doing so, we demonstrate how to create empty pd.DataFrame and fo.Dataset objects

Create empty¶

Create empty `pd.DataFrame`¶

[4]:

empty_df = pd.DataFrame()

we can get basic information about the DataFrame using the info property:

[5]:

empty_df.info

[5]:

<bound method DataFrame.info of Empty DataFrame
Columns: []
Index: []>

We can also give the DataFrame object a name:

[6]:

empty_df.name = 'empty_df'

Create empty `fo.Dataset`¶

We can similarly create a Dataset object by calling the FiftyOne core fo.Dataset() method without any arguments:

[7]:

empty_dataset = fo.Dataset()

We can get basic info about the Dataset object using print:

[8]:

print(empty_dataset)

Name:        2022.11.18.18.14.41
Media type:  None
Num samples: 0
Persistent:  False
Tags:        []
Sample fields:
    id:       fiftyone.core.fields.ObjectIdField
    filepath: fiftyone.core.fields.StringField
    tags:     fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)

We can see a few things: 1. Calling the fo.DataFrame() method without an input name resulted in a name being autogenerated based on the time of creation. 2. Whereas the empty Pandas DataFrame has a (trivial) Index, the initialized FiftyOne Dataset has empty Tags (accessible via dataset.tags), and each entry - called a Sample, has predefined fields, including id and filepath. These are necessary for properly accessing and addressing the samples, as the Dataset stores pointers to the media files, not the media objects themselves.

If we wanted to name an existing Dataset, we could do so in analogous fashion to pandas:

[9]:

empty_dataset.name = "empty-dataset"

[10]:

print(empty_dataset)

Name:        empty-dataset
Media type:  None
Num samples: 0
Persistent:  False
Tags:        []
Sample fields:
    id:       fiftyone.core.fields.ObjectIdField
    filepath: fiftyone.core.fields.StringField
    tags:     fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)

Alternatively, if we want to initialize the dataset with a name, we can pass a name in:

[11]:

empty_dataset = fo.Dataset('empty-ds')

Example data¶

For the rest of this tutorial, we will use the following example data:

Iris Dataset ¶

[12]:

df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

[13]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

[14]:

df.columns

[14]:

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

FiftyOne Quickstart Data ¶

[ ]:

ds = foz.load_zoo_dataset("quickstart")

[16]:

print(ds)

Name:        quickstart
Media type:  image
Num samples: 200
Persistent:  True
Tags:        []
Sample fields:
    id:              fiftyone.core.fields.ObjectIdField
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:      fiftyone.core.fields.FloatField
    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:         fiftyone.core.fields.IntField
    eval_fp:         fiftyone.core.fields.IntField
    eval_fn:         fiftyone.core.fields.IntField
    abstractness:    fiftyone.core.fields.FloatField
    new_const_field: fiftyone.core.fields.IntField
    computed_field:  fiftyone.core.fields.IntField

Basics¶

Head and tail¶

To start to get a feel for the data, we might want to inspect a few entries. For instance, we might want to look at the first few entries, or the last few entries. In both pandas and FiftyOne, these can be accomplished with the head() and tail() methods, which have identical syntax.

Head¶

[17]:

df.head(5)

[17]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

[18]:

first_few_samples = ds.head()

Running DataFrame.head(n) for instance returns the first \(n\) rows of the original DataFrame. Running Dataset.head(5) for instance returns the first five samples of the original Dataset.

In a pandas DataFrame, two-dimensional tabular data is represented in rows and columns.

Analogously, a FiftyOne Dataset consists of samples and fields. More explicitly:

Pandas DataFrame	FiftyOne Dataset
Row	Sample
Column	Field

In pandas, we expect that a fixed set of columns, each representing a different feature, suffices to represent the data. Some rows might not have values for each column, but each row has the same schema. This is ideal for dealing with a wide variety of data, from housing prices to time series predictions.

FiftyOne is built for dealing with the unstructured data often encountered in computer vision applications. As such, a FiftyOne Dataset does not assume such a uniform schema. In this example, ds let’s consider the field predictions. This field consists of a list of Detection objects, each of which has its own label, bounding box, and confidence score. These represent a model’s predictions for detected objects in the image corresponding to the sample. Not all images are guaranteed to contain the same number of predicted objects, so it is preferable for samples to be more flexible than the rows in a DataFrame!

Tail¶

To get the last \(n\) entries (rows or samples), we can use the tail(n) method

[19]:

df.tail(5)

[19]:

	sepal_length	sepal_width	petal_length	petal_width	species
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

[20]:

last_few_samples = ds.tail()

First and last¶

If we only want the first sample in a Dataset, we can use the first() method, which is equivalent to ds.head()[0]

[21]:

first_sample = ds.first()

Similarly, if we only want the last sample, we can use the last() method, which is equivalent to ds.tail()[0]

[22]:

last_sample = ds.last()

Get single element¶

In pandas, if we want to get the element at index \(j\) in a DataFrame, we can employ the loc[j] or iloc[j] functionality, depending on our usage. For instance,

[23]:

j = 10

[24]:

df.loc[j]

[24]:

sepal_length       5.4
sepal_width        3.7
petal_length       1.5
petal_width        0.2
species         setosa
Name: 10, dtype: object

In FiftyOne, we can achieve the same functionality of picking out the \(j^{th}\) sample by running:

[25]:

sample = ds.skip(j).first()

However, in many cases, one is more interested in extracting samples based on their sample id or filepath. In these cases, the syntactical sugar mirrors pandas: both sample = ds[id] and sample = ds[filepath] achieve the desired result.

[26]:

filepath = sample.filepath
print(ds[filepath].id == sample.id)

True

Number of rows/samples¶

We can get the number of samples in a fo.Dataset just the same as we would get the number of rows in a pd.DataFrame object - by passing it to Python’s len() function.

[27]:

len(df)

[27]:

[28]:

len(ds)

[28]:

There are \(150\) flowers in the Iris dataset, and \(200\) images in our FiftyOne Quickstart dataset

Getting columns/field schema¶

In pandas, where all rows in a DataFrame share the same columns, we can get the names of the columns with the DataFrame.columns property.

[29]:

df.columns

[29]:

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In FiftyOne, the core field schema is shared among samples, but the structure within these first-level fields can vary. We can get the field schema by calling the get_field_schema() method.

[30]:

ds.get_field_schema()

[30]:

OrderedDict([('id', <fiftyone.core.fields.ObjectIdField at 0x2a0a65a90>),
             ('filepath', <fiftyone.core.fields.StringField at 0x2a0a5b2b0>),
             ('tags', <fiftyone.core.fields.ListField at 0x2a0a8c460>),
             ('metadata',
              <fiftyone.core.fields.EmbeddedDocumentField at 0x2a0a8c100>),
             ('ground_truth',
              <fiftyone.core.fields.EmbeddedDocumentField at 0x2a0a651f0>),
             ('uniqueness', <fiftyone.core.fields.FloatField at 0x2a0a8cd90>),
             ('predictions',
              <fiftyone.core.fields.EmbeddedDocumentField at 0x2a0a8c1f0>),
             ('eval_tp', <fiftyone.core.fields.IntField at 0x2a0a8cf40>),
             ('eval_fp', <fiftyone.core.fields.IntField at 0x2a0a8cf70>),
             ('eval_fn', <fiftyone.core.fields.IntField at 0x2a0a78550>),
             ('abstractness',
              <fiftyone.core.fields.FloatField at 0x2a0a78580>),
             ('new_const_field',
              <fiftyone.core.fields.IntField at 0x2a0a785b0>),
             ('computed_field',
              <fiftyone.core.fields.IntField at 0x2a0a785e0>)])

In video tasks, get_field_schema is replaced by get_frame_field_schema().

Some of the field types, such as FloatField (float) and StringField (string) correspond in straightforward fashion to data types in pandas, or in Python more generally. As we will see below, the EmbeddedDocumentField, which does not have a perfect analog in pandas, is part of what gives the FiftyOne Dataset its powerful flexibility for tackling computer vision tasks.

If we just want the field names for all samples in the dataset, you can do the following:

[31]:

field_names = list(ds.get_field_schema().keys())
print(field_names)

['id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field']

All values in a column/field¶

In pandas, the entries in each column or pd.Series object must themselves be objects of the type of one of the numpy data types. Thus, when all of the values in a column are extracted, the resulting list will have depth one:

[33]:

col = "sepal_length"
sepal_lengths = df[col].tolist()
print(sepal_lengths[:10])

[5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9]

FiftyOne supports this functionality as well. For instance, each image in our dataset has a uniqueness score, which is a measure of how unique a given image is in the context of the complete dataset. We can extract these values for each image using the values() method as follows:

[34]:

uniqueness = ds.values("uniqueness")
print(uniqueness[:10])

[0.8175834390151201, 0.6844698885072961, 0.725267119762334, 0.7164587220038886, 0.6874799405473135, 0.6773349111042449, 0.6948791555330056, 0.6157872732023304, 0.6692531238595459, 0.7257486965960712]

Some of the relevant information for computer vision tasks, however, is less structured. In our example dataset, this is the case for both the ground_truth and predictions fields, each of which contains a number of object detections in the embedded detections field. The values method also gives us access to these embedded fields.

Let’s see this in action by using the values method to pull out the confidence score for each predicted detection:

[35]:

pred_confs = ds.values("predictions.detections.confidence")

[36]:

print(type(pred_confs))
print(len(pred_confs))
print(type(pred_confs[0]))

<class 'list'>
200
<class 'list'>

As with values("uniqueness"), we get a list with one result per image. However, now we have a sublist for each image, rather than just a single value. We can peak inside one of these sublists at the confidence scores for each detection:

[37]:

print(pred_confs[0])

[0.9750854969024658, 0.759726881980896, 0.6569182276725769, 0.2359301745891571, 0.221974179148674, 0.1965726613998413, 0.18904592096805573, 0.11480894684791565, 0.11089690029621124, 0.0971052274107933, 0.08403241634368896, 0.07699568569660187, 0.058097004890441895, 0.0519101656973362]

Let’s get the lengths of these sublists and print the first few. In the section on fo.Expression, we will see a more natural (and efficient) way of performing this operation.

[38]:

pred_conf_lens = [len(p) for p in pred_confs]
print(pred_conf_lens[:10])

[14, 20, 10, 51, 27, 13, 2, 9, 7, 13]

We can see that the number of confidence scores - and correspondingly the number of predictions - for each image is not fixed. This scenario is fairly typical in object detection tasks, where images can have varying numbers of objects!

View stages¶

Making a copy¶

Suppose we want to make a copy of the original data and modify the copy without the changes propagating back to the original.

In pandas, we can do this with the copy method:

[39]:

copy_df = df.copy()
copy_df['species'] = 'none'
df.head()

[39]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

In FiftyOne, we can do this with the clone() method:

[40]:

copy_ds = ds.clone()
copy_ds.name = 'copy_ds'
print(ds.name)

quickstart

Slicing¶

In pandas if we want to get a slice of a DataFrame, we can do so with the notation df[start:end].

[41]:

start = 10
end = 14

[42]:

df[start:end]

[42]:

	sepal_length	sepal_width	petal_length	petal_width	species
10	5.4	3.7	1.5	0.2	setosa
11	4.8	3.4	1.6	0.2	setosa
12	4.8	3.0	1.4	0.1	setosa
13	4.3	3.0	1.1	0.1	setosa

In FiftyOne, a Dataset can be sliced using the same notation:

[43]:

ds[start:end]

[43]:

Dataset:     quickstart
Media type:  image
Num samples: 4
Sample fields:
    id:              fiftyone.core.fields.ObjectIdField
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:      fiftyone.core.fields.FloatField
    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:         fiftyone.core.fields.IntField
    eval_fp:         fiftyone.core.fields.IntField
    eval_fn:         fiftyone.core.fields.IntField
    abstractness:    fiftyone.core.fields.FloatField
    new_const_field: fiftyone.core.fields.IntField
    computed_field:  fiftyone.core.fields.IntField
View stages:
    1. Skip(skip=10)
    2. Limit(limit=4)

However, as we can see from the output of the preceding command, this is merely syntactical sugar for the expression:

[44]:

ds.skip(start).limit(end - start)

[44]:

Dataset:     quickstart
Media type:  image
Num samples: 4
Sample fields:
    id:              fiftyone.core.fields.ObjectIdField
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:      fiftyone.core.fields.FloatField
    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:         fiftyone.core.fields.IntField
    eval_fp:         fiftyone.core.fields.IntField
    eval_fn:         fiftyone.core.fields.IntField
    abstractness:    fiftyone.core.fields.FloatField
    new_const_field: fiftyone.core.fields.IntField
    computed_field:  fiftyone.core.fields.IntField
View stages:
    1. Skip(skip=10)
    2. Limit(limit=4)

which utilizes the skip() and limit() methods.

Get random samples¶

When working with datasets, it is often the case that one might want to select a random set of samples. One typically wants either (a) a fixed number of random samples, or (b) to sample some fraction of the data randomly. We will show how to do both:

Select \(k\) random samples¶

[45]:

k = 20

In pandas, you can use the sample() method, passing in either a number, as in sample(n = k), or a fraction, as we show below

[46]:

rand_samples_df = df.sample(n=k)

[47]:

rand_samples_df.head()

[47]:

	sepal_length	sepal_width	petal_length	petal_width	species
101	5.8	2.7	5.1	1.9	virginica
129	7.2	3.0	5.8	1.6	virginica
1	4.9	3.0	1.4	0.2	setosa
79	5.7	2.6	3.5	1.0	versicolor
100	6.3	3.3	6.0	2.5	virginica

In FiftyOne, we can use the take() method, to which we can pass in a random seed, or let it seed the random number generator with the time.

[48]:

rand_samples_ds = ds.take(k, seed=123)

[49]:

rand_samples_ds

[49]:

Dataset:     quickstart
Media type:  image
Num samples: 20
Sample fields:
    id:              fiftyone.core.fields.ObjectIdField
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:      fiftyone.core.fields.FloatField
    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:         fiftyone.core.fields.IntField
    eval_fp:         fiftyone.core.fields.IntField
    eval_fn:         fiftyone.core.fields.IntField
    abstractness:    fiftyone.core.fields.FloatField
    new_const_field: fiftyone.core.fields.IntField
    computed_field:  fiftyone.core.fields.IntField
View stages:
    1. Take(size=20, seed=123)

With the random utils in FiftyOne, you can also sample flexibly with user-input weighting schemes, but that is beyond the present scope.

Randomly select fraction \(p<1\) of samples¶

[50]:

p = 0.05

[51]:

df.sample(frac=p).head()

[51]:

	sepal_length	sepal_width	petal_length	petal_width	species
140	6.7	3.1	5.6	2.4	virginica
14	5.8	4.0	1.2	0.2	setosa
40	5.0	3.5	1.3	0.3	setosa
58	6.6	2.9	4.6	1.3	versicolor
90	5.5	2.6	4.4	1.2	versicolor

[52]:

# We need to convert from fraction p to an integer k
k = int(len(ds) * p)
ds.take(k, seed=123)

[52]:

Dataset:     quickstart
Media type:  image
Num samples: 10
Sample fields:
    id:              fiftyone.core.fields.ObjectIdField
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:      fiftyone.core.fields.FloatField
    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:         fiftyone.core.fields.IntField
    eval_fp:         fiftyone.core.fields.IntField
    eval_fn:         fiftyone.core.fields.IntField
    abstractness:    fiftyone.core.fields.FloatField
    new_const_field: fiftyone.core.fields.IntField
    computed_field:  fiftyone.core.fields.IntField
View stages:
    1. Take(size=10, seed=123)

Shuffle data¶

In a similar vein to randomly selecting samples, one might want to create a new view in which the entire dataset is shuffled.

In pandas, we can accomplish this by randomly sampling all the rows (\(\mathrm{frac}=1\)) without replacement:

[53]:

shuffled_df_view = df.sample(frac=1)

In FiftyOne, we can just call the shuffle() method:

[54]:

shuffled_ds_view = ds.shuffle(seed=123)

Filtering¶

It is also quite natural to want to filter out the data based on some condition. For the Iris data, for instance, let’s get all of the flowers that have a sepal length greater than seven:

[55]:

sepal_length_thresh = 7
large_sepal_len_view = df[df.sepal_length > sepal_length_thresh]

[56]:

print(len(large_sepal_len_view))
print(large_sepal_len_view.head())

12
     sepal_length  sepal_width  petal_length  petal_width    species
102           7.1          3.0           5.9          2.1  virginica
105           7.6          3.0           6.6          2.1  virginica
107           7.3          2.9           6.3          1.8  virginica
109           7.2          3.6           6.1          2.5  virginica
117           7.7          3.8           6.7          2.2  virginica

In FiftyOne, we can perform an analogous filtering operation on the quickstart images, using the match() method and the ViewField to select all images that have a “uniqueness” score above some threshold:

[57]:

unique_thresh = 0.75
unique_view = ds.match(F("uniqueness") > unique_thresh)
print(unique_view)
print("values: ", unique_view.values("uniqueness"))

Dataset:     quickstart
Media type:  image
Num samples: 8
Sample fields:
    id:              fiftyone.core.fields.ObjectIdField
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:      fiftyone.core.fields.FloatField
    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:         fiftyone.core.fields.IntField
    eval_fp:         fiftyone.core.fields.IntField
    eval_fn:         fiftyone.core.fields.IntField
    abstractness:    fiftyone.core.fields.FloatField
    new_const_field: fiftyone.core.fields.IntField
    computed_field:  fiftyone.core.fields.IntField
View stages:
    1. Match(filter={'$expr': {'$gt': [...]}})
values:  [0.8175834390151201, 1.0, 0.922046961894074, 0.799848556973409, 0.7806850524560267, 0.7950646615140298, 0.7505336395700778, 0.7530639609974709]

However, in FiftyOne, given the potentially nested structure of the data in a Dataset, we can perform far more complex filtering operations using the same machinery, combined with the filter() method. Crucially, these matching and filtering operations apply equally well to embedded fields.

As an example, let’s say we want to filter for all images in our dataset that had at least one object prediction with very high confidence. In this case, the confidence score is an embedded field within the predicted detections for each image. Thus, we can create a filter on confidence scores, and then apply this filter to the embedded detections field within predictions:

[58]:

high_conf_filter = F("confidence") > 0.995

high_conf_view = ds.match(
    F("predictions.detections").filter(high_conf_filter).length() > 0
)

[59]:

high_conf_view

[59]:

Dataset:     quickstart
Media type:  image
Num samples: 116
Sample fields:
    id:              fiftyone.core.fields.ObjectIdField
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:      fiftyone.core.fields.FloatField
    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:         fiftyone.core.fields.IntField
    eval_fp:         fiftyone.core.fields.IntField
    eval_fn:         fiftyone.core.fields.IntField
    abstractness:    fiftyone.core.fields.FloatField
    new_const_field: fiftyone.core.fields.IntField
    computed_field:  fiftyone.core.fields.IntField
View stages:
    1. Match(filter={'$expr': {'$gt': [...]}})

For video tasks, the method match_frames() allows one to perform filtering on the frames of a video collection.

We explore this filtering and matching machinery a little more in the section on expressions, but a comprehensive discussion will be the subject of an upcoming tutorial.

Sorting¶

We might also want to sort by certain properties. Let’s see how that is done in pandas and FiftyOne.

In pandas, we use the sort_values method.

Suppose that we want to sort by petal length. We can do this as follows:

[60]:

petal_length_view = df.sort_values(by="petal_length", ascending=False)

[61]:

petal_length_view.head()

[61]:

	sepal_length	sepal_width	petal_length	petal_width	species
118	7.7	2.6	6.9	2.3	virginica
122	7.7	2.8	6.7	2.0	virginica
117	7.7	3.8	6.7	2.2	virginica
105	7.6	3.0	6.6	2.1	virginica
131	7.9	3.8	6.4	2.0	virginica

In FiftyOne, we use the sort_by() method. Let’s sort the samples by the number of “ground truth” objects in the sample images:

[62]:

field = "ground_truth.detections"
view = ds.sort_by(F(field).length(), reverse=True)

[63]:

print(len(view.first().ground_truth.detections))  # 39
print(len(view.last().ground_truth.detections))  # 0

39
0

Now we can see that the most crowded image has \(39\) objects, while the least crowded image is actually empty!

Deleting¶

If we are resource-constrained, we can delete old DataFrame or Dataset objects so that they no longer occupy memory.

In pandas we do this using the del command and the garbage collector utility. To delete the petal_length_view view, we can do the following:

[64]:

import gc
del petal_length_view
gc.collect()

[64]:

In FiftyOne, we can use the built-in delete() method:

[65]:

copy_ds.delete()

It is also worth mentioning that in FiftyOne, the Dataset is best thought of as an in-memory object. This means that a Dataset is deleted after closing Python (this is true in both Python interpreters and notebooks). If you want to use the dataset in the future, you can avoid this end-of-session deletion by setting the persistent property to True:

[66]:

ds.persistent = True

Aggregations¶

Given a set of values for a column or field, it is often desired to compute aggregate quantities over all of these values. pandas DataFrame objects and FiftyOne Dataset objects both come with this functionality built in.

The general syntax is that in pandas, aggregations are methods of pd.Series objects, which represent the columns in a DataFrame. In FiftyOne, the aggregations are methods of the Dataset or DatasetView object, which take as input the field to be aggregated over.

Count¶

In both pandas and FiftyOne, the count() method returns the total number of occurrences.

In pandas, this counts the number of values in the column, which is by construction equal to the number of rows in the DataFrame:

[67]:

print(df['species'].count())
print(len(df))

150
150

In FiftyOne, the count method returns the total number of occurrences of a certain field, which is not necessarily the same as the number of samples.

[68]:

num_predictions = ds.count("predictions.detections.label")
print(len(ds))
print(num_predictions)

200
5620

Sum¶

Both pandas and FiftyOne have the sum() method

[69]:

sum_sepal_lengths = df.sepal_length.sum()
print(sum_sepal_lengths)

876.5

[70]:

sum_pred_confs = ds.sum("predictions.detections.confidence")
print(sum_pred_confs)

1966.6705134399235

Unique¶

In pandas, the unique method returns a list of all unique values in the input pd.Series.

[71]:

df.species.unique()

[71]:

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In FiftyOne, the distinct() method reproduces this functionality.

[72]:

rand_samples_ds.distinct("predictions.detections.label")

[72]:

['banana',
 'bed',
 'bench',
 'bicycle',
 'bird',
 'boat',
 'book',
 'bowl',
 'broccoli',
 'bus',
 'cake',
 'car',
 'carrot',
 'cat',
 'cell phone',
 'chair',
 'clock',
 'couch',
 'cow',
 'cup',
 'dining table',
 'dog',
 'elephant',
 'fire hydrant',
 'fork',
 'frisbee',
 'giraffe',
 'handbag',
 'horse',
 'keyboard',
 'kite',
 'knife',
 'laptop',
 'person',
 'pizza',
 'sandwich',
 'scissors',
 'sheep',
 'skateboard',
 'skis',
 'snowboard',
 'spoon',
 'sports ball',
 'stop sign',
 'surfboard',
 'tie',
 'traffic light',
 'train',
 'truck',
 'tv',
 'umbrella']

Bounds¶

In pandas, you compute the minimum and maximum value of a pd.Series separately:

[73]:

min_sepal_len = df.sepal_length.min()
max_sepal_len = df.sepal_length.max()
print("min_sepal_len: {}, max_sepal_len: {}".format(min_sepal_len, max_sepal_len))

min_sepal_len: 4.3, max_sepal_len: 7.9

When working with a FiftyOne Dataset or DataView, the min and max are returned together in a tuple when the bounds() method is called on a field:

[74]:

(min_pred_conf, max_pred_conf) = ds.bounds("predictions.detections.confidence")
print("min_pred_conf: {}, max_pred_conf: {}".format(min_pred_conf, max_pred_conf))

min_pred_conf: 0.05003104358911514, max_pred_conf: 0.9999035596847534

Mean¶

Both pandas DataFrame objects and FiftyOne Dataset objects employ the method mean()

[75]:

mean_sepal_len = df.sepal_length.mean()
print(mean_sepal_len)

5.843333333333334

[76]:

mean_pred_conf = ds.mean("predictions.detections.confidence")
print(mean_pred_conf)

0.34994137249820706

Standard deviation¶

Both pandas DataFrame objects and FiftyOne Dataset objects employ the method std():

[77]:

std_sepal_len = df.sepal_length.std()
print(std_sepal_len)

0.828066127977863

[78]:

std_pred_conf = ds.std("predictions.detections.confidence")
print(std_pred_conf)

0.3184061813934825

Quantiles¶

If you don’t want just the mean, but instead want the value for a given column or field at arbitrary percentiles in the dataset, you can use the quantiles() method, which takes in a list of percentiles.

[79]:

percentiles = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

[80]:

sepal_len_quanties = df.sepal_length.quantile(percentiles)
print(sepal_len_quanties)

0.0    4.30
0.2    5.00
0.4    5.60
0.6    6.10
0.8    6.52
1.0    7.90
Name: sepal_length, dtype: float64

[81]:

pred_conf_quantiles = ds.quantiles("predictions.detections.confidence", percentiles)
print(pred_conf_quantiles)

[0.05003104358911514, 0.08101843297481537, 0.14457139372825623, 0.2922309935092926, 0.6890143156051636, 0.9999035596847534]

Median and other aggregations¶

Some aggregations which are native to pandas, such as computing the median, are not native to FiftyOne. In these cases, the canonical way to compute the aggregation is by first extracting the values from the Dataset field, and then using native numpy or scipy functionality.

Here we illustrate this procedure for computing the median. If you use the values method on the predictions.detections.confidence field with default arguments, we get a jagged array.

[82]:

pred_confs_jagged = ds.values("predictions.detections.confidence")
print([len(pc) for pc in pred_confs_jagged][:10])
print(sum([len(pc) for pc in pred_confs_jagged]))

[14, 20, 10, 51, 27, 13, 2, 9, 7, 13]
5620

However, we can simplify our lives by flattening the result passing in the argument unwind = True:

[83]:

pred_confs_flat = ds.values("predictions.detections.confidence", unwind = True)
print(len(pred_confs_flat))

And from this we can easily compute the median:

[84]:

pred_confs_median = np.median(pred_confs_flat)
print(pred_confs_median)

0.20251326262950897

Structural change operations¶

Add new column/field¶

There are many scenarios in which one might want to add another column/field to a dataset. From a practical standpoint, these come in three distinct flavors. 1. Add a new column/field with a default (constant) value for each row/sample. 2. Add new column/field defined with external or already computed data. 3. Create new column/field programmatically from other columns/fields.

In this section we show how to efficiently handle each of these cases in pandas and FiftyOne.

Add new column/field with default value¶

In pandas, the easiest way to create a new column const_col with constant value const_val is:

[85]:

df['const_col'] = 'const_val'
df.head()

[85]:

	sepal_length	sepal_width	petal_length	petal_width	species	const_col
0	5.1	3.5	1.4	0.2	setosa	const_val
1	4.9	3.0	1.4	0.2	setosa	const_val
2	4.7	3.2	1.3	0.2	setosa	const_val
3	4.6	3.1	1.5	0.2	setosa	const_val
4	5.0	3.6	1.4	0.2	setosa	const_val

which implicitly broadcasts the single value const_val to all rows in the DataFrame.

In FiftyOne, the canonical process for efficiently creating and populating a new field involves three steps. (1) a new field is added to the Dataset using the add_sample_field() method with add_sample_field(field_name, ftype). (2) The field is populated, using either set_field() or set_values(), as we will illustrate below. (3) the Dataset or DatasetView is saved using save(), saving the changes.

There is one key distinction in usage between set_field and set_values. Whereas set_values sets the values on the Dataset directly, using set_field creates a new DatasetView, and this DatasetView is what must be saved!

Before illustrating these more efficient approaches, it is also worth mentioning that you can also loop through the samples in a Dataset or DatasetView and add or set fields one at a time.

[86]:

for sample in ds.iter_samples(autosave=True):
    sample["new_const_field"] = 51
    sample["computed_field"] = len(sample.ground_truth.detections)

However, this is not an efficient approach. It is recommended to use set_field or set_values instead.

In the simplest scenario - analogous to the Pandas example above, we can pass a single value into set_field along with the name of the field:

[87]:

ds.add_sample_field("const_field", fo.StringField)
view = ds.set_field("const_field", "const_val")
view.save()

print(ds.first().field_names)
print(ds.values("const_field")[:10])

('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field', 'const_field')
['const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val']

As we will see shortly, however, set_field is far more flexible and powerful than this, as a result of FiftyOne’s robust matching and filtering capabilities.

Add new column/field from external data¶

Starting with pandas, suppose that our data team comes to us and tells us that now they also have the stem length for each flower, and they want us to incorporate that data into our models.

For instance, let’s say the stem lengths are:

[88]:

stem_lengths = np.random.uniform(5, 10, len(df))

We can add this into our dataset using a similar syntax as above. The only difference is that this time, the assignment is taking in an array (here a numpy array) instead of a single value.

[89]:

df['stem_length'] = stem_lengths

[90]:

df.head()

[90]:

	sepal_length	sepal_width	petal_length	petal_width	species	const_col	stem_length
0	5.1	3.5	1.4	0.2	setosa	const_val	9.519895
1	4.9	3.0	1.4	0.2	setosa	const_val	9.230470
2	4.7	3.2	1.3	0.2	setosa	const_val	8.312255
3	4.6	3.1	1.5	0.2	setosa	const_val	6.762648
4	5.0	3.6	1.4	0.2	setosa	const_val	8.624046

In FiftyOne, we can do something similar by passing an array of values into set_values.

As an example, let’s say we have an abstractness score between zero and one for each image.

[91]:

abstractness = np.random.uniform(0, 1, len(ds))

[92]:

ds.set_values("abstractness", abstractness)
print(ds.first().field_names)
print(ds.values("abstractness")[:10])

('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field', 'const_field')
[0.18992196548662132, 0.4195423356383746, 0.9782249923275138, 0.3555547463728417, 0.9019379850096877, 0.3647814428112852, 0.3030278060870243, 0.241988161650587, 0.7872455674533378, 0.44774858997738953]

Note that when using set_values we are modifying the Dataset directly. Thus, as opposed to set_field, we do not need to preface the method call with add_sample_field, and we do not need to explicitly save the Dataset with save afterwards.

Add a new column/frame from existing columns/fields¶

Finally, often either in the process of feature engineering or data analysis, you want to generate new columns or fields from existing ones.

In pandas, the canonical way of doing this is with the apply method. Suppose we want to create a new feature called “sepal volume” derived by taking the product of sepal length and sepal width. With apply we can map row-wise onto the columns:

[93]:

df["sepal_volume"] = df.apply(lambda x: x["sepal_length"]*x["sepal_width"], axis=1)

[94]:

df.head()

[94]:

	sepal_length	sepal_width	petal_length	petal_width	species	const_col	stem_length	sepal_volume
0	5.1	3.5	1.4	0.2	setosa	const_val	9.519895	17.85
1	4.9	3.0	1.4	0.2	setosa	const_val	9.230470	14.70
2	4.7	3.2	1.3	0.2	setosa	const_val	8.312255	15.04
3	4.6	3.1	1.5	0.2	setosa	const_val	6.762648	14.26
4	5.0	3.6	1.4	0.2	setosa	const_val	8.624046	18.00

In FiftyOne, we can perform operations like this by combining set_field with the Viewfield, here loaded as F.

To compute the number of predicted object detections for each sample in the Dataset we can write:

[95]:

view = ds.set_field(
    "predictions.num_predictions",
    F("$predictions.detections").length(),
)
view.save()
print(ds.first().predictions.field_names)
print(ds.values("predictions.num_predictions")[:10])

('detections', 'num_predictions')
[14, 20, 10, 51, 27, 13, 2, 9, 7, 13]

The above also highlights that all of the aforementioned operations also work on embedded fields. Note however that as we are not changing the base field_schema, we do not need to call add_sample_field!

Remove a column/field¶

Sometimes you want to look at a dataset without a certain column/field. More precisely, there are two related things one might want to do. 1. Create a new view of the dataset without specific column/field, or 2. Delete specific column/field from the original dataset.

Here, we show how to do both of these in Pandas and FiftyOne.

In pandas, you can create a view without specific columns using the drop method:

[96]:

df.head()

[96]:

	sepal_length	sepal_width	petal_length	petal_width	species	const_col	stem_length	sepal_volume
0	5.1	3.5	1.4	0.2	setosa	const_val	9.519895	17.85
1	4.9	3.0	1.4	0.2	setosa	const_val	9.230470	14.70
2	4.7	3.2	1.3	0.2	setosa	const_val	8.312255	15.04
3	4.6	3.1	1.5	0.2	setosa	const_val	6.762648	14.26
4	5.0	3.6	1.4	0.2	setosa	const_val	8.624046	18.00

[97]:

no_const_view = df.drop(["const_col"], axis=1)
# equvalent to df.drop(columns=["const"])

no_const_view.head()

[97]:

	sepal_length	sepal_width	petal_length	petal_width	species	stem_length	sepal_volume
0	5.1	3.5	1.4	0.2	setosa	9.519895	17.85
1	4.9	3.0	1.4	0.2	setosa	9.230470	14.70
2	4.7	3.2	1.3	0.2	setosa	8.312255	15.04
3	4.6	3.1	1.5	0.2	setosa	6.762648	14.26
4	5.0	3.6	1.4	0.2	setosa	8.624046	18.00

If one wants to delete the column from the original DataFrame, one does so by assigning the variable for the original DataFrame to the dropped view:

[98]:

df = df.drop(["const_col"], axis=1)
df.head()

[98]:

	sepal_length	sepal_width	petal_length	petal_width	species	stem_length	sepal_volume
0	5.1	3.5	1.4	0.2	setosa	9.519895	17.85
1	4.9	3.0	1.4	0.2	setosa	9.230470	14.70
2	4.7	3.2	1.3	0.2	setosa	8.312255	15.04
3	4.6	3.1	1.5	0.2	setosa	6.762648	14.26
4	5.0	3.6	1.4	0.2	setosa	8.624046	18.00

In FiftyOne, you can create a ViewStage without a particular field using the exclude_fields() method:

[99]:

no_predictions_view = ds.exclude_fields("predictions")
print(no_predictions_view.first().field_names)

('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field', 'const_field')

Alternatively, you can delete a field from the Dataset using delete_sample_field().

[100]:

ds.delete_sample_field("const_field")
print(ds.first().field_names)

('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field')

Both the exclude_field and delete_sample_field methods also work with embedded fields:

[101]:

ds.delete_sample_field("predictions.num_predictions")
print(ds.first().predictions.field_names)

('detections',)

To delete multiple fields at once, you can use the related delete_sample_fields() method.

Keep only specified columns/fields¶

Alternatively, if you only want to create a view with a small subset of columns/fields, it might be easier to specify those directly. As with removing columns, this can be done in a way that creates a new view while preserving the original, or in a way that deletes the columns/fields from the original dataset. We show both approaches below.

In pandas, to create a new view with only the “sepal_length” and “sepal_width” columns, one could write:

[102]:

sepal_df = df[["sepal_length", "sepal_width"]]
sepal_df.head()

[102]:

	sepal_length	sepal_width
0	5.1	3.5
1	4.9	3.0
2	4.7	3.2
3	4.6	3.1
4	5.0	3.6

In contrast, the following propagates the changes back to the original DataFrame:

[103]:

sepal_df = sepal_df[["sepal_length"]]
sepal_df.head()

[103]:

	sepal_length
0	5.1
1	4.9
2	4.7
3	4.6
4	5.0

In FiftyOne, if we want to create a separate view with only specified fields kept, we should first clone the original dataset and then apply the select_fields() method. when we apply the keep_fields() method following application of select_fields, the changes propagate from the DatasetView back to the underlying Dataset.

Let’s create two clones of our base Dataset to showcase this distinction.

[104]:

ds_clone1 = ds.clone()
ds_clone2 = ds.clone()

For both of these clones, let’s create views which select only the ground_truth field:

[105]:

clone1_view = ds_clone1.select_fields("ground_truth")
clone2_view = ds_clone2.select_fields("ground_truth")
print(clone1_view.first().field_names)
print(clone2_view.first().field_names)

('id', 'filepath', 'tags', 'metadata', 'ground_truth')
('id', 'filepath', 'tags', 'metadata', 'ground_truth')

The id, filepath, tags, and metadata are by default preserved, even when not passed in to select_fields. Aside from these and ground_truth, all other fields have been omitted from view. Now let’s only apply keep_fields on the first clone, and see what changes propagate back.

[106]:

clone1_view.keep_fields()

[107]:

print(ds_clone1.first().field_names)
print(ds_clone2.first().field_names)

('id', 'filepath', 'tags', 'metadata', 'ground_truth')
('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field')

As we can see, the changes only propagated back to the dataset (in this case ds_clone1) when we applied keep_fields.

Finally, we note that when dealing with video datasets, the methods exclude_fields and select_fields have analogous methods for frames - exclude_frames() and select_frames().

Concatenation¶

Suppose we have two datasets we want to combine or concatenate.

In both pandas and FiftyOne, we can concatenate them using the concat method.

In pandas, we can combine two DataFrame objects:

[108]:

df1 = df[df.species == 'setosa']
df2 = df[df.species == 'virginica']
concat_df = pd.concat([df1, df2])
print(len(concat_df))

In FiftyOne, we can use the concat() method to combine views from the same dataset:

[109]:

view1 = ds.match(F("uniqueness") < 0.2)
view2 = ds.match(F("uniqueness") > 0.7)

[110]:

print(len(view1))
print(len(view2))

19
17

[111]:

concat_view = view1.concat(view2)
print(len(view1) + len(view2))
print(len(concat_view))

36
36

The slightly more complicated operation of concatenating Dataset objects ds1 and ds2 (as opposed to DatasetView objects) can be achieved using merge_samples(), i.e., ds1.merge_samples(ds2).

Adding a single row/sample¶

Often times, we just want to enhance a dataset by adding in one sample at a time.

In pandas, the fastest way to do this is to use the same concat method as above. If the row data is in a dictionary format, we convert it to its own DataFrame first.

[112]:

len(df1)

[112]:

[113]:

single_row = df2.iloc[0]
df1_plus = pd.concat([df1, pd.DataFrame([single_row])], axis=1)
print(len(df1_plus))

In FiftyOne, we can use the add_sample() method. Notice that this is an in-place operation, and no assignment is needed. Also note that this does not work for views - a sample can only be added to a Dataset, not to a Dataview. As such, we first clone the view to turn it into its own Dataset.

[114]:

single_sample = view2.first()
view1_plus = view1.clone()
print(len(view1_plus))
view1_plus.add_sample(single_sample)
print(len(view1_plus))

19
20

We can also add a collection of samples to a dataset using the add_samples() method, which takes as input a list of fo.Sample objects.

[115]:

print(len(view1_plus))
view1_plus.add_samples(view2.skip(1).head(3))
print(len(view1_plus))

20
 100% |█████████████████████| 3/3 [35.6ms elapsed, 0s remaining, 84.2 samples/s]
23

Remove rows/samples¶

The same in-place vs out-of-place considerations for pandas, and Dataset vs DatasetView considerations for FiftyOne apply to rows/samples as applied to columns/fields.

In pandas, rows are removed by index using the drop method.

[116]:

### Randomly select a set of rows to remove
import random
rows_to_remove = random.sample(range(len(df)), 10)

To create a new view:

[117]:

sub_df = df.drop(rows_to_remove)
print(len(sub_df))
print(len(df))

140
150

To remove the rows from the original DataFrame:

[118]:

copy_df = df.copy()
copy_df = copy_df.drop(rows_to_remove)
print(len(copy_df))

In FiftyOne, exclude() creates a view without the specified samples:

[119]:

samples_to_remove = ds.take(10)

[120]:

sub_view = ds.exclude(samples_to_remove)
print(len(ds))
print(len(sub_view))
print(type(sub_view))

200
190
<class 'fiftyone.core.view.DatasetView'>

On the other hand, delete_samples() is an in-place operation which deletes the samples from the underlying Dataset:

[121]:

sub_ds = ds.clone()
sub_ds.delete_samples(samples_to_remove)
print(len(sub_ds))

Keep only specified rows/samples¶

As with columns/fields, one might want to pick out specific rows/samples. In the section on filtering and expressions, we’ll cover more advanced operations. Here we show how to select the data corresponding to a given list of rows/samples.

[122]:

rows_to_keep = list(random.sample(range(len(df)), 80))

[123]:

sub_df = df.iloc[rows_to_keep]
print(len(sub_df))

[124]:

sample_ids = ds.values("id")
ids_to_keep = [sample_ids[ind] for ind in rows_to_keep]
print(len(ids_to_keep))
print(len(ds.select(ids_to_keep)))

80
80

Rename column/field¶

In pandas, you can rename columns by passing a dictionary or mapping into the rename() method with the columns argument. This is not an in-place operation:

[125]:

renamed_df = df.rename(columns = {"sepal_length": "sl", "sepal_width": "sw"})
renamed_df.head()

[125]:

	sl	sw	petal_length	petal_width	species	stem_length	sepal_volume
0	5.1	3.5	1.4	0.2	setosa	9.519895	17.85
1	4.9	3.0	1.4	0.2	setosa	9.230470	14.70
2	4.7	3.2	1.3	0.2	setosa	8.312255	15.04
3	4.6	3.1	1.5	0.2	setosa	6.762648	14.26
4	5.0	3.6	1.4	0.2	setosa	8.624046	18.00

In FiftyOne, you can rename fields using an analogous (but in-place) name mapping, passed in to the rename_sample_fields() method.

[126]:

renamed_ds = ds.clone()
renamed_ds.rename_sample_fields({"ground_truth": "gt", "predictions":"pred"})
print(renamed_ds.first().field_names)

('id', 'filepath', 'tags', 'metadata', 'gt', 'uniqueness', 'pred', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field')

Alternatively, if you just want to rename a single field, you can also do so with the rename_sample_field() method as rename_sample_field(old_field_name, new_field_name):

[127]:

renamed_ds.rename_sample_field("gt", "gt_new")
print(renamed_ds.first().field_names)

('id', 'filepath', 'tags', 'metadata', 'gt_new', 'uniqueness', 'pred', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field')

Both of these methods extend naturally to embedded fields:

[128]:

renamed_ds.first().pred.detections[0].eval_iou

[128]:

0.8575063187115628

[129]:

renamed_ds.rename_sample_field("pred.detections.eval_iou", "pred.detections.iou")
print(renamed_ds.first().pred.detections[0].field_names)

('id', 'attributes', 'tags', 'label', 'bounding_box', 'mask', 'confidence', 'index', 'eval', 'eval_id', 'iou')

Expressions¶

As introduced above, the filter, and match methods, along with the ViewField, can be remarkably useful in selecting subsets of datasets that satisfy user-defined conditions. In this section, we demonstrate how to combine these components to perform Pandas-style queries.

A common theme throughout this section is that while in pandas, expressions (over a given set of rows) can only be applied to the values in the columns, in FiftyOne, expressions can be applied to fields, including embedded fields, or directly to labels or tags! As such, FiftyOne provides match_labels() and match_tags() methods.

Element comparison expressions¶

In both pandas and FiftyOne, the element comparison operators ==, >, <, !=, >=, and <= all conform to the same syntax. The following examples show this functionality.

Exact equality¶

[130]:

setosa_df = df[df.species == "setosa"]
print(len(setosa_df))

[131]:

ds.match(F("filepath") == '/root/fiftyone/quickstart/data/000880.jpg')

[131]:

Dataset:     quickstart
Media type:  image
Num samples: 0
Sample fields:
    id:              fiftyone.core.fields.ObjectIdField
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:      fiftyone.core.fields.FloatField
    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:         fiftyone.core.fields.IntField
    eval_fp:         fiftyone.core.fields.IntField
    eval_fn:         fiftyone.core.fields.IntField
    abstractness:    fiftyone.core.fields.FloatField
    new_const_field: fiftyone.core.fields.IntField
    computed_field:  fiftyone.core.fields.IntField
View stages:
    1. Match(filter={'$expr': {'$eq': [...]}})

Less than or equal to¶

[132]:

short_sepal_cond = df.sepal_length <= 5
short_sepal_df = df[short_sepal_cond]
short_sepal_df.head()

[132]:

	sepal_length	sepal_width	petal_length	petal_width	species	stem_length	sepal_volume
1	4.9	3.0	1.4	0.2	setosa	9.230470	14.70
2	4.7	3.2	1.3	0.2	setosa	8.312255	15.04
3	4.6	3.1	1.5	0.2	setosa	6.762648	14.26
4	5.0	3.6	1.4	0.2	setosa	8.624046	18.00
6	4.6	3.4	1.4	0.3	setosa	5.066091	15.64

[133]:

non_unique_filter = F("uniqueness") <= 0.2
non_unique_view = ds.match(non_unique_filter)
non_unique_view

[133]:

Dataset:     quickstart
Media type:  image
Num samples: 19
Sample fields:
    id:              fiftyone.core.fields.ObjectIdField
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:      fiftyone.core.fields.FloatField
    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:         fiftyone.core.fields.IntField
    eval_fp:         fiftyone.core.fields.IntField
    eval_fn:         fiftyone.core.fields.IntField
    abstractness:    fiftyone.core.fields.FloatField
    new_const_field: fiftyone.core.fields.IntField
    computed_field:  fiftyone.core.fields.IntField
View stages:
    1. Match(filter={'$expr': {'$lte': [...]}})

Logical expressions¶

Logical complement¶

If we have an expression and we want to find all rows/samples that do not satisfy this expression, we can use the complement operator ~. Let’s use this to get the complementary rows/samples to those picked out by the expression above:

[134]:

non_short_sepal_df = df[~short_sepal_cond]
non_short_sepal_df.head()

[134]:

	sepal_length	sepal_width	petal_length	petal_width	species	stem_length	sepal_volume
0	5.1	3.5	1.4	0.2	setosa	9.519895	17.85
5	5.4	3.9	1.7	0.4	setosa	9.171235	21.06
10	5.4	3.7	1.5	0.2	setosa	8.236024	19.98
14	5.8	4.0	1.2	0.2	setosa	5.914960	23.20
15	5.7	4.4	1.5	0.4	setosa	6.215238	25.08

[135]:

unique_view = ds.match(~non_unique_filter)
unique_view

[135]:

Dataset:     quickstart
Media type:  image
Num samples: 181
Sample fields:
    id:              fiftyone.core.fields.ObjectIdField
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:      fiftyone.core.fields.FloatField
    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:         fiftyone.core.fields.IntField
    eval_fp:         fiftyone.core.fields.IntField
    eval_fn:         fiftyone.core.fields.IntField
    abstractness:    fiftyone.core.fields.FloatField
    new_const_field: fiftyone.core.fields.IntField
    computed_field:  fiftyone.core.fields.IntField
View stages:
    1. Match(filter={'$expr': {'$not': {...}}})

Logical AND¶

In pandas and FiftyOne, the logical AND of two conditions can be evaluated with the & operator:

[136]:

pd_cond1 = (df.sepal_volume < 20)
pd_cond2 = (df.species == "setosa")
print("{} rows satisfy condition1".format(len(df[pd_cond1])))
print("{} rows satisfy condition2".format(len(df[pd_cond2])))
print("{} rows satisfy condition1 AND condition2".format(len(df[pd_cond1 & pd_cond2])))

109 rows satisfy condition1
50 rows satisfy condition2
43 rows satisfy condition1 AND condition2

[137]:

fo_cond1 = F("uniqueness") > 0.4
fo_cond2 = F("uniqueness") < 0.55
print("{} samples satisfy condition1".format(len(ds.match(fo_cond1))))
print("{} samples satisfy condition2".format(len(ds.match(fo_cond2))))
print("{} samples satisfy condition1 AND condition2".format(len(ds.match(fo_cond1 & fo_cond2))))

100 samples satisfy condition1
109 samples satisfy condition2
9 samples satisfy condition1 AND condition2

Additionally, if we want to evaluate the logical AND of a list of conditions, in FiftyOne we can do so using all():

[138]:

fo_cond3 = F("predictions.detections").length() >= 10
print(ds.match(F.all([fo_cond1, fo_cond2, fo_cond3])))

Dataset:     quickstart
Media type:  image
Num samples: 5
Sample fields:
    id:              fiftyone.core.fields.ObjectIdField
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:      fiftyone.core.fields.FloatField
    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:         fiftyone.core.fields.IntField
    eval_fp:         fiftyone.core.fields.IntField
    eval_fn:         fiftyone.core.fields.IntField
    abstractness:    fiftyone.core.fields.FloatField
    new_const_field: fiftyone.core.fields.IntField
    computed_field:  fiftyone.core.fields.IntField
View stages:
    1. Match(filter={'$expr': {'$and': [...]}})

Logical OR¶

In pandas and FiftyOne, the logical OR of two conditions can be evaluated with the | operator:

[139]:

print("{} rows satisfy condition1".format(len(df[pd_cond1])))
print("{} rows satisfy condition2".format(len(df[pd_cond2])))
print("{} rows satisfy condition1 OR condition2".format(len(df[pd_cond1 | pd_cond2])))

109 rows satisfy condition1
50 rows satisfy condition2
116 rows satisfy condition1 OR condition2

[140]:

print("{} samples satisfy condition1".format(len(ds.match(fo_cond1))))
print("{} samples satisfy condition3".format(len(ds.match(fo_cond3))))
print("{} samples satisfy condition1 OR condition3".format(len(ds.match(fo_cond1 | fo_cond3))))

100 samples satisfy condition1
134 samples satisfy condition3
166 samples satisfy condition1 OR condition3

Mirroring our usage of all, in FiftyOne we can use any() to evaluate the logical OR of a list of conditions:

[141]:

print(ds.match(F.any([fo_cond1, fo_cond3])))

Dataset:     quickstart
Media type:  image
Num samples: 166
Sample fields:
    id:              fiftyone.core.fields.ObjectIdField
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:      fiftyone.core.fields.FloatField
    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:         fiftyone.core.fields.IntField
    eval_fp:         fiftyone.core.fields.IntField
    eval_fn:         fiftyone.core.fields.IntField
    abstractness:    fiftyone.core.fields.FloatField
    new_const_field: fiftyone.core.fields.IntField
    computed_field:  fiftyone.core.fields.IntField
View stages:
    1. Match(filter={'$expr': {'$or': [...]}})

We note that these all and any methods in FiftyOne are distinctly different from the methods with the same names in pandas.

Subset-superset¶

Is in¶

In pandas, we can check whether the entries in a column are in a given list of values using the isin method:

[142]:

df.species.isin(['setosa', 'versicolor'])

[142]:

0       True
1       True
2       True
3       True
4       True
       ...
145    False
146    False
147    False
148    False
149    False
Name: species, Length: 150, dtype: bool

In FiftyOne, the analogous method is is_in(). We can filter our dataset for only detected animals, for instance, with the following:

[143]:

ANIMALS = [
    "bear", "bird", "cat", "cow", "dog", "elephant", "giraffe",
    "horse", "sheep", "zebra"
]

animal_view = ds.filter_labels("predictions", F("label").is_in(ANIMALS))
print(animal_view)

Dataset:     quickstart
Media type:  image
Num samples: 87
Sample fields:
    id:              fiftyone.core.fields.ObjectIdField
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:      fiftyone.core.fields.FloatField
    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:         fiftyone.core.fields.IntField
    eval_fp:         fiftyone.core.fields.IntField
    eval_fn:         fiftyone.core.fields.IntField
    abstractness:    fiftyone.core.fields.FloatField
    new_const_field: fiftyone.core.fields.IntField
    computed_field:  fiftyone.core.fields.IntField
View stages:
    1. FilterLabels(field='predictions', filter={'$in': ['$$this.label', [...]]}, only_matches=True, trajectories=False)

Additionally, when the FiftyOne fields contain lists, we might want to check if these lists are subsets of other lists. We can do this with the is_subset() method:

[144]:

empty_dataset.add_samples(
    [
        fo.Sample(
            filepath="image1.jpg",
            tags=["a", "b", "a", "b"]
        )
    ]
)

print(empty_dataset.values(F("tags").is_subset(["a", "b", "c"])))

 100% |█████████████████████| 1/1 [6.3ms elapsed, 0s remaining, 177.5 samples/s]
[True]

Contains¶

We can also flip this operation on its head and ask whether the column/field entries contain something else. In pandas, the entries in a DataFrame cannot be lists, so the only sensible type of containment is string containment, i.e., checking whether the strings in a column contain a substring:

[145]:

df.species.str.contains("set").sum()

[145]:

This has a parallel in FiftyOne: contains_str():

[146]:

ze_view = ds.filter_labels("predictions", F("label").contains_str("ze"))
print(ze_view)

Dataset:     quickstart
Media type:  image
Num samples: 5
Sample fields:
    id:              fiftyone.core.fields.ObjectIdField
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:      fiftyone.core.fields.FloatField
    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:         fiftyone.core.fields.IntField
    eval_fp:         fiftyone.core.fields.IntField
    eval_fn:         fiftyone.core.fields.IntField
    abstractness:    fiftyone.core.fields.FloatField
    new_const_field: fiftyone.core.fields.IntField
    computed_field:  fiftyone.core.fields.IntField
View stages:
    1. FilterLabels(field='predictions', filter={'$regexMatch': {'input': '$$this.label', 'options': None, 'regex': 'ze'}}, only_matches=True, trajectories=False)

On a related note, FiftyOne has other useful string operations, including starts_with() and ends_with().

What’s more, in FiftyOne, where fields themselves can be lists, we can check containment in those lists using the contains() method.

If we want to create a view which contains either cats or dogs, we can do so with:

[147]:

# Only contains samples with "cat" or "dog" predictions
cats_or_dogs_view = ds.match(
    F("predictions.detections.label").contains(["cat", "dog"])
)
print(cats_or_dogs_view.count())

If instead we want a view of all samples that contain both cats and dogs, we can pass in the all=True argument:

[148]:

# Only contains samples with "cat" and "dog" predictions
cats_and_dogs_view = ds.match(
    F("predictions.detections.label").contains(["cat", "dog"], all=True)
)
print(cats_and_dogs_view.count())

Checking data types¶

Numeric and string types¶

In recent versions of pandas, one can check if the data type of a DataFrame column is numeric or is a string by importing the corresponding functions:

[149]:

from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
print(is_numeric_dtype(df.sepal_length))
print(is_string_dtype(df.sepal_length))

True
False

In FiftyOne, these are taken care of by the is_number() and is_strin() methods:

[150]:

print(ds.match(F("uniqueness").is_number()).count())
print(ds.match(F("uniqueness").is_string()).count())

200
0

Null¶

In pandas, one checks whether data is null using the isna method:

[151]:

df.isna().any()

[151]:

sepal_length    False
sepal_width     False
petal_length    False
petal_width     False
species         False
stem_length     False
sepal_volume    False
dtype: bool

In FiftyOne, the is_null() method does this:

[152]:

null_view = ds.set_field(
    "uniqueness",
    (F("uniqueness") >= 0.25).if_else(F("uniqueness"), None)
)

# Create view that only contains samples with uniqueness = None
not_unique_view = null_view.match(F("uniqueness").is_null())

print(len(not_unique_view))

Because a FiftyOne Dataset can consist of samples of inhomogenous field schema, FiftyOne also provides the related methods, exists(), and its converse, is_missing(), which checks sample-wise if a field has a value.

Array¶

In FiftyOne, fields can also contain arrays. We can check for this with the is_array() method:

[153]:

ds.match(F("tags").is_array()).count()

[153]:

Conclusion¶

FiftyOne and pandas are both open source Python libraries that make dealing with your data easy. While they serve different purposes - pandas is built for tabular data, while FiftyOne helps users tackle the unstructured data prevalent in computer vision tasks - their syntax and functionality are closely aligned. Both pandas and FiftyOne are important components to many data science and machine learning workflows!