Run in Google Colab | View source on GitHub | Download notebook |
pandas-style queries in FiftyOne¶
Overview¶
pandas is a Python library for data analysis. The central object in pandas is a DataFrame
, which is a two-dimensional labeled data structure that handles tabular data. pandas is optimized for storing, manipulating, and analyzing tabular data, making it useful for a wide variety of data science, data engineering, and machine learning tasks.
FiftyOne, is an open-source Python library for building high-quality datasets and computer vision models. The central object in FiftyOne is the Dataset
, which allows for efficient handling of datasets consisting of images, videos, geospatial, or 3D data, as well as the corresponding metadata and labels associated with the media (which are often more complex than what can be represented in a two-dimensional data structure).
While they apply to different types of data, the pandas DataFrame
and FiftyOne Dataset
classes share many similar functionalities. In this overview, we’ll present a side-by-side comparison of common operations in the two libraries.
If you’re already a pandas power user, then you’ll be a FiftyOne power user too after running through this tutorial!
Getting started¶
The first thing to do is to install FiftyOne:
[ ]:
!pip install fiftyone
Then we will import pandas and FiftyOne:
[2]:
import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F # For handling expressions in matching and filtering
[3]:
import numpy as np
import pandas as pd
In this tutorial, we will download example data for illustrative purposes. Before doing so, we demonstrate how to create empty pd.DataFrame
and fo.Dataset
objects
Create empty¶
Create empty pd.DataFrame
¶
[4]:
empty_df = pd.DataFrame()
we can get basic information about the DataFrame
using the info property:
[5]:
empty_df.info
[5]:
<bound method DataFrame.info of Empty DataFrame
Columns: []
Index: []>
We can also give the DataFrame
object a name:
[6]:
empty_df.name = 'empty_df'
Create empty fo.Dataset
¶
We can similarly create a Dataset
object by calling the FiftyOne core fo.Dataset() method without any arguments:
[7]:
empty_dataset = fo.Dataset()
We can get basic info about the Dataset
object using print
:
[8]:
print(empty_dataset)
Name: 2022.11.18.18.14.41
Media type: None
Num samples: 0
Persistent: False
Tags: []
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
We can see a few things: 1. Calling the fo.DataFrame()
method without an input name resulted in a name being autogenerated based on the time of creation. 2. Whereas the empty Pandas DataFrame
has a (trivial) Index
, the initialized FiftyOne Dataset
has empty Tags
(accessible via dataset.tags
), and each entry - called a Sample
, has predefined fields, including id
and filepath
. These are necessary for properly accessing and addressing the samples, as the
Dataset
stores pointers to the media files, not the media objects themselves.
If we wanted to name an existing Dataset
, we could do so in analogous fashion to pandas:
[9]:
empty_dataset.name = "empty-dataset"
[10]:
print(empty_dataset)
Name: empty-dataset
Media type: None
Num samples: 0
Persistent: False
Tags: []
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
Alternatively, if we want to initialize the dataset with a name, we can pass a name in:
[11]:
empty_dataset = fo.Dataset('empty-ds')
Example data¶
For the rest of this tutorial, we will use the following example data:
Iris Dataset¶
[12]:
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
[13]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
[14]:
df.columns
[14]:
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')
FiftyOne Quickstart Data¶
[ ]:
ds = foz.load_zoo_dataset("quickstart")
[16]:
print(ds)
Name: quickstart
Media type: image
Num samples: 200
Persistent: True
Tags: []
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
uniqueness: fiftyone.core.fields.FloatField
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
eval_tp: fiftyone.core.fields.IntField
eval_fp: fiftyone.core.fields.IntField
eval_fn: fiftyone.core.fields.IntField
abstractness: fiftyone.core.fields.FloatField
new_const_field: fiftyone.core.fields.IntField
computed_field: fiftyone.core.fields.IntField
Basics¶
Head and tail¶
To start to get a feel for the data, we might want to inspect a few entries. For instance, we might want to look at the first few entries, or the last few entries. In both pandas and FiftyOne, these can be accomplished with the head() and tail() methods, which have identical syntax.
Head¶
[17]:
df.head(5)
[17]:
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
[18]:
first_few_samples = ds.head()
Running DataFrame.head(n)
for instance returns the first \(n\) rows of the original DataFrame
. Running Dataset.head(5)
for instance returns the first five samples of the original Dataset
.
In a pandas DataFrame
, two-dimensional tabular data is represented in rows and columns.
Analogously, a FiftyOne Dataset
consists of samples and fields. More explicitly:
Pandas DataFrame |
FiftyOne Dataset |
---|---|
Row |
Sample |
Column |
Field |
In pandas, we expect that a fixed set of columns, each representing a different feature, suffices to represent the data. Some rows might not have values for each column, but each row has the same schema. This is ideal for dealing with a wide variety of data, from housing prices to time series predictions.
FiftyOne is built for dealing with the unstructured data often encountered in computer vision applications. As such, a FiftyOne Dataset
does not assume such a uniform schema. In this example, ds
let’s consider the field predictions
. This field consists of a list of Detection
objects, each of which has its own label, bounding box, and confidence score. These represent a model’s predictions for detected objects in the image corresponding to the sample. Not all images are guaranteed
to contain the same number of predicted objects, so it is preferable for samples to be more flexible than the rows in a DataFrame
!
Tail¶
To get the last \(n\) entries (rows or samples), we can use the tail(n)
method
[19]:
df.tail(5)
[19]:
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
[20]:
last_few_samples = ds.tail()
First and last¶
If we only want the first sample in a Dataset
, we can use the first() method, which is equivalent to ds.head()[0]
[21]:
first_sample = ds.first()
Similarly, if we only want the last sample, we can use the last() method, which is equivalent to ds.tail()[0]
[22]:
last_sample = ds.last()
Get single element¶
In pandas, if we want to get the element at index \(j\) in a DataFrame
, we can employ the loc[j]
or iloc[j]
functionality, depending on our usage. For instance,
[23]:
j = 10
[24]:
df.loc[j]
[24]:
sepal_length 5.4
sepal_width 3.7
petal_length 1.5
petal_width 0.2
species setosa
Name: 10, dtype: object
In FiftyOne, we can achieve the same functionality of picking out the \(j^{th}\) sample by running:
[25]:
sample = ds.skip(j).first()
However, in many cases, one is more interested in extracting samples based on their sample id or filepath. In these cases, the syntactical sugar mirrors pandas: both sample = ds[id]
and sample = ds[filepath]
achieve the desired result.
[26]:
filepath = sample.filepath
print(ds[filepath].id == sample.id)
True
Number of rows/samples¶
We can get the number of samples in a fo.Dataset
just the same as we would get the number of rows in a pd.DataFrame
object - by passing it to Python’s len()
function.
[27]:
len(df)
[27]:
150
[28]:
len(ds)
[28]:
200
There are \(150\) flowers in the Iris dataset, and \(200\) images in our FiftyOne Quickstart dataset
Getting columns/field schema¶
In pandas, where all rows in a DataFrame
share the same columns, we can get the names of the columns with the DataFrame.columns
property.
[29]:
df.columns
[29]:
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')
In FiftyOne, the core field schema is shared among samples, but the structure within these first-level fields can vary. We can get the field schema by calling the get_field_schema() method.
[30]:
ds.get_field_schema()
[30]:
OrderedDict([('id', <fiftyone.core.fields.ObjectIdField at 0x2a0a65a90>),
('filepath', <fiftyone.core.fields.StringField at 0x2a0a5b2b0>),
('tags', <fiftyone.core.fields.ListField at 0x2a0a8c460>),
('metadata',
<fiftyone.core.fields.EmbeddedDocumentField at 0x2a0a8c100>),
('ground_truth',
<fiftyone.core.fields.EmbeddedDocumentField at 0x2a0a651f0>),
('uniqueness', <fiftyone.core.fields.FloatField at 0x2a0a8cd90>),
('predictions',
<fiftyone.core.fields.EmbeddedDocumentField at 0x2a0a8c1f0>),
('eval_tp', <fiftyone.core.fields.IntField at 0x2a0a8cf40>),
('eval_fp', <fiftyone.core.fields.IntField at 0x2a0a8cf70>),
('eval_fn', <fiftyone.core.fields.IntField at 0x2a0a78550>),
('abstractness',
<fiftyone.core.fields.FloatField at 0x2a0a78580>),
('new_const_field',
<fiftyone.core.fields.IntField at 0x2a0a785b0>),
('computed_field',
<fiftyone.core.fields.IntField at 0x2a0a785e0>)])
In video tasks, get_field_schema
is replaced by get_frame_field_schema().
Some of the field types, such as FloatField (float) and StringField (string) correspond in straightforward fashion to data types in pandas, or in Python more generally. As we will see below, the
EmbeddedDocumentField, which does not have a perfect analog in pandas, is part of what gives the FiftyOne Dataset
its powerful flexibility for tackling computer vision tasks.
If we just want the field names for all samples in the dataset, you can do the following:
[31]:
field_names = list(ds.get_field_schema().keys())
print(field_names)
['id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field']
All values in a column/field¶
In pandas, the entries in each column or pd.Series
object must themselves be objects of the type of one of the numpy data types. Thus, when all of the values in a column are extracted, the resulting list will have depth one:
[33]:
col = "sepal_length"
sepal_lengths = df[col].tolist()
print(sepal_lengths[:10])
[5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9]
FiftyOne supports this functionality as well. For instance, each image in our dataset has a uniqueness score, which is a measure of how unique a given image is in the context of the complete dataset. We can extract these values for each image using the values() method as follows:
[34]:
uniqueness = ds.values("uniqueness")
print(uniqueness[:10])
[0.8175834390151201, 0.6844698885072961, 0.725267119762334, 0.7164587220038886, 0.6874799405473135, 0.6773349111042449, 0.6948791555330056, 0.6157872732023304, 0.6692531238595459, 0.7257486965960712]
Some of the relevant information for computer vision tasks, however, is less structured. In our example dataset, this is the case for both the ground_truth
and predictions
fields, each of which contains a number of object detections in the embedded detections
field. The values
method also gives us access to these embedded fields.
Let’s see this in action by using the values
method to pull out the confidence score for each predicted detection:
[35]:
pred_confs = ds.values("predictions.detections.confidence")
[36]:
print(type(pred_confs))
print(len(pred_confs))
print(type(pred_confs[0]))
<class 'list'>
200
<class 'list'>
As with values("uniqueness")
, we get a list with one result per image. However, now we have a sublist for each image, rather than just a single value. We can peak inside one of these sublists at the confidence scores for each detection:
[37]:
print(pred_confs[0])
[0.9750854969024658, 0.759726881980896, 0.6569182276725769, 0.2359301745891571, 0.221974179148674, 0.1965726613998413, 0.18904592096805573, 0.11480894684791565, 0.11089690029621124, 0.0971052274107933, 0.08403241634368896, 0.07699568569660187, 0.058097004890441895, 0.0519101656973362]
Let’s get the lengths of these sublists and print the first few. In the section on fo.Expression
, we will see a more natural (and efficient) way of performing this operation.
[38]:
pred_conf_lens = [len(p) for p in pred_confs]
print(pred_conf_lens[:10])
[14, 20, 10, 51, 27, 13, 2, 9, 7, 13]
We can see that the number of confidence scores - and correspondingly the number of predictions - for each image is not fixed. This scenario is fairly typical in object detection tasks, where images can have varying numbers of objects!
View stages¶
Making a copy¶
Suppose we want to make a copy of the original data and modify the copy without the changes propagating back to the original.
In pandas, we can do this with the copy
method:
[39]:
copy_df = df.copy()
copy_df['species'] = 'none'
df.head()
[39]:
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
In FiftyOne, we can do this with the clone() method:
[40]:
copy_ds = ds.clone()
copy_ds.name = 'copy_ds'
print(ds.name)
quickstart
Slicing¶
In pandas if we want to get a slice of a DataFrame
, we can do so with the notation df[start:end]
.
[41]:
start = 10
end = 14
[42]:
df[start:end]
[42]:
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
10 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
11 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
12 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
13 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
In FiftyOne, a Dataset
can be sliced using the same notation:
[43]:
ds[start:end]
[43]:
Dataset: quickstart
Media type: image
Num samples: 4
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
uniqueness: fiftyone.core.fields.FloatField
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
eval_tp: fiftyone.core.fields.IntField
eval_fp: fiftyone.core.fields.IntField
eval_fn: fiftyone.core.fields.IntField
abstractness: fiftyone.core.fields.FloatField
new_const_field: fiftyone.core.fields.IntField
computed_field: fiftyone.core.fields.IntField
View stages:
1. Skip(skip=10)
2. Limit(limit=4)
However, as we can see from the output of the preceding command, this is merely syntactical sugar for the expression:
[44]:
ds.skip(start).limit(end - start)
[44]:
Dataset: quickstart
Media type: image
Num samples: 4
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
uniqueness: fiftyone.core.fields.FloatField
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
eval_tp: fiftyone.core.fields.IntField
eval_fp: fiftyone.core.fields.IntField
eval_fn: fiftyone.core.fields.IntField
abstractness: fiftyone.core.fields.FloatField
new_const_field: fiftyone.core.fields.IntField
computed_field: fiftyone.core.fields.IntField
View stages:
1. Skip(skip=10)
2. Limit(limit=4)
Get random samples¶
When working with datasets, it is often the case that one might want to select a random set of samples. One typically wants either (a) a fixed number of random samples, or (b) to sample some fraction of the data randomly. We will show how to do both:
Select \(k\) random samples¶
[45]:
k = 20
In pandas, you can use the sample()
method, passing in either a number, as in sample(n = k)
, or a fraction, as we show below
[46]:
rand_samples_df = df.sample(n=k)
[47]:
rand_samples_df.head()
[47]:
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
101 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
129 | 7.2 | 3.0 | 5.8 | 1.6 | virginica |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
79 | 5.7 | 2.6 | 3.5 | 1.0 | versicolor |
100 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
In FiftyOne, we can use the take() method, to which we can pass in a random seed, or let it seed the random number generator with the time.
[48]:
rand_samples_ds = ds.take(k, seed=123)
[49]:
rand_samples_ds
[49]:
Dataset: quickstart
Media type: image
Num samples: 20
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
uniqueness: fiftyone.core.fields.FloatField
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
eval_tp: fiftyone.core.fields.IntField
eval_fp: fiftyone.core.fields.IntField
eval_fn: fiftyone.core.fields.IntField
abstractness: fiftyone.core.fields.FloatField
new_const_field: fiftyone.core.fields.IntField
computed_field: fiftyone.core.fields.IntField
View stages:
1. Take(size=20, seed=123)
With the random utils in FiftyOne, you can also sample flexibly with user-input weighting schemes, but that is beyond the present scope.
Randomly select fraction \(p<1\) of samples¶
[50]:
p = 0.05
[51]:
df.sample(frac=p).head()
[51]:
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
140 | 6.7 | 3.1 | 5.6 | 2.4 | virginica |
14 | 5.8 | 4.0 | 1.2 | 0.2 | setosa |
40 | 5.0 | 3.5 | 1.3 | 0.3 | setosa |
58 | 6.6 | 2.9 | 4.6 | 1.3 | versicolor |
90 | 5.5 | 2.6 | 4.4 | 1.2 | versicolor |
[52]:
# We need to convert from fraction p to an integer k
k = int(len(ds) * p)
ds.take(k, seed=123)
[52]:
Dataset: quickstart
Media type: image
Num samples: 10
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
uniqueness: fiftyone.core.fields.FloatField
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
eval_tp: fiftyone.core.fields.IntField
eval_fp: fiftyone.core.fields.IntField
eval_fn: fiftyone.core.fields.IntField
abstractness: fiftyone.core.fields.FloatField
new_const_field: fiftyone.core.fields.IntField
computed_field: fiftyone.core.fields.IntField
View stages:
1. Take(size=10, seed=123)
Shuffle data¶
In a similar vein to randomly selecting samples, one might want to create a new view in which the entire dataset is shuffled.
In pandas, we can accomplish this by randomly sampling all the rows (\(\mathrm{frac}=1\)) without replacement:
[53]:
shuffled_df_view = df.sample(frac=1)
In FiftyOne, we can just call the shuffle() method:
[54]:
shuffled_ds_view = ds.shuffle(seed=123)
Filtering¶
It is also quite natural to want to filter out the data based on some condition. For the Iris data, for instance, let’s get all of the flowers that have a sepal length greater than seven:
[55]:
sepal_length_thresh = 7
large_sepal_len_view = df[df.sepal_length > sepal_length_thresh]
[56]:
print(len(large_sepal_len_view))
print(large_sepal_len_view.head())
12
sepal_length sepal_width petal_length petal_width species
102 7.1 3.0 5.9 2.1 virginica
105 7.6 3.0 6.6 2.1 virginica
107 7.3 2.9 6.3 1.8 virginica
109 7.2 3.6 6.1 2.5 virginica
117 7.7 3.8 6.7 2.2 virginica
In FiftyOne, we can perform an analogous filtering operation on the quickstart images, using the match() method and the ViewField to select all images that have a “uniqueness” score above some threshold:
[57]:
unique_thresh = 0.75
unique_view = ds.match(F("uniqueness") > unique_thresh)
print(unique_view)
print("values: ", unique_view.values("uniqueness"))
Dataset: quickstart
Media type: image
Num samples: 8
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
uniqueness: fiftyone.core.fields.FloatField
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
eval_tp: fiftyone.core.fields.IntField
eval_fp: fiftyone.core.fields.IntField
eval_fn: fiftyone.core.fields.IntField
abstractness: fiftyone.core.fields.FloatField
new_const_field: fiftyone.core.fields.IntField
computed_field: fiftyone.core.fields.IntField
View stages:
1. Match(filter={'$expr': {'$gt': [...]}})
values: [0.8175834390151201, 1.0, 0.922046961894074, 0.799848556973409, 0.7806850524560267, 0.7950646615140298, 0.7505336395700778, 0.7530639609974709]
However, in FiftyOne, given the potentially nested structure of the data in a Dataset
, we can perform far more complex filtering operations using the same machinery, combined with the filter() method. Crucially, these matching and filtering operations apply equally well to embedded fields.
As an example, let’s say we want to filter for all images in our dataset that had at least one object prediction with very high confidence. In this case, the confidence score is an embedded field within the predicted detections for each image. Thus, we can create a filter on confidence scores, and then apply this filter to the embedded detections
field within predictions
:
[58]:
high_conf_filter = F("confidence") > 0.995
high_conf_view = ds.match(
F("predictions.detections").filter(high_conf_filter).length() > 0
)
[59]:
high_conf_view
[59]:
Dataset: quickstart
Media type: image
Num samples: 116
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
uniqueness: fiftyone.core.fields.FloatField
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
eval_tp: fiftyone.core.fields.IntField
eval_fp: fiftyone.core.fields.IntField
eval_fn: fiftyone.core.fields.IntField
abstractness: fiftyone.core.fields.FloatField
new_const_field: fiftyone.core.fields.IntField
computed_field: fiftyone.core.fields.IntField
View stages:
1. Match(filter={'$expr': {'$gt': [...]}})
For video tasks, the method match_frames() allows one to perform filtering on the frames of a video collection.
We explore this filtering and matching machinery a little more in the section on expressions, but a comprehensive discussion will be the subject of an upcoming tutorial.
Sorting¶
We might also want to sort by certain properties. Let’s see how that is done in pandas and FiftyOne.
In pandas, we use the sort_values
method.
Suppose that we want to sort by petal length. We can do this as follows:
[60]:
petal_length_view = df.sort_values(by="petal_length", ascending=False)
[61]:
petal_length_view.head()
[61]:
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
118 | 7.7 | 2.6 | 6.9 | 2.3 | virginica |
122 | 7.7 | 2.8 | 6.7 | 2.0 | virginica |
117 | 7.7 | 3.8 | 6.7 | 2.2 | virginica |
105 | 7.6 | 3.0 | 6.6 | 2.1 | virginica |
131 | 7.9 | 3.8 | 6.4 | 2.0 | virginica |
In FiftyOne, we use the sort_by() method. Let’s sort the samples by the number of “ground truth” objects in the sample images:
[62]:
field = "ground_truth.detections"
view = ds.sort_by(F(field).length(), reverse=True)
[63]:
print(len(view.first().ground_truth.detections)) # 39
print(len(view.last().ground_truth.detections)) # 0
39
0
Now we can see that the most crowded image has \(39\) objects, while the least crowded image is actually empty!
Deleting¶
If we are resource-constrained, we can delete old DataFrame
or Dataset
objects so that they no longer occupy memory.
In pandas we do this using the del
command and the garbage collector utility. To delete the petal_length_view
view, we can do the following:
[64]:
import gc
del petal_length_view
gc.collect()
[64]:
16
In FiftyOne, we can use the built-in delete() method:
[65]:
copy_ds.delete()
It is also worth mentioning that in FiftyOne, the Dataset
is best thought of as an in-memory object. This means that a Dataset
is deleted after closing Python (this is true in both Python interpreters and notebooks). If you want to use the dataset in the future, you can avoid this end-of-session deletion by setting the persistent
property to True
:
[66]:
ds.persistent = True
Aggregations¶
Given a set of values for a column or field, it is often desired to compute aggregate quantities over all of these values. pandas DataFrame
objects and FiftyOne Dataset
objects both come with this functionality built in.
The general syntax is that in pandas, aggregations are methods of pd.Series
objects, which represent the columns in a DataFrame
. In FiftyOne, the aggregations are methods of the Dataset
or DatasetView
object, which take as input the field to be aggregated over.
Count¶
In both pandas and FiftyOne, the count() method returns the total number of occurrences.
In pandas, this counts the number of values in the column, which is by construction equal to the number of rows in the DataFrame
:
[67]:
print(df['species'].count())
print(len(df))
150
150
In FiftyOne, the count
method returns the total number of occurrences of a certain field, which is not necessarily the same as the number of samples.
[68]:
num_predictions = ds.count("predictions.detections.label")
print(len(ds))
print(num_predictions)
200
5620
Sum¶
Both pandas and FiftyOne have the sum() method
[69]:
sum_sepal_lengths = df.sepal_length.sum()
print(sum_sepal_lengths)
876.5
[70]:
sum_pred_confs = ds.sum("predictions.detections.confidence")
print(sum_pred_confs)
1966.6705134399235
Unique¶
In pandas, the unique
method returns a list of all unique values in the input pd.Series
.
[71]:
df.species.unique()
[71]:
array(['setosa', 'versicolor', 'virginica'], dtype=object)
In FiftyOne, the distinct() method reproduces this functionality.
[72]:
rand_samples_ds.distinct("predictions.detections.label")
[72]:
['banana',
'bed',
'bench',
'bicycle',
'bird',
'boat',
'book',
'bowl',
'broccoli',
'bus',
'cake',
'car',
'carrot',
'cat',
'cell phone',
'chair',
'clock',
'couch',
'cow',
'cup',
'dining table',
'dog',
'elephant',
'fire hydrant',
'fork',
'frisbee',
'giraffe',
'handbag',
'horse',
'keyboard',
'kite',
'knife',
'laptop',
'person',
'pizza',
'sandwich',
'scissors',
'sheep',
'skateboard',
'skis',
'snowboard',
'spoon',
'sports ball',
'stop sign',
'surfboard',
'tie',
'traffic light',
'train',
'truck',
'tv',
'umbrella']
Bounds¶
In pandas, you compute the minimum and maximum value of a pd.Series
separately:
[73]:
min_sepal_len = df.sepal_length.min()
max_sepal_len = df.sepal_length.max()
print("min_sepal_len: {}, max_sepal_len: {}".format(min_sepal_len, max_sepal_len))
min_sepal_len: 4.3, max_sepal_len: 7.9
When working with a FiftyOne Dataset or DataView, the min and max are returned together in a tuple when the bounds() method is called on a field:
[74]:
(min_pred_conf, max_pred_conf) = ds.bounds("predictions.detections.confidence")
print("min_pred_conf: {}, max_pred_conf: {}".format(min_pred_conf, max_pred_conf))
min_pred_conf: 0.05003104358911514, max_pred_conf: 0.9999035596847534
Mean¶
Both pandas DataFrame
objects and FiftyOne Dataset
objects employ the method mean()
[75]:
mean_sepal_len = df.sepal_length.mean()
print(mean_sepal_len)
5.843333333333334
[76]:
mean_pred_conf = ds.mean("predictions.detections.confidence")
print(mean_pred_conf)
0.34994137249820706
Standard deviation¶
Both pandas DataFrame
objects and FiftyOne Dataset
objects employ the method std():
[77]:
std_sepal_len = df.sepal_length.std()
print(std_sepal_len)
0.828066127977863
[78]:
std_pred_conf = ds.std("predictions.detections.confidence")
print(std_pred_conf)
0.3184061813934825
Quantiles¶
If you don’t want just the mean, but instead want the value for a given column or field at arbitrary percentiles in the dataset, you can use the quantiles() method, which takes in a list of percentiles.
[79]:
percentiles = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
[80]:
sepal_len_quanties = df.sepal_length.quantile(percentiles)
print(sepal_len_quanties)
0.0 4.30
0.2 5.00
0.4 5.60
0.6 6.10
0.8 6.52
1.0 7.90
Name: sepal_length, dtype: float64
[81]:
pred_conf_quantiles = ds.quantiles("predictions.detections.confidence", percentiles)
print(pred_conf_quantiles)
[0.05003104358911514, 0.08101843297481537, 0.14457139372825623, 0.2922309935092926, 0.6890143156051636, 0.9999035596847534]
Median and other aggregations¶
Some aggregations which are native to pandas, such as computing the median, are not native to FiftyOne. In these cases, the canonical way to compute the aggregation is by first extracting the values from the Dataset
field, and then using native numpy or scipy functionality.
Here we illustrate this procedure for computing the median. If you use the values
method on the predictions.detections.confidence
field with default arguments, we get a jagged array.
[82]:
pred_confs_jagged = ds.values("predictions.detections.confidence")
print([len(pc) for pc in pred_confs_jagged][:10])
print(sum([len(pc) for pc in pred_confs_jagged]))
[14, 20, 10, 51, 27, 13, 2, 9, 7, 13]
5620
However, we can simplify our lives by flattening the result passing in the argument unwind = True
:
[83]:
pred_confs_flat = ds.values("predictions.detections.confidence", unwind = True)
print(len(pred_confs_flat))
5620
And from this we can easily compute the median:
[84]:
pred_confs_median = np.median(pred_confs_flat)
print(pred_confs_median)
0.20251326262950897
Structural change operations¶
Add new column/field¶
There are many scenarios in which one might want to add another column/field to a dataset. From a practical standpoint, these come in three distinct flavors. 1. Add a new column/field with a default (constant) value for each row/sample. 2. Add new column/field defined with external or already computed data. 3. Create new column/field programmatically from other columns/fields.
In this section we show how to efficiently handle each of these cases in pandas and FiftyOne.
Add new column/field with default value¶
In pandas, the easiest way to create a new column const_col
with constant value const_val
is:
[85]:
df['const_col'] = 'const_val'
df.head()
[85]:
sepal_length | sepal_width | petal_length | petal_width | species | const_col | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | const_val |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | const_val |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | const_val |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | const_val |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | const_val |
which implicitly broadcasts the single value const_val
to all rows in the DataFrame
.
In FiftyOne, the canonical process for efficiently creating and populating a new field involves three steps. (1) a new field is added to the Dataset
using the add_sample_field() method with add_sample_field(field_name, ftype)
. (2) The field is populated, using either
set_field() or set_values(), as we will illustrate below. (3) the Dataset
or DatasetView
is saved using save(), saving the changes.
There is one key distinction in usage between set_field
and set_values
. Whereas set_values
sets the values on the Dataset
directly, using set_field
creates a new DatasetView
, and this DatasetView
is what must be saved!
Before illustrating these more efficient approaches, it is also worth mentioning that you can also loop through the samples in a Dataset
or DatasetView
and add or set fields one at a time.
[86]:
for sample in ds.iter_samples(autosave=True):
sample["new_const_field"] = 51
sample["computed_field"] = len(sample.ground_truth.detections)
However, this is not an efficient approach. It is recommended to use set_field
or set_values
instead.
In the simplest scenario - analogous to the Pandas example above, we can pass a single value into set_field
along with the name of the field:
[87]:
ds.add_sample_field("const_field", fo.StringField)
view = ds.set_field("const_field", "const_val")
view.save()
print(ds.first().field_names)
print(ds.values("const_field")[:10])
('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field', 'const_field')
['const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val']
As we will see shortly, however, set_field
is far more flexible and powerful than this, as a result of FiftyOne’s robust matching and filtering capabilities.
Add new column/field from external data¶
Starting with pandas, suppose that our data team comes to us and tells us that now they also have the stem length for each flower, and they want us to incorporate that data into our models.
For instance, let’s say the stem lengths are:
[88]:
stem_lengths = np.random.uniform(5, 10, len(df))
We can add this into our dataset using a similar syntax as above. The only difference is that this time, the assignment is taking in an array (here a numpy array) instead of a single value.
[89]:
df['stem_length'] = stem_lengths
[90]:
df.head()
[90]:
sepal_length | sepal_width | petal_length | petal_width | species | const_col | stem_length | |
---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | const_val | 9.519895 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | const_val | 9.230470 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | const_val | 8.312255 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | const_val | 6.762648 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | const_val | 8.624046 |
In FiftyOne, we can do something similar by passing an array of values into set_values
.
As an example, let’s say we have an abstractness
score between zero and one for each image.
[91]:
abstractness = np.random.uniform(0, 1, len(ds))
[92]:
ds.set_values("abstractness", abstractness)
print(ds.first().field_names)
print(ds.values("abstractness")[:10])
('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field', 'const_field')
[0.18992196548662132, 0.4195423356383746, 0.9782249923275138, 0.3555547463728417, 0.9019379850096877, 0.3647814428112852, 0.3030278060870243, 0.241988161650587, 0.7872455674533378, 0.44774858997738953]
Note that when using set_values
we are modifying the Dataset
directly. Thus, as opposed to set_field
, we do not need to preface the method call with add_sample_field
, and we do not need to explicitly save the Dataset
with save
afterwards.
Add a new column/frame from existing columns/fields¶
Finally, often either in the process of feature engineering or data analysis, you want to generate new columns or fields from existing ones.
In pandas, the canonical way of doing this is with the apply
method. Suppose we want to create a new feature called “sepal volume” derived by taking the product of sepal length and sepal width. With apply
we can map row-wise onto the columns:
[93]:
df["sepal_volume"] = df.apply(lambda x: x["sepal_length"]*x["sepal_width"], axis=1)
[94]:
df.head()
[94]:
sepal_length | sepal_width | petal_length | petal_width | species | const_col | stem_length | sepal_volume | |
---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | const_val | 9.519895 | 17.85 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | const_val | 9.230470 | 14.70 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | const_val | 8.312255 | 15.04 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | const_val | 6.762648 | 14.26 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | const_val | 8.624046 | 18.00 |
In FiftyOne, we can perform operations like this by combining set_field
with the Viewfield
, here loaded as F
.
To compute the number of predicted object detections for each sample in the Dataset
we can write:
[95]:
view = ds.set_field(
"predictions.num_predictions",
F("$predictions.detections").length(),
)
view.save()
print(ds.first().predictions.field_names)
print(ds.values("predictions.num_predictions")[:10])
('detections', 'num_predictions')
[14, 20, 10, 51, 27, 13, 2, 9, 7, 13]
The above also highlights that all of the aforementioned operations also work on embedded fields. Note however that as we are not changing the base field_schema, we do not need to call add_sample_field
!
Remove a column/field¶
Sometimes you want to look at a dataset without a certain column/field. More precisely, there are two related things one might want to do. 1. Create a new view of the dataset without specific column/field, or 2. Delete specific column/field from the original dataset.
Here, we show how to do both of these in Pandas and FiftyOne.
In pandas, you can create a view without specific columns using the drop
method:
[96]:
df.head()
[96]:
sepal_length | sepal_width | petal_length | petal_width | species | const_col | stem_length | sepal_volume | |
---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | const_val | 9.519895 | 17.85 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | const_val | 9.230470 | 14.70 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | const_val | 8.312255 | 15.04 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | const_val | 6.762648 | 14.26 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | const_val | 8.624046 | 18.00 |
[97]:
no_const_view = df.drop(["const_col"], axis=1)
# equvalent to df.drop(columns=["const"])
no_const_view.head()
[97]:
sepal_length | sepal_width | petal_length | petal_width | species | stem_length | sepal_volume | |
---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 9.519895 | 17.85 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 9.230470 | 14.70 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 8.312255 | 15.04 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 6.762648 | 14.26 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 8.624046 | 18.00 |
If one wants to delete the column from the original DataFrame
, one does so by assigning the variable for the original DataFrame
to the dropped view:
[98]:
df = df.drop(["const_col"], axis=1)
df.head()
[98]:
sepal_length | sepal_width | petal_length | petal_width | species | stem_length | sepal_volume | |
---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 9.519895 | 17.85 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 9.230470 | 14.70 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 8.312255 | 15.04 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 6.762648 | 14.26 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 8.624046 | 18.00 |
In FiftyOne, you can create a ViewStage
without a particular field using the exclude_fields() method:
[99]:
no_predictions_view = ds.exclude_fields("predictions")
print(no_predictions_view.first().field_names)
('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field', 'const_field')
Alternatively, you can delete a field from the Dataset
using delete_sample_field().
[100]:
ds.delete_sample_field("const_field")
print(ds.first().field_names)
('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field')
Both the exclude_field
and delete_sample_field
methods also work with embedded fields:
[101]:
ds.delete_sample_field("predictions.num_predictions")
print(ds.first().predictions.field_names)
('detections',)
To delete multiple fields at once, you can use the related delete_sample_fields() method.
Keep only specified columns/fields¶
Alternatively, if you only want to create a view with a small subset of columns/fields, it might be easier to specify those directly. As with removing columns, this can be done in a way that creates a new view while preserving the original, or in a way that deletes the columns/fields from the original dataset. We show both approaches below.
In pandas, to create a new view with only the “sepal_length” and “sepal_width” columns, one could write:
[102]:
sepal_df = df[["sepal_length", "sepal_width"]]
sepal_df.head()
[102]:
sepal_length | sepal_width | |
---|---|---|
0 | 5.1 | 3.5 |
1 | 4.9 | 3.0 |
2 | 4.7 | 3.2 |
3 | 4.6 | 3.1 |
4 | 5.0 | 3.6 |
In contrast, the following propagates the changes back to the original DataFrame
:
[103]:
sepal_df = sepal_df[["sepal_length"]]
sepal_df.head()
[103]:
sepal_length | |
---|---|
0 | 5.1 |
1 | 4.9 |
2 | 4.7 |
3 | 4.6 |
4 | 5.0 |
In FiftyOne, if we want to create a separate view with only specified fields kept, we should first clone the original dataset and then apply the select_fields() method. when we apply the keep_fields() method following application of select_fields
, the
changes propagate from the DatasetView
back to the underlying Dataset
.
Let’s create two clones of our base Dataset
to showcase this distinction.
[104]:
ds_clone1 = ds.clone()
ds_clone2 = ds.clone()
For both of these clones, let’s create views which select only the ground_truth
field:
[105]:
clone1_view = ds_clone1.select_fields("ground_truth")
clone2_view = ds_clone2.select_fields("ground_truth")
print(clone1_view.first().field_names)
print(clone2_view.first().field_names)
('id', 'filepath', 'tags', 'metadata', 'ground_truth')
('id', 'filepath', 'tags', 'metadata', 'ground_truth')
The id
, filepath
, tags
, and metadata
are by default preserved, even when not passed in to select_fields
. Aside from these and ground_truth
, all other fields have been omitted from view. Now let’s only apply keep_fields
on the first clone, and see what changes propagate back.
[106]:
clone1_view.keep_fields()
[107]:
print(ds_clone1.first().field_names)
print(ds_clone2.first().field_names)
('id', 'filepath', 'tags', 'metadata', 'ground_truth')
('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field')
As we can see, the changes only propagated back to the dataset (in this case ds_clone1
) when we applied keep_fields
.
Finally, we note that when dealing with video datasets, the methods exclude_fields
and select_fields
have analogous methods for frames - exclude_frames() and select_frames().
Concatenation¶
Suppose we have two datasets we want to combine or concatenate.
In both pandas and FiftyOne, we can concatenate them using the concat
method.
In pandas, we can combine two DataFrame
objects:
[108]:
df1 = df[df.species == 'setosa']
df2 = df[df.species == 'virginica']
concat_df = pd.concat([df1, df2])
print(len(concat_df))
100
In FiftyOne, we can use the concat() method to combine views from the same dataset:
[109]:
view1 = ds.match(F("uniqueness") < 0.2)
view2 = ds.match(F("uniqueness") > 0.7)
[110]:
print(len(view1))
print(len(view2))
19
17
[111]:
concat_view = view1.concat(view2)
print(len(view1) + len(view2))
print(len(concat_view))
36
36
The slightly more complicated operation of concatenating Dataset
objects ds1
and ds2
(as opposed to DatasetView
objects) can be achieved using merge_samples(), i.e., ds1.merge_samples(ds2)
.
Adding a single row/sample¶
Often times, we just want to enhance a dataset by adding in one sample at a time.
In pandas, the fastest way to do this is to use the same concat
method as above. If the row data is in a dictionary format, we convert it to its own DataFrame
first.
[112]:
len(df1)
[112]:
50
[113]:
single_row = df2.iloc[0]
df1_plus = pd.concat([df1, pd.DataFrame([single_row])], axis=1)
print(len(df1_plus))
51
In FiftyOne, we can use the add_sample() method. Notice that this is an in-place operation, and no assignment is needed. Also note that this does not work for views - a sample can only be added to a Dataset
, not to a Dataview
. As such, we first clone the view to turn it into its own Dataset
.
[114]:
single_sample = view2.first()
view1_plus = view1.clone()
print(len(view1_plus))
view1_plus.add_sample(single_sample)
print(len(view1_plus))
19
20
We can also add a collection of samples to a dataset using the add_samples() method, which takes as input a list of fo.Sample
objects.
[115]:
print(len(view1_plus))
view1_plus.add_samples(view2.skip(1).head(3))
print(len(view1_plus))
20
100% |█████████████████████| 3/3 [35.6ms elapsed, 0s remaining, 84.2 samples/s]
23
Remove rows/samples¶
The same in-place vs out-of-place considerations for pandas, and Dataset
vs DatasetView
considerations for FiftyOne apply to rows/samples as applied to columns/fields.
In pandas, rows are removed by index using the drop
method.
[116]:
### Randomly select a set of rows to remove
import random
rows_to_remove = random.sample(range(len(df)), 10)
To create a new view:
[117]:
sub_df = df.drop(rows_to_remove)
print(len(sub_df))
print(len(df))
140
150
To remove the rows from the original DataFrame
:
[118]:
copy_df = df.copy()
copy_df = copy_df.drop(rows_to_remove)
print(len(copy_df))
140
In FiftyOne, exclude() creates a view without the specified samples:
[119]:
samples_to_remove = ds.take(10)
[120]:
sub_view = ds.exclude(samples_to_remove)
print(len(ds))
print(len(sub_view))
print(type(sub_view))
200
190
<class 'fiftyone.core.view.DatasetView'>
On the other hand, delete_samples() is an in-place operation which deletes the samples from the underlying Dataset
:
[121]:
sub_ds = ds.clone()
sub_ds.delete_samples(samples_to_remove)
print(len(sub_ds))
190
Keep only specified rows/samples¶
As with columns/fields, one might want to pick out specific rows/samples. In the section on filtering and expressions, we’ll cover more advanced operations. Here we show how to select the data corresponding to a given list of rows/samples.
[122]:
rows_to_keep = list(random.sample(range(len(df)), 80))
[123]:
sub_df = df.iloc[rows_to_keep]
print(len(sub_df))
80
[124]:
sample_ids = ds.values("id")
ids_to_keep = [sample_ids[ind] for ind in rows_to_keep]
print(len(ids_to_keep))
print(len(ds.select(ids_to_keep)))
80
80
Rename column/field¶
In pandas, you can rename columns by passing a dictionary or mapping into the rename()
method with the columns
argument. This is not an in-place operation:
[125]:
renamed_df = df.rename(columns = {"sepal_length": "sl", "sepal_width": "sw"})
renamed_df.head()
[125]:
sl | sw | petal_length | petal_width | species | stem_length | sepal_volume | |
---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 9.519895 | 17.85 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 9.230470 | 14.70 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 8.312255 | 15.04 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 6.762648 | 14.26 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 8.624046 | 18.00 |
In FiftyOne, you can rename fields using an analogous (but in-place) name mapping, passed in to the rename_sample_fields() method.
[126]:
renamed_ds = ds.clone()
renamed_ds.rename_sample_fields({"ground_truth": "gt", "predictions":"pred"})
print(renamed_ds.first().field_names)
('id', 'filepath', 'tags', 'metadata', 'gt', 'uniqueness', 'pred', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field')
Alternatively, if you just want to rename a single field, you can also do so with the rename_sample_field() method as rename_sample_field(old_field_name, new_field_name)
:
[127]:
renamed_ds.rename_sample_field("gt", "gt_new")
print(renamed_ds.first().field_names)
('id', 'filepath', 'tags', 'metadata', 'gt_new', 'uniqueness', 'pred', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field')
Both of these methods extend naturally to embedded fields:
[128]:
renamed_ds.first().pred.detections[0].eval_iou
[128]:
0.8575063187115628
[129]:
renamed_ds.rename_sample_field("pred.detections.eval_iou", "pred.detections.iou")
print(renamed_ds.first().pred.detections[0].field_names)
('id', 'attributes', 'tags', 'label', 'bounding_box', 'mask', 'confidence', 'index', 'eval', 'eval_id', 'iou')
Expressions¶
As introduced above, the filter
, and match
methods, along with the ViewField
, can be remarkably useful in selecting subsets of datasets that satisfy user-defined conditions. In this section, we demonstrate how to combine these components to perform Pandas-style queries.
A common theme throughout this section is that while in pandas, expressions (over a given set of rows) can only be applied to the values in the columns, in FiftyOne, expressions can be applied to fields, including embedded fields, or directly to labels or tags! As such, FiftyOne provides match_labels() and match_tags() methods.
Element comparison expressions¶
In both pandas and FiftyOne, the element comparison operators ==
, >
, <
, !=
, >=
, and <=
all conform to the same syntax. The following examples show this functionality.
Exact equality¶
[130]:
setosa_df = df[df.species == "setosa"]
print(len(setosa_df))
50
[131]:
ds.match(F("filepath") == '/root/fiftyone/quickstart/data/000880.jpg')
[131]:
Dataset: quickstart
Media type: image
Num samples: 0
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
uniqueness: fiftyone.core.fields.FloatField
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
eval_tp: fiftyone.core.fields.IntField
eval_fp: fiftyone.core.fields.IntField
eval_fn: fiftyone.core.fields.IntField
abstractness: fiftyone.core.fields.FloatField
new_const_field: fiftyone.core.fields.IntField
computed_field: fiftyone.core.fields.IntField
View stages:
1. Match(filter={'$expr': {'$eq': [...]}})
Less than or equal to¶
[132]:
short_sepal_cond = df.sepal_length <= 5
short_sepal_df = df[short_sepal_cond]
short_sepal_df.head()
[132]:
sepal_length | sepal_width | petal_length | petal_width | species | stem_length | sepal_volume | |
---|---|---|---|---|---|---|---|
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 9.230470 | 14.70 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 8.312255 | 15.04 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 6.762648 | 14.26 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 8.624046 | 18.00 |
6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa | 5.066091 | 15.64 |
[133]:
non_unique_filter = F("uniqueness") <= 0.2
non_unique_view = ds.match(non_unique_filter)
non_unique_view
[133]:
Dataset: quickstart
Media type: image
Num samples: 19
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
uniqueness: fiftyone.core.fields.FloatField
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
eval_tp: fiftyone.core.fields.IntField
eval_fp: fiftyone.core.fields.IntField
eval_fn: fiftyone.core.fields.IntField
abstractness: fiftyone.core.fields.FloatField
new_const_field: fiftyone.core.fields.IntField
computed_field: fiftyone.core.fields.IntField
View stages:
1. Match(filter={'$expr': {'$lte': [...]}})
Logical expressions¶
Logical complement¶
If we have an expression and we want to find all rows/samples that do not satisfy this expression, we can use the complement operator ~
. Let’s use this to get the complementary rows/samples to those picked out by the expression above:
[134]:
non_short_sepal_df = df[~short_sepal_cond]
non_short_sepal_df.head()
[134]:
sepal_length | sepal_width | petal_length | petal_width | species | stem_length | sepal_volume | |
---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 9.519895 | 17.85 |
5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa | 9.171235 | 21.06 |
10 | 5.4 | 3.7 | 1.5 | 0.2 | setosa | 8.236024 | 19.98 |
14 | 5.8 | 4.0 | 1.2 | 0.2 | setosa | 5.914960 | 23.20 |
15 | 5.7 | 4.4 | 1.5 | 0.4 | setosa | 6.215238 | 25.08 |
[135]:
unique_view = ds.match(~non_unique_filter)
unique_view
[135]:
Dataset: quickstart
Media type: image
Num samples: 181
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
uniqueness: fiftyone.core.fields.FloatField
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
eval_tp: fiftyone.core.fields.IntField
eval_fp: fiftyone.core.fields.IntField
eval_fn: fiftyone.core.fields.IntField
abstractness: fiftyone.core.fields.FloatField
new_const_field: fiftyone.core.fields.IntField
computed_field: fiftyone.core.fields.IntField
View stages:
1. Match(filter={'$expr': {'$not': {...}}})
Logical AND¶
In pandas and FiftyOne, the logical AND
of two conditions can be evaluated with the &
operator:
[136]:
pd_cond1 = (df.sepal_volume < 20)
pd_cond2 = (df.species == "setosa")
print("{} rows satisfy condition1".format(len(df[pd_cond1])))
print("{} rows satisfy condition2".format(len(df[pd_cond2])))
print("{} rows satisfy condition1 AND condition2".format(len(df[pd_cond1 & pd_cond2])))
109 rows satisfy condition1
50 rows satisfy condition2
43 rows satisfy condition1 AND condition2
[137]:
fo_cond1 = F("uniqueness") > 0.4
fo_cond2 = F("uniqueness") < 0.55
print("{} samples satisfy condition1".format(len(ds.match(fo_cond1))))
print("{} samples satisfy condition2".format(len(ds.match(fo_cond2))))
print("{} samples satisfy condition1 AND condition2".format(len(ds.match(fo_cond1 & fo_cond2))))
100 samples satisfy condition1
109 samples satisfy condition2
9 samples satisfy condition1 AND condition2
Additionally, if we want to evaluate the logical AND
of a list of conditions, in FiftyOne we can do so using all():
[138]:
fo_cond3 = F("predictions.detections").length() >= 10
print(ds.match(F.all([fo_cond1, fo_cond2, fo_cond3])))
Dataset: quickstart
Media type: image
Num samples: 5
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
uniqueness: fiftyone.core.fields.FloatField
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
eval_tp: fiftyone.core.fields.IntField
eval_fp: fiftyone.core.fields.IntField
eval_fn: fiftyone.core.fields.IntField
abstractness: fiftyone.core.fields.FloatField
new_const_field: fiftyone.core.fields.IntField
computed_field: fiftyone.core.fields.IntField
View stages:
1. Match(filter={'$expr': {'$and': [...]}})
Logical OR¶
In pandas and FiftyOne, the logical OR
of two conditions can be evaluated with the |
operator:
[139]:
print("{} rows satisfy condition1".format(len(df[pd_cond1])))
print("{} rows satisfy condition2".format(len(df[pd_cond2])))
print("{} rows satisfy condition1 OR condition2".format(len(df[pd_cond1 | pd_cond2])))
109 rows satisfy condition1
50 rows satisfy condition2
116 rows satisfy condition1 OR condition2
[140]:
print("{} samples satisfy condition1".format(len(ds.match(fo_cond1))))
print("{} samples satisfy condition3".format(len(ds.match(fo_cond3))))
print("{} samples satisfy condition1 OR condition3".format(len(ds.match(fo_cond1 | fo_cond3))))
100 samples satisfy condition1
134 samples satisfy condition3
166 samples satisfy condition1 OR condition3
Mirroring our usage of all
, in FiftyOne we can use any() to evaluate the logical OR
of a list of conditions:
[141]:
print(ds.match(F.any([fo_cond1, fo_cond3])))
Dataset: quickstart
Media type: image
Num samples: 166
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
uniqueness: fiftyone.core.fields.FloatField
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
eval_tp: fiftyone.core.fields.IntField
eval_fp: fiftyone.core.fields.IntField
eval_fn: fiftyone.core.fields.IntField
abstractness: fiftyone.core.fields.FloatField
new_const_field: fiftyone.core.fields.IntField
computed_field: fiftyone.core.fields.IntField
View stages:
1. Match(filter={'$expr': {'$or': [...]}})
We note that these all
and any
methods in FiftyOne are distinctly different from the methods with the same names in pandas.
Subset-superset¶
Is in¶
In pandas, we can check whether the entries in a column are in a given list of values using the isin
method:
[142]:
df.species.isin(['setosa', 'versicolor'])
[142]:
0 True
1 True
2 True
3 True
4 True
...
145 False
146 False
147 False
148 False
149 False
Name: species, Length: 150, dtype: bool
In FiftyOne, the analogous method is is_in(). We can filter our dataset for only detected animals, for instance, with the following:
[143]:
ANIMALS = [
"bear", "bird", "cat", "cow", "dog", "elephant", "giraffe",
"horse", "sheep", "zebra"
]
animal_view = ds.filter_labels("predictions", F("label").is_in(ANIMALS))
print(animal_view)
Dataset: quickstart
Media type: image
Num samples: 87
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
uniqueness: fiftyone.core.fields.FloatField
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
eval_tp: fiftyone.core.fields.IntField
eval_fp: fiftyone.core.fields.IntField
eval_fn: fiftyone.core.fields.IntField
abstractness: fiftyone.core.fields.FloatField
new_const_field: fiftyone.core.fields.IntField
computed_field: fiftyone.core.fields.IntField
View stages:
1. FilterLabels(field='predictions', filter={'$in': ['$$this.label', [...]]}, only_matches=True, trajectories=False)
Additionally, when the FiftyOne fields contain lists, we might want to check if these lists are subsets of other lists. We can do this with the is_subset() method:
[144]:
empty_dataset.add_samples(
[
fo.Sample(
filepath="image1.jpg",
tags=["a", "b", "a", "b"]
)
]
)
print(empty_dataset.values(F("tags").is_subset(["a", "b", "c"])))
100% |█████████████████████| 1/1 [6.3ms elapsed, 0s remaining, 177.5 samples/s]
[True]
Contains¶
We can also flip this operation on its head and ask whether the column/field entries contain something else. In pandas, the entries in a DataFrame
cannot be lists, so the only sensible type of containment is string containment, i.e., checking whether the strings in a column contain a substring:
[145]:
df.species.str.contains("set").sum()
[145]:
50
This has a parallel in FiftyOne: contains_str():
[146]:
ze_view = ds.filter_labels("predictions", F("label").contains_str("ze"))
print(ze_view)
Dataset: quickstart
Media type: image
Num samples: 5
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
uniqueness: fiftyone.core.fields.FloatField
predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
eval_tp: fiftyone.core.fields.IntField
eval_fp: fiftyone.core.fields.IntField
eval_fn: fiftyone.core.fields.IntField
abstractness: fiftyone.core.fields.FloatField
new_const_field: fiftyone.core.fields.IntField
computed_field: fiftyone.core.fields.IntField
View stages:
1. FilterLabels(field='predictions', filter={'$regexMatch': {'input': '$$this.label', 'options': None, 'regex': 'ze'}}, only_matches=True, trajectories=False)
On a related note, FiftyOne has other useful string operations, including starts_with() and ends_with().
What’s more, in FiftyOne, where fields themselves can be lists, we can check containment in those lists using the contains() method.
If we want to create a view which contains either cats or dogs, we can do so with:
[147]:
# Only contains samples with "cat" or "dog" predictions
cats_or_dogs_view = ds.match(
F("predictions.detections.label").contains(["cat", "dog"])
)
print(cats_or_dogs_view.count())
39
If instead we want a view of all samples that contain both cats and dogs, we can pass in the all=True
argument:
[148]:
# Only contains samples with "cat" and "dog" predictions
cats_and_dogs_view = ds.match(
F("predictions.detections.label").contains(["cat", "dog"], all=True)
)
print(cats_and_dogs_view.count())
10
Checking data types¶
Numeric and string types¶
In recent versions of pandas, one can check if the data type of a DataFrame
column is numeric or is a string by importing the corresponding functions:
[149]:
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
print(is_numeric_dtype(df.sepal_length))
print(is_string_dtype(df.sepal_length))
True
False
In FiftyOne, these are taken care of by the is_number() and is_strin() methods:
[150]:
print(ds.match(F("uniqueness").is_number()).count())
print(ds.match(F("uniqueness").is_string()).count())
200
0
Null¶
In pandas, one checks whether data is null using the isna
method:
[151]:
df.isna().any()
[151]:
sepal_length False
sepal_width False
petal_length False
petal_width False
species False
stem_length False
sepal_volume False
dtype: bool
In FiftyOne, the is_null() method does this:
[152]:
null_view = ds.set_field(
"uniqueness",
(F("uniqueness") >= 0.25).if_else(F("uniqueness"), None)
)
# Create view that only contains samples with uniqueness = None
not_unique_view = null_view.match(F("uniqueness").is_null())
print(len(not_unique_view))
92
Because a FiftyOne Dataset
can consist of samples of inhomogenous field schema, FiftyOne also provides the related methods, exists(), and its converse, is_missing(), which checks sample-wise if a
field has a value.
Array¶
In FiftyOne, fields can also contain arrays. We can check for this with the is_array() method:
[153]:
ds.match(F("tags").is_array()).count()
[153]:
200
Conclusion¶
FiftyOne and pandas are both open source Python libraries that make dealing with your data easy. While they serve different purposes - pandas is built for tabular data, while FiftyOne helps users tackle the unstructured data prevalent in computer vision tasks - their syntax and functionality are closely aligned. Both pandas and FiftyOne are important components to many data science and machine learning workflows!