Note

This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.

Hugging Face

Dataset Card for EgoCart#

image/png

EgoCart is a large-scale benchmark dataset for egocentric, image-based indoor localization in a retail store. RGB images and depth maps were captured by cameras mounted on shopping carts moving through a real supermarket. Each frame is annotated with a 3-DOF camera pose (position + orientation) and a store-zone class label.

This card covers the egocart_videos FiftyOne dataset: 9 MP4 videos — one per recording sequence — with frame-level pose and zone annotations.

This is a FiftyOne dataset with 9 samples.

Installation#

If you haven’t already, install FiftyOne:

pip install -U fiftyone

Usage#

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/egocart_videos")

# Launch the App
session = fo.launch_app(dataset)

Filtering on frame-level fields#

FiftyOne lets you filter videos by properties of their frames using match_frames:

# Videos that contain at least one frame in zone 15
has_zone_15 = dataset.match_frames(fo.ViewField("zone_id") == 15)

# Videos where the cart was in the right half of the store
has_right_half = dataset.match_frames(fo.ViewField("location_x") > 0)

# Videos where the cart was facing roughly East (heading ≈ 0°)
has_east = dataset.match_frames(
    (fo.ViewField("heading_deg") > -30) & (fo.ViewField("heading_deg") < 30)
)

# Build a frame-level view to analyse pose across all sequences
frames_view = dataset.to_frames()
print(frames_view.count("frames"))  # 19,531 total frames

Dataset Details#

Dataset Description#

EgoCart was collected to study the problem of localising shopping carts in a large retail store from egocentric images. It supports research into indoor localisation (image retrieval, pose regression), egocentric video understanding, and analysis of customer movement patterns.

RGB-D cameras (1280 × 720) mounted on shopping carts captured footage at 50 fps during real shopping sessions in a single Italian retail store. Nine independent recording sequences were made; six form the training split and three the test split. The store floor plan spans roughly 40 m × 17 m.

  • Original dataset creators: E. Spera, A. Furnari, S. Battiato, G. M. Farinella — University of Catania, Italy

  • FiftyOne curation: Harpreet Sahota

  • License: Research use only (see original dataset page)

Dataset Sources#

Uses#

Direct Use#

  • Indoor localisation / place recognition: Use frame-level pose annotations as ground truth for image retrieval and nearest-neighbour localisation methods.

  • Pose regression from video: Train networks to predict (location_x, location_y, heading_deg) from RGB frames, exploiting temporal context across the video.

  • Store zone classification: Predict zone_id (1–16) from each video frame as a discrete localization proxy.

  • Trajectory analysis: Examine how shopping carts move through the store over time; the video format makes it easy to study motion continuity, revisitation patterns, and zone transitions.

Out-of-Scope Use#

  • The dataset captures a single retail store at a single point in time. Models trained here will not generalise to different stores or layouts without re-collection.

  • The camera faces forward along aisles; there are no identifiable people in the footage, making it unsuitable for pedestrian detection or customer re-identification.

  • Videos encode only camera heading (yaw). Camera pitch and roll are not captured, so the dataset is unsuitable for full 6-DOF pose estimation.

Dataset Structure#

FiftyOne video dataset overview#

egocart_videos is a FiftyOne video dataset (media_type = "video") containing 9 samples. In FiftyOne’s data model, each sample corresponds to one video file; annotations that change over time are stored as frame-level fields accessible via sample.frames[frame_number] (1-indexed).

egocart_videos

 Sample 0   filepath = egocart_seq_0.mp4   tags = ["train"]   sequence_id = "0"
    Frame 1   location_x=-17.90  location_y=4.77  heading_deg=-2.49  zone_id=16
    Frame 2   location_x=-17.93  location_y=4.76  heading_deg=-4.57  zone_id=16
    ...  (1,838 frames total)

 Sample 1   filepath = egocart_seq_1.mp4   tags = ["train"]   sequence_id = "1"
    ...  (1,740 frames)

...

 Sample 8   filepath = egocart_seq_8.mp4   tags = ["train"]   sequence_id = "8"
     ...  (3,101 frames)

Sample-level fields#

Field

Type

Description

filepath

StringField

Absolute path to the MP4 file

tags

ListField(StringField)

["train"] or ["test"]

sequence_id

StringField

Recording run identifier ("0"–"8")

metadata

VideoMetadata

Auto-populated: duration, fps, resolution, frame count

Frame-level fields#

Each frame corresponds to one original RGB image from the dataset. Frames are 1-indexed in FiftyOne.

Field

Type

Description

frame_number

FrameNumberField

1-indexed position within the video

location_x

FloatField

Cart X position in the store (metres, ≈ −20 to +20)

location_y

FloatField

Cart Y position in the store (metres, ≈ −9 to +8)

orientation_u

FloatField

Camera heading unit-vector, X component

orientation_v

FloatField

Camera heading unit-vector, Y component

heading_deg

FloatField

atan2(v, u) in degrees — see coordinate system below

zone_id

IntField

Store location class (1–16)

Train / test split#

Split

Sequences

Frames

Duration at 15 fps

Train

0, 1, 2, 4, 6, 8

13,360

≈ 15 min total

Test

3, 5, 7

6,171

≈ 7 min total

Sequences were recorded independently on different occasions. The split is at the sequence level, ensuring no temporal overlap between train and test.

Video encoding#

Videos are encoded with H.264 / yuv420p at 15 fps (original capture rate: 50 fps, so playback is ≈ 3× slower than real time). This makes the cart’s motion easy to inspect frame-by-frame in the FiftyOne App. Frame-level annotations are aligned to the 15 fps video — frame n in the video corresponds to row n in the sorted annotation file for that sequence.

Store coordinate system#

The coordinate origin is a fixed reference point inside the store. Axes are in metres:

Axis

Range

Direction

X

−20.2 to +19.8 m

Left → Right

Y

−9.5 to +8.0 m

Bottom → Top

(orientation_u, orientation_v) is a 2D unit vector in the store’s XY floor plane (u² + v² ≈ 1.0). The derived field heading_deg = atan2(v, u) gives a compass-style angle: 0° = East (+X), 90° = North (+Y), ±180° = West (−X), −90° = South (−Y).

The store aisles run primarily North–South, so most frames show heading values near ±90°. The cart reverses direction between adjacent aisle lanes (serpentine route), which is clearly visible as alternating heading clusters in each sequence.

Store zone distribution#

Zones 1–9 are sequential single-aisle sections along the central spine from left to right. Zones 10–16 are larger structural zones (perimeter walkways, wide sections).

Zone

Description

Train frames

Test frames

1

Left aisle zone 1

311

173

2

Left aisle zone 2

244

131

3

Left aisle zone 3

241

177

4

Left aisle zone 4

278

117

5

Left aisle zone 5

665

174

6

Center aisle zone 6

403

217

7

Center aisle zone 7

394

201

8

Center aisle zone 8

266

190

9

Right aisle zone 9

366

235

10

Right half aisles

1,722

751

11

Back wall aisle (full width)

1,610

658

12

Right wall

1,355

409

13

Bottom right

897

456

14

Bottom left

1,138

788

15

Central bottom aisle (full width)

2,668

1,130

16

Left wall

802

364

Zone distribution is imbalanced: zone 15 is the most frequent because the cart must traverse the full-width bottom aisle when moving between the two halves of the store. Zones 1–4 are the rarest.

Dataset Creation#

Curation Rationale#

Indoor localisation from images is a practically important but data-scarce problem. At the time of publication, existing datasets were too small, collected in corridor environments, or lacked ground-truth poses. EgoCart provides a large, real-world retail benchmark with accurate 3-DOF pose annotations to enable reproducible benchmarking of image retrieval and pose regression methods.

The video format presented here groups the original frame sequences into their natural temporal structure, making it straightforward to study motion context, temporal consistency of predictions, and zone transition behaviour.

Source Data#

Data Collection and Processing#

RGB-D cameras were mounted on shopping carts and driven through a real retail store in Italy. Cameras captured at 50 fps — confirmed by the 20 ms (20-unit) increments in the numeric portion of each source filename. Ground-truth 3-DOF poses (x, y, u, v) were obtained through a separate localisation system. Zone labels (1–16) were assigned based on which manually defined store region contains each frame’s (x, y) position.

Who are the source data producers?#

The dataset was produced by researchers at the Image Processing Laboratory (IPLab), Department of Mathematics and Computer Science, University of Catania, Italy.

Annotations#

Annotation process#

Camera poses (x, y, u, v) were obtained via an automated localisation pipeline. Zone labels correspond to manually defined rectangular floor regions; each frame receives the label of the zone containing its (x, y) position.

Who are the annotators?#

Researchers at the University of Catania (Spera, Furnari, Battiato, Farinella).

Citation#

BibTeX:

@article{spera2021egocart,
  author  = {Spera, E. and Furnari, A. and Battiato, S. and Farinella, G. M.},
  title   = {{EgoCart}: A Benchmark Dataset for Large-Scale Indoor Image-Based Localization in Retail Stores},
  journal = {IEEE Transactions on Circuits and Systems for Video Technology},
  volume  = {31},
  number  = {4},
  pages   = {1253--1267},
  year    = {2021},
  month   = apr,
  doi     = {10.1109/TCSVT.2019.2941040}
}

APA:

Spera, E., Furnari, A., Battiato, S., & Farinella, G. M. (2021). EgoCart: A benchmark dataset for large-scale indoor image-based localization in retail stores. IEEE Transactions on Circuits and Systems for Video Technology, 31(4), 1253–1267. https://doi.org/10.1109/TCSVT.2019.2941040