Note
This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.
KITScenes Multimodal — FiftyOne Dataset#

A FiftyOne build of KITScenes Multimodal (KIT-MRT), a high-fidelity European urban autonomous-driving dataset. Each frame is a synchronized capture from a full robotaxi sensor suite — nine global-shutter cameras giving 360° coverage, seven long-range lidars, and three 4D imaging radars — paired with production-grade Lanelet2 HD-map labels, projected lidar depth, the future ego path, and image instance predictions.
This build packages those captures as a grouped FiftyOne dataset so every sensor for a given moment lives in one group, and the 3D lidar/radar point cloud sits alongside the camera images. The card below describes exactly what is in the dataset and how it is organized.
This is a FiftyOne dataset with 680 samples.
Installation#
If you haven’t already, install FiftyOne:
pip install -U fiftyone
Usage#
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/kitscenes-multimodal")
# Launch the App
session = fo.launch_app(dataset)
At a glance#
Dataset name |
|
Media type |
|
Samples |
6,800 |
Frames (groups) |
680 |
Scenes |
4 (validation split) |
Frames per scene |
100 / 100 / 200 / 280 |
Group slices |
9 cameras + 1 fused 3D lidar slice |
Capture rate |
10 Hz |
Region |
Frankfurt, Germany (European urban) |
License |
CC-BY-NC-4.0 |
A group corresponds to one timestamped frame and holds 10 samples: the 9 camera images plus the fused 3D point cloud. With 680 groups that gives 6,120 image samples + 680 3D samples = 6,800 total.
Dataset sources#
Curated by: the KITScenes team at the Institute of Measurement and Control Systems (MRT), Karlsruhe Institute of Technology (KIT), and the FZI Research Center for Information Technology — Richard Schwarzkopf and Fabian Immel (joint first authors), Jan-Hendrik Pauls (project lead), Christoph Stiller, and collaborators. This FiftyOne build was prepared by Harpreet Sahota (Voxel51).
Language: English
License: CC-BY-NC-4.0
Resource |
Link |
|---|---|
Original dataset (Hugging Face) |
|
Single-scene preview (Hugging Face) |
|
Python API / devkit (GitHub) |
|
Paper |
The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset — arXiv:2606.02956 |
Project page |
|
This FiftyOne build |
|
The kitscenes Python package on GitHub (the devkit) is the official loader for the
sensor, calibration, and map data; this FiftyOne build uses it to decode and project
the geometry and labels.
Dataset structure#
Group slices#
The dataset is grouped on the group field. Each frame contains the following
slices (the slice name doubles as the sensor name in the sensor field). The
default slice shown in the App is camera_ring_front.
Slice |
Media |
Role |
|---|---|---|
|
image |
Forward ring camera (default view) |
|
image |
Ring camera, front-left |
|
image |
Ring camera, front-right |
|
image |
Rear ring camera |
|
image |
Ring camera, rear-left |
|
image |
Ring camera, rear-right |
|
image |
High-resolution long-range front camera |
|
image |
Rectified front stereo, left |
|
image |
Rectified front stereo, right |
|
3d |
Fused point cloud: 7 lidars + 3 radars, in the ego frame |
The six camera_ring_* slices form the 360° surround view; the three
camera_base_* slices are the long-range and stereo cameras.
Sample-level fields#
These fields are present on every sample (cameras and the 3D slice), giving each sample its scene context, timing, and ego pose.
Field |
Type |
Description |
|---|---|---|
|
string |
UUID of the source scene |
|
int |
Frame index within the scene (0-based) |
|
float |
Reference timestamp (seconds) |
|
string |
Sensor / slice name |
|
list[float] |
Ego position |
|
list[float] |
Ego orientation |
|
float |
Ego heading (degrees) |
|
|
GNSS longitude/latitude |
|
float |
GNSS altitude (meters) |
|
int |
GNSS fix-status code |
|
float |
Ego speed from GNSS twist (m/s) |
The per-frame ego pose plus GNSS together give the full car trajectory — the sequence of ego positions and headings over each scene.
Camera slices additionally carry:
Field |
Type |
Description |
|---|---|---|
|
dict |
Pinhole intrinsics (focal length, principal point) |
|
dict |
Image |
Label fields#
Labels are attached per camera slice; not every label exists on every camera. The table shows where each one is populated.
Field |
FiftyOne type |
Where |
What it is |
|---|---|---|---|
|
|
all 9 cameras |
Fused lidar depth projected into the image, encoded as an 8-bit depth heatmap (near→far) |
|
|
6 ring cameras |
Lanelet2 HD-map elements reprojected into the image (lane markings, borders, road markings, poles, traffic signs, traffic lights) |
|
|
|
The vehicle’s future path (ego waypoints) projected onto the road ahead, label |
|
|
|
Instance predictions (boxes + masks) in the Mapillary-Vistas taxonomy |
hd_map polylines carry a top-level label (the coarse category) and a subtype
attribute holding the fine-grained Lanelet2 class (e.g. lane-marking style, or the
specific German traffic-sign code such as de206).
The 3D lidar slice#
The lidar slice is a single .fo3d scene per frame that fuses seven lidars and
three radars into one ego-frame point cloud (lidar sweeps are motion-deskewed;
radar detections are ego-motion compensated). Points are shaded by intensity in the
App. The point clouds carry these per-point scalar fields:
Lidar points:
intensity(reflectivity) andisground(per-point ground flag from ground segmentation).Radar points:
intensity(RCS) andrange_rate(Doppler velocity).
Saved views#
Three dynamic grouped views ship with the dataset for browsing:
View |
What it shows |
|---|---|
|
The forward ring camera, grouped by |
|
The rear ring camera, grouped by |
|
The fused lidar slice grouped by |
Label taxonomies#
HD map (hd_map) categories: lane_marking, road_marking, road_border,
pole, traffic_sign, traffic_light. Each polyline’s subtype holds the
detailed Lanelet2 class — lane-marking styles (e.g. dashed, solid,
dashed_solid) and the fine-grained German traffic-sign codes (de…).
Instance predictions (seamseg) classes: Mapillary-Vistas “thing” classes,
including Car, Truck, Bus, Bicycle, Motorcycle, Trailer,
Other Vehicle, Person, Bicyclist, Motorcyclist, Other Rider,
Traffic Light, Traffic Sign (Front), Traffic Sign (Back),
Traffic Sign Frame, Pole, Utility Pole, Street Light, Bench,
Billboard, Banner, Bike Rack, Trash Can, Mailbox, Fire Hydrant,
Junction Box, Catch Basin, Manhole, Phone Booth, CCTV Camera, Bird,
Wheeled Slow, Crosswalk - Plain, Lane Marking - Crosswalk.
Uses#
This FiftyOne build is suited to:
Multimodal browsing and curation — inspect all 9 cameras and the fused point cloud for any frame, side by side.
HD-map perception — the
hd_mappolylines provide reprojection-accurate Lanelet2 map labels aligned to image pixels.Long-range depth —
lidar_depthheatmaps provide dense, long-range depth ground truth (the source lidar reaches beyond 400 m).Trajectory / motion work — per-frame ego pose plus the projected
ego_trajectoryfuture path.2D object analysis — the
seamseginstance detections on the front and rear ring cameras.
Out-of-scope#
This is an early-release preview subset (4 validation scenes). It is meant for
exploration and pipeline development, not final benchmark reporting. The build also
does not include 3D bounding boxes, tracks, or instance segmentation for
dynamic agents (the source dataset omits these in the current release). The
seamseg detections are model predictions, not human annotations.
Source data#
KITScenes Multimodal was recorded across Karlsruhe, Frankfurt, and Sindelfingen by
the Institute of Measurement and Control Systems (MRT) at the Karlsruhe Institute of
Technology (KIT). The scenes here are from the validation split (Frankfurt). Camera
imagery is anonymized (faces and license plates). Geometry and label projections in
this build are produced with the official kitscenes Python API. See
Dataset sources above for the original dataset, devkit, paper,
and project-page links.
Citation#
@misc{schwarzkopf2026kitscenes,
title={The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset},
author={Richard Schwarzkopf and Fabian Immel and Alexander Blumberg and Jonas Merkert and Nils Rack and Kaiwen Wang and Fabian Konstantinidis and Julian Truetsch and Carlos Fernandez and Annika Bätz and Kevin Rösch and Marlon Steiner and Willi Poh and Yinzhe Shen and Royden Wagner and Felix Hauser and Dominik Strutz and Jaime Villa and Gleb Stepanov and Holger Caesar and Ömer Şahin Taş and Frank Bieder and Jan-Hendrik Pauls and Christoph Stiller},
year={2026},
eprint={2606.02956},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.02956},
}
License#
Released under CC-BY-NC-4.0, matching the source dataset’s terms. Non-commercial use only; attribution required.