Note

This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.

Hugging Face

Dataset Card for RCS UTN Green Box (FiftyOne)#

FiftyOne

image/png

rcs_utn_green_box is a grouped FiftyOne video dataset of a multi-view robot manipulation task — “pick the green box” — collected with the Robot Control Stack (RCS) ecosystem from the University of Technology Nuremberg. Each episode is a group with one synchronized video per camera, plus dense robot proprioception and action data on every frame.

Installation#

If you haven’t already, install FiftyOne:

pip install -U fiftyone

Usage#

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub("Voxel51/rcs_utn_green_box")
session = fo.launch_app(dataset)

Dataset Details#

Dataset Description#

Robot Control Stack (RCS) is a lean, modular ecosystem for robot learning at scale, with a unified interface for simulated and physical robots to facilitate sim-to-real transfer. This dataset captures a single cube-picking task recorded from five camera perspectives, with per-frame joint states, end-effector poses, gripper state, actions, and the tracked cube pose — the kind of multi-view, multi-modal trajectory data used to train and evaluate Vision-Language-Action (VLA) policies.

This FiftyOne version is a grouped video dataset: each episode links the five camera streams so they can be scrubbed together in the App, with robot state and actions rendered as per-frame numeric fields.


FiftyOne Dataset Structure#

Dataset name: rcs_utn_green_box

Media type: group

Default group slice: side_wide

Summary#

Property

Value

Groups (episodes)

143

Video samples (total)

715

Group slices

side_wide, wrist, side_right, bird_eye, side

Language instruction

pick the green box

Groups and slices#

Each episode is one demonstration. The five linked slices are the camera perspectives recorded during that episode:

Slice

Description

side_wide

Wide side view (default slice)

wrist

Wrist-mounted camera

side_right

Right-side view

bird_eye

Top-down bird’s-eye view

side

Side view

Videos are encoded as H.264 / yuv420p (30 fps) from the source JPEG frames for in-App playback.

Sample-level fields#

Field

Type

Description

episode_id

string

Episode identifier (from the source parquet shard)

camera

string

Camera/slice name for this sample

language_instruction

string

Natural-language task description

intrinsics

list

Camera intrinsics for this view

extrinsics

list

Camera extrinsics for this view

Frame-level fields#

Field

Type

Description

step

int

Step index within the episode

timestamp

float

Frame timestamp

reward

float

Per-step reward

success

bool

Success flag

joints

list(float)

Robot joint positions

tquat

list(float)

End-effector pose (translation + quaternion)

xyzrpy

list(float)

End-effector pose (xyz + roll/pitch/yaw)

gripper

float

Gripper state

action_tquat

list(float)

Commanded end-effector action (translation + quaternion)

action_gripper

float

Commanded gripper action

cube_pos_tquat

list(float)

Tracked green-cube pose (translation + quaternion)

Citation#

@article{juelg2025rcs,
  title   = {Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale},
  author  = {J\"ulg, Tobias and Krack, Pierre and Bien, Seongjin and Blei, Yannik and Gamal, Khaled and Nakahara, Ken and Hechtl, Johannes and Calandra, Roberto and Burgard, Wolfram and Walter, Florian},
  journal = {arXiv preprint arXiv:2509.14932},
  year    = {2025}
}

License#

The source Robot Control Stack project is released under the Apache-2.0 License.