Note

This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.

Hugging Face

Dataset Card for SceneFun3D#

image/png

SceneFun3D is a 3D scene-understanding dataset of high-resolution Faro laser-scan point clouds of indoor environments, densely annotated with fine-grained functional interactive elements (handles, knobs, buttons, switches, …), their affordances, motion parameters, and free-form task descriptions. Each scene is also captured by several iPad video sequences with RGB, depth, camera poses, and intrinsics.

This is the FiftyOne version of the dataset: a grouped multimodal dataset where each scene is a group containing the scene’s FO3D laser-scan point cloud (with 3D functional elements) plus one video slice per iPad recording (ipad_1, ipad_2, …). The video frames carry per-frame depth (as Heatmap labels), camera poses, and intrinsics, and the 3D functional elements are projected into the frames as 2D detections + keypoints, linked back to the 3D boxes via fo.Instance.

This dataset was created with FiftyOne and can be loaded and visualized in the FiftyOne App (3D viewer for the point cloud, video player for the iPad sequences).

Installation#

pip install -U fiftyone

Usage#

Build the dataset (downloads visit + video assets on demand, then parses them):

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
from huggingface_hub import snapshot_download


# Download the dataset snapshot to the current working directory

snapshot_download(
    repo_id="Voxel51/SceneFun3D", 
    local_dir=".", 
    repo_type="dataset"
    )

# Load dataset from current directory using FiftyOne's native format
dataset = fo.Dataset.from_dir(
    dataset_dir=".",  # Current directory contains the dataset files
    dataset_type=fo.types.FiftyOneDataset,  # Specify FiftyOne dataset format
    name="SceneFun3D"  # Assign a name to the dataset for identification
)

# Launch the App
session = fo.launch_app(dataset)

Dataset Details#

Dataset Description#

SceneFun3D targets fine-grained functionality and affordance understanding in 3D scenes: beyond recognizing objects, it localizes the small interactive parts a person actually manipulates (a drawer handle, a light switch, a stove knob) and describes how to interact with them. The full dataset (per the paper) provides more than 14.8k (14,867) functional interactive element annotations across 710 high-resolution real-world indoor scenes, with 9 Gibsonian-inspired affordance categories, motion parameters for 14,279 elements (8,325 translational, 6,542 rotational), and natural-language task descriptions for 10,913 elements (17,133 including automated rephrasings). Each scene is a combined, 5mm-voxel-downsampled Faro laser scan (several million points); functional elements are annotated as point-index masks on that scan.

In this FiftyOne build, every scene becomes one FO3D point cloud, each functional element becomes a 3D Detection (axis-aligned box from the masked points) carrying its affordance and motion, and each scene’s iPad recordings are video slices with the elements projected into the frames (see Dataset Structure).

  • Curated by: Alexandros Delitzas, Ayca Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann (ETH Zurich, Google, TU Munich, Microsoft). Built on top of ARKitScenes.

  • Funded by: A Career Seed Award from the ETH Zurich Foundation and an Innosuisse grant (48727.1 IP-ICT); AD supported by a HELLENiQ ENERGY scholarship.

  • Shared by: SceneFun3D authors (ETH Zurich CVG release mirror).

  • Language(s): English (task descriptions).

  • License: Non-commercial research use, inherited from ARKitScenes (CC BY-NC-SA 4.0).

Dataset Sources#

  • Repository: https://github.com/SceneFun3D/scenefun3d

  • Paper: Delitzas et al. “SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes.” CVPR 2024 (Oral).

  • Demo: https://scenefun3d.github.io

Uses#

Direct Use#

  • Functional interactive element detection / segmentation in 3D point clouds.

  • Affordance grounding (predicting the affordance class of interactive parts).

  • Task-driven affordance grounding: localizing the 3D element that satisfies a natural-language instruction (“open the drawer next to the sink”).

  • Motion estimation for articulated/interactive parts (axis, direction, type).

  • Robotics and embodied-AI research on manipulation target selection.

Dataset Structure#

This is a grouped dataset (media_type = "group") where the group is one scene (visit_id). Each group has:

  • laser_scan (3d/FO3D) - the scene’s Faro point cloud (RGB-shaded) carrying the 3D functional_elements, objects_3d, and tasks (one per scene).

  • ipad_1, ipad_2, … (video) - one slice per iPad recording of the scene (high-res RGB, 1920x1440, ~10 FPS, re-encoded to H.264 MP4), with per-frame depth, pose, intrinsics, and the 3D elements/objects projected into the frame. Scenes have ~2-3 recordings; positional slices are populated up to that count (a 2-recording scene leaves ipad_3 empty).

The default slice is ipad_1. This build samples 10 scenes from each of the train / val / test splits (30 scenes), and every sample is tagged with its split (train / val / test). Image/video/scene metadata is computed for all slices.

Note: the test split’s functional annotations are withheld by the benchmark, so test-split groups have the point cloud + video slices (and ARKit objects_3d where available) but no functional_elements / tasks / projected functional labels.

Sample fields (by slice)#

Shared:

Field

FiftyOne type

Description

filepath

StringField

.mp4 video (ipad_N) or .fo3d scene (laser_scan).

group

Group

Group membership + slice name.

visit_id

StringField

6-digit scene identifier (verbatim).

tags

ListField(StringField)

Source split of the sample (train / val / test).

metadata

SceneMetadata / VideoMetadata

Computed media metadata (size, and frame count / dimensions for videos).

laser_scan slice:

Field

FiftyOne type

Description

functional_elements

Detections

3D functional interactive elements (one Detection per annotation), each linked to its 2D projections via fo.Instance.

objects_3d

Detections

ARKit room-level object boxes (e.g. bed, cabinet, shelf, tv_monitor), aligned from the ARKit frame into the laser-scan frame; each linked to its 2D projection via fo.Instance.

tasks

ListField(StringField)

All natural-language task descriptions for the scene.

ipad_N slices (one video sample per recording):

Field

FiftyOne type

Description

video_id

StringField

8-digit iPad sequence identifier (verbatim) of this recording.

frames[n].timestamp

FloatField

Capture timestamp of the frame.

frames[n].depth

Heatmap

Per-frame depth map (map_path to the source depth PNG in mm, range in mm).

frames[n].intrinsics

DictField

Per-frame camera intrinsics {width, height, fx, fy, cx, cy}.

frames[n].camera_pose

ListField

4x4 camera-to-world pose (COLMAP, laser-scan frame), nearest-timestamp matched.

frames[n].projected_elements

Detections

2D boxes of the functional elements visible in the frame (only on frames where an element projects); instance links each back to its 3D box.

frames[n].projected_points

Keypoints

The projected (subsampled) mask points of each visible element; same instance linkage.

frames[n].projected_objects

Detections

2D boxes of the ARKit room-level objects visible in the frame; instance links each back to its objects_3d box.

functional_elements detection attributes#

Each Detection in functional_elements carries:

Attribute

Type

Description

label

str

Affordance class of the element (e.g. rotate, key_press, tip_push, hook_turn, pinch_pull, plug_in, unplug).

location

[x, y, z]

Center of the axis-aligned 3D box, in the Faro laser-scan coordinate frame.

dimensions

[dx, dy, dz]

Box size, derived from the extent of the masked points.

rotation

[0, 0, 0]

Axis-aligned boxes (no orientation estimated from the mask).

annot_id

str

Source annotation UUID.

num_points

int

Number of laser-scan points in the element’s index mask.

descriptions

list[str]

Task instructions that reference this element.

motion_type

str

trans (translation) or rot (rotation).

motion_dir

[x, y, z]

Motion direction vector.

motion_origin

[x, y, z]

Motion origin point (laser-scan coordinate of motion_origin_idx).

motion_viz_orient

str

inwards / outwards orientation hint for visualizing the motion.

The label is one of the 9 Gibsonian-inspired affordance categories (paper Tab. 1):

  • rotate - adjusted by a rotary switch/knob (e.g. thermostat)

  • key_press - surfaces of keys that can be pressed (e.g. remote, keyboard)

  • tip_push - triggered by the tip of a finger (e.g. light switch)

  • hook_pull - pulled by hooking up fingers (e.g. fridge handle)

  • pinch_pull - pulled with a pinch movement (e.g. drawer knob)

  • hook_turn - turned by hooking up fingers (e.g. door handle)

  • foot_push - pushed by foot (e.g. trash-can pedal)

  • plug_in - electrical power sources

  • unplug - removing a plug from a socket

(The source also has an exclude category for elements whose geometry is poorly captured, e.g. reflective materials; it is a don’t-care mask, not an affordance, and is dropped here.)

What is not ingested#

  • Low-res iPad stream (lowres_wide / lowres_depth, 256x192 @ 60 FPS) is not imported; the hires stream is used as the single RGB video slice.

  • Remaining ARKit-legacy assets (arkit_mesh reconstruction, vga_wide, ultrawide camera streams) are available from the source but not imported here. (The ARKit 3dod_annotation objects and the Faro<->ARKit transform are now ingested - see objects_3d.)

Dataset Creation#

Curation Rationale#

Most 3D scene datasets label whole objects or object parts, which is only an intermediate step toward agents that must actually interact with the functional elements (knobs, handles, buttons) to accomplish tasks. Commodity RGB-D reconstructions (ScanNet, Matterport) often fail to capture these small details, so SceneFun3D leverages high-resolution Faro laser scans. It is also the first dataset to link Gibsonian affordances (what an element affords, e.g. “press”) with telic affordances (the element’s purpose in scene context, e.g. “turn on the ceiling light”) via natural-language task descriptions, plus motion parameters describing how to interact.

Source Data#

Data Collection and Processing#

Scenes are built on ARKitScenes captures. For each scene, multiple Faro Focus S70 laser scans (four on average) are combined under a common coordinate frame and downsampled with a 5mm voxel size to preserve small functional parts while remaining tractable; extraneous points from transparent surfaces (e.g. windows) are removed with DBSCAN and flagged by a binary crop mask. Each scene is also accompanied by iPad Pro (2020) video sequences (three on average) with RGB, on-device LiDAR depth, and camera trajectory. Because the iPad data and the laser scan are in different coordinate frames, the authors register them (proxy high-resolution RGB-D reconstruction + Predator + multi-scale ICP) and provide per-frame camera poses via rigid-body motion interpolation in SO(3) x R^3. Each scene’s hires RGB-D recordings, poses, and intrinsics are ingested as the ipad_N video slices of its group.

The dataset’s official splits are 545 train / 80 val / 85 test scenes (710 total; ARKitScenes’ validation set is used as the test set since its test set is private). This FiftyOne build samples 10 scenes from each split as listed in the toolkit’s benchmark scene lists.

Who are the source data producers?#

The underlying RGB-D captures and Faro laser scans come from ARKitScenes (Apple), recorded with a 2020 iPad Pro and a Faro Focus S70 laser scanner. The functional, motion, and language annotations were produced by the SceneFun3D authors and their annotation team.

Annotations#

Annotation process#

Annotations were collected with a custom lightweight web-based tool that supports point-accurate selection on dense high-resolution point clouds (accelerated by a Bounding Volume Hierarchy ray-caster, no GPU required), with the scene videos available to annotators for reference. For each functional interactive element, annotators (1) select a Gibsonian affordance label, (2) annotate the instance mask at single-point accuracy, (3) select the motion type (translational or rotational) with a motion-axis origin point and direction vector, and (4) provide free-form natural-language task descriptions that uniquely involve that element. Collected descriptions are additionally rephrased for diversity using OpenAI’s gpt-3.5-turbo-instruct and verified. Elements whose geometry (or whose parent object) is poorly captured (e.g. reflective materials) are labeled exclude and omitted from the benchmark evaluation.

Who are the annotators?#

Human annotators organized by the SceneFun3D authors, using the custom web-based annotation tool. Task-description rephrasings are machine-generated (gpt-3.5-turbo-instruct) and human-verified.

Citation#

BibTeX:

@inproceedings{delitzas2024scenefun3d,
  title={{SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes}},
  author={Delitzas, Alexandros and Takmaz, Ayca and Tombari, Federico and Sumner, Robert and Pollefeys, Marc and Engelmann, Francis},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}

APA:

Delitzas, A., Takmaz, A., Tombari, F., Sumner, R., Pollefeys, M., & Engelmann, F. (2024). SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

More Information#

Built on ARKitScenes (https://github.com/apple/ARKitScenes). Toolkit and documentation: https://scenefun3d.github.io. This FiftyOne build downloads, per scene, the visit-level assets (laser scan, crop mask, annotations, descriptions, motions) and, per recording, the hires RGB / depth / intrinsics / poses from the SceneFun3D release mirror plus the ARKit 3dod_annotation and Faro<->ARKit transform (for objects_3d).