Note

This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.

Dataset Card for Syn4D RGBD#

image/png

Syn4D is a large-scale, fully-synthetic multiview dataset of dynamic scenes designed to advance research in 4D reconstruction, depth estimation, 3D point tracking, novel-view synthesis, and human pose estimation. It provides dense, complete, and accurate geometric annotations — including per-pixel depth maps, multi-view camera trajectories, dense long-range 3D point tracks, and parametric SMPL-X human body annotations — across a diverse collection of indoor and outdoor environments rendered with Unreal Engine 5.

This FiftyOne dataset is a curated subset of the full Syn4D release, covering 3 scenes (45 sequences total) from the Hugging Face repository. Each sequence is available as 8 synchronised multi-camera video clips, paired with per-frame depth heatmaps, instance segmentation masks, and camera pose metadata. A fused coloured 3D point cloud reconstructed from all 8 views is also included for each sequence.

Installation#

If you haven’t already, install FiftyOne:

pip install -U fiftyone

Usage#

import fiftyone as fo
from huggingface_hub import snapshot_download


# Download the dataset snapshot to the current working directory

snapshot_download(
    repo_id="Voxel51/Syn4D_RGBD", 
    local_dir=".", 
    repo_type="dataset"
    )

# Load dataset from current directory using FiftyOne's native format
dataset = fo.Dataset.from_dir(
    dataset_dir=".",  # Current directory contains the dataset files
    dataset_type=fo.types.FiftyOneDataset,  # Specify FiftyOne dataset format
    name="Syn4D_RGBD"  # Assign a name to the dataset for identification
)

# Launch the App
session = fo.launch_app(dataset)

Dataset Details#

Curated by: Visual Geometry Group (VGG), University of Oxford; Nanyang Technological University; Naver Labs Europe
Authors: Zeren Jiang*, Yushi Lan*, Yihang Luo, Yufan Deng, Zihang Lai, Edgar Sucar, Christian Rupprecht, Iro Laina, Diane Larlus, Chuanxia Zheng, Andrea Vedaldi (*equal contribution)
Funded by: Clarendon Scholarship, NTU SUG-NAP, NRF-NRFF17-2025-0009, ERC 101001212-UNION, EPSRC EP/Z001811/1 SYN3D
License: Licensed for AI training use (all 3D assets sourced from Unreal Fab store and Objaverse under AI-training-compatible licences)
Language: English (captions)
Paper: arXiv 2605.05207
Project page: https://jzr99.github.io/Syn4D/
Raw data repository: Syn4D/Syn4D_RGBD on Hugging Face

Scenes Included#

Scene	Type	Sequences	Cameras	Approx. frames/clip
`bigoffice_v1`	Indoor office	5	8	117–477
`warehouse_group_static`	Indoor warehouse	20	8	131–477
`hospital`	Indoor hospital	20	8	120–477
Total		45	8

Each sequence is rendered at 1280×720, 30 FPS with Unreal Engine 5’s Lumen global illumination.

FiftyOne Dataset Structure#

Group structure#

The dataset is a grouped video dataset. Each group represents one scene sequence and contains 9 slices:

Slice	Media type	Description
`cam_0` … `cam_7`	`video`	8 synchronised camera views (MP4, 1280×720, 30 fps)
`pointcloud`	`3d`	Fused coloured point cloud from all 8 views (fo3d + PLY)

Sample-level fields#

Field	Type	Description
`scene_name`	`str`	Scene identifier (`bigoffice_v1`, `warehouse_group_static`, `hospital`)
`sequence_id`	`str`	Sequence identifier within the scene (e.g. `seq_000000`)
`camera_index`	`int`	Camera index 0–7 (video slices only)
`global_caption`	`str`	One natural-language description of the entire clip, generated by Tarsier2-7B
`local_captions`	`TemporalDetections`	Per-window captions covering 81-frame segments; each `TemporalDetection` has `label` (the caption text), `support` (1-indexed frame range), and `chunk` (original range string e.g. `"0-80"`)

Frame-level fields (video slices)#

Field	Type	Description
`depth`	`Heatmap`	Per-frame depth map (uint16 PNG on disk, `range=[0, 30000]` cm). Lighter = farther.
`segmentation`	`Segmentation`	Per-frame integer segmentation mask (uint8 PNG on disk). See mask targets below.
`cam_x`, `cam_y`, `cam_z`	`float`	Camera world position in Unreal Engine units (centimetres)
`cam_yaw`, `cam_pitch`, `cam_roll`	`float`	Camera rotation in degrees (Unreal Engine convention)
`focal_length`	`float`	Focal length in millimetres
`hfov`	`float`	Horizontal field of view in degrees

Pointcloud slice#

Each pointcloud sample references a .fo3d scene file backed by a binary PLY point cloud:

Field	Type	Description
`scene_name`	`str`	Scene identifier
`sequence_id`	`str`	Sequence identifier
`source_frame`	`int`	The video frame index used to generate this point cloud

The PLY files contain coloured 3D points in metric units (metres), gamma-corrected RGB colours, and have been voxel-downsampled (3 cm voxels) with statistical outlier removal applied.

Dataset Creation#

Rendering pipeline#

Scenes were procedurally composed in Unreal Engine 5 using:

30 large-scale 3D environments purchased from the Unreal Fab store
1,674 animated 3D objects from Objaverse / ObjaverseXL (robots, animals, monsters, humanoid characters)
585 simulated humans from BEDLAM2 with SMPL-X body pose and shape

Each clip is rendered at 1280×720, 30 FPS with physically based lighting (Lumen real-time global illumination + baked lightmaps). Camera trajectories are procedurally generated as orbit shots, dolly shots, or paired orbit/static combinations to maximise coverage.

Annotations#

All annotations are derived automatically from the Unreal Engine render pipeline — no manual labelling was required:

Depth maps — rendered as float32 EXR via Unreal’s depth pass
Segmentation masks — rendered as separate binary passes per object layer (body, clothing, environment, objects 1–3)
Camera parameters — extrinsics (world position + Euler angles) and intrinsics (focal length, sensor size, HFOV) recorded per frame
Captions — generated by Tarsier2-7B applied to 16-frame samples of each clip (global) and 32-frame samples of 81-frame windows (local)

The full Syn4D dataset additionally provides dense 3D tracking annotations via pixel-aligned barycentric coordinate maps paired with animated mesh sequences (not included in this subset). These enable recovery of the 3D position of any pixel at any time in any camera.

Uses#

Intended use#

Training and evaluation of monocular/multi-view depth estimation models
Benchmarking video segmentation and instance tracking algorithms
Training camera pose estimation and multi-view 3D reconstruction models
Fine-tuning human pose estimation models (SMPL / SMPL-X based)
Research into geometry-aware novel-view synthesis
Visual exploration and analysis of synthetic dynamic scene data

Out-of-scope use#

This dataset contains only synthetic, fully computer-generated content — it is not suitable for tasks requiring photorealistic real-world appearance fidelity
The dense tracking annotations (barycentric maps + meshes) required for full 4D reconstruction are not included in this subset
Human characters are BEDLAM2 parametric body models — they do not represent real individuals and should not be used for facial recognition or identity-related tasks

Citation#

BibTeX:

@misc{jiang2026syn4d,
      title={Syn4D: A Multiview Synthetic 4D Dataset},
      author={Zeren Jiang and Yushi Lan and Yihang Luo and Yufan Deng and Zihang Lai
              and Edgar Sucar and Christian Rupprecht and Iro Laina and Diane Larlus
              and Chuanxia Zheng and Andrea Vedaldi},
      year={2026},
      eprint={2605.05207},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.05207},
}

APA: Jiang, Z., Lan, Y., Luo, Y., Deng, Y., Lai, Z., Sucar, E., Rupprecht, C., Laina, I., Larlus, D., Zheng, C., & Vedaldi, A. (2026). Syn4D: A Multiview Synthetic 4D Dataset. arXiv:2605.05207.