Note

This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.

Hugging Face

Dataset Card for Syn4D RGBD#

image/png

Syn4D is a large-scale, fully-synthetic multiview dataset of dynamic scenes designed to advance research in 4D reconstruction, depth estimation, 3D point tracking, novel-view synthesis, and human pose estimation. It provides dense, complete, and accurate geometric annotations β€” including per-pixel depth maps, multi-view camera trajectories, dense long-range 3D point tracks, and parametric SMPL-X human body annotations β€” across a diverse collection of indoor and outdoor environments rendered with Unreal Engine 5.

This FiftyOne dataset is a curated subset of the full Syn4D release, covering 3 scenes (45 sequences total) from the Hugging Face repository. Each sequence is available as 8 synchronised multi-camera video clips, paired with per-frame depth heatmaps, instance segmentation masks, and camera pose metadata. A fused coloured 3D point cloud reconstructed from all 8 views is also included for each sequence.

Installation#

If you haven’t already, install FiftyOne:

pip install -U fiftyone

Usage#

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/Syn4D_RGBD")

# Launch the App
session = fo.launch_app(dataset)

Dataset Details#

  • Curated by: Visual Geometry Group (VGG), University of Oxford; Nanyang Technological University; Naver Labs Europe

  • Authors: Zeren Jiang*, Yushi Lan*, Yihang Luo, Yufan Deng, Zihang Lai, Edgar Sucar, Christian Rupprecht, Iro Laina, Diane Larlus, Chuanxia Zheng, Andrea Vedaldi (*equal contribution)

  • Funded by: Clarendon Scholarship, NTU SUG-NAP, NRF-NRFF17-2025-0009, ERC 101001212-UNION, EPSRC EP/Z001811/1 SYN3D

  • License: Licensed for AI training use (all 3D assets sourced from Unreal Fab store and Objaverse under AI-training-compatible licences)

  • Language: English (captions)

  • Paper: arXiv 2605.05207

  • Project page: https://jzr99.github.io/Syn4D/

  • Raw data repository: Syn4D/Syn4D_RGBD on Hugging Face


Scenes Included#

Scene

Type

Sequences

Cameras

Approx. frames/clip

bigoffice_v1

Indoor office

5

8

117–477

warehouse_group_static

Indoor warehouse

20

8

131–477

hospital

Indoor hospital

20

8

120–477

Total

45

8

Each sequence is rendered at 1280Γ—720, 30 FPS with Unreal Engine 5’s Lumen global illumination.


FiftyOne Dataset Structure#

Group structure#

The dataset is a grouped video dataset. Each group represents one scene sequence and contains 9 slices:

Slice

Media type

Description

cam_0 … cam_7

video

8 synchronised camera views (MP4, 1280Γ—720, 30 fps)

pointcloud

3d

Fused coloured point cloud from all 8 views (fo3d + PLY)

Sample-level fields#

Field

Type

Description

scene_name

str

Scene identifier (bigoffice_v1, warehouse_group_static, hospital)

sequence_id

str

Sequence identifier within the scene (e.g. seq_000000)

camera_index

int

Camera index 0–7 (video slices only)

global_caption

str

One natural-language description of the entire clip, generated by Tarsier2-7B

local_captions

TemporalDetections

Per-window captions covering 81-frame segments; each TemporalDetection has label (the caption text), support (1-indexed frame range), and chunk (original range string e.g. "0-80")

Frame-level fields (video slices)#

Field

Type

Description

depth

Heatmap

Per-frame depth map (uint16 PNG on disk, range=[0, 30000] cm). Lighter = farther.

segmentation

Segmentation

Per-frame integer segmentation mask (uint8 PNG on disk). See mask targets below.

cam_x, cam_y, cam_z

float

Camera world position in Unreal Engine units (centimetres)

cam_yaw, cam_pitch, cam_roll

float

Camera rotation in degrees (Unreal Engine convention)

focal_length

float

Focal length in millimetres

hfov

float

Horizontal field of view in degrees

Pointcloud slice#

Each pointcloud sample references a .fo3d scene file backed by a binary PLY point cloud:

Field

Type

Description

scene_name

str

Scene identifier

sequence_id

str

Sequence identifier

source_frame

int

The video frame index used to generate this point cloud

The PLY files contain coloured 3D points in metric units (metres), gamma-corrected RGB colours, and have been voxel-downsampled (3 cm voxels) with statistical outlier removal applied.


Dataset Creation#

Rendering pipeline#

Scenes were procedurally composed in Unreal Engine 5 using:

  • 30 large-scale 3D environments purchased from the Unreal Fab store

  • 1,674 animated 3D objects from Objaverse / ObjaverseXL (robots, animals, monsters, humanoid characters)

  • 585 simulated humans from BEDLAM2 with SMPL-X body pose and shape

Each clip is rendered at 1280Γ—720, 30 FPS with physically based lighting (Lumen real-time global illumination + baked lightmaps). Camera trajectories are procedurally generated as orbit shots, dolly shots, or paired orbit/static combinations to maximise coverage.

Annotations#

All annotations are derived automatically from the Unreal Engine render pipeline β€” no manual labelling was required:

  • Depth maps β€” rendered as float32 EXR via Unreal’s depth pass

  • Segmentation masks β€” rendered as separate binary passes per object layer (body, clothing, environment, objects 1–3)

  • Camera parameters β€” extrinsics (world position + Euler angles) and intrinsics (focal length, sensor size, HFOV) recorded per frame

  • Captions β€” generated by Tarsier2-7B applied to 16-frame samples of each clip (global) and 32-frame samples of 81-frame windows (local)

The full Syn4D dataset additionally provides dense 3D tracking annotations via pixel-aligned barycentric coordinate maps paired with animated mesh sequences (not included in this subset). These enable recovery of the 3D position of any pixel at any time in any camera.


Uses#

Intended use#

  • Training and evaluation of monocular/multi-view depth estimation models

  • Benchmarking video segmentation and instance tracking algorithms

  • Training camera pose estimation and multi-view 3D reconstruction models

  • Fine-tuning human pose estimation models (SMPL / SMPL-X based)

  • Research into geometry-aware novel-view synthesis

  • Visual exploration and analysis of synthetic dynamic scene data

Out-of-scope use#

  • This dataset contains only synthetic, fully computer-generated content β€” it is not suitable for tasks requiring photorealistic real-world appearance fidelity

  • The dense tracking annotations (barycentric maps + meshes) required for full 4D reconstruction are not included in this subset

  • Human characters are BEDLAM2 parametric body models β€” they do not represent real individuals and should not be used for facial recognition or identity-related tasks


Citation#

BibTeX:

@misc{jiang2026syn4d,
      title={Syn4D: A Multiview Synthetic 4D Dataset},
      author={Zeren Jiang and Yushi Lan and Yihang Luo and Yufan Deng and Zihang Lai
              and Edgar Sucar and Christian Rupprecht and Iro Laina and Diane Larlus
              and Chuanxia Zheng and Andrea Vedaldi},
      year={2026},
      eprint={2605.05207},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.05207},
}

APA: Jiang, Z., Lan, Y., Luo, Y., Deng, Y., Lai, Z., Sucar, E., Rupprecht, C., Laina, I., Larlus, D., Zheng, C., & Vedaldi, A. (2026). Syn4D: A Multiview Synthetic 4D Dataset. arXiv:2605.05207.