Note

This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.

Dataset Card for edit3d-bench#

image/png

This is a FiftyOne dataset with 300 samples.

Installation#

If you haven’t already, install FiftyOne:

pip install -U fiftyone

Usage#

import fiftyone as fo
from huggingface_hub import snapshot_download


# Download the dataset snapshot to the current working directory

snapshot_download(
    repo_id="Voxel51/edit3d-bench", 
    local_dir=".", 
    repo_type="dataset"
    )



# Load dataset from current directory using FiftyOne's native format
dataset = fo.Dataset.from_dir(
    dataset_dir=".",  # Current directory contains the dataset files
    dataset_type=fo.types.FiftyOneDataset,  # Specify FiftyOne dataset format
    name="edit3d-bench"  # Assign a name to the dataset for identification
)

# Launch the App
session = fo.launch_app(dataset)

# Launch the App
session = fo.launch_app(dataset)

Dataset Description#

Edit3D-Bench comprises 100 high-quality 3D models — 50 from Google Scanned Objects (GSO) and 50 from PartObjaverse-Tiny. For each model, the authors provide 3 distinct editing prompts covering a range of modifications (object replacement, accessory addition, material changes, etc.), yielding 300 total editing tasks.

Each editing task includes:

Source 3D model (model.glb) with multi-view renders, RGB/normal/mask videos
3D edit region (3d_edit_region.glb) — a human-annotated mesh specifying which part of the source model to edit
2D edit mask (2d_mask.png) — the edit region projected to a canonical camera view
2D edited reference image (2d_edit.png) — generated by FLUX.1 Fill, showing the intended edit result
2D visualization (2d_visual.png) — the source model rendered with the edit region removed
Multi-view edit region renders — rotating video (visual3d.mp4) and 16 static views showing the edit region overlaid on the source model

The benchmark is designed to evaluate three aspects of 3D editing methods: (1) preservation of unedited regions (Chamfer Distance, masked PSNR/SSIM/LPIPS), (2) overall 3D quality (FID, FVD), and (3) alignment with editing conditions (DINO-I, CLIP-T).

Curated by: Lin Li (Renmin University of China), Zehuan Huang (Beihang University), Haoran Feng (Tsinghua University), Gengxiong Zhuang (Beihang University), Rui Chen (Beihang University), Chunchao Guo (Tencent Hunyuan), Lu Sheng (Beihang University)
Language(s): en
License: MIT

Dataset Sources#

Repository: huanngzh/Edit3D-Bench
Paper: VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space (arXiv:2508.19247)
Project Page: huanngzh.github.io/VoxHammer-Page
Code: github.com/Nelipot-Lee/VoxHammer

Uses#

Direct Use#

Benchmarking 3D local editing methods on preservation of unedited regions, overall quality, and condition alignment
Evaluating 3D editing pipelines that take a source model + edit region + text prompt as input
Studying the relationship between 2D editing guidance and 3D consistency

Out-of-Scope Use#

This dataset provides editing specifications (source model, edit region, reference image, prompt), not edited 3D outputs. It cannot be used directly as paired training data for 3D editing without first running an editing method.
The 2D edited reference images are generated by FLUX.1 Fill and may contain artifacts or inconsistencies not representative of ground-truth 3D edits.

FiftyOne Dataset Structure#

Design Rationale#

The raw dataset is organized around 100 objects with 3 prompts each, spanning heterogeneous media types (video, images, 3D meshes) and multiple label relationships (masks as segmentations, normals as heatmaps, edit regions as 3D overlays). A FiftyOne grouped dataset is the natural fit because:

Each group = one editing task (object + prompt pair), which is the fundamental unit of the benchmark
Slices enable multi-modal browsing — toggle between video, image, and 3D views of the same editing task in the App without duplicating metadata
Native label types map directly to the dataset’s annotation types — fo.Segmentation for masks, fo.Heatmap for normal maps, fo.GltfMesh for 3D scenes

Slices (6 per group, 300 groups, 1800 total samples)#

Slice	Media Type	Source File	Labels	Purpose
`edit_region_video` (default)	video	`prompt_N/render/visual3d.mp4`	—	Rotating view of the source model with the edit region mesh overlaid semi-transparently. Shows where the edit happens in 3D context.
`source_video`	video	`source_model/video_rgb.mp4`	`segmentation` (frame-level `fo.Segmentation` from `video_mask.mp4`), `normal_heatmap` (frame-level `fo.Heatmap` from `video_normal.mp4`)	Rotating RGB video of the source 3D object with per-frame object silhouette mask and surface normal overlays.
`source_render`	image	`prompt_N/2d_render.png`	`edit_mask` (`fo.Segmentation` from `2d_mask.png`)	Static render of the source model from the canonical editing camera, with the edit region highlighted as a segmentation overlay.
`edit`	image	`prompt_N/2d_edit.png`	—	The FLUX.1 Fill reference image showing the intended edit result.
`visual`	image	`prompt_N/2d_visual.png`	—	Source model rendered with the edit region blacked out, showing the “hole” where the edit goes.
`scene_3d`	3D	`prompt_N/scene.fo3d`	—	Interactive 3D scene containing both the `source_model` and `edit_region` GLB meshes. Toggle mesh visibility to inspect the edit region in 3D.

Sample Fields (shared across all slices)#

Field	Type	Description
`dataset_source`	string	Source dataset: `"GSO"` or `"PartObjaverse-Tiny"`
`object_name`	string	Object identifier (human-readable name for GSO, UUID for PartObjaverse-Tiny)
`prompt_index`	int	Editing prompt number: 1, 2, or 3
`source_prompt`	string	Text description of the original source model
`edit_prompt`	string	Text description of the desired edit

Preprocessing#

The source_video slice requires extracted video frames for its frame-level labels. The loader (load_into_fiftyone.py) automatically decodes video_mask.mp4 and video_normal.mp4 into per-frame PNGs under source_model/frames/:

Mask frames (mask_0001.png … mask_0120.png): Binarized at threshold 128 to remove MP4 compression artifacts. Referenced via fo.Segmentation(mask_path=...).
Normal frames (normal_0001.png … normal_0120.png): Full RGB surface normals preserved as-is. Referenced via fo.Heatmap(map_path=...).

This extraction is idempotent — frames are only written if they don’t already exist on disk.

Dataset Creation#

Curation Rationale#

Existing 3D editing benchmarks lack labeled editing regions, making it difficult to objectively evaluate how well methods preserve unedited parts of a model. Edit3D-Bench was constructed specifically to address this gap by providing human-annotated 3D editing regions for each editing task.

Source Data#

Data Collection and Processing#

3D models: 50 models from Google Scanned Objects (GSO), a collection of high-quality 3D scanned household items, and 50 from PartObjaverse-Tiny, a subset of Objaverse with part-level annotations
Editing prompts: 3 prompts per model, covering modifications such as object replacement, accessory addition, and material/texture changes
3D edit regions: Human-annotated 3D meshes specifying the precise spatial extent of each edit
2D reference edits: Generated by rendering the source model from a canonical viewpoint, then inpainting the edit region using FLUX.1 Fill conditioned on the edit prompt
Multi-view renders: Each source model rendered from a 16-camera rig (2 elevation rings of 8 azimuth angles at 1024x1024 resolution), plus 120-frame rotating videos for RGB, normals, and object mask

Who are the source data producers?#

The 3D models originate from GSO (Google) and PartObjaverse-Tiny (community-contributed Objaverse subset). The editing annotations (regions, prompts, reference edits) were produced by the VoxHammer paper authors.

Annotations#

Annotation process#

The 3D edit regions were manually annotated by the paper authors as 3D meshes (3d_edit_region.glb) that define the spatial volume to be edited. The corresponding 2D masks (2d_mask.png) are projections of these 3D regions onto the canonical camera view. The 2D edited reference images were generated automatically using FLUX.1 Fill.

Personal and Sensitive Information#

The dataset contains only 3D models of everyday objects (toys, furniture, animals, vehicles, etc.). It does not contain personal, sensitive, or private information.

Bias, Risks, and Limitations#

The 2D edited reference images are generated by FLUX.1 Fill, not manually created. They may contain artifacts, hallucinations, or inconsistencies inherent to the image inpainting model.
The dataset covers a limited range of object categories (household items, toys, furniture) and editing types. Results may not generalize to all 3D editing scenarios.
Edit prompts are in English only.

Citation#

BibTeX:

@article{li2025voxhammer,
  title = {VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space},
  author = {Li, Lin and Huang, Zehuan and Feng, Haoran and Zhuang, Gengxiong and Chen, Rui and Guo, Chunchao and Sheng, Lu},
  journal = {arXiv preprint arXiv:2508.19247},
  year = {2025},
  url = {https://huggingface.co/papers/2508.19247}
}