Note

This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.

Dataset Card for IndustryShapes#

image/png

IndustryShapes is an RGB-D benchmark dataset for 6D object pose estimation of industrial assembly tools and components. It provides high-quality annotated data of five challenging industrial objects—characterized by weak texture, reflective surfaces, symmetries, and thin structures—captured in realistic industrial assembly environments. The dataset is designed to support both instance-level and novel-object pose estimation approaches, and is the first dataset to include RGB-D static onboarding sequences for model-free methods.

This is a FiftyOne dataset with 13012 samples.

Installation#

If you haven’t already, install FiftyOne:

pip install -U fiftyone

Usage#

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/IndustryShapes")

# Launch the App
session = fo.launch_app(dataset)

Dataset Sources#

Curated by: Panagiotis Sapoutzoglou, Orestis Vaggelis, Athina Zacharia, Evangelos Sartinas, Maria Pateraki (National Technical University of Athens, Greece)
Funded by: European HEU programmes SOPRANO (GA No 101120990) and PANDORA (GA No 101135775); data collection supported by Stellantis — Centro Ricerche FIAT (CRF)
License: MIT
Paper: arXiv:2602.05555
Repository: POSE-Lab/IndustryShapes on Hugging Face
Paper: Sapoutzoglou et al., IndustryShapes: An RGB-D Benchmark dataset for 6D object pose estimation of industrial assembly components and tools, arXiv 2602.05555, 2026
Project page: https://pose-lab.github.io/IndustryShapes

Uses#

Direct Use#

6D object pose estimation — both instance-level (known CAD model) and novel-object (model-based and model-free)
Object detection and instance segmentation in industrial scenes
Depth estimation research and RGB-D sensor evaluation in industrial settings
Robotic manipulation — grasping and assembly tasks requiring precise 6D pose

Out-of-Scope Use#

Not suitable for general consumer or household object recognition
Not designed for ego-centric or hand-held camera scenarios
The five objects are industrial-specific; generalization to other domains should be validated

Dataset Structure#

FiftyOne Sample Fields#

Each sample corresponds to one image. The dataset_subset field identifies which part of the dataset the image belongs to.

Field	Type	Description
`filepath`	`string`	Absolute path to the 640×480 RGB image
`split`	`string`	`"train"` or `"test"`
`dataset_subset`	`string`	`"classic"`, `"extended_onboarding"`, or `"extended_office"`
`scene_id`	`string`	Zero-padded scene identifier (e.g. `"000001"`)
`image_id`	`string`	Frame index within the scene
`depth_scale`	`float`	Multiply raw depth values by this to get millimetres (1.0 for classic, 0.1 for extended)
`camera_intrinsics`	`list[list[float]]`	3×3 camera intrinsic matrix `[[fx,0,cx],[0,fy,cy],[0,0,1]]`
`depth`	`fo.Heatmap`	Depth map in millimetres; pixel values encode distance in the App colormap
`ground_truth`	`fo.Detections`	Per-instance annotations (see below)
`axes`	`fo.Polylines`	Projected coordinate frame (X/Y/Z) at the object origin, for pose visualization
`bbox3d`	`fo.Polylines`	12-edge 3D bounding box wireframe projected into image space

Detection Fields (`ground_truth`)#

Each fo.Detection inside ground_truth represents one annotated object instance.

Field	Type	Description
`label`	`string`	Object class: `object_01` … `object_05`
`bounding_box`	`[x, y, w, h]`	Normalised 2D bounding box (top-left origin, values in [0, 1])
`mask`	`bool array`	Binary instance segmentation mask, cropped to the bounding box
`rotation_matrix`	`list[list[float]]`	3×3 rotation matrix R (BOP convention: maps object → camera frame)
`translation_mm`	`[tx, ty, tz]`	Translation vector in millimetres (object origin in camera frame)
`obj_id`	`int`	Numeric object ID (1–5)
`visibility`	`float`	Fraction of object surface visible in the image [0, 1]

Labels#

object_01, object_02, object_03, object_04, object_05

Depth Heatmap#

The depth field is a fo.Heatmap backed by a 16-bit PNG on disk. Each pixel stores the depth in millimetres after applying depth_scale. Invalid pixels (no sensor return, or beyond sensor range) are encoded as 0. The range stored on the heatmap is the 2nd–98th percentile of valid pixel values for that frame, used to set the colormap bounds in the FiftyOne App.

Pose Convention#

Poses follow the BOP convention:

x_camera = R @ x_object + t

where R is the 3×3 rotation matrix, t is the translation in millimetres, and x_object is a point in the object’s coordinate frame.

Dataset Creation#

Curation Rationale#

Most existing 6D pose estimation datasets focus on household or consumer objects in controlled lab environments, which do not reflect the challenges of real-world industrial deployment. IndustryShapes was created to fill this gap by providing industrial objects with challenging physical properties (textureless surfaces, symmetries, thin and reflective parts) in realistic assembly environments, and by explicitly supporting modern model-free methods through the first publicly released RGB-D static onboarding sequences.

Source Data#

Data Collection#

RGB-D data were captured at 640×480 resolution using:

Intel RealSense D455 — classic set (industrial and lab scenes)
Intel RealSense D405 — extended set (onboarding and office sequences; closer range)

Synthetic training images for Object 3 were generated with an OpenGL-based renderer using photorealistic CAD model textures. All data are formatted according to the BOP specification.

Who are the source data producers?#

The POSE Lab at the National Technical University of Athens (NTUA), Greece. Data were collected at a realistic industrial assembly facility with support from Stellantis — Centro Ricerche FIAT (CRF).

Annotations#

Annotation Process#

Three annotation approaches were used depending on the scene type:

Marker-based (lab/turn-table scenes) — ArUco markers provided precise camera poses.
SfM-based semi-automatic (industrial scenes) — Structure-from-Motion reconstruction combined with manually defined anchor points on the CAD model established 2D–3D correspondences; the PnP problem was then solved per frame.
Synthetic — Ground-truth poses are exact by construction.

Annotation accuracy was validated by comparing captured depth against rendered depth at the annotated poses. The mean absolute depth error is < 12 mm for the classic set and ≈ 5 mm for the extended set — a relative error under 5% of the mean object diameter (254 mm).

Recommendations#

Benchmark results should be interpreted per-object rather than only overall, as difficulty varies considerably (e.g. Object 5 consistently achieves lower AR than Object 1 across all methods). Users training instance-level methods should note the domain gap between the single-object training scenes and the cluttered multi-object test scenes.

Citation#

BibTeX:

@article{sapoutzoglou2026industryshapes,
  title   = {IndustryShapes: An RGB-D Benchmark dataset for 6D object pose
             estimation of industrial assembly components and tools},
  author  = {Sapoutzoglou, Panagiotis and Vaggelis, Orestis and Zacharia, Athina
             and Sartinas, Evangelos and Pateraki, Maria},
  journal = {arXiv preprint arXiv:2602.05555},
  year    = {2026}
}

APA:

Sapoutzoglou, P., Vaggelis, O., Zacharia, A., Sartinas, E., & Pateraki, M. (2026). IndustryShapes: An RGB-D Benchmark dataset for 6D object pose estimation of industrial assembly components and tools. arXiv:2602.05555.