Note

This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.

Dataset Card for KubriCount (subset)#

image/png

KubriCount is a large-scale synthetic benchmark for multi-grained visual counting, introduced in the paper Count Anything at Any Granularity (Liu, Wu & Xie, SJTU 2026). It reframes open-world counting as a prompt-following problem across five explicit semantic granularity levels, supported by the most comprehensively annotated counting dataset published to date.

This is a FiftyOne dataset with 6736 samples.

Installation#

If you haven’t already, install FiftyOne:

pip install -U fiftyone

Usage#

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/KubriCount")

# Launch the App
session = fo.launch_app(dataset)

Dataset Description#

Most counting datasets treat “what to count” as a single category-level matching problem. KubriCount exposes this limitation by requiring models to follow fine-grained prompts that specify which semantic level the user intends — from counting a specific object identity all the way up to an abstract concept — while excluding controlled distractors that differ by exactly one semantic factor.

Each scene is a 1024×1024 synthetic image produced by a four-stage automatic pipeline: controllable 3D rendering via Kubric + Blender, mask-conditioned image editing (Nano-Banana-Pro) to reduce the sim-to-real gap, and VLM-based quality filtering (Gemini-3-Pro) to guarantee annotation fidelity.

Curated by: Chang Liu, Haoning Wu, Weidi Xie — School of Artificial Intelligence, Shanghai Jiao Tong University
License: Apache-2.0
Paper: arXiv:2605.10887

Dataset Sources#

Repository: Verg-Avesta/KubriCount
HuggingFace Dataset: liuchang666/KubriCount
Project Page: verg-avesta.github.io/KubriCount

Counting Granularity Levels#

KubriCount defines five levels of counting granularity. Each level specifies a target set and, for levels 2–5, a controlled distractor set that differs by exactly one semantic factor:

Level	Granularity	Prompt example	Distractor
L1	Identity	“Count all the dogs.”	None
L2 (size)	Attribute	“Count large cherries.”	Small cherries
L2 (color)	Attribute	“Count mustard sofas.”	Dark gray sofas
L3	Category	“Count the cans.”	Bags
L4	Instance type	“Count backpack A.”	Backpack B
L5	Concept	“Count the lobsters.”	Octopuses

Levels 2–5 generate two annotation queries per scene by swapping the target and distractor roles, which is why the total query count (198,702) exceeds the scene count (110,507).

Dataset Statistics#

Split	Scenes	Queries	Purpose
train	99,639	179,140	Seen categories (normal + dense configurations, ~4:1 ratio)
testA	5,462	9,837	Unseen assets from training categories
testB	5,406	9,725	Entirely unseen categories
Total	110,507	198,702

Categories: 157 across 16 super-categories
Total annotated objects: ~7.3 million
Objects per image: 1–250 (capped at 250 by Kubric’s 256-instance limit)
Image resolution: 1024 × 1024 px

FiftyOne Dataset Structure#

The dataset is loaded into FiftyOne as a flat image dataset — one sample per counting query. Scenes with two queries (L2–L5) produce two samples pointing to the same filepath.

Sample Fields#

Field	FiftyOne Type	Description
`filepath`	`StringField`	Path to `edited_00000.png` — the final benchmark image
`image_id`	`StringField`	Relative path key matching the HuggingFace annotation files
`split`	`StringField`	`"train"`, `"testA"`, or `"testB"`
`level`	`IntField`	Counting granularity level: 1–5
`category`	`StringField`	Text label for the target objects to count
`count`	`IntField`	Ground truth object count
`target_points`	`fo.Keypoints`	One `fo.Keypoint` per target object, each with a single normalized center point `(x/W, y/H)`
`example_boxes`	`fo.Detections`	2–8 few-shot exemplar bounding boxes in `[x, y, w, h]` relative coords
`segmentation`	`fo.Segmentation`	`mask_path` pointing to `segmentation_00000.png` on disk — the instance segmentation map
`negative_category`	`StringField`	Distractor label (empty string for L1)
`negative_count`	`IntField`	Ground truth distractor count (0 for L1)
`negative_points`	`fo.Keypoints`	One `fo.Keypoint` per distractor object (None for L1)
`negative_example_boxes`	`fo.Detections`	Few-shot exemplar boxes for the distractor class (None for L1)
`tags`	`ListField`	e.g. `["testA", "level5"]`

Design Notes#

target_points as a counting sanity check: for any sample, len(sample.target_points.keypoints) == sample.count. This invariant holds by construction and can be used to verify import correctness.
example_boxes are not exhaustive: these are 2–8 manually selected exemplar crops used as few-shot visual prompts, not full ground-truth box coverage of all objects.
segmentation is an instance map: pixel values encode per-instance IDs as rendered by Kubric. It is not a semantic segmentation map.
Dual queries per scene (L2–L5): two FiftyOne samples share the same filepath but have swapped category / negative_category fields, representing the two valid counting queries for that scene.

Dataset Creation#

Generation Pipeline#

KubriCount is constructed in four automatic stages:

3D asset curation — ~58K assets across 157 categories sourced from ShapeNetCore-v2 and controllable 3D generation (TRELLIS family). ~5K HDRI environment maps sourced from Poly Haven and Text2Light.
Prototype synthesis — Kubric + PyBullet + Blender renders scenes with exact instance metadata (RGB, instance masks, 2D/3D boxes, center points). Level-specific composition rules control target/distractor selection.
Consistent image editing — Nano-Banana-Pro refines textures and harmonizes lighting while preserving topology (no instances added, removed, merged, or split). Level-aware constraints prevent edits that would corrupt the counting criterion.
Automatic data filtering — Gemini-3-Pro inspects each edited image against the prototype and masks, issuing PASS/FAIL. ~20% are rejected on the first pass; iterative re-editing reduces the final rejection rate to ~5%.

Splits#

Dataset splits are enforced at the 3D asset level before synthesis:

Train: seen categories, full asset pool
TestA: unseen assets within training categories (~10% holdout per category)
TestB: unseen categories (~10% of total assets)

Both test splits use only unseen HDRI backgrounds and evaluate on the normal (non-dense) scene configuration.

Annotations#

All annotations are derived automatically from the Kubric rendering engine — there are no human annotators. The engine produces pixel-perfect instance masks, 2D/3D bounding boxes, and center points as part of the rendering process. VLM-based filtering (not annotation) is applied post-hoc to ensure label fidelity.

Citation#

@article{liu2026count,
  title={Count Anything at Any Granularity},
  author={Liu, Chang and Wu, Haoning and Xie, Weidi},
  journal={arXiv preprint arXiv:2605.10887},
  year={2026}
}

APA:

Liu, C., Wu, H., & Xie, W. (2026). Count Anything at Any Granularity. arXiv preprint arXiv:2605.10887.