Note
This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.
Dataset Card for KubriCount (subset)#

KubriCount is a large-scale synthetic benchmark for multi-grained visual counting, introduced in the paper Count Anything at Any Granularity (Liu, Wu & Xie, SJTU 2026). It reframes open-world counting as a prompt-following problem across five explicit semantic granularity levels, supported by the most comprehensively annotated counting dataset published to date.
This is a FiftyOne dataset with 6736 samples.
Installation#
If you haven’t already, install FiftyOne:
pip install -U fiftyone
Usage#
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/KubriCount")
# Launch the App
session = fo.launch_app(dataset)
Dataset Description#
Most counting datasets treat “what to count” as a single category-level matching problem. KubriCount exposes this limitation by requiring models to follow fine-grained prompts that specify which semantic level the user intends — from counting a specific object identity all the way up to an abstract concept — while excluding controlled distractors that differ by exactly one semantic factor.
Each scene is a 1024Ă—1024 synthetic image produced by a four-stage automatic pipeline: controllable 3D rendering via Kubric + Blender, mask-conditioned image editing (Nano-Banana-Pro) to reduce the sim-to-real gap, and VLM-based quality filtering (Gemini-3-Pro) to guarantee annotation fidelity.
Curated by: Chang Liu, Haoning Wu, Weidi Xie — School of Artificial Intelligence, Shanghai Jiao Tong University
License: Apache-2.0
Paper: arXiv:2605.10887
Dataset Sources#
Repository: Verg-Avesta/KubriCount
HuggingFace Dataset: liuchang666/KubriCount
Project Page: verg-avesta.github.io/KubriCount
Counting Granularity Levels#
KubriCount defines five levels of counting granularity. Each level specifies a target set and, for levels 2–5, a controlled distractor set that differs by exactly one semantic factor:
Level |
Granularity |
Prompt example |
Distractor |
|---|---|---|---|
L1 |
Identity |
“Count all the dogs.” |
None |
L2 (size) |
Attribute |
“Count large cherries.” |
Small cherries |
L2 (color) |
Attribute |
“Count mustard sofas.” |
Dark gray sofas |
L3 |
Category |
“Count the cans.” |
Bags |
L4 |
Instance type |
“Count backpack A.” |
Backpack B |
L5 |
Concept |
“Count the lobsters.” |
Octopuses |
Levels 2–5 generate two annotation queries per scene by swapping the target and distractor roles, which is why the total query count (198,702) exceeds the scene count (110,507).
Dataset Statistics#
Split |
Scenes |
Queries |
Purpose |
|---|---|---|---|
train |
99,639 |
179,140 |
Seen categories (normal + dense configurations, ~4:1 ratio) |
testA |
5,462 |
9,837 |
Unseen assets from training categories |
testB |
5,406 |
9,725 |
Entirely unseen categories |
Total |
110,507 |
198,702 |
Categories: 157 across 16 super-categories
Total annotated objects: ~7.3 million
Objects per image: 1–250 (capped at 250 by Kubric’s 256-instance limit)
Image resolution: 1024 Ă— 1024 px
FiftyOne Dataset Structure#
The dataset is loaded into FiftyOne as a flat image dataset — one sample per counting query. Scenes with two queries (L2–L5) produce two samples pointing to the same filepath.
Sample Fields#
Field |
FiftyOne Type |
Description |
|---|---|---|
|
|
Path to |
|
|
Relative path key matching the HuggingFace annotation files |
|
|
|
|
|
Counting granularity level: 1–5 |
|
|
Text label for the target objects to count |
|
|
Ground truth object count |
|
|
One |
|
|
2–8 few-shot exemplar bounding boxes in |
|
|
|
|
|
Distractor label (empty string for L1) |
|
|
Ground truth distractor count (0 for L1) |
|
|
One |
|
|
Few-shot exemplar boxes for the distractor class (None for L1) |
|
|
e.g. |
Design Notes#
target_pointsas a counting sanity check: for any sample,len(sample.target_points.keypoints) == sample.count. This invariant holds by construction and can be used to verify import correctness.example_boxesare not exhaustive: these are 2–8 manually selected exemplar crops used as few-shot visual prompts, not full ground-truth box coverage of all objects.segmentationis an instance map: pixel values encode per-instance IDs as rendered by Kubric. It is not a semantic segmentation map.Dual queries per scene (L2–L5): two FiftyOne samples share the same
filepathbut have swappedcategory/negative_categoryfields, representing the two valid counting queries for that scene.
Dataset Creation#
Generation Pipeline#
KubriCount is constructed in four automatic stages:
3D asset curation — ~58K assets across 157 categories sourced from ShapeNetCore-v2 and controllable 3D generation (TRELLIS family). ~5K HDRI environment maps sourced from Poly Haven and Text2Light.
Prototype synthesis — Kubric + PyBullet + Blender renders scenes with exact instance metadata (RGB, instance masks, 2D/3D boxes, center points). Level-specific composition rules control target/distractor selection.
Consistent image editing — Nano-Banana-Pro refines textures and harmonizes lighting while preserving topology (no instances added, removed, merged, or split). Level-aware constraints prevent edits that would corrupt the counting criterion.
Automatic data filtering — Gemini-3-Pro inspects each edited image against the prototype and masks, issuing PASS/FAIL. ~20% are rejected on the first pass; iterative re-editing reduces the final rejection rate to ~5%.
Splits#
Dataset splits are enforced at the 3D asset level before synthesis:
Train: seen categories, full asset pool
TestA: unseen assets within training categories (~10% holdout per category)
TestB: unseen categories (~10% of total assets)
Both test splits use only unseen HDRI backgrounds and evaluate on the normal (non-dense) scene configuration.
Annotations#
All annotations are derived automatically from the Kubric rendering engine — there are no human annotators. The engine produces pixel-perfect instance masks, 2D/3D bounding boxes, and center points as part of the rendering process. VLM-based filtering (not annotation) is applied post-hoc to ensure label fidelity.
Citation#
@article{liu2026count,
title={Count Anything at Any Granularity},
author={Liu, Chang and Wu, Haoning and Xie, Weidi},
journal={arXiv preprint arXiv:2605.10887},
year={2026}
}
APA:
Liu, C., Wu, H., & Xie, W. (2026). Count Anything at Any Granularity. arXiv preprint arXiv:2605.10887.