Note
This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.
Dataset Card for VisualOverload#

This is a FiftyOne dataset with 2,720 samples. It is a FiftyOne-format conversion of the original paulgavrikov/visualoverload dataset (CVPR 2026). All credit for the data, annotations, and benchmark design belongs to the original authors — please see Citation and Dataset Sources.
Installation#
If you haven’t already, install FiftyOne:
pip install -U fiftyone
Usage#
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/VisualOverload")
# Launch the App
session = fo.launch_app(dataset)
Dataset Details#
Dataset Description#
Is basic visual understanding really solved in state-of-the-art VLMs? VisualOverload is a visual question answering (VQA) benchmark comprising 2,720 question–answer pairs with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near-global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or overloaded) scenes. The dataset consists of 150 high-resolution scans of public-domain paintings populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. The images were manually annotated with questions across six task categories to probe a thorough understanding of the scene.
The authors hypothesize that current benchmarks overestimate the performance of VLMs, and that encoding and reasoning over details remains challenging, especially in densely populated scenes. Indeed, even the best model evaluated (o3) out of 37 tested models reaches only 19.6% accuracy on the hardest split and 69.5% overall. The accompanying error analysis reveals failure modes including weak counting, OCR failures, and logical inconsistencies under complex tasks.
Curated by: Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, and Hilde Kuehne
Shared by: Voxel51 (FiftyOne-format conversion)
Language(s) (NLP): en
License: CC BY-SA 4.0 (the underlying images are royalty-free public-domain artwork, CC0)
Dataset Sources#
Original dataset (please cite this): https://huggingface.co/datasets/paulgavrikov/visualoverload
Repository: https://github.com/paulgavrikov/visualoverload
Paper: VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes (arXiv:2509.25339)
Project page: https://paulgavrikov.github.io/visualoverload/
Leaderboard / online evaluator: https://huggingface.co/spaces/paulgavrikov/visualoverload-submit
Uses#
Direct Use#
Benchmark the fine-grained visual understanding of vision-language models (VLMs) in dense, detail-heavy scenes.
Slice and analyze results by question type, difficulty, and category using the prefixed sample tags (see Dataset Structure).
Run a VLM per question — each sample carries a single
question(and a ready-to-usedefault_prompt), so a model can read the prompt from the sample field and write one prediction per sample, then submitquestion_id+ predicted answer to the official evaluation server.
Out-of-Scope Use#
Training / fine-tuning. This is an evaluation benchmark; ground-truth answers are held privately and are intentionally not distributed.
Drawing conclusions about general image understanding outside the dense-scene, painting-domain setting the benchmark was designed for.
Dataset Structure#
The benchmark is modeled one sample per question: 2,720 samples over 150 paintings
(each image is shared by the ~18 questions that reference it). Ground-truth answers are not
included — models are scored via the official evaluation server using each question’s
question_id. All samples belong to the single test split.
Fields
Field |
Type |
Description |
|---|---|---|
|
image |
Path to the painting (shared across its questions) |
|
|
Unique id — the key used for leaderboard submissions |
|
|
The question about the image |
|
|
Answer options for |
|
|
Ready-to-use prompt (question + options + output-format constraint) |
|
|
Painting id (filename stem) — groups an image’s questions |
|
|
Per-image model win-rate from the benchmark (a difficulty signal) |
|
|
Image width/height (most images are ~4K, e.g. 3840×2160) |
Sample tags — question_type, difficulty, and category are stored as prefixed
sample tags (filter via the App sidebar or dataset.match_tags(...)). They are prefixed
because question_type and category share the values counting and ocr.
Tag prefix |
Values (counts) |
|---|---|
|
|
|
|
|
|
Every sample is also tagged test.
# Example: all hard OCR questions
from fiftyone import ViewField as F
hard_ocr = dataset.match_tags(["difficulty:hard", "question_type:ocr"], all=True)
Dataset Creation#
Curation Rationale#
Existing VQA benchmarks largely probe near-global image understanding and may overestimate VLM capability. VisualOverload deliberately targets simple, knowledge-free perception (reading, counting, attribute and activity recognition, scene/relationship reasoning) in overloaded scenes that contain many figures, actions, and subplots, to expose the gap in encoding and reasoning over fine detail.
Source Data#
Data Collection and Processing#
The images are high-resolution scans of public-domain paintings (CC0). Most match a 4K pixel budget (≈ 3840×2160) across varying aspect ratios.
Annotations#
The images were manually annotated with questions spanning six task categories
(activity, attributes, counting, ocr, reasoning, scene), three difficulty levels
(easy, medium, hard), and three answer/question types (choice with 2 or 4 options,
freeform counting, and freeform ocr). Ground-truth answers are withheld to prevent
contamination and are only accessible through the evaluation server.
Bias, Risks, and Limitations#
Evaluation-only: ground truth is private; scoring requires the official server, so this copy cannot be used for supervised training or offline scoring.
Domain: the imagery is limited to scanned public-domain paintings; performance here may not transfer to photographs or other domains.
Scale: 150 source images / 2,720 questions — small relative to large-scale VQA corpora.
Recommendations#
Use VisualOverload as a targeted probe of fine-grained perception in dense scenes rather than a general VQA score. Report results by difficulty and category (the sample tags make this easy) and submit predictions to the official evaluator for comparable, leak-free numbers.
Citation#
If you use this dataset, please cite the original work:
BibTeX:
@InProceedings{Gavrikov_2026_visualoverload,
author = {Paul Gavrikov and Wei Lin and M. Jehanzeb Mirza and Soumya Jahagirdar and Muhammad Huzaifa and Sivan Doveh and Serena Yeung-Levy and James Glass and Hilde Kuehne},
title = {{VisualOverload}: Probing Visual Understanding of VLMs in Really Dense Scenes},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2026}
}
APA:
Gavrikov, P., Lin, W., Mirza, M. J., Jahagirdar, S., Huzaifa, M., Doveh, S., Yeung-Levy, S., Glass, J., & Kuehne, H. (2026). VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).