Note

This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.

Hugging Face

Dataset Card for VisualOverload#

image/png

This is a FiftyOne dataset with 2,720 samples. It is a FiftyOne-format conversion of the original paulgavrikov/visualoverload dataset (CVPR 2026). All credit for the data, annotations, and benchmark design belongs to the original authors — please see Citation and Dataset Sources.

Installation#

If you haven’t already, install FiftyOne:

pip install -U fiftyone

Usage#

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/VisualOverload")

# Launch the App
session = fo.launch_app(dataset)

Dataset Details#

Dataset Description#

Is basic visual understanding really solved in state-of-the-art VLMs? VisualOverload is a visual question answering (VQA) benchmark comprising 2,720 question–answer pairs with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near-global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or overloaded) scenes. The dataset consists of 150 high-resolution scans of public-domain paintings populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. The images were manually annotated with questions across six task categories to probe a thorough understanding of the scene.

The authors hypothesize that current benchmarks overestimate the performance of VLMs, and that encoding and reasoning over details remains challenging, especially in densely populated scenes. Indeed, even the best model evaluated (o3) out of 37 tested models reaches only 19.6% accuracy on the hardest split and 69.5% overall. The accompanying error analysis reveals failure modes including weak counting, OCR failures, and logical inconsistencies under complex tasks.

  • Curated by: Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, and Hilde Kuehne

  • Shared by: Voxel51 (FiftyOne-format conversion)

  • Language(s) (NLP): en

  • License: CC BY-SA 4.0 (the underlying images are royalty-free public-domain artwork, CC0)

Dataset Sources#

Uses#

Direct Use#

  • Benchmark the fine-grained visual understanding of vision-language models (VLMs) in dense, detail-heavy scenes.

  • Slice and analyze results by question type, difficulty, and category using the prefixed sample tags (see Dataset Structure).

  • Run a VLM per question — each sample carries a single question (and a ready-to-use default_prompt), so a model can read the prompt from the sample field and write one prediction per sample, then submit question_id + predicted answer to the official evaluation server.

Out-of-Scope Use#

  • Training / fine-tuning. This is an evaluation benchmark; ground-truth answers are held privately and are intentionally not distributed.

  • Drawing conclusions about general image understanding outside the dense-scene, painting-domain setting the benchmark was designed for.

Dataset Structure#

The benchmark is modeled one sample per question: 2,720 samples over 150 paintings (each image is shared by the ~18 questions that reference it). Ground-truth answers are not included — models are scored via the official evaluation server using each question’s question_id. All samples belong to the single test split.

Fields

Field

Type

Description

filepath

image

Path to the painting (shared across its questions)

question_id

StringField

Unique id — the key used for leaderboard submissions

question

StringField

The question about the image

response_options

ListField(StringField)

Answer options for choice questions (e.g. ["yes", "no"]); empty otherwise. Listed as options in the source dataset.

default_prompt

StringField

Ready-to-use prompt (question + options + output-format constraint)

image_id

StringField

Painting id (filename stem) — groups an image’s questions

win_rate

FloatField

Per-image model win-rate from the benchmark (a difficulty signal)

metadata

ImageMetadata

Image width/height (most images are ~4K, e.g. 3840×2160)

Sample tags — question_type, difficulty, and category are stored as prefixed sample tags (filter via the App sidebar or dataset.match_tags(...)). They are prefixed because question_type and category share the values counting and ocr.

Tag prefix

Values (counts)

question_type:

choice (2043), counting (559), ocr (118)

difficulty:

easy (986), medium (1304), hard (430)

category:

activity (150), attributes (149), counting (559), ocr (118), reasoning (356), scene (1388)

Every sample is also tagged test.

# Example: all hard OCR questions
from fiftyone import ViewField as F
hard_ocr = dataset.match_tags(["difficulty:hard", "question_type:ocr"], all=True)

Dataset Creation#

Curation Rationale#

Existing VQA benchmarks largely probe near-global image understanding and may overestimate VLM capability. VisualOverload deliberately targets simple, knowledge-free perception (reading, counting, attribute and activity recognition, scene/relationship reasoning) in overloaded scenes that contain many figures, actions, and subplots, to expose the gap in encoding and reasoning over fine detail.

Source Data#

Data Collection and Processing#

The images are high-resolution scans of public-domain paintings (CC0). Most match a 4K pixel budget (≈ 3840×2160) across varying aspect ratios.

Annotations#

The images were manually annotated with questions spanning six task categories (activity, attributes, counting, ocr, reasoning, scene), three difficulty levels (easy, medium, hard), and three answer/question types (choice with 2 or 4 options, freeform counting, and freeform ocr). Ground-truth answers are withheld to prevent contamination and are only accessible through the evaluation server.

Bias, Risks, and Limitations#

  • Evaluation-only: ground truth is private; scoring requires the official server, so this copy cannot be used for supervised training or offline scoring.

  • Domain: the imagery is limited to scanned public-domain paintings; performance here may not transfer to photographs or other domains.

  • Scale: 150 source images / 2,720 questions — small relative to large-scale VQA corpora.

Recommendations#

Use VisualOverload as a targeted probe of fine-grained perception in dense scenes rather than a general VQA score. Report results by difficulty and category (the sample tags make this easy) and submit predictions to the official evaluator for comparable, leak-free numbers.

Citation#

If you use this dataset, please cite the original work:

BibTeX:

@InProceedings{Gavrikov_2026_visualoverload,
  author    = {Paul Gavrikov and Wei Lin and M. Jehanzeb Mirza and Soumya Jahagirdar and Muhammad Huzaifa and Sivan Doveh and Serena Yeung-Levy and James Glass and Hilde Kuehne},
  title     = {{VisualOverload}: Probing Visual Understanding of VLMs in Really Dense Scenes},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026}
}

APA:

Gavrikov, P., Lin, W., Mirza, M. J., Jahagirdar, S., Huzaifa, M., Doveh, S., Yeung-Levy, S., Glass, J., & Kuehne, H. (2026). VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Dataset Card Authors#

FiftyOne-format conversion shared by Voxel51. The dataset, annotations, and benchmark were created by Paul Gavrikov et al.; see the original dataset at paulgavrikov/visualoverload.