Note
This is a Hugging Face dataset. Learn how to load datasets from the Hub in the Hugging Face integration docs.
Dataset Card for GQA-35k#

The GQA (Visual Reasoning in the Real World) dataset is a large-scale visual question answering dataset that includes scene graph annotations for each image.
This is a FiftyOne dataset with 35000 samples.
Note: This is a 35,000 sample subset which does not contain questions, only the scene graph annotations as detection-level attributes.
You can find the recipe notebook for creating the dataset here
Installation#
If you haven’t already, install FiftyOne:
pip install -U fiftyone
Usage#
import fiftyone as fo
import fiftyone.utils.huggingface as fouh
# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = fouh.load_from_hub("Voxel51/GQA-Scene-Graph")
# Launch the App
session = fo.launch_app(dataset)
Dataset Details#
Dataset Description#
Scene Graph Annotations#
Each of the 113K images in GQA is associated with a detailed scene graph describing the objects, attributes and relations present.
The scene graphs are based on a cleaner version of the Visual Genome scene graphs.
For each image, the scene graph is provided as a dictionary (sceneGraph) containing:
Image metadata like width, height, location, weather
A dictionary (objects) mapping each object ID to its name, bounding box coordinates, attributes, and relations[6]
Relations are represented as triples specifying the predicate (e.g. “holding”, “on”, “left of”) and the target object ID[6]
Curated by: Drew Hudson & Christopher Manning
Shared by: Harpreet Sahota, Hacker-in-Residence at Voxel51
Language(s) (NLP): en
License:
GQA annotations (scene graphs, questions, programs) licensed under CC BY 4.0
Images sourced from Visual Genome may have different licensing terms
Dataset Sources#
Repository: https://cs.stanford.edu/people/dorarad/gqa/
Paper : https://arxiv.org/pdf/1902.09506
Demo: https://cs.stanford.edu/people/dorarad/gqa/vis.html
Dataset Structure#
Here’s the information presented as a markdown table:
Field |
Type |
Description |
|---|---|---|
location |
str |
Optional. The location of the image, e.g. kitchen, beach. |
weather |
str |
Optional. The weather in the image, e.g. sunny, cloudy. |
objects |
dict |
A dictionary from objectId to its object. |
object |
dict |
A visual element in the image (node). |
name |
str |
The name of the object, e.g. person, apple or sky. |
x |
int |
Horizontal position of the object bounding box (top left). |
y |
int |
Vertical position of the object bounding box (top left). |
w |
int |
The object bounding box width in pixels. |
h |
int |
The object bounding box height in pixels. |
attributes |
[str] |
A list of all the attributes of the object, e.g. blue, small, running. |
relations |
[dict] |
A list of all outgoing relations (edges) from the object (source). |
relation |
dict |
A triple representing the relation between source and target objects. |
Note: I’ve used non-breaking spaces ( ) to indent the nested fields in the ‘Field’ column to represent the hierarchy. This helps to visually distinguish the nested structure within the table.
Citation#
BibTeX:
@article{Hudson_2019,
title={GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering},
ISBN={9781728132938},
url={http://dx.doi.org/10.1109/CVPR.2019.00686},
DOI={10.1109/cvpr.2019.00686},
journal={2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
publisher={IEEE},
author={Hudson, Drew A. and Manning, Christopher D.},
year={2019},
month={Jun}
}