Note
This is a Hugging Face dataset. Learn how to load datasets from the Hub in the Hugging Face integration docs.
Dataset Card for Multimodal Mind2Web “Cross-Task” Test Split#
Note: This dataset is the test split of the Cross-Task dataset introduced in the paper.

This is a FiftyOne dataset with 1338 samples.
Installation#
If you haven’t already, install FiftyOne:
pip install -U fiftyone
Usage#
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/mind2web_multimodal_test_task")
# Launch the App
session = fo.launch_app(dataset)
Dataset Description#
Curated by: The Ohio State University NLP Group (OSU-NLP-Group)
Shared by: OSU-NLP-Group on Hugging Face
Language(s) (NLP): en
License: OPEN-RAIL License
Dataset Source#
Repository: https://github.com/OSU-NLP-Group/SeeAct and https://huggingface.co/datasets/osunlp/Multimodal-Mind2Web
Paper: “GPT-4V(ision) is a Generalist Web Agent, if Grounded” by Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
Demo: https://osu-nlp-group.github.io/SeeAct
Uses#
Direct Use#
Evaluating web agents’ ability to generalize to new tasks on familiar websites
Benchmarking LMMs and LLMs on web navigation tasks
Training and fine-tuning models for web navigation
Testing model performance on tasks that require following multi-step instructions
Out-of-Scope Use#
Developing web agents for harmful purposes (as stated in the paper’s impact statement)
Automating actions that could violate website terms of service
Creating agents that access users’ personal profiles or perform sensitive operations without consent
Dataset Structure#
Contains 177 tasks across 17 domains and 64 websites
Tasks average 7.6 actions each
Average 4,172 visual tokens per task
Average 607 HTML elements per task
Average 123,274 HTML tokens per task
Each example includes task descriptions, HTML structure, operations (CLICK, TYPE, SELECT), target elements with attributes, and action histories
FiftyOne Dataset Structure#
Basic Info: 1,338 web UI screenshots with task-based annotations
Core Fields:
action_uid: StringField - Unique action identifierannotation_id: StringField - Annotation identifiertarget_action_index: IntField - Index of target action in sequenceground_truth: EmbeddedDocumentField(Detection) - Element to interact with:label: Action type (TYPE, CLICK)bounding_box: a list of relative bounding box coordinates in [0, 1] in the following format:<top-left-x>, <top-left-y>, <width>, <height>]target_action_reprs: String representation of target action
website: EmbeddedDocumentField(Classification) - Website namedomain: EmbeddedDocumentField(Classification) - Website domain categorysubdomain: EmbeddedDocumentField(Classification) - Website subdomain categorytask_description: StringField - Natural language description of the taskfull_sequence: ListField(StringField) - Complete sequence of actions for the taskprevious_actions: ListField - Actions already performed in the sequencecurrent_action: StringField - Action to be performedalternative_candidates: EmbeddedDocumentField(Detections) - Other possible elements
Dataset Creation#
Curation Rationale#
The Cross-Task split was specifically designed to evaluate an agent’s ability to generalize to new tasks on websites and domains it has already encountered during training.
Source Data#
Data Collection and Processing#
Based on the original MIND2WEB dataset
Each HTML document is aligned with its corresponding webpage screenshot image
Underwent human verification to confirm element visibility and correct rendering for action prediction
Who are the source data producers?#
Web screenshots and HTML were collected from 64 websites across 17 domains that were also represented in the training data.
Annotations#
Annotation process#
Each task includes annotated action sequences showing the correct steps to complete the task. These were likely captured through a tool that records user actions on websites.
Who are the annotators?#
Researchers from The Ohio State University NLP Group or hired annotators, though specific details aren’t provided in the paper.
Personal and Sensitive Information#
The dataset focuses on non-login tasks to comply with user agreements and avoid privacy issues.
Bias, Risks, and Limitations#
Performance on this split is generally better than Cross-Website and Cross-Domain, as models can leverage knowledge of website structures
Supervised fine-tuning methods show advantages on this split compared to in-context learning
The dataset may contain biases present in the original websites
Website layouts and functionality may change over time, affecting the validity of the dataset
Citation#
BibTeX:#
@article{zheng2024seeact,
title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=piecKJ2DlB},
}
@inproceedings{deng2023mindweb,
title={Mind2Web: Towards a Generalist Agent for the Web},
author={Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
url={https://openreview.net/forum?id=kiYqbO3wqw}
}
APA:#
Zheng, B., Gou, B., Kil, J., Sun, H., & Su, Y. (2024). GPT-4V(ision) is a Generalist Web Agent, if Grounded. arXiv preprint arXiv:2401.01614.
Dataset Card Contact#
GitHub: https://github.com/OSU-NLP-Group/SeeAct