Note
This is a Hugging Face dataset. Learn how to load datasets from the Hub in the Hugging Face integration docs.
Dataset Card for “Cross-Domain” Test Split in Multimodal Mind2Web#
Note: This dataset is the test split of the Cross-Domain dataset introduced in the paper.

This is a FiftyOne dataset with 4050 samples.
Installation#
If you haven’t already, install FiftyOne:
pip install -U fiftyone
Usage#
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/mind2web_multimodal_test_domain")
# Launch the App
session = fo.launch_app(dataset)
Dataset Description#
Curated by: The Ohio State University NLP Group (OSU-NLP-Group)
Shared by: OSU-NLP-Group on Hugging Face
Language(s) (NLP): en
License: OPEN-RAIL License
Dataset Sources#
Repository: https://github.com/OSU-NLP-Group/SeeAct and https://huggingface.co/datasets/osunlp/Multimodal-Mind2Web
Paper: “GPT-4V(ision) is a Generalist Web Agent, if Grounded” by Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
Demo: https://osu-nlp-group.github.io/SeeAct
Uses#
Direct Use#
Evaluating web agents’ ability to generalize to entirely new domains
Testing zero-shot domain transfer capabilities of models
Benchmarking the true generalist capabilities of web agents
Assessing model performance in unseen web environments
Out-of-Scope Use#
Developing web agents for harmful purposes (as stated in the paper’s impact statement)
Automating actions that could violate website terms of service
Creating agents that access users’ personal profiles or perform sensitive operations without consent
Dataset Structure#
Contains 694 tasks across 13 domains and 53 websites
Tasks average 5.9 actions each
Average 4,314 visual tokens per task
Average 494 HTML elements per task
Average 91,163 HTML tokens per task
Each example includes task descriptions, HTML structure, operations (CLICK, TYPE, SELECT), target elements with attributes, and action histories
FiftyOne Dataset Structure#
Basic Info: 1,338 web UI screenshots with task-based annotations
Core Fields:
action_uid: StringField - Unique action identifierannotation_id: StringField - Annotation identifiertarget_action_index: IntField - Index of target action in sequenceground_truth: EmbeddedDocumentField(Detection) - Element to interact with:label: Action type (TYPE, CLICK)bounding_box: a list of relative bounding box coordinates in [0, 1] in the following format:<top-left-x>, <top-left-y>, <width>, <height>]target_action_reprs: String representation of target action
website: EmbeddedDocumentField(Classification) - Website namedomain: EmbeddedDocumentField(Classification) - Website domain categorysubdomain: EmbeddedDocumentField(Classification) - Website subdomain categorytask_description: StringField - Natural language description of the taskfull_sequence: ListField(StringField) - Complete sequence of actions for the taskprevious_actions: ListField - Actions already performed in the sequencecurrent_action: StringField - Action to be performedalternative_candidates: EmbeddedDocumentField(Detections) - Other possible elements
Dataset Creation#
Curation Rationale#
The Cross-Domain split was specifically designed to evaluate an agent’s ability to generalize to entirely new domains it hasn’t encountered during training, representing the most challenging generalization scenario.
Source Data#
Data Collection and Processing#
Based on the original MIND2WEB dataset
Each HTML document is aligned with its corresponding webpage screenshot image
Underwent human verification to confirm element visibility and correct rendering for action prediction
Specifically includes websites from top-level domains held out from the training data
Who are the source data producers?#
Web screenshots and HTML were collected from 53 websites across 13 domains that were not represented in the training data.
Annotations#
Annotation process#
Each task includes annotated action sequences showing the correct steps to complete the task. These were likely captured through a tool that records user actions on websites.
Who are the annotators?#
Researchers from The Ohio State University NLP Group or hired annotators, though specific details aren’t provided in the paper.
Personal and Sensitive Information#
The dataset focuses on non-login tasks to comply with user agreements and avoid privacy issues.
Bias, Risks, and Limitations#
This split presents the most challenging generalization scenario as it tests performance on entirely unfamiliar domains
In-context learning methods with large models show better performance than supervised fine-tuning on this split
The gap between SEEACTOracle and other methods is largest in this split (23.2% step success rate difference)
Website layouts and functionality may change over time, affecting the validity of the dataset
Limited to the specific domains included; may not fully represent all possible web domains
Citation#
BibTeX:#
@article{zheng2024seeact,
title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=piecKJ2DlB},
}
@inproceedings{deng2023mindweb,
title={Mind2Web: Towards a Generalist Agent for the Web},
author={Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
url={https://openreview.net/forum?id=kiYqbO3wqw}
}
APA:#
Zheng, B., Gou, B., Kil, J., Sun, H., & Su, Y. (2024). GPT-4V(ision) is a Generalist Web Agent, if Grounded. arXiv preprint arXiv:2401.01614.
Dataset Card Contact#
GitHub: https://github.com/OSU-NLP-Group/SeeAct