Dataset Card for “Cross-Domain” Test Split in Multimodal Mind2Web#

Note: This dataset is the test split of the Cross-Domain dataset introduced in the paper.

image/png

This is a FiftyOne dataset with 4050 samples.

Installation#

If you haven’t already, install FiftyOne:

pip install -U fiftyone

Usage#

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/mind2web_multimodal_test_domain")

# Launch the App
session = fo.launch_app(dataset)

Dataset Description#

Curated by: The Ohio State University NLP Group (OSU-NLP-Group)
Shared by: OSU-NLP-Group on Hugging Face
Language(s) (NLP): en
License: OPEN-RAIL License

Dataset Sources#

Repository: https://github.com/OSU-NLP-Group/SeeAct and https://huggingface.co/datasets/osunlp/Multimodal-Mind2Web
Paper: “GPT-4V(ision) is a Generalist Web Agent, if Grounded” by Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
Demo: https://osu-nlp-group.github.io/SeeAct

Uses#

Direct Use#

Evaluating web agents’ ability to generalize to entirely new domains
Testing zero-shot domain transfer capabilities of models
Benchmarking the true generalist capabilities of web agents
Assessing model performance in unseen web environments

Out-of-Scope Use#

Developing web agents for harmful purposes (as stated in the paper’s impact statement)
Automating actions that could violate website terms of service
Creating agents that access users’ personal profiles or perform sensitive operations without consent

Dataset Structure#

Contains 694 tasks across 13 domains and 53 websites
Tasks average 5.9 actions each
Average 4,314 visual tokens per task
Average 494 HTML elements per task
Average 91,163 HTML tokens per task
Each example includes task descriptions, HTML structure, operations (CLICK, TYPE, SELECT), target elements with attributes, and action histories

FiftyOne Dataset Structure#

Basic Info: 1,338 web UI screenshots with task-based annotations

Core Fields:

action_uid: StringField - Unique action identifier
annotation_id: StringField - Annotation identifier
target_action_index: IntField - Index of target action in sequence
ground_truth: EmbeddedDocumentField(Detection) - Element to interact with:
- label: Action type (TYPE, CLICK)
- bounding_box: a list of relative bounding box coordinates in [0, 1] in the following format: <top-left-x>, <top-left-y>, <width>, <height>]
- target_action_reprs: String representation of target action
website: EmbeddedDocumentField(Classification) - Website name
domain: EmbeddedDocumentField(Classification) - Website domain category
subdomain: EmbeddedDocumentField(Classification) - Website subdomain category
task_description: StringField - Natural language description of the task
full_sequence: ListField(StringField) - Complete sequence of actions for the task
previous_actions: ListField - Actions already performed in the sequence
current_action: StringField - Action to be performed
alternative_candidates: EmbeddedDocumentField(Detections) - Other possible elements

Dataset Creation#

Curation Rationale#

The Cross-Domain split was specifically designed to evaluate an agent’s ability to generalize to entirely new domains it hasn’t encountered during training, representing the most challenging generalization scenario.

Source Data#

Data Collection and Processing#

Based on the original MIND2WEB dataset
Each HTML document is aligned with its corresponding webpage screenshot image
Underwent human verification to confirm element visibility and correct rendering for action prediction
Specifically includes websites from top-level domains held out from the training data

Who are the source data producers?#

Web screenshots and HTML were collected from 53 websites across 13 domains that were not represented in the training data.

Annotations#

Annotation process#

Each task includes annotated action sequences showing the correct steps to complete the task. These were likely captured through a tool that records user actions on websites.

Who are the annotators?#

Researchers from The Ohio State University NLP Group or hired annotators, though specific details aren’t provided in the paper.

Personal and Sensitive Information#

The dataset focuses on non-login tasks to comply with user agreements and avoid privacy issues.

Bias, Risks, and Limitations#

This split presents the most challenging generalization scenario as it tests performance on entirely unfamiliar domains
In-context learning methods with large models show better performance than supervised fine-tuning on this split
The gap between SEEACTOracle and other methods is largest in this split (23.2% step success rate difference)
Website layouts and functionality may change over time, affecting the validity of the dataset
Limited to the specific domains included; may not fully represent all possible web domains

Citation#

BibTeX:#

@article{zheng2024seeact,
  title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
  author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
  booktitle={Forty-first International Conference on Machine Learning},
  year={2024},
  url={https://openreview.net/forum?id=piecKJ2DlB},
}

@inproceedings{deng2023mindweb,
  title={Mind2Web: Towards a Generalist Agent for the Web},
  author={Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023},
  url={https://openreview.net/forum?id=kiYqbO3wqw}
}

APA:#

Zheng, B., Gou, B., Kil, J., Sun, H., & Su, Y. (2024). GPT-4V(ision) is a Generalist Web Agent, if Grounded. arXiv preprint arXiv:2401.01614.

Dataset Card Contact#

GitHub: https://github.com/OSU-NLP-Group/SeeAct