Note

This is a Hugging Face dataset. Learn how to load datasets from the Hub in the Hugging Face integration docs.

Hugging Face

Dataset Card for “Cross-Domain” Test Split in Multimodal Mind2Web#

Note: This dataset is the test split of the Cross-Domain dataset introduced in the paper.

image/png

This is a FiftyOne dataset with 4050 samples.

Installation#

If you haven’t already, install FiftyOne:

pip install -U fiftyone

Usage#

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/mind2web_multimodal_test_domain")

# Launch the App
session = fo.launch_app(dataset)

Dataset Description#

Curated by: The Ohio State University NLP Group (OSU-NLP-Group)
Shared by: OSU-NLP-Group on Hugging Face
Language(s) (NLP): en
License: OPEN-RAIL License

Dataset Sources#

Repository: https://github.com/OSU-NLP-Group/SeeAct and https://huggingface.co/datasets/osunlp/Multimodal-Mind2Web
Paper: “GPT-4V(ision) is a Generalist Web Agent, if Grounded” by Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
Demo: https://osu-nlp-group.github.io/SeeAct

Uses#

Direct Use#

  • Evaluating web agents’ ability to generalize to entirely new domains

  • Testing zero-shot domain transfer capabilities of models

  • Benchmarking the true generalist capabilities of web agents

  • Assessing model performance in unseen web environments

Out-of-Scope Use#

  • Developing web agents for harmful purposes (as stated in the paper’s impact statement)

  • Automating actions that could violate website terms of service

  • Creating agents that access users’ personal profiles or perform sensitive operations without consent

Dataset Structure#

  • Contains 694 tasks across 13 domains and 53 websites

  • Tasks average 5.9 actions each

  • Average 4,314 visual tokens per task

  • Average 494 HTML elements per task

  • Average 91,163 HTML tokens per task

  • Each example includes task descriptions, HTML structure, operations (CLICK, TYPE, SELECT), target elements with attributes, and action histories

FiftyOne Dataset Structure#

Basic Info: 1,338 web UI screenshots with task-based annotations

Core Fields:

  • action_uid: StringField - Unique action identifier

  • annotation_id: StringField - Annotation identifier

  • target_action_index: IntField - Index of target action in sequence

  • ground_truth: EmbeddedDocumentField(Detection) - Element to interact with:

    • label: Action type (TYPE, CLICK)

    • bounding_box: a list of relative bounding box coordinates in [0, 1] in the following format: <top-left-x>, <top-left-y>, <width>, <height>]

    • target_action_reprs: String representation of target action

  • website: EmbeddedDocumentField(Classification) - Website name

  • domain: EmbeddedDocumentField(Classification) - Website domain category

  • subdomain: EmbeddedDocumentField(Classification) - Website subdomain category

  • task_description: StringField - Natural language description of the task

  • full_sequence: ListField(StringField) - Complete sequence of actions for the task

  • previous_actions: ListField - Actions already performed in the sequence

  • current_action: StringField - Action to be performed

  • alternative_candidates: EmbeddedDocumentField(Detections) - Other possible elements

Dataset Creation#

Curation Rationale#

The Cross-Domain split was specifically designed to evaluate an agent’s ability to generalize to entirely new domains it hasn’t encountered during training, representing the most challenging generalization scenario.

Source Data#

Data Collection and Processing#

  • Based on the original MIND2WEB dataset

  • Each HTML document is aligned with its corresponding webpage screenshot image

  • Underwent human verification to confirm element visibility and correct rendering for action prediction

  • Specifically includes websites from top-level domains held out from the training data

Who are the source data producers?#

Web screenshots and HTML were collected from 53 websites across 13 domains that were not represented in the training data.

Annotations#

Annotation process#

Each task includes annotated action sequences showing the correct steps to complete the task. These were likely captured through a tool that records user actions on websites.

Who are the annotators?#

Researchers from The Ohio State University NLP Group or hired annotators, though specific details aren’t provided in the paper.

Personal and Sensitive Information#

The dataset focuses on non-login tasks to comply with user agreements and avoid privacy issues.

Bias, Risks, and Limitations#

  • This split presents the most challenging generalization scenario as it tests performance on entirely unfamiliar domains

  • In-context learning methods with large models show better performance than supervised fine-tuning on this split

  • The gap between SEEACTOracle and other methods is largest in this split (23.2% step success rate difference)

  • Website layouts and functionality may change over time, affecting the validity of the dataset

  • Limited to the specific domains included; may not fully represent all possible web domains

Citation#

BibTeX:#

@article{zheng2024seeact,
  title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
  author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
  booktitle={Forty-first International Conference on Machine Learning},
  year={2024},
  url={https://openreview.net/forum?id=piecKJ2DlB},
}

@inproceedings{deng2023mindweb,
  title={Mind2Web: Towards a Generalist Agent for the Web},
  author={Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023},
  url={https://openreview.net/forum?id=kiYqbO3wqw}
}

APA:#

Zheng, B., Gou, B., Kil, J., Sun, H., & Su, Y. (2024). GPT-4V(ision) is a Generalist Web Agent, if Grounded. arXiv preprint arXiv:2401.01614.

Dataset Card Contact#

GitHub: https://github.com/OSU-NLP-Group/SeeAct