Note

This is a Hugging Face dataset. Learn how to load datasets from the Hub in the Hugging Face integration docs.

Hugging Face

Dataset Card for Multimodal Mind2Web “Cross-Task” Test Split#

Note: This dataset is the test split of the Cross-Task dataset introduced in the paper.

image/png

This is a FiftyOne dataset with 1338 samples.

Installation#

If you haven’t already, install FiftyOne:

pip install -U fiftyone

Usage#

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/mind2web_multimodal_test_task")

# Launch the App
session = fo.launch_app(dataset)

Dataset Description#

Curated by: The Ohio State University NLP Group (OSU-NLP-Group)
Shared by: OSU-NLP-Group on Hugging Face
Language(s) (NLP): en
License: OPEN-RAIL License

Dataset Source#

Repository: https://github.com/OSU-NLP-Group/SeeAct and https://huggingface.co/datasets/osunlp/Multimodal-Mind2Web
Paper: “GPT-4V(ision) is a Generalist Web Agent, if Grounded” by Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
Demo: https://osu-nlp-group.github.io/SeeAct

Uses#

Direct Use#

  • Evaluating web agents’ ability to generalize to new tasks on familiar websites

  • Benchmarking LMMs and LLMs on web navigation tasks

  • Training and fine-tuning models for web navigation

  • Testing model performance on tasks that require following multi-step instructions

Out-of-Scope Use#

  • Developing web agents for harmful purposes (as stated in the paper’s impact statement)

  • Automating actions that could violate website terms of service

  • Creating agents that access users’ personal profiles or perform sensitive operations without consent

Dataset Structure#

  • Contains 177 tasks across 17 domains and 64 websites

  • Tasks average 7.6 actions each

  • Average 4,172 visual tokens per task

  • Average 607 HTML elements per task

  • Average 123,274 HTML tokens per task

  • Each example includes task descriptions, HTML structure, operations (CLICK, TYPE, SELECT), target elements with attributes, and action histories

FiftyOne Dataset Structure#

Basic Info: 1,338 web UI screenshots with task-based annotations

Core Fields:

  • action_uid: StringField - Unique action identifier

  • annotation_id: StringField - Annotation identifier

  • target_action_index: IntField - Index of target action in sequence

  • ground_truth: EmbeddedDocumentField(Detection) - Element to interact with:

    • label: Action type (TYPE, CLICK)

    • bounding_box: a list of relative bounding box coordinates in [0, 1] in the following format: <top-left-x>, <top-left-y>, <width>, <height>]

    • target_action_reprs: String representation of target action

  • website: EmbeddedDocumentField(Classification) - Website name

  • domain: EmbeddedDocumentField(Classification) - Website domain category

  • subdomain: EmbeddedDocumentField(Classification) - Website subdomain category

  • task_description: StringField - Natural language description of the task

  • full_sequence: ListField(StringField) - Complete sequence of actions for the task

  • previous_actions: ListField - Actions already performed in the sequence

  • current_action: StringField - Action to be performed

  • alternative_candidates: EmbeddedDocumentField(Detections) - Other possible elements

Dataset Creation#

Curation Rationale#

The Cross-Task split was specifically designed to evaluate an agent’s ability to generalize to new tasks on websites and domains it has already encountered during training.

Source Data#

Data Collection and Processing#

  • Based on the original MIND2WEB dataset

  • Each HTML document is aligned with its corresponding webpage screenshot image

  • Underwent human verification to confirm element visibility and correct rendering for action prediction

Who are the source data producers?#

Web screenshots and HTML were collected from 64 websites across 17 domains that were also represented in the training data.

Annotations#

Annotation process#

Each task includes annotated action sequences showing the correct steps to complete the task. These were likely captured through a tool that records user actions on websites.

Who are the annotators?#

Researchers from The Ohio State University NLP Group or hired annotators, though specific details aren’t provided in the paper.

Personal and Sensitive Information#

The dataset focuses on non-login tasks to comply with user agreements and avoid privacy issues.

Bias, Risks, and Limitations#

  • Performance on this split is generally better than Cross-Website and Cross-Domain, as models can leverage knowledge of website structures

  • Supervised fine-tuning methods show advantages on this split compared to in-context learning

  • The dataset may contain biases present in the original websites

  • Website layouts and functionality may change over time, affecting the validity of the dataset

Citation#

BibTeX:#

@article{zheng2024seeact,
  title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
  author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
  booktitle={Forty-first International Conference on Machine Learning},
  year={2024},
  url={https://openreview.net/forum?id=piecKJ2DlB},
}

@inproceedings{deng2023mindweb,
  title={Mind2Web: Towards a Generalist Agent for the Web},
  author={Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023},
  url={https://openreview.net/forum?id=kiYqbO3wqw}
}

APA:#

Zheng, B., Gou, B., Kil, J., Sun, H., & Su, Y. (2024). GPT-4V(ision) is a Generalist Web Agent, if Grounded. arXiv preprint arXiv:2401.01614.

Dataset Card Contact#

GitHub: https://github.com/OSU-NLP-Group/SeeAct