Note

This is a community plugin, an external project maintained by its respective author. Community plugins are not part of FiftyOne core and may change independently. Please review each plugin’s documentation and license before use.

GUI-Actor FiftyOne Integration#

A FiftyOne integration for Microsoft’s GUI-Actor vision-language models, enabling GUI automation and visual interface analysis with rich attention visualization.

Overview#

GUI-Actor is a multimodal foundation model designed for GUI automation tasks. This integration brings GUI-Actor’s capabilities to FiftyOne, allowing you to:

Predict interaction points on GUI screenshots with confidence scores
Visualize attention maps showing where the model focuses
Analyze GUI understanding across datasets of interface screenshots
Evaluate model performance on GUI automation tasks

Features#

Keypoint Detection: Identifies optimal interaction points for GUI automation
Attention Heatmaps: Automatically stores attention maps on samples for visualization
Multiple Model Sizes: Support for both 3B and 7B parameter variants
Flexible Prompting: Use custom prompts or dataset instruction fields
Seamless Integration: Works with FiftyOne’s dataset management and visualization

Installation#

# Install FiftyOne
pip install fiftyone

Quick Start#

import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone.utils.huggingface import load_from_hub

# Load a GUI dataset
dataset = load_from_hub("Voxel51/ScreenSpot-v2", shuffle=True)

# Register the model source
foz.register_zoo_model_source("https://github.com/harpreetsahota204/gui_actor")

# Load the GUI-Actor model
model = foz.load_zoo_model("microsoft/GUI-Actor-7B-Qwen2.5-VL")

# Apply model to dataset
# Keypoints are stored in "guiactor_output"
# Attention heatmaps are automatically stored in "gui_actor_heatmap"
dataset.apply_model(
    model, 
    prompt_field="instruction",  # Use dataset's instruction field
    label_field="guiactor_output"
)

# Visualize results
session = fo.launch_app(dataset)

Model Variants#

Model	Parameters	Description
`microsoft/GUI-Actor-3B-Qwen2.5-VL`	3B	Lightweight version for faster inference
`microsoft/GUI-Actor-7B-Qwen2.5-VL`	7B	Full-size model with best performance

Output Format#

The model stores two fields on each sample:

Keypoints (label_field): Interaction points with confidence scores
Attention Heatmap (gui_actor_heatmap): Attention map stored as fo.Heatmap

Keypoint Structure#

fo.Keypoint(
    label="top_interaction_point",
    points=[[x, y]],  # Normalized coordinates [0,1]
    confidence=[confidence_score],  # Model confidence
    reasoning="the model's output text"  # Custom attribute
)

Attention Heatmap#

Stored automatically as gui_actor_heatmap field on each sample
Contains normalized attention scores in [0, 1] range
Stored at native model resolution (FiftyOne handles resizing for visualization)
Visualize in the FiftyOne App as a heatmap overlay

Advanced Usage#

Custom Prompts#

# Use a custom prompt instead of dataset field
model = foz.load_zoo_model("microsoft/GUI-Actor-7B-Qwen2.5-VL")
model.prompt = "Click the login button"

# Apply to dataset
predictions = dataset.apply_model(model, label_field="custom_predictions")

Integration Details#

Model Architecture#

Based on Qwen2.5-VL with pointer generation capabilities
Uses attention-based grounding for spatial understanding
Supports multiple candidate region detection
Implements specialized pointer tokens for coordinate generation

License#

This integration is licensed under the Apache 2.0 License. The GUI-Actor models are licensed under the MIT License.

Citation#

@article{wu2025gui,
  title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},
  author={Wu, Qianhui and Cheng, Kanzhi and Yang, Rui and Zhang, Chaoyun and Yang, Jianwei and Jiang, Huiqiang and Mu, Jian and Peng, Baolin and Qiao, Bo and Tan, Reuben and others},
  journal={arXiv preprint arXiv:2506.03143},
  year={2025}
}