Note
This is a community plugin, an external project maintained by its respective author. Community plugins are not part of FiftyOne core and may change independently. Please review each plugin’s documentation and license before use.
GUI-Actor FiftyOne Integration#
A FiftyOne integration for Microsoft’s GUI-Actor vision-language models, enabling GUI automation and visual interface analysis with rich attention visualization.
Overview#
GUI-Actor is a multimodal foundation model designed for GUI automation tasks. This integration brings GUI-Actor’s capabilities to FiftyOne, allowing you to:
Predict interaction points on GUI screenshots with confidence scores
Visualize attention maps showing where the model focuses
Analyze GUI understanding across datasets of interface screenshots
Evaluate model performance on GUI automation tasks
Features#
Keypoint Detection: Identifies optimal interaction points for GUI automation
Attention Heatmaps: Automatically stores attention maps on samples for visualization
Multiple Model Sizes: Support for both 3B and 7B parameter variants
Flexible Prompting: Use custom prompts or dataset instruction fields
Seamless Integration: Works with FiftyOne’s dataset management and visualization
Installation#
# Install FiftyOne
pip install fiftyone
Quick Start#
import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone.utils.huggingface import load_from_hub
# Load a GUI dataset
dataset = load_from_hub("Voxel51/ScreenSpot-v2", shuffle=True)
# Register the model source
foz.register_zoo_model_source("https://github.com/harpreetsahota204/gui_actor")
# Load the GUI-Actor model
model = foz.load_zoo_model("microsoft/GUI-Actor-7B-Qwen2.5-VL")
# Apply model to dataset
# Keypoints are stored in "guiactor_output"
# Attention heatmaps are automatically stored in "gui_actor_heatmap"
dataset.apply_model(
model,
prompt_field="instruction", # Use dataset's instruction field
label_field="guiactor_output"
)
# Visualize results
session = fo.launch_app(dataset)
Model Variants#
Model |
Parameters |
Description |
|---|---|---|
|
3B |
Lightweight version for faster inference |
|
7B |
Full-size model with best performance |
Output Format#
The model stores two fields on each sample:
Keypoints (
label_field): Interaction points with confidence scoresAttention Heatmap (
gui_actor_heatmap): Attention map stored asfo.Heatmap
Keypoint Structure#
fo.Keypoint(
label="top_interaction_point",
points=[[x, y]], # Normalized coordinates [0,1]
confidence=[confidence_score], # Model confidence
reasoning="the model's output text" # Custom attribute
)
Attention Heatmap#
Stored automatically as
gui_actor_heatmapfield on each sampleContains normalized attention scores in
[0, 1]rangeStored at native model resolution (FiftyOne handles resizing for visualization)
Visualize in the FiftyOne App as a heatmap overlay
Advanced Usage#
Custom Prompts#
# Use a custom prompt instead of dataset field
model = foz.load_zoo_model("microsoft/GUI-Actor-7B-Qwen2.5-VL")
model.prompt = "Click the login button"
# Apply to dataset
predictions = dataset.apply_model(model, label_field="custom_predictions")
Integration Details#
Model Architecture#
Based on Qwen2.5-VL with pointer generation capabilities
Uses attention-based grounding for spatial understanding
Supports multiple candidate region detection
Implements specialized pointer tokens for coordinate generation
License#
This integration is licensed under the Apache 2.0 License. The GUI-Actor models are licensed under the MIT License.
Citation#
@article{wu2025gui,
title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},
author={Wu, Qianhui and Cheng, Kanzhi and Yang, Rui and Zhang, Chaoyun and Yang, Jianwei and Jiang, Huiqiang and Mu, Jian and Peng, Baolin and Qiao, Bo and Tan, Reuben and others},
journal={arXiv preprint arXiv:2506.03143},
year={2025}
}