Note
This is a community plugin, an external project maintained by its respective author. Community plugins are not part of FiftyOne core and may change independently. Please review each plugin’s documentation and license before use.
GUI-Actor FiftyOne Integration#
A FiftyOne integration for Microsoft’s GUI-Actor vision-language models, enabling GUI automation and visual interface analysis with rich attention visualization.
Overview#
GUI-Actor is a multimodal foundation model designed for GUI automation tasks. This integration brings GUI-Actor’s capabilities to FiftyOne, allowing you to:
Predict interaction points on GUI screenshots with confidence scores
Visualize attention maps showing where the model focuses
Analyze GUI understanding across datasets of interface screenshots
Evaluate model performance on GUI automation tasks
Features#
Keypoint Detection: Identifies optimal interaction points for GUI automation
Attention Visualization: Generates heatmaps showing model attention patterns
Multiple Model Sizes: Support for both 3B and 7B parameter variants
Flexible Prompting: Use custom prompts or dataset instruction fields
Seamless Integration: Works with FiftyOne’s dataset management and visualization
Installation#
# Install FiftyOne
pip install fiftyone
Quick Start#
import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone.utils.huggingface import load_from_hub
# Load a GUI dataset
dataset = load_from_hub("Voxel51/ScreenSpot-v2", shuffle=True)
# Register the model source
foz.register_zoo_model_source("https://github.com/harpreetsahota204/gui_actor")
# Load the GUI-Actor model
model = foz.load_zoo_model("microsoft/GUI-Actor-7B-Qwen2.5-VL")
# Apply model to dataset
dataset.apply_model(
model,
prompt_field="instruction", # Use dataset's instruction field
label_field="guiactor_output"
)
# Add attention heatmaps (saved as side effects during inference)
from pathlib import Path
for sample in dataset:
original_path = Path(sample.filepath)
heatmap_path = original_path.parent / f"{original_path.stem}_attention.png"
if heatmap_path.exists():
sample["attention_heatmap"] = fo.Heatmap(map_path=str(heatmap_path))
sample.save()
# Visualize results
session = fo.launch_app(dataset)
Model Variants#
Model |
Parameters |
Description |
|---|---|---|
|
3B |
Lightweight version for faster inference |
|
7B |
Full-size model with best performance |
Output Format#
The model returns structured predictions containing:
Keypoints: Interaction points with confidence scores
Attention Heatmaps: Visual attention maps saved as PNG files
Text Output: Model reasoning and explanations
Keypoint Structure#
fo.Keypoint(
label="top_interaction_point",
points=[[x, y]], # Normalized coordinates [0,1]
confidence=[confidence_score], # Model confidence
reasoning=<"the model's output text, may or may not be useful"> # Custom attribute
)
Attention Heatmaps#
Saved automatically as
{image_name}_attention.pngUse jet colormap (blue=low attention, red=high attention)
Resized to match original image resolution
Applied using matplotlib’s visualization pipeline
Advanced Usage#
Custom Prompts#
# Use a custom prompt instead of dataset field
model = foz.load_zoo_model("microsoft/GUI-Actor-7B-Qwen2.5-VL")
model.prompt = "Click the login button"
# Apply to dataset
predictions = dataset.apply_model(model, label_field="custom_predictions")
Integration Details#
Model Architecture#
Based on Qwen2.5-VL with pointer generation capabilities
Uses attention-based grounding for spatial understanding
Supports multiple candidate region detection
Implements specialized pointer tokens for coordinate generation
License#
This integration is licensed under the Apache 2.0 License. The GUI-Actor models are licensed under the MIT License.
Citation#
@article{wu2025gui,
title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},
author={Wu, Qianhui and Cheng, Kanzhi and Yang, Rui and Zhang, Chaoyun and Yang, Jianwei and Jiang, Huiqiang and Mu, Jian and Peng, Baolin and Qiao, Bo and Tan, Reuben and others},
journal={arXiv preprint arXiv:2506.03143},
year={2025}
}