Note

This is a community plugin, an external project maintained by its respective author. Community plugins are not part of FiftyOne core and may change independently. Please review each plugin’s documentation and license before use.

OS-Atlas FiftyOne Integration#

A robust FiftyOne model integration for OS-Atlas vision-language models, designed specifically for GUI agents and UI understanding tasks.

Features#

Multiple Vision Tasks: Detection, OCR, keypoint detection, classification, VQA, and agentic actions
Robust Parsing: Handles inconsistent model output formats automatically
Flexible Prompting: Support for custom prompts and system prompts
Production Ready: Built-in error handling and graceful degradation

Supported Operations#

Operation	Description	Output Format
`detect`	Object/UI element detection	`fo.Detections`
`ocr`	Grounded text detection and recognition	`fo.Detections`
`point`	Keypoint detection for UI elements	`fo.Keypoints`
`classify`	UI classification and categorization	`fo.Classifications`
`vqa`	Visual question answering	Raw text
`agentic`	GUI agent action planning	`fo.Keypoints` with metadata

Installation#

Register the Model Source#

import fiftyone.zoo as foz

# Register the OS-Atlas model source
foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/os_atlas", 
    overwrite=True
)

Load a Model#

Base Model (recommended for development):

model = foz.load_zoo_model(
    "OS-Copilot/OS-Atlas-Base-7B",
    install_requirements=True
)

Quick Start#

Load a UI Dataset#

import fiftyone as fo
import fiftyone.utils.huggingface as fouh

# Load sample UI dataset
dataset = fouh.load_from_hub(
    "Voxel51/GroundUI-18k",
    max_samples=100
)

Basic Usage Examples#

Visual Question Answering#

model.operation = "vqa"
model.prompt = "Describe this screenshot and what the user might be doing."
dataset.apply_model(model, label_field="vqa_results")

UI Element Detection#

model.operation = "detect"
model.prompt = "Find all buttons and interactive elements"
dataset.apply_model(model, label_field="ui_detections")

Grounded OCR#

model.operation = "ocr"
model.prompt = "Extract all text from UI elements like buttons, menus, and labels"
dataset.apply_model(model, label_field="ocr_results")

Keypoint Detection#

model.operation = "point"
model.prompt = "Find the search button"
dataset.apply_model(model, label_field="keypoints")

Classification#

model.operation = "classify"
model.prompt = "Classify this UI as: mobile app, web browser, desktop application, or other"
dataset.apply_model(model, label_field="ui_type")

Agentic Actions#

model.operation = "agentic"
model.prompt = "Click on the login button"
dataset.apply_model(model, label_field="agent_actions")

Advanced Usage#

Using Dataset Fields as Prompts#

Use existing fields in your dataset as dynamic prompts:

# Use the 'instruction' field from your dataset
dataset.apply_model(
    model, 
    prompt_field="instruction",  # Field containing prompts
    label_field="results"
)

Custom System Prompts#

Override default system prompts for specialized behavior:

# Clear existing system prompt
model.system_prompt = None

# Set custom system prompt
model.system_prompt = """
You are a specialized UI accessibility analyzer. 
Focus on identifying elements that may be difficult 
for users with visual impairments to interact with.
"""

model.operation = "detect"
dataset.apply_model(model, label_field="accessibility_analysis")

Dynamic Classification#

Generate classification prompts from dataset metadata:

# Extract unique platforms from dataset
platforms = dataset.distinct("platform")

# Create dynamic classification prompt
model.operation = "classify"
model.prompt = f"Which platform is this from? Choose exactly one: {platforms}"
dataset.apply_model(model, label_field="platform_classification")

Understanding Outputs#

Detection Results#

Bounding boxes: Normalized coordinates in FiftyOne format [x, y, width, height]
Labels: Descriptive labels for detected UI elements
Confidence: Automatic confidence scoring

Keypoint Results#

Points: Normalized [x, y] coordinates
Labels: Descriptive labels for interaction points
Metadata: Additional context for agentic actions

Agentic Action Metadata#

Agentic operations include rich metadata:

action: Type of action (click, type, scroll, etc.)
thought: Model’s reasoning process
sequence_idx: Action order in multi-step plans
Action-specific parameters (content for typing, direction for scrolling, etc.)

Classification Results#

Labels: Predicted categories
Thought: Model’s reasoning (when available)

Model Behavior & Robustness#

This integration handles several model output inconsistencies automatically:

Format Variations: Supports both direct arrays [{...}] and wrapped objects {"key": [{...}]}
Coordinate Formats: Handles tuples, lists, strings, and malformed coordinate syntax
Truncated Output: Recovers partial results from incomplete JSON
Mixed Dimensionality: Converts 2D points to bounding boxes when needed

Visualization#

Launch the FiftyOne App to visualize results:

import fiftyone as fo

# Launch interactive app
session = fo.launch_app(dataset)

# View results for first sample
sample = dataset.first()
print(f"VQA: {sample.vqa_results}")
print(f"Detections: {len(sample.ui_detections.detections)} objects found")

Requirements#

Automatic Installation#

The integration will automatically install required dependencies when loading with install_requirements=True:

huggingface-hub - Model downloading and management
transformers>=4.30.0 - Transformer model support
torch>=1.12.0 - PyTorch backend
torchvision - Computer vision utilities
qwen-vl-utils - Qwen vision-language utilities
accelerate - Model acceleration and optimization

Manual Installation#

If you prefer manual dependency management:

pip install huggingface-hub transformers torch torchvision qwen-vl-utils accelerate

License#

This integration and the OS-Atlas models are released under the Apache 2.0 License, making them suitable for both commercial and non-commercial use. See the original model repositories for complete license details.

Contributing#

Contributions are welcome! Please submit issues and pull requests to the main repository.

Citation#

If you use this integration in your research, please cite the original OS-Atlas paper and FiftyOne:

@article{wu2024atlas,
        title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},
        author={Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and others},
        journal={arXiv preprint arXiv:2410.23218},
        year={2024}
      }