Note

This is a community plugin, an external project maintained by its respective author. Community plugins are not part of FiftyOne core and may change independently. Please review each plugin’s documentation and license before use.

GitHub Repo

UI-TARS FiftyOne Integration#

A comprehensive integration of the UI-TARS vision-language model with FiftyOne for GUI agent development and analysis. UI-TARS is an end-to-end native GUI agent model that can perceive screenshots and perform human-like interactions through unified action modeling.

IMPORTANT NOTE:#

I deviated from the original prompts in an attempt to have this integration as similar as possible to my other integrations with FiftyOne.

The results could show better promise if the original prompts as shown in the paper and repo are used. If there is enough interest, I can update the prompts to the original ones and update the parsing code.

Contributions are also welcome.

<img src-“uitars-hq.gif”>

  • Multi-Modal Operations: Support for 5 different operation modes

  • Unified Action Space: Standardized actions across desktop, mobile, and web platforms

  • Advanced Reasoning: System-2 thinking with explicit thought generation

  • Precise Grounding: State-of-the-art coordinate prediction and element localization

  • FiftyOne Integration: Seamless dataset management and visualization

Installation#

# Install FiftyOne
pip install fiftyone

# Register the UI-TARS model source
import fiftyone.zoo as foz
foz.register_zoo_model_source("https://github.com/harpreetsahota204/UI_TARS", overwrite=True)

# Load the model
model = foz.load_zoo_model(
    "ByteDance-Seed/UI-TARS-1.5-7B",
    # install_requirements=True, #you can pass this to make sure you have all reqs installed
    )

‍ Quick Start#

Load Your Dataset#

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load UI dataset
dataset = load_from_hub(
    "Voxel51/ShowUI_Web",
    max_samples=200,
    shuffle=True
)

# Launch FiftyOne App
fo.launch_app(dataset)

Basic Usage#

Operation Modes#

Mode

Description

Output Format

Use Cases

vqa

Visual Question Answering

Text response

UI description, analysis

ocr

Text detection and recognition

Keypoints with text

Text extraction, element identification

point

Keypoint detection

Keypoints with actions

Action planning, element targeting

classify

UI classification

Classifications

Platform detection, UI categorization

agentic

Full agent actions

Keypoints with metadata

Complete automation workflows

# Visual Question Answering
model.operation = "vqa"
model.prompt = "Describe this screenshot and what the user might be doing."
dataset.apply_model(model, label_field="vqa_results")

# OCR - Extract text and UI elements
model.operation = "ocr"
model.prompt = "Point to any buttons, icons, and input fields in this UI"
dataset.apply_model(model, label_field="ocr_results")

# Keypoint Detection
model.operation = "point"
model.prompt = "Identify clickable elements for navigation"
dataset.apply_model(model, label_field="ui_keypoints")

Agent Actions#

# Full agentic mode with reasoning
model.operation = "agentic"
dataset.apply_model(model, prompt_field="instructions", label_field="agentic_output")

# Custom system prompts
model.system_prompt = "You are a GUI testing assistant. Focus on accessibility and usability."

Detailed Operation Examples#

Visual Question Answering (vqa)#

Perfect for understanding UI context and user intent.

model.operation = "vqa"
model.prompt = "What type of application is this and what can users do here?"
dataset.apply_model(model, label_field="ui_analysis")

Output: Natural language description of the interface

OCR Text Detection (ocr)#

Extracts and localizes text elements with UI categorization.

model.operation = "ocr"
model.prompt = "Find all interactive text elements"
dataset.apply_model(model, label_field="text_elements")

Output: Keypoints with text content and categories (button, link, input, etc.)

Keypoint Detection (point)#

Identifies actionable elements and suggests interactions.

model.operation = "point"
model.prompt = "Locate elements needed to complete a purchase"
dataset.apply_model(model, label_field="purchase_points")

Output: Keypoints with action types and reasoning

Classification (classify)#

Categorizes UI characteristics for automated analysis.

model.operation = "classify"
model.prompt = "Classify the platform type and primary function"
dataset.apply_model(model, label_field="ui_categories")

Output: Multiple classification labels with confidence

Agentic Actions (agentic)#

Complete automation with multi-step reasoning and complex actions.

model.operation = "agentic"
# Uses instructions from dataset field
dataset.apply_model(model, prompt_field="task_instructions", label_field="agent_actions")

Output: Sequence of actions with coordinates, parameters, and reasoning

Custom System Prompts#

# Clear default prompt
model.system_prompt = None

# Set domain-specific prompt
model.system_prompt = """
You are a web accessibility auditor. Focus on identifying:
- ARIA labels and roles
- Keyboard navigation paths  
- Color contrast issues
- Screen reader compatibility
"""

model.operation = "point"
model.prompt = "Identify accessibility issues in this interface"

Using Dataset Fields#

Leverage existing dataset annotations as prompts:

# Use existing instruction field
dataset.apply_model(model, prompt_field="user_instructions", label_field="responses")

# Combine with custom prompts
model.prompt = "Based on the instruction, identify the next action to take"
dataset.apply_model(model, prompt_field="context", label_field="next_actions")

Action Space Reference#

UI-TARS supports a comprehensive action space for cross-platform automation:

Action Type

Parameters

Description

click

point_2d

Single click at coordinates

left_double

point_2d

Double-click action

right_single

point_2d

Right-click for context menus

long_press

point_2d

Long press (mobile)

drag

start_point, end_point

Drag gesture between points

scroll

point_2d, direction

Scroll in specified direction

type

content

Text input with escape characters

hotkey

key

Keyboard shortcuts (e.g., “ctrl c”)

wait

-

Pause for dynamic content

Best Practices#

Coordinate Handling#

  • All coordinates are normalized to [0,1] range

  • Automatic transformation between model and original image space

  • Smart resizing maintains aspect ratios

Prompt Engineering#

  • Be specific about desired actions and elements

  • Use task-oriented language for agentic mode

  • Leverage system prompts for domain adaptation

Error Handling#

  • Model includes built-in reflection and error recovery

  • JSON parsing handles malformed outputs gracefully

  • Coordinate validation prevents out-of-bounds errors

Citation#

@article{qin2025ui,
  title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
  author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},
  journal={arXiv preprint arXiv:2501.12326},
  year={2025}
}

Resources#