Note

This is a community plugin, an external project maintained by its respective author. Community plugins are not part of FiftyOne core and may change independently. Please review each plugin’s documentation and license before use.

Qwen2.5-VL FiftyOne Remote Model Zoo Implementation#

Implementing Qwen2.5-VL as a Remote Zoo Model for FiftyOne

NOTE: Due to recent changes in Transformers 4.50.0 (which are to be patched by Hugging Face) please ensure you have transformers<=4.49.0 installed before running the model

Features#

Based on the documentation, here’s a comprehensive primer on the tasks supported by Qwen2.5-VL:

Qwen2.5-VL Supported Tasks#

Visual Question Answering (VQA)
- Answers natural language questions about images
- Returns text responses in English
- Can be used for general image understanding and description
Object Detection
- Locates and identifies objects in images
- Returns normalized bounding box coordinates and object labels
- Can be prompted to find specific types of objects
Optical Character Recognition (OCR)
- Reads and extracts text from images
- Particularly useful for documents, signs, and text-containing images
Keypoint Detection
- Identifies specific points of interest in images
- Returns normalized point coordinates with labels
- Useful for pose estimation and landmark detection
Image Classification
- Categorizes images into predefined classes
- Can identify image quality issues
- Returns classification labels

Advanced Grounded Operations#

The model also supports two sophisticated grounded operations that build upon VQA results:

Grounded Detection
- Links textual descriptions with specific object locations
- Returns detection boxes grounded in the VQA response
Grounded Pointing
- Associates text descriptions with specific points in the image
- Returns keypoints grounded in the VQA response

The model is highly flexible, allowing you to switch between these tasks by simply changing the operation mode and prompt, making it a versatile tool for various computer vision and multimodal applications.

Technical Details#

The model implementation:

Supports multiple devices (CUDA, MPS, CPU)
Uses bfloat16 precision on CUDA devices for optimal performance
Handles various output formats including JSON parsing and coordinate normalization
Provides comprehensive system prompts for each operation type
Converts outputs to FiftyOne-compatible formats (Detections, Keypoints, Classifications)

Installation#

# Register the model source
foz.register_zoo_model_source("https://github.com/harpreetsahota204/qwen2_5_vl")

# Download the model
foz.download_zoo_model(
    "https://github.com/harpreetsahota204/qwen2_5_vl",
    model_name="Qwen/Qwen2.5-VL-3B-Instruct"
)

Usage Examples#

Loading the model#

model = foz.load_zoo_model(
    "Qwen/Qwen2.5-VL-3B-Instruct",
    # install_requirements=True #if you are using for the first time and need to download reuirement,
    # ensure_requirements=True #  ensure any requirements are installed before loading the model
)

Available Checkpoints#

These checkpoints come in different sizes (3B, 7B, 32B, and 72B parameters) and each size has two variants:

Regular version (with -Instruct suffix)
AWQ quantized version (with -Instruct-AWQ suffix)

Qwen/Qwen2.5-VL-3B-Instruct
Qwen/Qwen2.5-VL-3B-Instruct-AWQ
Qwen/Qwen2.5-VL-7B-Instruct
Qwen/Qwen2.5-VL-7B-Instruct-AWQ
Qwen/Qwen2.5-VL-32B-Instruct
Qwen/Qwen2.5-VL-32B-Instruct-AWQ
Qwen/Qwen2.5-VL-72B-Instruct
Qwen/Qwen2.5-VL-72B-Instruct-AWQ

The AWQ versions require an additional package autoawq==0.2.7.post3 but offer a more memory-efficient alternative to the regular versions while maintaining performance.

Switching Between Operations#

The same model instance can be used for different operations by simply changing its properties:

Visual Question Answering#

model.operation="vqa"
model.prompt="List all objects in this image seperated by commas

dataset.apply_model(model, label_field="q_vqa")

Object Detection#

model.operation = "detect"
model.prompt = "Locate the objects in this image."
dataset.apply_model(model, label_field="qdets")

OCR with Detection#

model.prompt = "Read all the text in the image."
dataset.apply_model(model, label_field="q_ocr")

Keypoint Detection#

model.operation = "point"
model.prompt = "Detect the keypoints in the image."
dataset.apply_model(model, label_field="qpts")

Image Classification#

model.operation = "classify"
model.prompt = "List the potential image quality issues in this image."
dataset.apply_model(model, label_field="q_cls")

Grounded Operations#

The model also supports grounded detection and pointing by using results from VQA:

# Grounded Detection
dataset.apply_model(model, label_field="grounded_qdets", prompt_field="q_vqa")

# Grounded Pointing
dataset.apply_model(model, label_field="grounded_qpts", prompt_field="q_vqa")

Please refer to the example notebook for more details

Output Formats#

Each operation returns results in a specific format:

VQA (Visual Question Answering): Returns str
- Natural language text responses in English
Detection: Returns fiftyone.core.labels.Detections
- Normalized bounding box coordinates [0,1] x [0,1]
- Object labels
Keypoint Detection: Returns fiftyone.core.labels.Keypoints
- Normalized point coordinates [0,1] x [0,1]
- Point labels
Classification: Returns fiftyone.core.labels.Classifications
- Class labels
Grounded Operations: Returns same format as base operation
- Grounded Detection: fiftyone.core.labels.Detections
- Grounded Pointing: fiftyone.core.labels.Keypoints

Citation#

@article{Qwen2.5-VL,
  title={Qwen2.5-VL Technical Report},
  author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025}
}