Note

This is a community plugin, an external project maintained by its respective author. Community plugins are not part of FiftyOne core and may change independently. Please review each plugin’s documentation and license before use.

PaliGemma2 Mix for FiftyOne#

This repository integrates Google DeepMind’s PaliGemma2 Mix models into the FiftyOne computer vision platform. PaliGemma2 Mix is a set of vision-language models fine-tuned on diverse tasks, designed to work out-of-the-box for a variety of computer vision applications.

Features#

PaliGemma2 Mix models can perform:

Image captioning (multiple detail levels)
Object detection
Semantic segmentation (Not perfect, but good for initial exploration)
Optical character recognition (OCR)
Visual question answering
Zero-shot classification

Available Models#

Model	Size	Resolution	Source
`paligemma2-3b-mix-224`	3B	224×224	HuggingFace
`paligemma2-10b-mix-224`	10B	224×224	HuggingFace
`paligemma2-28b-mix-224`	28B	224×224	HuggingFace
`paligemma2-3b-mix-448`	3B	448×448	HuggingFace
`paligemma2-10b-mix-448`	10B	448×448	HuggingFace
`paligemma2-28b-mix-448`	28B	448×448	HuggingFace

Requirements#

FiftyOne
PyTorch
Transformers (>=4.50)
Huggingface Hub
JAX/FLAX (for segmentation masks)
NumPy
PIL

Installation#

Install required packages:

pip install fiftyone torch torchvision transformers huggingface-hub jax flax

Register the model repository:

import fiftyone.zoo as foz
foz.register_zoo_model_source("https://github.com/harpreetsahota204/paligemma2")

Download your chosen model:

foz.download_zoo_model(
    "https://github.com/harpreetsahota204/paligemma2",
    model_name="google/paligemma2-10b-mix-448", 
)

Usage Examples#

Load a dataset#

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load a sample dataset
dataset = load_from_hub(
    "voxel51/hand-keypoints",
    name="hands_subset",
    max_samples=10
)

Load the model#

import fiftyone.zoo as foz

model = foz.load_zoo_model(
    "google/paligemma2-10b-mix-448",
    # install_requirements=True #if you are using for the first time and need to download reuirement,
    # ensure_requirements=True #  ensure any requirements are installed before loading the model
)

Image Captioning#

# Set operation and detail level
model.operation = "caption"
model.detail_level = "coco-style"  # Options: "short", "coco-style", "detailed"

# Apply to dataset
dataset.apply_model(model, label_field="captions")

Object Detection#

# Set operation and classes to detect
model.operation = "detection"
model.prompt = ["person", "hand", "face"]  # List of classes to detect
# Alternative format: model.prompt = "person; hand; face"

# Apply to dataset
dataset.apply_model(model, label_field="detections")

Semantic Segmentation#

# Set operation and classes to segment
model.operation = "segment"
model.prompt = ["person", "hand"]  # List of classes to segment
# Alternative format: model.prompt = "person; hand"

# Apply to dataset
dataset.apply_model(model, label_field="segmentations")

OCR (Optical Character Recognition)#

# Set operation for OCR
model.operation = "ocr"

# Apply to dataset
dataset.apply_model(model, label_field="text")

Zero-Shot Classification#

# Set operation for classification
model.operation = "classify"
model.prompt = ["indoor", "outdoor", "close-up", "wide-angle"]  # Potential classes

# Apply to dataset
dataset.apply_model(model, label_field="classifications")

Visual Question Answering#

# Set operation for answering questions
model.operation = "answer"
model.prompt = "How many people are in this image?"

# Apply to dataset
dataset.apply_model(model, label_field="answers")

Visualize Results#

# Launch the FiftyOne App to visualize the results
session = fo.launch_app(dataset)

Using Different Resolution Models#

For higher quality results (at the cost of speed), use higher resolution models:

# Lower resolution, faster
small_model = foz.load_zoo_model("google/paligemma2-3b-mix-224")

# Higher resolution, better quality
large_model = foz.load_zoo_model("google/paligemma2-28b-mix-448")

License#

PaliGemma2 models are subject to the Gemma license. Please review the license terms before using these models.

Citation#

@article{
    title={PaliGemma 2: A Family of Versatile VLMs for Transfer},
    author={Andreas Steiner and André Susano Pinto and Michael Tschannen and Daniel Keysers and Xiao Wang and Yonatan Bitton and Alexey Gritsenko and Matthias Minderer and Anthony Sherbondy and Shangbang Long and Siyang Qin and Reeve Ingle and Emanuele Bugliarello and Sahar Kazemzadeh and Thomas Mesnard and Ibrahim Alabdulmohsin and Lucas Beyer and Xiaohua Zhai},
    year={2024},
    journal={arXiv preprint arXiv:2412.03555}
}