Note

This is a community plugin, an external project maintained by its respective author. Community plugins are not part of FiftyOne core and may change independently. Please review each plugin’s documentation and license before use.

GitHub Repo

Molmo2 - FiftyOne Model Zoo Integration#

Molmo2 is a family of open vision-language models developed by the Allen Institute for AI (Ai2) that support image, video, and multi-image understanding and grounding. Molmo2 models are trained on publicly available third-party datasets and Molmo2 data, a collection of highly-curated image-text and video-text pairs. It has state-of-the-art performance among multimodal models with similar size.

Use Cases#

Molmo2’s core strength is grounding — precise pixel-level localization and tracking over time. While standard models might say “a car passed by,” Molmo2 can point to the exact car, follow it through a crowd, and mark the exact second it crossed a line.

Domain

Applications

Robotics & Automation

Spatial affordance prediction, action counting (“How many times did the robot grasp the block?”), event localization

Autonomous Driving

Vehicle tracking, traffic monitoring, pedestrian detection through occlusions

Sports Analytics

Athlete tracking, action spotting (goals, fouls), multi-player trajectory analysis

Document & Business

Chart/table understanding, multi-image reasoning for policy analysis, administrative workflow automation

Media & Accessibility

Dense video captioning (924 words avg), video search metadata, assistive narration for visually impaired

Generative AI QA

Localizing visual artifacts in AI-generated videos (vanishing subjects, physical incongruities)

Model Checkpoints#

Model

Base LLM

Vision Backbone

Notes

allenai/Molmo2-O-7B

Olmo3-7B-Instruct

SigLIP 2

Outperforms others on short videos, counting, and captioning

allenai/Molmo2-4B

Qwen3-4B-Instruct

SigLIP 2

Compact model with competitive performance

allenai/Molmo2-8B

Qwen3-8B

SigLIP 2

Balanced size and performance

allenai/Molmo2-VideoPoint-4B

Qwen3-4B-Instruct

SigLIP 2

Finetuned on Molmo2-VideoPoint data only for video pointing and counting

All models are competitive on long-videos.

Installation#

Important: Requires transformers==4.57.1

pip install transformers==4.57.1
pip install fiftyone umap-learn
pip install einops accelerate decord2 molmo_utils

Usage#

Load a Dataset#

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub(
    "Voxel51/qualcomm-interactive-video-dataset",
    max_samples=20,
    overwrite=True
)

# REQUIRED: Compute metadata (needed for timestamp-to-frame conversion)
dataset.compute_metadata()

Load the Model#

import fiftyone.zoo as foz

model = foz.load_zoo_model("allenai/Molmo2-4B")

Operations#

Operation

Prompt Template

Output Field Type

pointing

"Point to the {prompt}."

Frame-level fo.Keypoints

tracking

"Track the {prompt}."

Frame-level fo.Keypoints with fo.Instance linking

describe

Uses prompt directly

Sample-level string

temporal_localization

Fixed prompt (finds activity events)

Sample-level fo.TemporalDetections

comprehensive

Fixed prompt (full video analysis)

Sample-level mixed (see below)

Output Fields by Operation#

pointing / tracking:

  • Frame-level fo.Keypoints stored on each frame

  • Each keypoint has label, points (normalized x, y), and index (object ID)

  • For tracking, keypoints share fo.Instance objects to link across frames

describe:

  • Sample-level string field containing the model’s text response

temporal_localization:

  • Sample-level fo.TemporalDetections with start/end frame numbers and event descriptions

comprehensive:

  • summary: Sample-level string

  • events: fo.TemporalDetections

  • objects: fo.TemporalDetections (with first/last appearance times)

  • text_content: fo.TemporalDetections (for any text detected in video)

  • scene_info_*: fo.Classification fields (setting, time_of_day, location_type)

  • activities_*: fo.Classification or fo.Classifications fields

Embeddings#

Generate fixed-dimension vector embeddings for each video. Useful for similarity search, clustering, and visualization.

Added field: molmo_embeddings — a numpy array of shape (hidden_dim,) on each sample.

Pooling strategy: Controls how variable-length hidden states are collapsed into a fixed-size vector:

  • mean (default) — average across all tokens

  • max — max pooling across tokens

  • cls — use the first (CLS) token only

model.pooling_strategy = "mean"  # or "max" or "cls"

dataset.compute_embeddings(
    model,
    batch_size=8,
    num_workers=2,
    embeddings_field="molmo_embeddings",
    skip_failures=False
)

Visualize Embeddings#

Use FiftyOne Brain to project embeddings into 2D/3D for visualization in the App.

import fiftyone.brain as fob

results = fob.compute_visualization(
    dataset,
    method="umap",  # Also supports "tsne", "pca"
    brain_key="molmo_viz",
    embeddings="molmo_embeddings",
    num_dims=2  # or 3 for 3D
)

Describe#

Generate free-form text descriptions, captions, or answers to questions about videos. The prompt is used directly without any template.

Added field: A string field (e.g., prompted_describe or answer_pred) containing the model’s text response on each sample.

# With a global prompt
model.operation = "describe"
model.prompt = "Provide a short description for what is happening in the video"

dataset.apply_model(
    model,
    "prompted_describe",
    batch_size=16,
    num_workers=4,
    skip_failures=False
)

# With per-sample prompts from a field (e.g., for VQA)
model.operation = "describe"

dataset.apply_model(
    model,
    prompt_field="question",
    label_field="answer_pred",
    batch_size=16,
    num_workers=1,
    skip_failures=False
)

Pointing#

Point to objects in video frames. The model identifies where instances of the specified object appear across frames and returns their coordinates.

Added field: Frame-level fo.Keypoints on each frame where the object is detected. Each keypoint contains normalized (x, y) coordinates and an index identifying the object instance.

Tip: If pointing/counting is your primary use case, consider using allenai/Molmo2-VideoPoint-4B which is specifically finetuned for video pointing and counting tasks.

model.operation = "pointing"
model.prompt = "person's nose"

dataset.apply_model(
    model,
    "point_pred",
    batch_size=16,
    num_workers=1,
    skip_failures=False
)

Tracking#

Track objects across video frames. Similar to pointing, but keypoints for the same object share an fo.Instance to link them across time.

Added field: Frame-level fo.Keypoints with fo.Instance linking. Objects maintain consistent identity across frames, enabling trajectory analysis.

model.operation = "tracking"
model.prompt = "person's hand"

dataset.apply_model(
    model,
    "track_pred",
    batch_size=16,
    num_workers=1,
    skip_failures=False
)

Comprehensive#

Run a full video analysis that extracts multiple types of information: summary, events, objects, text content, scene info, and activities.

Added fields: Multiple fields are added to each sample:

  • comprehensive_summary — text description

  • comprehensive_events — fo.TemporalDetections for activities/events

  • comprehensive_objects — fo.TemporalDetections with first/last appearance times

  • comprehensive_scene_info_* — fo.Classification fields (setting, time_of_day, location_type)

  • comprehensive_activities_* — fo.Classification or fo.Classifications

model.operation = "comprehensive"

dataset.apply_model(
    model,
    "comprehensive",
    batch_size=2,
    num_workers=1,
    skip_failures=False
)

Temporal Localization#

Find and localize activity events in the video with start/end timestamps.

Added field: fo.TemporalDetections on each sample containing detected events with their time intervals and descriptions.

model.operation = "temporal_localization"

dataset.apply_model(
    model,
    "temporal_localization",
    batch_size=2,
    num_workers=1,
    skip_failures=False
)

Launch the App#

session = fo.launch_app(dataset, auto=False)

Citation#

If you use Molmo2 in your research, please cite the technical report:

@techreport{molmo2,
  title={Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding},
  author={Clark, Christopher and Zhang, Jieyu and Ma, Zixian and Park, Jae Sung and Salehi, Mohammadreza and Tripathi, Rohun and Lee, Sangho and Ren, Zhongzheng and Kim, Chris Dongjoo and Yang, Yinuo and Shao, Vincent and Yang, Yue and Huang, Weikai and Gao, Ziqi and Anderson, Taira and Zhang, Jianrui and Jain, Jitesh and Stoica, George and Han, Winston and Farhadi, Ali and Krishna, Ranjay},
  institution={Allen Institute for AI},
  year={2025}
}