Note
This is a community plugin, an external project maintained by its respective author. Community plugins are not part of FiftyOne core and may change independently. Please review each plugin’s documentation and license before use.
ColModernVBert for FiftyOne#
Integration of ColModernVBert as a FiftyOne Zoo Model for fine-grained multimodal document retrieval and zero-shot classification.

Overview#
ColModernVBert is a multi-vector vision-language model built on the ModernVBert architecture that generates ColBERT-style embeddings for both images and text. Unlike single-vector models that compress entire images into a single representation, ColModernVBert produces multiple 128-dimensional vectors per input, enabling fine-grained matching between specific image regions and text tokens.
Key Features#
- Multi-Vector Embeddings: Variable-length sequences of 128-dimensional vectors - Images: ~884 vectors per image 
- Text: ~13 vectors per query 
 
- MaxSim Scoring: ColBERT-style late interaction for fine-grained matching 
- Pre-Compressed Vectors: No token pooling required (already 128-dim per vector) 
- Dual-Mode Operation: Pooled 128-dim for retrieval, full multi-vectors for classification 
- Zero-Shot Classification: Use text prompts to classify images without training 
- Document Understanding: Optimized for visual document analysis 
Architecture#
Multi-Vector Design#
ColModernVBert uses a multi-vector architecture inspired by ColBERT, where each input (image or text) is represented by multiple vectors rather than a single embedding:
# Image or Text → Processor → Model → (batch, num_vectors, 128)
Benefits of Multi-Vectors:
- Fine-grained matching: Match specific image regions to text tokens 
- Better accuracy: Capture more detailed information than single vectors 
- Late interaction: Efficient MaxSim scoring at query time 
Dual-Mode Operation#
This integration supports two workflows optimized for different use cases:
Mode 1: Retrieval/Similarity Search#
For efficient large-scale search, multi-vectors are pooled to fixed 128-dim embeddings:
Multi-vectors (N, 128) → Final Pooling (mean/max) → 128-dim vector
Use case: Similarity search, embeddings visualization, clustering
Mode 2: Zero-Shot Classification#
For accurate classification, full multi-vectors are used with MaxSim scoring:
Image multi-vectors × Text multi-vectors → MaxSim → Classification scores
Use case: Zero-shot classification, fine-grained document analysis
How MaxSim Works#
MaxSim (Maximum Similarity) is a late interaction scoring mechanism:
- For each text vector, find its maximum similarity with any image vector 
- Sum these maximum similarities across all text vectors 
- Result: A score that captures fine-grained matches between text and image 
This allows the model to match specific keywords to relevant image regions, providing better accuracy than single-vector approaches.
Installation#
Note: This model requires the colpali-engine package which provides the ColModernVBert implementation.
# Install FiftyOne and BiModernVBert dependencies
pip install fiftyone torch transformers pillow
pip install git+https://github.com/illuin-tech/colpali.git@vbert#egg=colpali-engine
Quick Start#
Load Dataset#
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
# Load document dataset from Hugging Face
dataset = load_from_hub(
    "Voxel51/document-haystack-10pages",
    overwrite=True,
    max_samples=250  # Optional: subset for testing
)
Register the Zoo Model#
import fiftyone.zoo as foz
# Register this repository as a remote zoo model source
foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/colmodernvbert",
    overwrite=True
)
Basic Workflow#
import fiftyone.zoo as foz
import fiftyone.brain as fob
# Load ColModernVBert model
model = foz.load_zoo_model(
    "ModernVBERT/colmodernvbert",
    pooling_strategy="mean"  # or "max"
)
# Compute embeddings for all documents
# Multi-vectors are pooled to 128-dim for storage
dataset.compute_embeddings(
    model=model,
    embeddings_field="colmodernvbert_embeddings"
)
# Check embedding dimensions
print(dataset.first()['colmodernvbert_embeddings'].shape)  # (128,)
# Build similarity index
text_img_index = fob.compute_similarity(
    dataset,
    model="ModernVBERT/colmodernvbert",
    embeddings_field="colmodernvbert_embeddings",
    brain_key="colmodernvbert_sim",
    model_kwargs={"pooling_strategy": "mean"}
)
# Query for specific content
results = text_img_index.sort_by_similarity(
    "invoice from 2024",
    k=10  # Top 10 results
)
# Launch FiftyOne App
session = fo.launch_app(results, auto=False)
Pooling Strategies#
The pooling strategy determines how multi-vectors are compressed to fixed-dimension embeddings for retrieval:
Mean Pooling (Default)#
Averages all vectors to create a holistic representation:
model = foz.load_zoo_model(
    "ModernVBERT/colmodernvbert",
    pooling_strategy="mean"
)
Best for:
- General document retrieval 
- Holistic semantic matching 
- When overall content matters more than specific details 
Max Pooling#
Takes the maximum value across vectors for each dimension:
model = foz.load_zoo_model(
    "ModernVBERT/colmodernvbert",
    pooling_strategy="max"
)
Best for:
- Keyword-based search 
- Finding specific content or phrases 
- When any matching element is sufficient 
Advanced Embedding Workflows#
Embedding Visualization with UMAP#
Create 2D visualizations of your document embeddings:
import fiftyone.brain as fob
# First compute embeddings
dataset.compute_embeddings(
    model=model,
    embeddings_field="colmodernvbert_embeddings"
)
# Create UMAP visualization
results = fob.compute_visualization(
    dataset,
    method="umap",  # Also supports "tsne", "pca"
    brain_key="colmodernvbert_viz",
    embeddings="colmodernvbert_embeddings",
    num_dims=2
)
# Explore in the App
session = fo.launch_app(dataset)
Similarity Search#
Build powerful similarity search:
import fiftyone.brain as fob
results = fob.compute_similarity(
    dataset,
    backend="sklearn",
    brain_key="colmodernvbert_sim",
    embeddings="colmodernvbert_embeddings"
)
# Find similar images
sample_id = dataset.first().id
similar_samples = dataset.sort_by_similarity(
    sample_id,
    brain_key="colmodernvbert_sim",
    k=10
)
# View results
session = fo.launch_app(similar_samples)
Dataset Representativeness#
Score how representative each sample is of your dataset:
import fiftyone.brain as fob
# Compute representativeness scores
fob.compute_representativeness(
    dataset,
    representativeness_field="colmodernvbert_represent",
    method="cluster-center",
    embeddings="colmodernvbert_embeddings"
)
# Find most representative samples
representative_view = dataset.sort_by("colmodernvbert_represent", reverse=True)
Duplicate Detection#
Find and remove near-duplicate documents:
import fiftyone.brain as fob
# Detect duplicates using embeddings
results = fob.compute_uniqueness(
    dataset,
    embeddings="colmodernvbert_embeddings"
)
# Filter to most unique samples
unique_view = dataset.sort_by("uniqueness", reverse=True)
Zero-Shot Classification#
ColModernVBert excels at zero-shot classification using multi-vector MaxSim scoring:
import fiftyone.zoo as foz
# Load model with classes for classification
model = foz.load_zoo_model(
    "ModernVBERT/colmodernvbert",
    classes=["invoice", "receipt", "form", "contract", "other"],
    text_prompt="This document is a",
    pooling_strategy="max"  # Max pooling often works well for classification
)
# Apply model for zero-shot classification
# Uses full multi-vectors with MaxSim (not pooled embeddings)
dataset.apply_model(
    model,
    label_field="document_type_predictions"
)
# View predictions
print(dataset.first()['document_type_predictions'])
session = fo.launch_app(dataset)
Dynamic Classification with Multiple Tasks#
Reuse the same model for different classification tasks:
import fiftyone.zoo as foz
# Load model once
model = foz.load_zoo_model(
    "ModernVBERT/colmodernvbert",
    pooling_strategy="max"
)
# Task 1: Classify document types
model.classes = ["invoice", "receipt", "form", "contract"]
model.text_prompt = "This is a " 
dataset.apply_model(model, label_field="doc_type")
# Task 2: Classify importance
model.classes = ["high_priority", "medium_priority", "low_priority"]
model.text_prompt = "The priority level is "
dataset.apply_model(model, label_field="priority")
# Task 3: Classify language
model.classes = ["english", "spanish", "french", "german", "chinese"]
model.text_prompt = "The document language is "
dataset.apply_model(model, label_field="language")
# Task 4: Classify completeness
model.classes = ["complete", "incomplete", "draft"]
model.text_prompt = "The document status is "
dataset.apply_model(model, label_field="status")
Technical Details#
FiftyOne Integration Architecture#
Retrieval Pipeline (Pooled Mode):
dataset.compute_embeddings(model, embeddings_field="embeddings")
> embed_images()
    > processor.process_images(imgs)
        > model(**inputs)
            > Multi-vectors (batch, N, 128)
                > Final pooling (mean/max)
                    > Returns (batch, 128) pooled embeddings
                        > Stores in FiftyOne for similarity search
Classification Pipeline (Multi-Vector Mode):
dataset.apply_model(model, label_field="predictions")
> _predict_all()
    > Get image multi-vectors (batch, N, 128)
    > Get text multi-vectors for classes (num_classes, M, 128)
    > processor.score() with MaxSim
        > Returns (batch, num_classes) logits
            > Output processor → Classification labels
Typical Use Cases#
| Use Case | Mode | Pooling Strategy | Notes | 
|---|---|---|---|
| Document retrieval | Pooled | Mean | Efficient for large-scale search | 
| Keyword search | Pooled | Max | Finds specific content matches | 
| Zero-shot classification | Multi-vector | N/A | Highest accuracy with MaxSim | 
| Fine-grained matching | Multi-vector | N/A | Match specific regions | 
| Embeddings visualization | Pooled | Mean | Holistic semantic space | 
| Duplicate detection | Pooled | Mean | Fast similarity computation | 
Combining Embeddings and Classification#
Use the same model for both workflows:
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.brain as fob
# Load model once
model = foz.load_zoo_model(
    "ModernVBERT/colmodernvbert",
    pooling_strategy="mean"
)
# Step 1: Compute pooled embeddings for similarity search
dataset.compute_embeddings(
    model=model,
    embeddings_field="colmodernvbert_embeddings"
)
# Step 2: Build similarity index
index = fob.compute_similarity(
    dataset,
    model="ModernVBERT/colmodernvbert",
    embeddings_field="colmodernvbert_embeddings",
    brain_key="colmodernvbert_sim"
)
# Step 3: Add zero-shot classification (uses full multi-vectors)
model.classes = ["technical", "financial", "legal", "personal"]
model.text_prompt = "This document category is"
dataset.apply_model(model, label_field="category")
# Step 4: Add more classifications
model.classes = ["urgent", "normal", "low_priority"]
model.text_prompt = "The urgency level is"
dataset.apply_model(model, label_field="urgency")
# Explore combined results
session = fo.launch_app(dataset)
Resources#
- Model Hub: ModernVBERT/colmodernvbert 
- ColPali Engine: colpali-engine 
- FiftyOne Docs: docs.voxel51.com 
- Base Architecture: ModernVBert 
- Inspiration: ColBERT late interaction 
Citation#
If you use ColModernVBert in your research, please cite:
@misc{teiletche2025modernvbertsmallervisualdocument,
      title={ModernVBERT: Towards Smaller Visual Document Retrievers}, 
      author={Paul Teiletche and Quentin Macé and Max Conti and Antonio Loison and Gautier Viaud and Pierre Colombo and Manuel Faysse},
      year={2025},
      eprint={2510.01149},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2510.01149}, 
}
License#
- Model: MIT 
- Integration Code: Apache 2.0 (see LICENSE) 
Contributing#
Found a bug or have a feature request? Please open an issue on GitHub!
Acknowledgments#
- ModernVBERT Team for the excellent ColModernVBert model 
- ColPali Engine for the model implementation and processor 
- ColBERT for pioneering multi-vector late interaction 
- Voxel51 for the FiftyOne framework and brain module architecture 
- HuggingFace for model hosting