Note

This is a community plugin, an external project maintained by its respective author. Community plugins are not part of FiftyOne core and may change independently. Please review each plugin’s documentation and license before use.

GitHub Repo

MinerU 2.5 for FiftyOne#

image

A FiftyOne zoo model integration for MinerU2.5, a 1.2B-parameter vision-language model for efficient high-resolution document parsing.

About MinerU 2.5#

MinerU2.5 achieves state-of-the-art document parsing accuracy with a two-stage strategy:

  • Global layout analysis on downsampled images

  • Fine-grained content recognition on native-resolution crops

Key capabilities:

  • Comprehensive layout analysis (headers, footers, lists, code blocks)

  • Complex mathematical formula parsing (including mixed Chinese-English)

  • Robust table parsing (handles rotated, borderless, and partial-border tables)

For more details, see the model card and technical report.

Installation#

pip install fiftyone
pip install "mineru-vl-utils[transformers]"

Optional: Install Caption Viewer Plugin#

For the best experience viewing extracted text, we recommend installing the Caption Viewer plugin, which provides intelligent formatting for OCR outputs and text fields:

fiftyone plugins download https://github.com/harpreetsahota204/caption_viewer

This plugin automatically:

  • Renders line breaks and escape sequences properly

  • Converts HTML tables to markdown

  • Pretty-prints JSON content

  • Shows character counts

Usage in FiftyOne#

Register the Model Source#

import fiftyone.zoo as foz

foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/mineru_2_5",
    overwrite=True
)

Load the Model#

# Load with default settings (fast layout detection mode)
model = foz.load_zoo_model(
    "opendatalab/MinerU2.5-2509-1.2B",
    operation="layout_detection",  # Fast mode (default)
    batch_size=8  # Process 8 images at once
)

Three Operation Modes#

MinerU now supports three operation modes with different speed/accuracy tradeoffs:

2. OCR Detection (SLOW but Complete)#

Full extraction with bounding boxes AND OCR content:

model.operation = "ocr_detection"
dataset.apply_model(model, label_field="text_detections")

Speed: ~800ms per image (15 blocks Ă— 50ms + overhead)
Returns: fo.Detections with bounding boxes, types, AND extracted text
Use case: Complete document extraction with precise element locations

3. OCR (FAST - Text Only)#

Extract all text content as a single string:

model.operation = "ocr"
dataset.apply_model(model, label_field="text_extraction")

Speed: ~50ms per image
Returns: Plain text string
Use case: Full-text search, content indexing

Detection Format#

All detection modes return fo.Detections with:

  • label: Element type (text, title, table, equation, image, etc.)

  • bounding_box: Normalized coordinates [x, y, width, height]

  • text: Extracted text content (empty for layout_detection mode)

  • angle: Rotation angle (0, 90, 180, or 270)

Complete Example#

import fiftyone as fo
import fiftyone.zoo as foz

# Register and load model
foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/mineru_2_5",
    overwrite=True
)

# Load model with batching support (NEW!)
model = foz.load_zoo_model(
    "opendatalab/MinerU2.5-2509-1.2B",
    operation="layout_detection",  # Fast mode (default)
    batch_size=8  # Process 8 images at once
)

# Load your dataset
dataset = fo.Dataset.from_dir(
    dataset_dir="/path/to/images",
    dataset_type=fo.types.ImageDirectory,
)

# Apply FAST layout detection (with batching!)
model.operation = "layout_detection"
dataset.apply_model(model, label_field="layout")

# Optionally: Apply full OCR to specific samples
# model.operation = "ocr_detection"
# dataset.match(F("layout.detections").length() > 5).apply_model(
#     model, label_field="detailed_ocr"
# )

# Launch the app
session = fo.launch_app(dataset)

Performance Comparison#

Operation

Inference Passes

Speed (per image)

Use Case

layout_detection

1 pass

~50ms

Fast bounding boxes only

ocr

1 pass

~50ms

Fast text extraction

ocr_detection

N+1 passes

~800ms

Complete extraction

Batching speedup: With batch_size=8, you get an additional 3-5x speedup from GPU parallelization!

Model Configuration#

Operations#

  • layout_detection (default): Fast layout with bounding boxes → fo.Detections (1 pass)

  • ocr_detection: Full extraction with content → fo.Detections (N+1 passes)

  • ocr: Plain text only → str (1 pass)

Batching (NEW!)#

The model now supports efficient batching via FiftyOne’s SupportsGetItem interface:

# Load model with initial batch size
model = foz.load_zoo_model(
    "opendatalab/MinerU2.5-2509-1.2B",
    operation="layout_detection",
    batch_size=8,  # Internal MinerU batch size (default: 8)
)

# Batching happens automatically with apply_model()
dataset.apply_model(
    model, 
    label_field="results",
    batch_size=8,      # FiftyOne DataLoader batch size
    num_workers=4      # Parallel data loading workers
)

Why set batch_size in two places?

  • model.batch_size: Controls MinerU’s internal inference batching (how many images the VLM processes at once)

  • apply_model(..., batch_size=...): Controls FiftyOne’s DataLoader batching (how many images are loaded in parallel)

For best performance, set both to the same value:

BATCH_SIZE = 32

# Update model's internal batch size
model.batch_size = BATCH_SIZE
model.operation = "layout_detection"

# Apply with matching DataLoader batch size
dataset.apply_model(
    model, 
    label_field="layout",
    batch_size=BATCH_SIZE,  # Match model.batch_size
    num_workers=8           # 8 workers for parallel I/O
)

Benefits:

  • 5-10x faster than sequential processing

  • 70-90% GPU utilization (vs 20-30% without batching)

  • Parallel data loading with num_workers

  • Optimal throughput when both batch sizes match

Advanced Options#

The model automatically:

  • Selects the best available device (CUDA/MPS/CPU)

  • Uses torch.float16 on CUDA, float32 on CPU/MPS

  • Enables Flash Attention 2 when available on CUDA devices

  • Optimizes model loading (no device_map overhead for small models)

Requirements#

  • Python 3.8+

  • FiftyOne

  • PyTorch

  • transformers >= 4.56.0 (recommended)

  • mineru-vl-utils

Citation#

@misc{niu2025mineru25decoupledvisionlanguagemodel,
    title={MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing}, 
    author={Junbo Niu and Zheng Liu and Zhuangcheng Gu and Bin Wang and Linke Ouyang and Zhiyuan Zhao and Tao Chu and Tianyao He and Fan Wu and Qintong Zhang and Zhenjiang Jin and others},
    year={2025},
    eprint={2509.22186},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2509.22186}
}

License#

AGPL-3.0 (as per the original MinerU2.5 model)