Note
This is a community plugin, an external project maintained by its respective author. Community plugins are not part of FiftyOne core and may change independently. Please review each plugin’s documentation and license before use.
DeepSeek-OCR FiftyOne Zoo Model#
DeepSeek-OCR is a vision-language model designed for optical character recognition with a focus on “contextual optical compression.” Unlike traditional OCR engines, it uses a dual-encoder architecture (SAM + CLIP) to process documents and convert them to structured text formats like Markdown.

Key Features:
Supports multiple resolution modes for different document types
Can process documents with complex layouts, tables, and formulas
Outputs structured Markdown with bounding box annotations
Handles multi-page PDFs and various image formats
Requirements#
Important: This model requires specific versions of transformers and tokenizers:
pip install transformers==4.46.3
pip install tokenizers==0.20.3
pip install addict
pip install fiftyone
pip install torch
pip install torchvision
pip install einops
Optional (for GPU acceleration):
pip install flash-attn --no-build-isolation
Installation and Setup#
Register the Model Source#
import fiftyone.zoo as foz
# Register the model source
foz.register_zoo_model_source(
"https://github.com/harpreetsahota204/deepseek_ocr",
overwrite=True # This will make sure you're always using the latest implementation
)
Load the Model#
# Load the model
model = foz.load_zoo_model("deepseek-ai/DeepSeek-OCR")
Usage Examples#
Load a Dataset#
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/document-haystack-10pages")
Grounding Mode - Extract Text with Bounding Boxes#
# Grounding Mode - Extract text with bounding boxes
model.resolution_mode = "gundam"
model.operation = "grounding"
dataset.apply_model(model, label_field="text_detections")
Free OCR - Text Extraction Only#
# Free OCR
model.operation = "ocr"
dataset.apply_model(model, label_field="text_extraction")
Describe Mode - Document Description#
# Describe mode
model.operation = "describe"
dataset.apply_model(model, label_field="doc_description")
Custom Prompt#
# Custom prompt
model.operation = "grounding"
model.prompt = "<image>\n<|grounding|>Locate <|ref|>The secret<|/ref|> in the image."
dataset.apply_model(model, label_field="custom_detections")
Resolution Modes#
DeepSeek-OCR provides five predefined resolution modes optimized for different document types:
Single-View Modes (crop_mode=False)#
These modes process the entire image as one single view at a fixed resolution:
Mode |
|
|
|
Vision Tokens |
Description |
|---|---|---|---|---|---|
Tiny |
512 |
512 |
False |
64 |
Fastest, for very simple documents |
Small |
640 |
640 |
False |
100 |
Fast, for simple receipts/forms |
Base |
1024 |
1024 |
False |
256 |
Balanced, for standard documents |
Large |
1280 |
1280 |
False |
400 |
Highest quality, slower |
How it works:
Your image (any size) → Resized/padded to [N×N] → Single view → Fixed token count
Multi-View Mode: Gundam (crop_mode=True)#
Gundam mode is the recommended default for complex documents. It processes images using two complementary views:
Mode |
|
|
|
Vision Tokens |
Description |
|---|---|---|---|---|---|
Gundam |
1024 |
640 |
True |
Variable |
Multi-view for complex layouts |
How it works:
Your image (any size) → [1024×1024 global view] (overall structure)
+ [640Ă—640 patches Ă— N] (fine details)
→ 256 + (N × 100) tokens
The model automatically determines how many 640Ă—640 patches are needed based on your image dimensions.
Visual Example:
For a 2400Ă—3200 pixel image with Gundam mode:
Global View: Local Patches:
1 2 3 4
1024Ă—1024 +
5 6 7 8
9 10 11 12
(each patch is 640Ă—640)
Key Parameters#
resolution_mode#
Controls the processing resolution and strategy. Options: "gundam" (default), "base", "small", "large", "tiny".
operation#
Determines the task type and output format:
"grounding"- Returnsfo.Detectionswith bounding boxes"ocr"- Returns text string"describe"- Returns description text string
Custom Prompts#
You can create custom prompts to guide the model toward specific extraction tasks. The model automatically infers the output type based on the prompt content.
Guidelines:
Always include the
<image>placeholderInclude
<|grounding|>for detection output with bounding boxesOmit
<|grounding|>for text-only output
Examples:
# Grounding with bounding boxes - returns fo.Detections
model.prompt = "<image>\n<|grounding|>Extract all table content."
model.prompt = "<image>\n<|grounding|>Find all headers and section titles."
model.prompt = "<image>\n<|grounding|>Locate all monetary amounts."
# Text-only output - returns string
model.prompt = "<image>\nExtract only phone numbers and email addresses."
model.prompt = "<image>\nSummarize the main points in bullet format."
model.prompt = "<image>\nTranslate the document text to Spanish."
When using custom prompts, the model automatically determines the output format based on whether <|grounding|> is present.
Best Practices and Recommendations#
Choosing the Right Resolution Mode#
Document Type |
Recommended Settings |
Rationale |
|---|---|---|
Complex PDFs, academic papers |
Gundam (1024/640/True) |
Captures both structure and details |
Multi-column layouts |
Gundam (1024/640/True) |
Handles complex spatial relationships |
Tables and forms |
Gundam (1024/640/True) |
Preserves table structure |
Standard single-page docs |
Base (1024/1024/False) |
Balanced speed and quality |
Simple receipts |
Small (640/640/False) |
Fast processing |
Quick testing/preview |
Tiny (512/512/False) |
Fastest option |
High-res scans |
Large (1280/1280/False) |
Maximum quality |
Visualization#
For viewing extracted text and captions, install the caption viewer plugin:
fiftyone plugins download https://github.com/mythrandire/caption-viewer
Additional Resources#
Official Repository: https://github.com/deepseek-ai/DeepSeek-OCR
Model Card: https://huggingface.co/deepseek-ai/DeepSeek-OCR
Paper: DeepSeek-OCR: Contexts Optical Compression (arXiv, 2025)
Citation#
@article{wei2024deepseek-ocr,
title={DeepSeek-OCR: Contexts Optical Compression},
author={Wei, Haoran and Sun, Yaofeng and Li, Yukun},
journal={arXiv preprint arXiv:2510.18234},
year={2025}
}