Note

This is a community plugin, an external project maintained by its respective author. Community plugins are not part of FiftyOne core and may change independently. Please review each plugin’s documentation and license before use.

Florence2 FiftyOne Remote Model Zoo Implementation#

As of now, Florence2 only works in transformers<4.50.0#

This repository provides a FiftyOne Model Zoo implementation for Florence-2, Microsoft’s powerful multimodal model. The implementation allows seamless integration of Florence-2’s capabilities with FiftyOne’s computer vision tools.

NOTE: Due to recent changes in Transformers 4.50.0 (which are to be patched by Hugging Face) please ensure you have transformers<=4.49.0 installed before running the model

Features#

Florence-2 supports multiple vision-language tasks through this implementation:

Image Captioning
- Three detail levels: basic, detailed, and more_detailed
- Generates natural language descriptions of images
Optical Character Recognition (OCR)
- Text extraction from images
- Optional region-based detection with bounding boxes
Object Detection
- Multiple detection modes:
  - Standard object detection
  - Dense region captioning
  - Region proposal generation
  - Open vocabulary detection (with custom prompts)
Phrase Grounding
- Links phrases to specific regions in images
- Requires a caption or text prompt
Referring Expression Segmentation
- Segments objects based on natural language descriptions
- Returns polygon contours for the referenced objects

Installation#

pip install fiftyone
pip install transformers<=4.49.0

Usage#

Register and download the model (one-time setup)#

import fiftyone.zoo as foz

foz.register_zoo_model_source("https://github.com/harpreetsahota204/florence2", overwrite=True)

foz.download_zoo_model("https://github.com/harpreetsahota204/florence2", model_name="microsoft/Florence-2-base-ft")

Load the model#

model = foz.load_zoo_model(
   "microsoft/Florence-2-base-ft",
    # install_requirements=True #if you are using for the first time and need to download reuirement,
    # ensure_requirements=True #  ensure any requirements are installed before loading the model
   )

There are four available Florence2 checkpoints:

microsoft/Florence-2-base - Base model
microsoft/Florence-2-large - Large model
microsoft/Florence-2-base-ft - Fine-tuned base model
microsoft/Florence-2-large-ft - Fine-tuned large model

Usage#

Switching Between Operations#

The same model instance can be used for different operations by simply changing its properties:

Image Captioning#

model.operation = "caption"
model.detail_level = "detailed"  # Options: "basic", "detailed", "more_detailed"
dataset.apply_model(model, label_field="captions")

OCR#

model.operation = "ocr"
model.store_region_info = True # True will return detected bounding boxes, False will return just the text
dataset.apply_model(model, label_field="text_detections")

Object Detection#

Florence-2 supports four different types of detection operations, each serving a different purpose:

1. Standard Detection (`detection_type="detection"`)#

model.operation = "detection"
model.detection_type = "detection"
dataset.apply_model(model, label_field="standard_detections")

Basic object detection mode
Detects common objects in the image
Returns bounding boxes with object labels

2. Dense Region Captioning (`detection_type="dense_region_caption"`)#

model.operation = "detection"
model.detection_type = "dense_region_caption"
dataset.apply_model(model, label_field="region_captions")

Generates detailed captions for different regions in the image
Each region comes with a descriptive caption
Useful for understanding scene composition

3. Region Proposal (`detection_type="region_proposal"`)#

model.operation = "detection"
model.detection_type = "region_proposal"
dataset.apply_model(model, label_field="region_proposals")

Generates potential regions of interest
Identifies areas that might contain objects
Useful as a preprocessing step for other tasks

4. Open Vocabulary Detection (`detection_type="open_vocabulary_detection"`)#

model.operation = "detection"
model.detection_type = "open_vocabulary_detection"
model.prompt = "Find all the red cars and blue bicycles"
dataset.apply_model(model, label_field="custom_detections")

Phrase Grounding#

model.operation = "phrase_grounding"
model.prompt = "person wearing a red hat"
dataset.apply_model(model, label_field="grounding")

Switch to Segmentation#

model.operation = "segmentation"
model.prompt = "the cat sleeping on the couch"
dataset.apply_model(model, label_field="segments")

You can look at the example notebook for detailed usage syntax.

Output Formats#

Captions: Returns string: Returns str
- Natural language text responses in English
OCR: Returns either string or fiftyone.core.labels.Detections
- Bounding box coordinates are normalized to [0,1] x [0,1]
Detection: Returns fiftyone.core.labels.Detections
- Bounding box coordinates are normalized to [0,1] x [0,1]
Phrase Grounding: Returns fiftyone.core.labels.Detections
- Bounding box coordinates are normalized to [0,1] x [0,1]
Segmentation: Returns fiftyone.core.labels.Polylines
- Normalized point coordinates [0,1] x [0,1]

Device Support#

The implementation automatically selects the appropriate device:

CUDA if available
Apple M1/M2 MPS if available
CPU as fallback

Citation#

@article{xiao2023florence,
  title={Florence-2: Advancing a unified representation for a variety of vision tasks},
  author={Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu},
  journal={arXiv preprint arXiv:2311.06242},
  year={2023}
}

Florence2 FiftyOne Remote Model Zoo Implementation#

As of now, Florence2 only works in transformers<4.50.0#

Features#

Installation#

Usage#

Register and download the model (one-time setup)#

Load the model#

Usage#

Switching Between Operations#

Image Captioning#

OCR#

Object Detection#

1. Standard Detection (detection_type="detection")#

2. Dense Region Captioning (detection_type="dense_region_caption")#

3. Region Proposal (detection_type="region_proposal")#

4. Open Vocabulary Detection (detection_type="open_vocabulary_detection")#

Phrase Grounding#

Switch to Segmentation#

Output Formats#

Device Support#

Citation#

1. Standard Detection (`detection_type="detection"`)#

2. Dense Region Captioning (`detection_type="dense_region_caption"`)#

3. Region Proposal (`detection_type="region_proposal"`)#

4. Open Vocabulary Detection (`detection_type="open_vocabulary_detection"`)#