Note

This is a community plugin, an external project maintained by its respective author. Community plugins are not part of FiftyOne core and may change independently. Please review each plugin’s documentation and license before use.

GitHub Repo

Gemini Vision Plugin#

screen-2025-10-15_13 02 13-ezgif com-video-to-webp-converter

Plugin Overview#

This plugin integrates Google Gemini’s multimodal Vision models (including the latest gemini-3-pro-preview) into your FiftyOne workflows. Prompt with text and one or more images; receive a text response grounded in visual inputs. Now featuring Gemini 3.0 with advanced reasoning capabilities!

Installation#

If you haven’t already, install FiftyOne:

pip install fiftyone

Then, install the plugin:

fiftyone plugins download https://github.com/AdonaiVera/gemini-vision-plugin

To use Gemini Vision, set the following environment variable with your API key:

  • GEMINI_API_KEY

Getting your API Key: Follow this step-by-step guide to create your Gemini API key: Getting Your API Key Guide

Important: You need an active Google Cloud account with billing enabled and credits to use the Gemini API. The free tier has limited quotas. If you encounter quota errors like “Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests”, you’ll need to:

  1. Enable billing on your Google Cloud project

  2. Purchase credits or upgrade to a paid plan

  3. Monitor your usage at: https://ai.dev/usage?tab=rate-limit

Refer to the official docs for pricing and quotas: https://ai.google.dev/gemini-api/docs/rate-limits

Getting your Data into FiftyOne#

To use GPT-4 Vision, you will need to have a dataset of images in FiftyOne. If you don’t have a dataset, you can create one from a directory of images:

import fiftyone as fo

# Load BDD unsafe/safe dataset 
dataset = foz.load_zoo_dataset(
    "https://github.com/AdonaiVera/bddoia-fiftyone",
    split="validation",
    max_samples=10
)
dataset.persistent = True

## view the dataset in the App
session = fo.launch_app(dataset)

Operators#

query_gemini_vision#

first_video-ezgif com-video-to-webp-converter

Chat with your images using Gemini Vision models.

Inputs:

  • query_text: The text to prompt Gemini with

  • model: Select from available Gemini models (default: gemini-3-pro-preview)

  • thinking_level: NEW in Gemini 3.0 - Control reasoning depth (low for speed/cost, high for complex reasoning)

  • max_tokens: The maximum number of output tokens to generate (up to 64K with Gemini 3.0)

The operator encodes all selected images and sends them along with your text prompt to the Gemini Vision API. The model’s text response is displayed in the output panel. With Gemini 3.0, you get 1M token context window and enhanced reasoning!

text_to_image#

text_image-ezgif com-video-to-webp-converter

Generate high-quality images from text descriptions using Gemini’s image generation capabilities.

Inputs:

  • prompt: Text description of the image to generate

  • model: Choose between gemini-2.5-flash-image or gemini-3-pro-image-preview (Nano Banana Pro - default, supports 2K/4K)

  • aspect_ratio: Choose from multiple aspect ratios (1:1, 16:9, 9:16, etc.)

The generated image is automatically saved to your dataset with metadata including the prompt and generation type.

image_editing#

edit_image

Edit existing images using text instructions. Provide an image and use text prompts to add, remove, or modify elements, change the style, or adjust the color grading.

Inputs:

  • prompt: Edit instruction (e.g., “add sunglasses”, “change to watercolor style”)

  • model: NEW - Choose between gemini-2.5-flash-image or gemini-3-pro-image-preview (Nano Banana Pro - default)

  • aspect_ratio: Choose from multiple aspect ratios

Select exactly one image from your dataset. The edited image is automatically saved to your dataset with the original prompt preserved.

multi_image_composition#

Compose a new image from multiple input images. Use multiple images to create a new scene or transfer the style from one image to another.

Inputs:

  • prompt: Composition instruction (e.g., “combine these in a collage”, “transfer style from first to second”)

  • model: NEW - Choose between gemini-2.5-flash-image or gemini-3-pro-image-preview (Nano Banana Pro - default)

  • aspect_ratio: Choose from multiple aspect ratios

Select 2-3 images from your dataset (optimally up to 3 images). The composed image is automatically saved to your dataset.

video_understanding#

video_describing-ezgif com-video-to-webp-converter

Analyze and extract information from videos using Gemini’s video understanding capabilities.

Inputs:

  • task_type: Choose analysis type (describe, segment, extract, question)

  • prompt: Analysis prompt describing what you want to know about the video

  • model: Select from available Gemini models (default: gemini-3-pro-preview)

  • thinking_level: NEW in Gemini 3.0 - Control reasoning depth (low for speed/cost, high for complex reasoning)

  • media_resolution: NEW in Gemini 3.0 - Video frame resolution control:

    • high (1,120 tokens/frame) - Recommended for detailed analysis

    • medium (560 tokens/frame) - Optimal for PDFs

    • low (70 tokens/frame) - Most efficient, fewer tokens consumed

Features:

  • Describe: Get a comprehensive description of video content

  • Segment: Identify and describe different segments within the video

  • Extract: Extract specific information from the video

  • Question: Ask specific questions about video content, including timestamp-based queries (e.g., “What happens at 0:30?”)

Select exactly one video from your dataset. The video must be under 20MB for inline analysis. Analysis results are automatically saved to the video sample’s metadata under the video_analysis field. Gemini 3.0 consumes fewer tokens per video while providing better reasoning!

Happy exploring!

Next Steps#

If you like this plugin and find it useful, please leave a star on the repository!

Future Enhancements#

We’re planning to add more exciting features:

  • Batch Image Generation: Create multiple images from a single query

  • Pipeline Support: Build workflows to generate multiple images with different variations

  • Dynamic Prompting: Use dynamic variables per image for automated, customized generation at scale

Stay tuned for updates!