Note
This is a community plugin, an external project maintained by its respective author. Community plugins are not part of FiftyOne core and may change independently. Please review each plugin’s documentation and license before use.
PDF Loader Plugin#
A FiftyOne plugin for loading PDFs as images.
Installation#
If you haven’t already, install FiftyOne:
pip install fiftyone
Then install the plugin and its dependencies:
fiftyone plugins download https://github.com/brimoor/pdf-loader
brew install poppler
pip install pdf2image
Usage#
Using the App UI#
Launch the App:
import fiftyone as fo
dataset = fo.Dataset()
session = fo.launch_app(dataset)
Press
`
or click theBrowse operations
icon above the gridRun the
pdf_loader
operator
Using the SDK#
You can use the plugin programmatically from Python:
import fiftyone as fo
import fiftyone.operators as foo
import requests
import os
# Download a PDF from a URL (optional - you can use any local PDF)
url = "https://arxiv.org/pdf/2309.11419"
filename = url.split('/')[-1] + ".pdf" # Add .pdf extension
response = requests.get(url)
if response.status_code == 200:
with open(filename, 'wb') as f:
f.write(response.content)
print(f"Downloaded {filename}")
else:
print(f"Failed to download {filename}. Status code: {response.status_code}")
# Load the PDF loader operator
pdf_loader = foo.get_operator("@brimoor/pdf-loader/pdf_loader")
# Create a dataset for the PDF pages
pdf_dataset = fo.Dataset("pdf_dataset")
# Convert PDF to images and add to dataset
pdf_loader(
pdf_dataset,
input_path="./2309.11419.pdf", # Path to your PDF file
output_dir="./pdf_images", # Directory to save the images
dpi=200, # Image quality in DPI
fmt="png", # Image format (png or jpg)
tags=None, # Optional tags for samples
delegate=False # Set to True for async execution
)
What next?#
Install the PyTesseract OCR and Semantic Document Search plugins to make your documents searchable!
Install the plugins and their dependencies:
fiftyone plugins download https://github.com/jacobmarks/pytesseract-ocr-plugin
pip install pytesseract
<video controls width="100%" style="max-width: 600px; height: auto;"><source src="https://github.com/jacobmarks/semantic-document-search-plugin
pip install qdrant_client
pip install sentence_transformers
Launch a Qdrant server:
docker run -p "6333:6333" -p "6334:6334" -d qdrant/qdrant" type="video/mp4"></video>
Run the
run_ocr_engine
operator to detect text blocksRun the
create_semantic_document_index
operator to generate a semantic index for the text blocksRun the
semantically_search_documents
operator to perform arbitrary searches against the index!
Implementation#
This plugin is a basically a wrapper around the following code:
import os
from pdf2image import convert_from_path
INPUT_PATH = "/path/to/your.pdf"
OUTPUT_DIR = "/path/for/page/images"
os.makedirs(OUTPUT_DIR, exist_ok=True)
convert_from_path(INPUT_PATH, output_folder=OUTPUT_DIR, fmt="jpg")
dataset.add_images_dir(OUTPUT_DIR)