Skip to content

A Prodigy plugin for PDF annotation

License

Notifications You must be signed in to change notification settings

explosion/prodigy-pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📄 Prodigy-PDF

This repository contains a Prodigy plugin with recipes for image- and text-based annotation of PDF files, as well as recipes for OCR (Optical Character Recognition) to extract content from documents. The pdf.spans.manual recipe uses spacy-layout and Docling to extract the text contents from PDFs and lets you annotate spans of text, with an optional side-by-side preview of the original document and pre-fetching for faster loading during annotation.

pdf.image.manual recipe

pdf_spans_manual

You can install this plugin via pip.

pip install "prodigy-pdf @ git+https://github.com/explosion/prodigy-pdf"

If you want to use the OCR recipes, you'll also want to ensure that tesseract is installed.

# for mac
brew install tesseract

# for ubuntu
sudo apt install tesseract-ocr

To learn more about this plugin, you can check the Prodigy docs.

Issues?

Are you have trouble with this plugin? Let us know on our support forum and we'll get back to you!