📄 Prodigy-PDF

This repository contains a Prodigy plugin with recipes for image- and text-based annotation of PDF files, as well as recipes for OCR (Optical Character Recognition) to extract content from documents. The pdf.spans.manual recipe uses spacy-layout and Docling to extract the text contents from PDFs and lets you annotate spans of text, with an optional side-by-side preview of the original document and pre-fetching for faster loading during annotation.

You can install this plugin via pip.

pip install "prodigy-pdf @ git+https://github.com/explosion/prodigy-pdf"

If you want to use the OCR recipes, you'll also want to ensure that tesseract is installed.

# for mac
brew install tesseract

# for ubuntu
sudo apt install tesseract-ocr

To learn more about this plugin, you can check the Prodigy docs.

Issues?

Are you have trouble with this plugin? Let us know on our support forum and we'll get back to you!

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
prodigy_pdf		prodigy_pdf
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 Prodigy-PDF

Issues?

About

Releases 6

Packages

Contributors 4

Languages

License

explosion/prodigy-pdf

Folders and files

Latest commit

History

Repository files navigation

📄 Prodigy-PDF

Issues?

About

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 4

Languages

Packages