A Docker-powered service for extracting text from PDF documents
This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.
Start the service:
# With GPU support
make start
# Without GPU support [if you do not have a GPU on your system]
make start_no_gpu
Get all the text from a PDF:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/text
To stop the server:
make stop
- Docker Desktop 4.25.0 install link
- 4 GB RAM memory
- 6 GB GPU memory (if not, it will run with CPU)
As we mentioned at the Quick Start, you can get all the text inside a PDF simply like this:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/text
But you can also specify the types of the text which you want to extract like:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080 -F "types=text, title, section header, list item"
These are the types you can pass:
"Caption"
"Footnote"
"Formula"
"List item"
"Page footer"
"Page header"
"Picture"
"Section header"
"Table"
"Text"
"Title"
Also, if you want to get the results faster (but with slightly worse performance) you can run this command:,
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/text -F "fast=true"
For more information about models and this fast method, check this link.
And to stop the server, you can simply use this:
make stop