GitHub - huridocs/pdf-text-extraction: This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.

PDF Text Extraction

A Docker-powered service for extracting text from PDF documents

This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.

Quick Start

Start the service:

# With GPU support
make start

# Without GPU support [if you do not have a GPU on your system]
make start_no_gpu

Get all the text from a PDF:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/text

To stop the server:

make stop

Dependencies

Docker Desktop 4.25.0 install link

Requirements

4 GB RAM memory
6 GB GPU memory (if not, it will run with CPU)

Usage

As we mentioned at the Quick Start, you can get all the text inside a PDF simply like this:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/text

But you can also specify the types of the text which you want to extract like:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080 -F "types=text, title, section header, list item"

These are the types you can pass:

   "Caption"
   "Footnote"
   "Formula"
   "List item"
   "Page footer"
   "Page header"
   "Picture"
   "Section header"
   "Table"
   "Text"
   "Title"

Also, if you want to get the results faster (but with slightly worse performance) you can run this command:,

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/text -F "fast=true"

For more information about models and this fast method, check this link.

And to stop the server, you can simply use this:

make stop

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose-gpu.yml		docker-compose-gpu.yml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Text Extraction

Quick Start

Contents

Dependencies

Requirements

Usage

About

Releases

Packages

Languages

License

huridocs/pdf-text-extraction

Folders and files

Latest commit

History

Repository files navigation

PDF Text Extraction

Quick Start

Contents

Dependencies

Requirements

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages