Skip to content

This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.

License

Notifications You must be signed in to change notification settings

huridocs/pdf-text-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Text Extraction

A Docker-powered service for extracting text from PDF documents


This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.

Quick Start

Start the service:

# With GPU support
make start

# Without GPU support [if you do not have a GPU on your system]
make start_no_gpu

Get all the text from a PDF:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/text

To stop the server:

make stop

Contents

Dependencies

Requirements

  • 4 GB RAM memory
  • 6 GB GPU memory (if not, it will run with CPU)

Usage

As we mentioned at the Quick Start, you can get all the text inside a PDF simply like this:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/text

But you can also specify the types of the text which you want to extract like:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080 -F "types=text, title, section header, list item"

These are the types you can pass:

   "Caption"
   "Footnote"
   "Formula"
   "List item"
   "Page footer"
   "Page header"
   "Picture"
   "Section header"
   "Table"
   "Text"
   "Title"

Also, if you want to get the results faster (but with slightly worse performance) you can run this command:,

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/text -F "fast=true"

For more information about models and this fast method, check this link.

And to stop the server, you can simply use this:

make stop

About

This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published