Library for extracting plain text from documents(files) for further processing (indexing and searching).
Install appropriate tools before using:
sudo apt-get install zlib1g zlib1g-dev zip rar p7zip-full
Add this line to your application's Gemfile:
gem 'act_as_page_extractor'
bundle
For example, for model Document in the Rails framework we need run:
rails g act_as_page_extractor:migration Document category_id user_id
As a result we get two migration files:
class AddPageExtractorFields < ActiveRecord::Migration
def change
add_column :documents, :page_extraction_state, :string, default: ''
add_column :documents, :page_extraction_pages, :integer, default: 0
add_column :documents, :page_extraction_doctype, :string, default: ''
add_column :documents, :page_extraction_filesize, :string, default: ''
end
end
class CreateExtractedPages < ActiveRecord::Migration
def change
create_table :extracted_pages do |t|
t.text :page
t.integer :document_id
t.integer :category_id
t.integer :user_id
t.integer :page_number
t.timestamps null: false
end
add_index :extracted_pages, :document_id
add_index :extracted_pages, :category_id
add_index :extracted_pages, [:document_id, :category_id]
add_index :extracted_pages, [:document_id, :page_number]
end
end
Model Document must have field which contains path to file(supports different archive types that contains txt, pdf, doc/x, txt, html, rtf, ...)
Add to model next parameters for initializing:
class Document < ActiveRecord::Base
include ActAsPageExtractor
act_as_page_extractor options: {
document_class: 'Document',
save_as_pdf: true,
filename: :filename,
document_id: :document_id,
additional_fields: [:category_id, :user_id],
#file_storage: "/full/path/to/documents/storage",
#pdf_storage: "/full/path/to/extracted/pdf/storage"
}
has_many :extracted_pages, dependent: :destroy
end
Now our instance has few new methods:
document = Document.first
document.page_extract!
document.extracted_pages
document.pdf_path # if option 'save_as_pdf' is 'true'
# Access to pages
ExtractedPage.count
# Importing whole directory of documents
ActAsPageExtractor.import_files('/path/to/foler/with/documents')
# We can use cron for run the processing of all the new documents
ActAsPageExtractor.start_extraction
# Getting statistics information of all documents
ActAsPageExtractor.statistics
Parameters of initializing act_as_page_extractor:
- document_class - name of model (e.g. Document)
- save_as_pdf - boolean [true, false] when we want save temporary pdf
- filename - name of field which contains access to the file and it should be an object with 'url' method that returns path to file (e.g. CarrierWave object with 'filename.url')
- document_id - name for saving id
- additional_fields - additional fields that added to extracted page (e.g. for indexing, etc.)
- file_storage - path for saving tmp files (by default it is "public")
- pdf_storage - path for saving pdf (by default it is "public/uploads/extracted/pdf")
rspec
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
https://github.com/phlowerteam [email protected]
Copyright (c) 2024 PhlowerTeam MIT License