-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add page count back into the PDF parser #130
Comments
My vote would be on the following:
|
So if I understand correctly:
What's the underlying idea behind this? To one day swap out the pdf-reader gem with something custom? |
Not quite. The idea is to make a |
Makes sense. It does feel a bit like an in-between solution for:
Would you like met to pick up the |
Yes, please do 👍 Maybe also worthy to try to do the necessary overrides in pdf-reader first. |
@julik What would be a good approach to do these overrides, in your opinion? Fork it to my github account or fork it to wetransfer's account and make the changes? Or just include the PDF reader gem and override the modules and classes within the gem itself? |
nvm;
I went with this approach. You can follow the progress here: https://github.com/WeTransfer/format_parser_pdf . I'm following the pdf-reader's code to see where the |
The old (byte-by-byte)
page_count
implemenation has been removed from this gem and rightfully so. There are however a few alternatives to consider on how to move forward and reimplement this method.I did a very simple
Benchmark.ips
test* with 36 random PDF files, with 4 different libs/parsers and see which performed better, including the PDF parser as it is on the pdf-parser-rework branch.Results:
In bulk (iterations = 100, so 100x36 = 3600 PDF's parsed per binary/lib)
All-in-all:
I'm a bit amazed by the fact that the PDFReader - although it's size as a dependency - is still almost 2 times as fast as the 'custom' PDFReader.
pdf_info
andimagemagick
both tried to 'automatically repair' the Xref tables for broken PDF's which took an insane amount of time forimagemagick
, less so forpdfinfo
but it still bubbles up.How to move forward?
There are a few options to consider:
page_count
I'm not sure which route is the best route to take in this case, but my preference now leans towards using the PDFReader. I don't know how you feel about it @julik, as you put most work in the custom PDF parser.
*Code I used for this benchmark: https://gist.github.com/grdw/dae430ffb3efe291c0be0c423a5b3597
The text was updated successfully, but these errors were encountered: