-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use PLAYA instead of pdfminer #1226
base: develop
Are you sure you want to change the base?
Conversation
Fascinating! Thank you for sharing. An idle thought: What if |
This wouldn't be terribly hard to do - it would be a useful exercise as some of the representations used by The goal of PLAYA is just to be a Pythonic and lazy wrapper around the internals of PDF, obviously (I will probably change the recursive acronym to PLAYA is a LAzY Analyzer for PDF 🤣) |
I may wish to promote this to a real PR shortly (awaiting a release of PLAYA that will fix a couple important bugs). PLAYA is much more robust to borken PDFs than pdfminer.six, supports color spaces and patterns more correctly, and is also significantly faster. For a 486-page PDF document, running With PLAYA it takes 1:16 minutes ... a 28% speedup! |
Really neat to see you developing this so rapidly, and great to hear about that speedup. |
Thanks! I keep on finding interesting bugs in pdfminer.six, unfortunately... these ones are fixed in PLAYA: pdfminer/pdfminer.six#1065 This one isn't yet (and it's kind of nasty since it causes text extraction to simply fail silently on some files): |
So... I went ahead and rewrote large parts of pdfminer.six, because I kept having nightmares about being back in Software Engineering 101 every time I looked at its code. The result is PLAYA, which does less stuff than pdfminer.six but I believe does it somewhat better (and about 20% faster).
This PR uses it, and also as a consequence fixes a few longstanding issues due to pdfminer's quirks. Some of these quirks have not been fixed yet (e.g. the placement of things relative to the MediaBox, lack of actual support for pattern color spaces) but should be soon.
On the downside,
LAParams
no longer exists and thus cannot be used. What it actually did was mostly just change the ordering of items in the page, and do some heuristic detection of whitespace in text, replicating things that pdfplumber was already doing. (in general this is true of all the "layout analysis" pdfminer did)I have tried to keep the API reasonable and compact so that it could ultimately be reimplemented on some other PDF parser. Note however that the API is subject to change - this PR is using the "eager" API which is kind of custom made for pdfplumber and also retains some pdfplumber quirks, and thus might not stick around.
Do not merge this, for obvious reasons! It's here in case you or anyone somehow feel the desire to play with it.