-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Scan time regression in 16.4.3 with --redo-ocr
#1380
Comments
To provide some context, this issue, along with #1378, resulted in a poor initial user experience for me. When working with a 1200-page file, the scan phase alone displayed an estimated time of 35 minutes, but it ended up taking over an hour to complete, when in reality it should have taken only a few seconds. The issue is that pdfminer discards its buffer and then refills it every time @jbarlow83 how do you think this issue should be approached? One option is to wait for pdfminer/pdfminer.six#1030 to land upstream and then revert d35d008. Another one is to lower |
The typical input PDF to ocrmypdf has a lot of images, but not many parseable content streams, which is the only time BUFSIZ comes into play anyway, so this isn't a typical use case. My intention is to wait for the pdfminer.six patch to be released in pdfminer.six. I advise installing ocrmypdf in a virtual environment and using pip to install a patched pdfminer.six and/or ocrmypdf in the meantime. I'm not inclined to lower BUFSIZ, because it is the maximum value to read, and the token splitting issue is much more painful (wrong behavior and in bizarre, complex ways) than poor performance. I respect pdfminer.six but in the long term I will probably replace that library with something more robust. |
I've been curious about alternatives as well... I gather that pikepdf is not able to replace pdfminer.six since you would have done that already? :) |
I really like the pure-Python-ness of pdfminer.six but it has a lot of quirks, some of which have to be worked around in less-than-robust ways. For its main use case most of its API is superfluous and could be redone in a much more intuitive/Pythonic style... |
Hmm. This is interesting - would my proposal to discard the lossy buffering entirely and simply read the entire stream into memory help or hurt in this case? If so I may just rewrite pdfminer/pdfminer.six#1030 to do that... |
Hmm, that's an interesting thought. However, I don't support loading the entire stream into memory for two main reasons:
|
Ah, I understand. Yes, basically the code in pdfminer.six is just very ill-suited to a very large BUFSIZ. My other, less modest, proposal, is to rewrite the I searched for an existing lexer library for Python that will do this properly but can't seem to find one... |
@dhdaines In terms of pikepdf and pdfminer.six -- lately I'm experimenting with rust PDF libraries and in particular pdfium-render seems far beyond what anything in Python is capable in terms of text rendering and layout capabilities. I've had a goal of reducing the number of external binaries ocrmypdf depends on, since that makes installation difficult for less technical users, and generally speaking it looks like the rust ecosystem now has most of what I need. I could pretty much eliminate all mandatory dependencies (tesseract, ghostscript) and some of the optional ones (pngquant) while still allowing optional access to those. So time permitting that is probably the future. qpdf doesn't have code to do what pdfminer.six does, so if I were to add it to pikepdf I'd mostly be copying from pdfminer.six. I'm grateful for pdfminer.six but it is "mature" and dated in many ways. I agree with the buffered reader approach -- separating buffering concerns from lexing and parsing is probably the right appraoch. pdfminer.six's lexer/parser is complex. After examining it a while, I ended up deciding to make BUFSIZ large rather than actually take on the burden of trying to fix it, so I appreciate your effort in improving it. |
Interesting - from what I can see it's exactly equivalent to pypdfium2, i.e. a runtime binding to pdfium. This is certainly good for rendering and also adding objects and annotations (I'm using it for that...). When I checked (in a discussion regaring pdfplumber's use of pdfminer.six) it didn't seem to give low-level access to object and layout information, though. This is what I mean by the "actual use case for pdfminer.six" that could be refactored out. I think the rendering quality is somewhat better with poppler + cairo, though I think there aren't good Python bindings for them and the API is much more limited.
I would say needlessly complex, even... the PostScript subset used in content streams is maybe the only simple and elegant part of the PDF specification :) |
Sorry for spamming this issue with tangential discussions (I'm never sure where to take them though...) but one more question here - OCRmyPDF and pdfplumber seem to be the main users of pdfminer.six at this point, and it appears that they are both using essentially the same part of it, i.e. I think it's useful to have this functionality available in pure Python where it isn't hidden behind extra dependencies and abstractions. On the PDF generation side there is https://github.com/CourtBouillon/pydyf which fills a similar role. So (another question) would a library that does just those parts of pdfminer.six, with various mothballs removed, be useful to you? I'll discuss this with the pdfplumber and pdfminer.six maintainers... |
pdfminer.six has a lot of users: I think it would be doable and an improvement to refactor pdfminer.six to have a backend, and then make that backend either pikepdf or pypdf, so that it retains the higher level features. I think that would be useful work that would remov duplication from pdfminer.six and likely eliminate PDF access bugs like the sort we're seeing. pypdf is under active maintenance again and would be the way to go for a pure Python solution. It's come a long way recently, but I'm still skeptical it's as robust as qpdf (and thus pikepdf). qpdf does so much to quietly fix all kinds of malformed PDFs from obscure PDF generators, or just to support more obscure PDF structures. Mainly this backend separation would amount to creating a class that reimplements pdftypes.py in pdfminer.six to obtain and convert objects differently. Might also be worth checking if borb (AGPL-3, pure Python) has any interesting ideas in this area, but I think they focus a lot more on PDF writing. |
Back to the original issue here. I did go and reimplement the "parser" (lexer) without buffering, and what I found was:
But, more interestingly, it turns out that the vast majority of "parsing" (lexing) in Short of using mmap to access the entire file, which has some portability issues, or decompressing streams on the fly (probably difficult to implement), the optimal solution is to remove the buffering, and have an alternate lexer for in-memory data that just uses a big regex. |
Interested to see how the rewritten parser performs: pdfminer/pdfminer.six#1041 |
After 6h of processing on one file and noticing what seemed like an exponential decrease in speed near the 700-page mark (in a 1300-page doc, 70 MB), I found this ticket. |
Fixed in f77f701 |
I don't think this got fixed! Even after f77f701 the test file takes >3hr to scan and, when reverting the regressed commit, it takes <2min. But at least it is not quadratic in time and one can now rely on the progress bar estimates :) @jbarlow83 wouldn't it be better to have an open ticket here to track the upstream fix and update accordingly when it lands? |
While playing with #1378, I noticed that the initial scanning time with
--redo-ocr
regressed in v16.4.3 in this commit. The newest version takes ~15x longer to scan the same file even when #1378 is fixed locally.I can't upload my local file, but the regression is noticeable with any file with size > 100MB and > 1000 pages. For reference
sample-10k.pdf
showed 10x regression. Starting from sample-100.pdf, it can be prepared with the following commands:The text was updated successfully, but these errors were encountered: