Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runlength decoding allocates too much memory and is slow. #1054

Open
helpmefindaname opened this issue Oct 23, 2024 · 0 comments · May be fixed by #1055
Open

Runlength decoding allocates too much memory and is slow. #1054

helpmefindaname opened this issue Oct 23, 2024 · 0 comments · May be fixed by #1055

Comments

@helpmefindaname
Copy link

Bug report

Using loading a pdf that contains a large image that is runlength decoded takes long, as the decoding algorithm allocates new bytes with every step.

for example we assume the bytes of a complete white RGB-image with dimensions (3000, 4000).
We can easily construct the runtime encoding using the following code.

    from pdfminer.runlength import rldecode
    large_white_image_encoded = bytes([129, 255]*(3 * 3000 * 4000 // 128))
    data = rldecode(large_white_image_encoded)

As the current implementation stores an immutable bytearray and then adds on top of it, it will allocate a new bytearray with every step. A simple solution is to instead use a list instead and convert to a bytearray at the end.

On a real pdf - that I'm not allowed to share - I could reduce an processing step from 17minutes down to 14s just by optimizing that function. I will create a PR with the speed up

@helpmefindaname helpmefindaname linked a pull request Oct 23, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@helpmefindaname and others