Runlength decoding allocates too much memory and is slow. #1054

helpmefindaname · 2024-10-23T15:41:55Z

Bug report

Using loading a pdf that contains a large image that is runlength decoded takes long, as the decoding algorithm allocates new bytes with every step.

for example we assume the bytes of a complete white RGB-image with dimensions (3000, 4000).
We can easily construct the runtime encoding using the following code.

    from pdfminer.runlength import rldecode
    large_white_image_encoded = bytes([129, 255]*(3 * 3000 * 4000 // 128))
    data = rldecode(large_white_image_encoded)

As the current implementation stores an immutable bytearray and then adds on top of it, it will allocate a new bytearray with every step. A simple solution is to instead use a list instead and convert to a bytearray at the end.

On a real pdf - that I'm not allowed to share - I could reduce an processing step from 17minutes down to 14s just by optimizing that function. I will create a PR with the speed up

helpmefindaname linked a pull request Oct 23, 2024 that will close this issue

Fix #1054: speedup runlength decoding #1055

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runlength decoding allocates too much memory and is slow. #1054

Runlength decoding allocates too much memory and is slow. #1054

helpmefindaname commented Oct 23, 2024

Runlength decoding allocates too much memory and is slow. #1054

Runlength decoding allocates too much memory and is slow. #1054

Comments

helpmefindaname commented Oct 23, 2024