Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on non-ASCII input. #1032

Open
vk2diy opened this issue Aug 4, 2024 · 4 comments
Open

Crash on non-ASCII input. #1032

vk2diy opened this issue Aug 4, 2024 · 4 comments

Comments

@vk2diy
Copy link

vk2diy commented Aug 4, 2024

Description

Crash on non-ASCII input: UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)

Steps to reproduce the bug

To make it easier, this will download mc3362.pdf.

  1. wget https://github.com/user-attachments/files/16489263/mc3362.pdf && pdf2txt.py mc3362.pdf

Error produced

Traceback (most recent call last):
  File "pdf2txt.py", line 115, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
                                        ^^^^^^^^^^^^^^
  File "pdf2txt.py", line 110, in main
    interpreter.process_page(page)
  File "/lib/python3.12/site-packages/pdfminer/pdfinterp.py", line 841, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/lib/python3.12/site-packages/pdfminer/pdfinterp.py", line 854, in render_contents
    self.execute(list_value(streams))
  File "/lib/python3.12/site-packages/pdfminer/pdfinterp.py", line 869, in execute
    name = keyword_name(obj).decode('ascii')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
@dhdaines
Copy link
Contributor

dhdaines commented Aug 6, 2024

What version of pdfminer.six are you using? I can't reproduce this with either Python 3.11 or 3.12 and pdfminer.six v20240706.

@vk2diy
Copy link
Author

vk2diy commented Aug 6, 2024

Looks old.

./lib/python3.12/site-packages/pdfminer-20191125.dist-info

Unsure why it would be old, I used pip to install it. I'm not really a python person.

@pietermarsman
Copy link
Member

Closing since @dhdaines can't reproduce. Probalby you can fix this by removing all versions of pdfminer and pdfminer.six and then installing the lastest version from pip.

@vk2diy
Copy link
Author

vk2diy commented Nov 28, 2024

I can confirm the error is gone with a new download in a new python venv. Probably it's a historic bug.

For reference, here is the output of pip freeze:

cffi==1.17.1
charset-normalizer==3.4.0
cryptography==44.0.0
pdfminer.six==20240706
pillow==11.0.0
pycparser==2.22

However, I get zero output rather than the desired output, which is not as expected/desired. Perhaps you could tell me if you can get any text output from the file specified?

I also tried various command line options like pdf2txt.py -A -n -t text mc3362.pdf .. same result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants