Type Error during extracting pages in some pdfs #720

psrubing · 2022-02-22T14:54:38Z

Hello,

I've encountered bug during extrating pages using extract_pages() function from pdfminer.high_level module. This only happens to some pdf-s.
Image below provides this bug:

Below pdf implies this bug:
pdf_bug.pdf

Environment:
Python - 3.7.11
pdfminer.six - 20201018

pietermarsman · 2022-02-22T20:16:32Z

Can replicate:

$ PYTHONPATH=. python tools/pdf2txt.py ~/Downloads/pdf_bug.pdf 
Traceback (most recent call last):
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdffont.py", line 920, in char_width
    return cast(Dict[int, float], self.widths)[cid] * self.hscale
KeyError: 67

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdffont.py", line 924, in char_width
    return str_widths[self.to_unichr(cid)] * self.hscale
KeyError: 'a'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 313, in <module>
    sys.exit(main())
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 307, in main
    outfp = extract_text(**vars(parsed_args))
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 62, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/high_level.py", line 121, in extract_text_to_fp
    interpreter.process_page(page)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 991, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 1010, in render_contents
    self.execute(list_value(streams))
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 1036, in execute
    func(*args)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 896, in do_TJ
    self.device.render_string(
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfdevice.py", line 133, in render_string
    textstate.linematrix = self.render_string_horizontal(
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfdevice.py", line 173, in render_string_horizontal
    x += self.render_char(
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/converter.py", line 206, in render_char
    textwidth = font.char_width(cid)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdffont.py", line 926, in char_width
    return self.default_width * self.hscale
TypeError: unsupported operand type(s) for *: 'PDFObjRef' and 'float'

pietermarsman · 2022-02-22T20:17:56Z

Probably the solution is to call resolve1 when getting the default width.

psrubing · 2022-02-24T10:22:16Z

Hi, thanks for response, but I don't understand your comment. Function extract_pages() doesn't take any parameter related to resolve1 as from my knowledge based on documentation:

Or did I missed something?
Best regards

pietermarsman · 2022-02-26T12:46:28Z

I mean, to fix this issue we have to make a change to pdfminer.six, using resolve1(). This is a bug in the current code.

psrubing · 2022-02-28T08:25:08Z

Okey, I understand now :) Do you know approximate time of release with this fix?

pietermarsman · 2022-03-13T20:01:12Z

Nobody is working on it as far as I know.

Do you have time to work on this?

psrubing · 2022-03-14T09:49:24Z

Unfortunately I don't :/ Have to work on different projects, but if something change I will update and could look at this bug.

resolve1 when getting the default width.

* Issue #720 resolve1 when getting the default width. * Add CHANGELOG.md Co-authored-by: Pieter Marsman <[email protected]>

gosiafilipek · 2022-06-28T09:56:44Z

I found another file that generate similar error.
I already found solution so I will create pull request.

Below pdf implies this bug
pdf_bug2.pdf

self.attrs['MediaBox'] contains params with type PDFObjRef insted of int. I used resolve1 on all params in self.attrs['MediaBox'] to eliminate problem

datatalking · 2022-07-20T00:17:39Z

@pietermarsman @psrubing This was an issue for me in a past project and they ended up using an OCR solution. I was going to say I could take a look at this to debug it but noticed @gosiafilipek created a PR solution?

If those are done I have time in the next 2 months to contribute, but didn't see a 'good first issue' icon or whatever its called so I looked back and found these I could start with. Does anyone have requests or recommendations on where I should start?

#470
#154
#499
#497

pietermarsman · 2022-08-08T20:18:57Z

Hi @datatalking,

Thanks for reaching out! And for wanting to help! You can get in touch on gitter.im. In the private or group chat. We can have a sync about what to work on.

I'll try and see if I can create a good-first-issue label.

pietermarsman · 2022-08-18T18:42:38Z

Fixed by #772

pietermarsman added component:converter type: bug labels Mar 13, 2022

gosiafilipek added a commit to gosiafilipek/pdfminer.six that referenced this issue Jun 22, 2022

Issue pdfminer#720

5b2ab61

resolve1 when getting the default width.

gosiafilipek mentioned this issue Jun 22, 2022

Issue #720 #772

Merged

pietermarsman added a commit that referenced this issue Jun 25, 2022

Fix TypeError when getting default width of font (#772)

1044fc0

* Issue #720 resolve1 when getting the default width. * Add CHANGELOG.md Co-authored-by: Pieter Marsman <[email protected]>

gosiafilipek mentioned this issue Jun 28, 2022

Issue #720 PDFObjRef #778

Closed

pietermarsman added the status: accepted label Aug 7, 2022

pietermarsman added component: converter Related to any PDFLayoutAnalyzer and removed component:converter labels Aug 8, 2022

pietermarsman closed this as completed Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Type Error during extracting pages in some pdfs #720

Type Error during extracting pages in some pdfs #720

psrubing commented Feb 22, 2022

pietermarsman commented Feb 22, 2022 •

edited

Loading

pietermarsman commented Feb 22, 2022

psrubing commented Feb 24, 2022

pietermarsman commented Feb 26, 2022

psrubing commented Feb 28, 2022

pietermarsman commented Mar 13, 2022

psrubing commented Mar 14, 2022

gosiafilipek commented Jun 28, 2022

datatalking commented Jul 20, 2022

pietermarsman commented Aug 8, 2022

pietermarsman commented Aug 18, 2022

Type Error during extracting pages in some pdfs #720

Type Error during extracting pages in some pdfs #720

Comments

psrubing commented Feb 22, 2022

pietermarsman commented Feb 22, 2022 • edited Loading

pietermarsman commented Feb 22, 2022

psrubing commented Feb 24, 2022

pietermarsman commented Feb 26, 2022

psrubing commented Feb 28, 2022

pietermarsman commented Mar 13, 2022

psrubing commented Mar 14, 2022

gosiafilipek commented Jun 28, 2022

datatalking commented Jul 20, 2022

pietermarsman commented Aug 8, 2022

pietermarsman commented Aug 18, 2022

pietermarsman commented Feb 22, 2022 •

edited

Loading