PSBaseParser can't handle tokens split across end of buffer #1025

jbarlow83 · 2024-07-31T09:11:08Z

If a parsed token in a PSParser subclass is split across the boundary between buffers, a keyword token will be incorrect split into two separate tokens, causing the wrong keyword to be produced and destroying all subsequent parsing. The BUFSIZ is just 4096, so any data stream longer than 4096 will potentially suffer from this issue.

A simple solution is to increase the buffer size to a much larger value (gigabytes) - in practice the impact on performance will be negligible since most PDFs fit within available RAM anyway. Alternate, (it seems to me) all _parse* functions would need to be adjusted to handle the case where we hit the end of buffer and the subsequent bytes in the buffer might contain the rest of a toekn.

The attached PDF demonstrates the issue when trying to parse its cmaps, some of which are longer than 4096 bytes.

Here is log output from parsing the attached PDF's fonts -- note that the token beginbfchar is incorrectly split into two tokens, beg and inbfchar since it the token happens to be split by the end of the buffer. This cause the incorrect interpretation of all subsequent tokens. Increasing BUFSIZ mitigates the issue.

  DEBUG pdfminer.psparser - nexttoken: (4079, /b'endbfchar')
  DEBUG pdfminer.psparser - do_keyword: pos=4079, token=/b'endbfchar', stack=[(2151, b'\x00d'), (2158, b'\x00e'), (2166, b'\x00e'), (2173, b'\x00e\x00,'), (2185, b'\x00f'), (2192, b'\x00e\x00,\x00e\x00o\x00.\x00.'), (2220, b'\x00g'), (2227, b'\x00e\x00o\x00.\x00.'), (2247, b'\x00h'), (2254, b'\x00e\x00p\x00o'), (2270, b'\x00i'), (2277, b'\x00e\x00r'), (2289, b'\x00j'), (2296, b'\x00e\x00r'), (2308, b'\x00k'), (2315, b'\x00e\x00t'), (2327, b'\x00l'), (2334, b'\x00f\x00u'), (2346, b'\x00m'), (2353, b'\x00f\x00u\x00e'), (2369, b'\x00n'), (2376, b'\x00g'), (2384, b'\x00o'), (2391, b'\x00h'), (2399, b'\x00p'), (2406, b'\x00h'), (2414, b'\x00q'), (2421, b'\x00h'), (2429, b'\x00r'), (2436, b'\x00h\x00_\x00:\x00)'), (2456, b'\x00s'), (2463, b'\x00h\x00e'), (2475, b'\x00t'), (2482, b'\x00h\x00e\x00o\x00.\x00d'), (2506, b'\x00u'), (2513, b'\x00h\x00o\x00c\x00.\x00K'), (2537, b'\x00v'), (2544, b'\x00i'), (2552, b'\x00w'), (2559, b'\x00i'), (2567, b'\x00x'), (2574, b'\x00i'), (2582, b'\x00y'), (2589, b'\x00i'), (2597, b'\x00z'), (2604, b'\x00i'), (2612, b'\x00{'), (2619, b'\x00i'), (2627, b'\x00|'), (2634, b'\x00i'), (2642, b'\x00}'), (2649, b'\x00i'), (2657, b'\x00~'), (2664, b'\x00i\x00o'), (2676, b'\x00\x7f'), (2683, b'\x00i\x00o\x00.\x00.'), (2703, b'\x00\x80'), (2710, b'\x00i\x00t'), (2722, b'\x00\x81'), (2729, b'\x00l'), (2737, b'\x00\x82'), (2744, b'\x00l'), (2752, b'\x00\x83'), (2759, b'\x00l'), (2767, b'\x00\x84'), (2774, b'\x00l\x00e'), (2786, b'\x00\x85'), (2793, b'\x00m'), (2801, b'\x00\x86'), (2808, b'\x00m'), (2816, b'\x00\x87'), (2823, b'\x00m'), (2831, b'\x00\x88'), (2838, b'\x00m\x00a\x00r'), (2854, b'\x00\x89'), (2861, b'\x00m\x00e\x00t'), (2877, b'\x00\x8a'), (2884, b'\x00n'), (2892, b'\x00\x8b'), (2899, b'\x00n'), (2907, b'\x00\x8c'), (2914, b'\x00n\x00.'), (2926, b'\x00\x8d'), (2933, b'\x00n\x00_\x009'), (2949, b'\x00\x8e'), (2956, b'\x00n\x00e'), (2968, b'\x00\x8f'), (2975, b'\x00n\x00k\x00<\x00>'), (2995, b'\x00\x90'), (3002, b'\x00n\x00o\x00.\x00d\x00.'), (3026, b'\x00\x91'), (3033, b'\x00n\x00s\x00t'), (3049, b'\x00\x92'), (3056, b'\x00o'), (3064, b'\x00\x93'), (3071, b'\x00o'), (3079, b'\x00\x94'), (3086, b'\x00o'), (3094, b'\x00\x95'), (3101, b'\x00o'), (3109, b'\x00\x96'), (3116, b'\x00o'), (3124, b'\x00\x97'), (3131, b'\x00o'), (3139, b'\x00\x98'), (3146, b'\x00o'), (3154, b'\x00\x99'), (3161, b'\x00o\x00.\x00.'), (3177, b'\x00\x9a'), (3184, b'\x00o\x00.\x00.'), (3200, b'\x00\x9b'), (3207, b'\x00o\x00.\x00.'), (3223, b'\x00\x9c'), (3230, b'\x00o\x00.\x00d'), (3246, b'\x00\x9d'), (3253, b'\x00o\x00.\x00e'), (3269, b'\x00\x9e'), (3276, b'\x00o\x00.\x00n'), (3292, b'\x00\x9f'), (3299, b'\x00o\x00.\x00n'), (3315, b'\x00\xa0'), (3322, b'\x00o\x00.\x00n'), (3338, b'\x00\xa1'), (3345, b'\x00o\x00.\x00n\x00t'), (3365, b'\x00\xa2'), (3372, b'\x00o\x00.\x00n\x00u\x00s'), (3396, b'\x00\xa3'), (3403, b'\x00o\x00.\x00p'), (3419, b'\x00\xa4'), (3426, b'\x00o\x00.\x00r'), (3442, b'\x00\xa5'), (3449, b'\x00o\x00.\x00r'), (3465, b'\x00\xa6'), (3472, b'\x00o\x00.\x00r\x00r'), (3492, b'\x00\xa7'), (3499, b'\x00o\x00.\x00w'), (3515, b'\x00\xa8'), (3522, b'\x00o\x00J'), (3534, b'\x00\xa9'), (3541, b'\x00o\x00J'), (3553, b'\x00\xaa'), (3560, b'\x00o\x00J'), (3572, b'\x00\xab'), (3579, b'\x00o\x00d'), (3591, b'\x00\xac'), (3598, b'\x00o\x00e'), (3610, b'\x00\xad'), (3617, b'\x00o\x00l'), (3629, b'\x00\xae'), (3636, b'\x00o\x00s'), (3648, b'\x00\xaf'), (3655, b'\x00o\x00s'), (3667, b'\x00\xb0'), (3674, b'\x00o\x00t'), (3686, b'\x00\xb1'), (3693, b'\x00o\x00\xa5'), (3705, b'\x00\xb2'), (3712, b'\x00o\x00\xa5'), (3724, b'\x00\xb3'), (3731, b'\x00p'), (3739, b'\x00\xb4'), (3746, b'\x00p'), (3754, b'\x00\xb5'), (3761, b'\x00p\x00<\x00>'), (3777, b'\x00\xb6'), (3784, b'\x00p\x00<\x00>'), (3800, b'\x00\xb7'), (3807, b'\x00p\x00e'), (3819, b'\x00\xb8'), (3826, b'\x00p\x00r'), (3838, b'\x00\xb9'), (3845, b'\x00r'), (3853, b'\x00\xba'), (3860, b'\x00r'), (3868, b'\x00\xbb'), (3875, b'\x00r'), (3883, b'\x00\xbc'), (3890, b'\x00r'), (3898, b'\x00\xbd'), (3905, b'\x00r'), (3913, b'\x00\xbe'), (3920, b'\x00r'), (3928, b'\x00\xbf'), (3935, b'\x00r'), (3943, b'\x00\xc0'), (3950, b'\x00r'), (3958, b'\x00\xc1'), (3965, b'\x00r'), (3973, b'\x00\xc2'), (3980, b'\x00r'), (3988, b'\x00\xc3'), (3995, b'\x00r'), (4003, b'\x00\xc4'), (4010, b'\x00r'), (4018, b'\x00\xc5'), (4025, b"\x00r\x00'\x009"), (4041, b'\x00\xc6'), (4048, b'\x00r\x00.'), (4060, b'\x00\xc7'), (4067, b'\x00r\x00\\')]
  DEBUG pdfminer.psparser - nexttoken: (4090, 49)
  DEBUG pdfminer.psparser - nexttoken: (4093, /b'beg')   <--- WRONG!!!
  DEBUG pdfminer.psparser - do_keyword: pos=4093, token=/b'beg', stack=[(4090, 49)]
  DEBUG pdfminer.psparser - nexttoken: (4096, /b'inbfchar')   <--- SECOND HALF OF TOKEN
  DEBUG pdfminer.psparser - do_keyword: pos=4096, token=/b'inbfchar', stack=[(4090, 49), (4093, /b'beg')]
  DEBUG pdfminer.psparser - nexttoken: (4106, b'\x00\xc8')
  DEBUG pdfminer.psparser - nexttoken: (4113, b'\x00r\x00o\x00d')
  DEBUG pdfminer.psparser - nexttoken: (4129, b'\x00\xc9')
  DEBUG pdfminer.psparser - nexttoken: (4136, b'\x00r\x00r\x00m')
  DEBUG pdfminer.psparser - nexttoken: (4152, b'\x00\xca')
  DEBUG pdfminer.psparser - nexttoken: (4159, b'\x00r\x00r\x00m\x00e')

1361.pdf

Originating issue:
ocrmypdf/OCRmyPDF#1361

The text was updated successfully, but these errors were encountered:

dhdaines · 2024-08-01T14:34:43Z

Indeed ... the buffering code in the parser is brittle and opaque and I'm not surprised it contains an error like this!

dhdaines · 2024-08-01T21:15:09Z

This change is the culprit: #885 as it doesn't distinguish between the end of the stream and the end of the buffer.

Ideally the PSParser code should be replaced with a lexer based on a more robust and well-tested codebase, but for the moment we can simply fix that fix, which I'll do in a second.

pietermarsman · 2024-11-26T18:26:31Z

Thanks for figuring this out. I'll look at the PR at some point.

dhdaines · 2024-11-27T13:12:16Z

Thanks! the second PR (adding a second parser implementation) may be a bit too disruptive (and prone to bugs).

dhdaines added a commit to dhdaines/pdfminer.six that referenced this issue Aug 1, 2024

fix: fix the fix to pdfminer#884 to fix pdfminer#1025

5e38342

dhdaines linked a pull request Aug 1, 2024 that will close this issue

fix: fix the fix to #884 to fix #1025 #1030

Open

dhdaines mentioned this issue Sep 19, 2024

Rewrite PSBaseParser and add an optimized in-memory version #1041

Open

5 tasks

pietermarsman added type: bug component:parser Related to PDFParser status: accepted labels Nov 26, 2024

dhdaines mentioned this issue Dec 6, 2024

Unstructured unnecessarily "repairs" then falls back to OCR on extremely large documents Unstructured-IO/unstructured#3815

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PSBaseParser can't handle tokens split across end of buffer #1025

PSBaseParser can't handle tokens split across end of buffer #1025

jbarlow83 commented Jul 31, 2024

dhdaines commented Aug 1, 2024

dhdaines commented Aug 1, 2024

pietermarsman commented Nov 26, 2024

dhdaines commented Nov 27, 2024

PSBaseParser can't handle tokens split across end of buffer #1025

PSBaseParser can't handle tokens split across end of buffer #1025

Comments

jbarlow83 commented Jul 31, 2024

dhdaines commented Aug 1, 2024

dhdaines commented Aug 1, 2024

pietermarsman commented Nov 26, 2024

dhdaines commented Nov 27, 2024