CID characters when extracting text from Korean pdf #1035

mk-docenty · 2024-08-22T03:41:10Z

Hi,

I am testing a PDF file and when I try to run it using pdfminer.xi characters are broken and my pdf is encoded with /UniKS-UTF16-H
This is the output coming
(cid:53)(cid:51)(cid:53)(cid:54)(cid:15434)(cid:4738)(cid:11182)(cid:6530)(cid:35) (cid:11206)(cid:9838)(cid:11542)(cid:35) (cid:9967)(cid:4794)(cid:8882)(cid:4766)(cid:9946)

(cid:977)(cid:20) (cid:20) (cid:1923)

Here is my enviorment or pdfminer version

pip show pdfminer.six
Name: pdfminer.six
Version: 20240706

Test PDF

2023_..9..-7-12.pdf

code


from pdfminer.high_level import extract_text

file_path = "2023_._.9._.-7-12.pdf"  # Replace with your PDF file path

text = extract_text(file_path, codec='UniKS-UTF16-H')

output_file = "output_pdfminer.md"
print("\n\n",  text)
with open(output_file, "w", encoding="utf-16", errors="ignore") as md_file:
    md_file.write(text)

print(f"Text extracted and saved as {output_file}")

The text was updated successfully, but these errors were encountered:

dhdaines · 2024-09-19T16:00:43Z

Same problem as #1036 - again, try to copy and paste text out of the file and you will see that the mappings are just nonsense.

nnurmano · 2024-10-03T04:47:38Z

Not sure what's the problem, I copied text from the pdf and it indeed returns squares, but then I tried the same pdf with llamaparse and it returns text as in the pdf itself, could it be something else?

mk-docenty · 2024-10-04T04:18:46Z

@nnurmano I guess its encoding issue

dhdaines · 2024-10-07T16:11:33Z

Oh, it could be that pdfminer has an old or broken version of UniKS-UTF16-H encoding - @mk-docenty can you try copying/pasting from Adobe Acrobat? (I just tried poppler and Chrome, which could have the same problems of incorrect encoding definitions)

dhdaines · 2024-10-07T16:13:09Z

Also @nnurmano this is the first I have heard of llamaparse. It appears to maybe be proprietary? Do you know what they are actually using to extract text from PDF?

nnurmano · 2024-10-07T17:43:36Z

Also @nnurmano this is the first I have heard of llamaparse. It appears to maybe be proprietary? Do you know what they are actually using to extract text from PDF?

No idea. But I shall try to find it out.

dhdaines · 2024-10-07T18:15:02Z

Some more digging in that PDF - the UniKS-UTF16-H encoding is only used by the font MalgunGothic, which is only used in the Form XObject on the first page. FontForge refuses to even open this font saying it is "corrupted beyond repair".

Most of the text in the tables (on page 3 for instance) is actually in IIBDIK+HCRDotum or IIBEIJ+GulimChe. These both use an "Identity" mapping - off the top of my head I don't recall exactly what this means, if anything, for text extraction. Here's GulimChe for instance:

67 0 obj
<<
/Ascent 858
/CIDSet 68 0 R
/CapHeight 0
/Descent -141
/Flags 4
/FontBBox [0 -150 1000 863]
/FontFile2 69 0 R
/FontName /IIBEIJ+GulimChe
/ItalicAngle 0
/StemV 0
/Type /FontDescriptor
/XHeight 0
>>
endobj
66 0 obj
<<
/BaseFont /IIBEIJ+GulimChe
/CIDSystemInfo <<
/Ordering (Identity)
/Registry (Adobe)
/Supplement 0
>>
/CIDToGIDMap /Identity
/DW 1000
/FontDescriptor 67 0 R
/Subtype /CIDFontType2
/Type /Font
/W [1659 [500] 1664 1685 500 1688 [500] 1692 1698 500 1700 [500] 1703 [500] 1705 1707 500 1709 1711 500 1714 1715 500 1718 [500] 1720 [500] 1722 1724 500 1731 [500] 1734 [500] 1736 [500] 1747 [500] 1750 [500] 1752 1753 500]
>>
endobj
65 0 obj
<<
/BaseFont /IIBEIJ+GulimChe
/DescendantFonts [66 0 R]
/Encoding /Identity-H
/Subtype /Type0
/Type /Font
>>
endobj

Looking at this in FontForge, it appears to have a collection of precomposed Hangul blocks and jamo at specific code points, which can be assumed to mean something, just not what pdfminer.six, PDFium and Poppler expect them to mean ;-) Do these code points mean anything to you?

pietermarsman added component: converter Related to any PDFLayoutAnalyzer type: bug component:characters Anything with encodings, character mappings or CJK languages status: needs more info and removed component: converter Related to any PDFLayoutAnalyzer labels Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CID characters when extracting text from Korean pdf #1035

CID characters when extracting text from Korean pdf #1035

mk-docenty commented Aug 22, 2024

dhdaines commented Sep 19, 2024

nnurmano commented Oct 3, 2024

mk-docenty commented Oct 4, 2024

dhdaines commented Oct 7, 2024

dhdaines commented Oct 7, 2024

nnurmano commented Oct 7, 2024

dhdaines commented Oct 7, 2024

CID characters when extracting text from Korean pdf #1035

CID characters when extracting text from Korean pdf #1035

Comments

mk-docenty commented Aug 22, 2024

Here is my enviorment or pdfminer version

Test PDF

code

dhdaines commented Sep 19, 2024

nnurmano commented Oct 3, 2024

mk-docenty commented Oct 4, 2024

dhdaines commented Oct 7, 2024

dhdaines commented Oct 7, 2024

nnurmano commented Oct 7, 2024

dhdaines commented Oct 7, 2024