You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unfortunately what this means in practice is that for PDF 1.1 documents, it takes a str, while for PDF 1.2 documents, it takes a bytes. This is because in PDF 1.2 and later the destination dictionary is not a dictionary but a name tree, and (PDF 1.7, page 88):
A name tree serves a similar purpose to a dictionary—associating keys and values—but by different means. A name tree differs from a dictionary in the following important ways:
• Unlike the keys in a dictionary, which are name objects, those in a name tree are strings.
What this means in practice is that while pdfminer.six (dubiously) converts the keys of a dictionary to str (because they are name objects and thus kinda-sorta UTF-8, since PDF 1.2, see PDF 1.7 page 16), it cannot reasonably do this for the keys of a name tree as they are undifferentiated blobs of 8-bit data. In practice they can and will be various things including UTF-16 with a BOM (see the EmbeddedFiles in https://github.com/pdfminer/pdfminer.six/blob/master/samples/contrib/issue-625-identity-cmap.pdf).
This means that get_dest isn't really very useful since you have to know what the named destination is and how it's encoded before you can look it up. A better approach would be to allow the user to iterate over the destinations.
The text was updated successfully, but these errors were encountered:
Another note: PDF 1.7 specifies (page 367), with respect to the names of destinations:
The keys in the name tree may be treated as text strings for display purposes.
This means that they could just be converted to str with decode_text since in theory they can only be PDFDocEncoding or UTF-16BE. (in practice they are almost certainly other things as well...)
The
get_dest
method ofPDFDocument
is defined as:Unfortunately what this means in practice is that for PDF 1.1 documents, it takes a
str
, while for PDF 1.2 documents, it takes abytes
. This is because in PDF 1.2 and later the destination dictionary is not a dictionary but a name tree, and (PDF 1.7, page 88):What this means in practice is that while
pdfminer.six
(dubiously) converts the keys of a dictionary tostr
(because they are name objects and thus kinda-sorta UTF-8, since PDF 1.2, see PDF 1.7 page 16), it cannot reasonably do this for the keys of a name tree as they are undifferentiated blobs of 8-bit data. In practice they can and will be various things including UTF-16 with a BOM (see theEmbeddedFiles
in https://github.com/pdfminer/pdfminer.six/blob/master/samples/contrib/issue-625-identity-cmap.pdf).This means that
get_dest
isn't really very useful since you have to know what the named destination is and how it's encoded before you can look it up. A better approach would be to allow the user to iterate over the destinations.The text was updated successfully, but these errors were encountered: