`get_dest` works differently for PDF 1.2 vs PDF 1.1 #1053

dhdaines · 2024-10-21T15:00:25Z

The get_dest method of PDFDocument is defined as:

    def get_dest(self, name: Union[str, bytes]) -> Any:

Unfortunately what this means in practice is that for PDF 1.1 documents, it takes a str, while for PDF 1.2 documents, it takes a bytes. This is because in PDF 1.2 and later the destination dictionary is not a dictionary but a name tree, and (PDF 1.7, page 88):

A name tree serves a similar purpose to a dictionary—associating keys and values—but by different means. A name tree differs from a dictionary in the following important ways:
• Unlike the keys in a dictionary, which are name objects, those in a name tree are strings.

What this means in practice is that while pdfminer.six (dubiously) converts the keys of a dictionary to str (because they are name objects and thus kinda-sorta UTF-8, since PDF 1.2, see PDF 1.7 page 16), it cannot reasonably do this for the keys of a name tree as they are undifferentiated blobs of 8-bit data. In practice they can and will be various things including UTF-16 with a BOM (see the EmbeddedFiles in https://github.com/pdfminer/pdfminer.six/blob/master/samples/contrib/issue-625-identity-cmap.pdf).

This means that get_dest isn't really very useful since you have to know what the named destination is and how it's encoded before you can look it up. A better approach would be to allow the user to iterate over the destinations.

The text was updated successfully, but these errors were encountered:

dhdaines · 2024-10-21T17:31:02Z

Another note: PDF 1.7 specifies (page 367), with respect to the names of destinations:

The keys in the name tree may be treated as text strings for display purposes.

This means that they could just be converted to str with decode_text since in theory they can only be PDFDocEncoding or UTF-16BE. (in practice they are almost certainly other things as well...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`get_dest` works differently for PDF 1.2 vs PDF 1.1 #1053

`get_dest` works differently for PDF 1.2 vs PDF 1.1 #1053

dhdaines commented Oct 21, 2024 •

edited

Loading

dhdaines commented Oct 21, 2024

get_dest works differently for PDF 1.2 vs PDF 1.1 #1053

get_dest works differently for PDF 1.2 vs PDF 1.1 #1053

Comments

dhdaines commented Oct 21, 2024 • edited Loading

dhdaines commented Oct 21, 2024

`get_dest` works differently for PDF 1.2 vs PDF 1.1 #1053

`get_dest` works differently for PDF 1.2 vs PDF 1.1 #1053

dhdaines commented Oct 21, 2024 •

edited

Loading