Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_dest works differently for PDF 1.2 vs PDF 1.1 #1053

Open
dhdaines opened this issue Oct 21, 2024 · 1 comment
Open

get_dest works differently for PDF 1.2 vs PDF 1.1 #1053

dhdaines opened this issue Oct 21, 2024 · 1 comment

Comments

@dhdaines
Copy link
Contributor

dhdaines commented Oct 21, 2024

The get_dest method of PDFDocument is defined as:

    def get_dest(self, name: Union[str, bytes]) -> Any:

Unfortunately what this means in practice is that for PDF 1.1 documents, it takes a str, while for PDF 1.2 documents, it takes a bytes. This is because in PDF 1.2 and later the destination dictionary is not a dictionary but a name tree, and (PDF 1.7, page 88):

A name tree serves a similar purpose to a dictionary—associating keys and values—but by different means. A name tree differs from a dictionary in the following important ways:
• Unlike the keys in a dictionary, which are name objects, those in a name tree are strings.

What this means in practice is that while pdfminer.six (dubiously) converts the keys of a dictionary to str (because they are name objects and thus kinda-sorta UTF-8, since PDF 1.2, see PDF 1.7 page 16), it cannot reasonably do this for the keys of a name tree as they are undifferentiated blobs of 8-bit data. In practice they can and will be various things including UTF-16 with a BOM (see the EmbeddedFiles in https://github.com/pdfminer/pdfminer.six/blob/master/samples/contrib/issue-625-identity-cmap.pdf).

This means that get_dest isn't really very useful since you have to know what the named destination is and how it's encoded before you can look it up. A better approach would be to allow the user to iterate over the destinations.

@dhdaines
Copy link
Contributor Author

Another note: PDF 1.7 specifies (page 367), with respect to the names of destinations:

The keys in the name tree may be treated as text strings for display purposes.

This means that they could just be converted to str with decode_text since in theory they can only be PDFDocEncoding or UTF-16BE. (in practice they are almost certainly other things as well...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@dhdaines and others