Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UNICODE EXIF UserComment tag read as BigEndian Unicode results in incorrect decoding #423

Open
RupertAvery opened this issue Jun 12, 2024 · 3 comments
Labels
format-exif help wanted image-queue Actionable issue with sample image

Comments

@RupertAvery
Copy link
Contributor

RupertAvery commented Jun 12, 2024

338819806-4bc1ebcf-6edb-4545-92e5-ff2fc4f7cfb2

The metadata in the file has a UserComment tag in the Exif SubIFD directory that contains a UNICODE-encoded text containing JSON. With the existing code, the text will be decoded using BigEndianUnicode, which will result in incorrect text.

If the Encoding in TagDescriptor for the UNICODE encodingMap is set to Encoding.Unicode, it will decode properly.

Should this be just Unicode? Is there a discriminator that determines what endianess it should use?

@RupertAvery RupertAvery changed the title UNICODE EXIF UserComment tag read as BigEndian Unicode. Should this be just Unicode? Is there a discriminator that says how it should be read? UNICODE EXIF UserComment tag read as BigEndian Unicode results in incorrect decoding Jun 12, 2024
@RupertAvery
Copy link
Contributor Author

MetadataExtractor/TagDescriptor.cs @ L.370

            // TODO use ByteTrie here
            // Someone suggested "ISO-8859-1".
            var encodingMap = new Dictionary<string, Encoding>
            {
                ["ASCII"] = Encoding.ASCII,
                ["UTF8"] = Encoding.UTF8,
#pragma warning disable SYSLIB0001 // Type or member is obsolete
                ["UTF7"] = Encoding.UTF7,
#pragma warning restore SYSLIB0001 // Type or member is obsolete
                ["UTF32"] = Encoding.UTF32,
               // Affected code
                ["UNICODE"] = Encoding.Unicode,
            };

@drewnoakes
Copy link
Owner

It's a good question. There might not be one true answer unfortunately. Perhaps the endianness of the TIFF data stream should be used. However I doubt that different cameras/software handle this consistently.

Generally in this case I run the code before/after on the regression test suite to see whether it helps more than it hurts.

A workaround is to extract the comment bytes (StringValue) and use an explicit encoding directly.

@drewnoakes drewnoakes added help wanted image-queue Actionable issue with sample image format-exif labels Jun 19, 2024
@RupertAvery
Copy link
Contributor Author

An ugly hack, but I used this as a workaround for my purposes.

                    if (idCode == "UNICODE")
                    {
                        var encoding = commentBytes[8] == 0 ? Encoding.BigEndianUnicode : Encoding.Unicode;
                        var text = encoding.GetString(commentBytes, 8, commentBytes.Length - 8);
                        return text;
                    }
                    else
                    {
                   else
                   {
                       if (encodingMap.TryGetValue(idCode, out var encoding))
                       {
                           var text = encoding.GetString(commentBytes, 8, commentBytes.Length - 8);
                           if (encoding == Encoding.ASCII)
                               text = text.Trim('\0', ' ');
                           return text;
                       }
                   }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
format-exif help wanted image-queue Actionable issue with sample image
Projects
None yet
Development

No branches or pull requests

2 participants