UNICODE EXIF UserComment tag read as BigEndian Unicode results in incorrect decoding #423

RupertAvery · 2024-06-12T06:19:32Z

The metadata in the file has a UserComment tag in the Exif SubIFD directory that contains a UNICODE-encoded text containing JSON. With the existing code, the text will be decoded using BigEndianUnicode, which will result in incorrect text.

If the Encoding in TagDescriptor for the UNICODE encodingMap is set to Encoding.Unicode, it will decode properly.

Should this be just Unicode? Is there a discriminator that determines what endianess it should use?

RupertAvery · 2024-06-12T14:43:58Z

MetadataExtractor/TagDescriptor.cs @ L.370

            // TODO use ByteTrie here
            // Someone suggested "ISO-8859-1".
            var encodingMap = new Dictionary<string, Encoding>
            {
                ["ASCII"] = Encoding.ASCII,
                ["UTF8"] = Encoding.UTF8,
#pragma warning disable SYSLIB0001 // Type or member is obsolete
                ["UTF7"] = Encoding.UTF7,
#pragma warning restore SYSLIB0001 // Type or member is obsolete
                ["UTF32"] = Encoding.UTF32,
               // Affected code
                ["UNICODE"] = Encoding.Unicode,
            };

drewnoakes · 2024-06-19T00:33:22Z

It's a good question. There might not be one true answer unfortunately. Perhaps the endianness of the TIFF data stream should be used. However I doubt that different cameras/software handle this consistently.

Generally in this case I run the code before/after on the regression test suite to see whether it helps more than it hurts.

A workaround is to extract the comment bytes (StringValue) and use an explicit encoding directly.

RupertAvery · 2024-08-26T17:30:01Z

An ugly hack, but I used this as a workaround for my purposes.

                    if (idCode == "UNICODE")
                    {
                        var encoding = commentBytes[8] == 0 ? Encoding.BigEndianUnicode : Encoding.Unicode;
                        var text = encoding.GetString(commentBytes, 8, commentBytes.Length - 8);
                        return text;
                    }
                    else
                    {
                   else
                   {
                       if (encodingMap.TryGetValue(idCode, out var encoding))
                       {
                           var text = encoding.GetString(commentBytes, 8, commentBytes.Length - 8);
                           if (encoding == Encoding.ASCII)
                               text = text.Trim('\0', ' ');
                           return text;
                       }
                   }

RupertAvery changed the title ~~UNICODE EXIF UserComment tag read as BigEndian Unicode. Should this be just Unicode? Is there a discriminator that says how it should be read?~~ UNICODE EXIF UserComment tag read as BigEndian Unicode results in incorrect decoding Jun 12, 2024

drewnoakes added help wanted image-queue Actionable issue with sample image format-exif labels Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UNICODE EXIF UserComment tag read as BigEndian Unicode results in incorrect decoding #423

UNICODE EXIF UserComment tag read as BigEndian Unicode results in incorrect decoding #423

RupertAvery commented Jun 12, 2024 •

edited

Loading

RupertAvery commented Jun 12, 2024

drewnoakes commented Jun 19, 2024

RupertAvery commented Aug 26, 2024

UNICODE EXIF UserComment tag read as BigEndian Unicode results in incorrect decoding #423

UNICODE EXIF UserComment tag read as BigEndian Unicode results in incorrect decoding #423

Comments

RupertAvery commented Jun 12, 2024 • edited Loading

RupertAvery commented Jun 12, 2024

drewnoakes commented Jun 19, 2024

RupertAvery commented Aug 26, 2024

RupertAvery commented Jun 12, 2024 •

edited

Loading