Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to extract thumbnail bytes from Jpeg? #276

Closed
activityworkshop opened this issue Jul 6, 2017 · 9 comments
Closed

Question: How to extract thumbnail bytes from Jpeg? #276

activityworkshop opened this issue Jul 6, 2017 · 9 comments

Comments

@activityworkshop
Copy link

Hi,
I'm using ImageMetadataReader to read metadata from a jpeg file, and I'd like to gain access to the thumbnail bytes.

Previously, using version 2.7, I could do this:

ExifThumbnailDirectory thumbDir = metadata.getDirectory(ExifThumbnailDirectory.class);
if (thumbDir.hasThumbnailData())
{
	byte[] tdata = thumbDir.getThumbnailData();

and could then have access to the bytes of the thumbnail image.

With version 2.10, these methods are no longer available, so I try this:

ExifThumbnailDirectory thumbDir = metadata.getFirstDirectoryOfType(ExifThumbnailDirectory.class);
if (thumbDir.containsTag(ExifThumbnailDirectory.TAG_COMPRESSION))
{
	Integer offset = thumbDir.getInteger(ExifThumbnailDirectory.TAG_THUMBNAIL_OFFSET);
	Integer length = thumbDir.getInteger(ExifThumbnailDirectory.TAG_THUMBNAIL_LENGTH);

so I can get the offset (relative to the start of this directory?) and the length of the data block, but I cannot see how to get access to the data itself.

I have tried reading the jpeg file afterwards (using a FileInputStream) and extracting bytes from it, but that only works if I add the value (from TAG_THUMBNAIL_OFFSET) to a further offset to go from the start of the file, and the appropriate value to add depends on which jpeg I'm using. So I guess I need to ask the thumbDir what its offset is relative to the start of the file, and add that to the value of the TAG_THUMBNAIL_OFFSET, is that correct? Does the Directory class have any way to know where it is? Or can the Metadata object tell where a directory (or its data) can be found?
Or am I missing something and the thumbnail bytes are accessible in another way?

@payton
Copy link
Collaborator

payton commented Jul 6, 2017

You can actually avoid going back through the file by modifying the ExifReader class. The JPEG thumbnail is stored in the APP1 segment. In the readJpegSegments method, the APP1 segment is passed to the TiffReader as a byte array. I just added a quick snippet after extracting the thumbnail data in that method to go back through the array and grab the actual thumbnail data

ExifReader.java

public void readJpegSegments(@NotNull final Iterable<byte[]> segments, @NotNull final Metadata metadata, @NotNull final JpegSegmentType segmentType)
{
    assert(segmentType == JpegSegmentType.APP1);

    for (byte[] segmentBytes : segments) {
        // Filter any segments containing unexpected preambles
        if (segmentBytes.length < JPEG_SEGMENT_PREAMBLE.length() || !new String(segmentBytes, 0, JPEG_SEGMENT_PREAMBLE.length()).equals(JPEG_SEGMENT_PREAMBLE))
            continue;
        extract(new ByteArrayReader(segmentBytes), metadata, JPEG_SEGMENT_PREAMBLE.length());

        /** Added code **/
        try {
            ByteArrayReader reader = new ByteArrayReader(segmentBytes);
            int offset = metadata.getFirstDirectoryOfType(ExifThumbnailDirectory.class).getInt(ExifThumbnailDirectory.TAG_THUMBNAIL_OFFSET);
            int length = metadata.getFirstDirectoryOfType(ExifThumbnailDirectory.class).getInt(ExifThumbnailDirectory.TAG_THUMBNAIL_LENGTH);
            System.out.println("Offset: " + offset);
            System.out.println("Length: " + length);
            FileUtils.writeByteArrayToFile(new File("thumbnail.jpg"), reader.getBytes(offset + JPEG_SEGMENT_PREAMBLE.length(), length));
        } catch (MetadataException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        /****************/
    }
}

The offset that is given to us for the thumbnail is relative to the beginning of the Tiff data (II/MM), which can be skipped to using the Jpeg Preamble length.

I'm sure there is a better way to incorporate this into the library, but hopefully that helps to get you started.

@activityworkshop
Copy link
Author

Thanks for your feedback.
But I'm looking for a way that doesn't involve changing the ExifReader class. I'm treating the libmetadata-extractor as an external library and I'd like to use an unmodified jar of the latest released version.

Do you think that version 2.10.1 would allow me to do that or would I need to wait for a future version?

@payton
Copy link
Collaborator

payton commented Jul 10, 2017

Ah, I see... Unfortunately there doesn't seem to be a way to get the bytes directly #262. Looking back at this commit, you can see that functionality was removed to prevent storing so much data in memory. In the comments of that commit, they talk about finding the offset you are looking for and some of the problems that came up. There are some issues that Drew links to in his dotnet library that might help too.

If you are willing to go back to reading the jpeg file afterwards, you would just need to find the APP1 segment (0xFFE1) and then from there get to the Tiff marker (II/MM). JPEG Structure Info . I know it isn't ideal, but there doesn't seem to be a better way with out modifying the library as it is now.

I don't believe the storage of the bytes would be added again, but adding a tag for the offset from start of image wouldn't be a bad idea. Looking at the issues, there doesn't seem to be anybody actively working on this, though.

@activityworkshop
Copy link
Author

That's great, thanks for the feedback.
It looks like I'll need to do a bit more digging to find the best way to fix this.

@lfcnassif
Copy link

lfcnassif commented Jan 3, 2019

So, that method was removed and we are also being affected by this backwards incompatible change. I think at least the library could point to the new way to get thumbnail bytes from images to not break its users.

@Nadahar
Copy link
Contributor

Nadahar commented Jan 3, 2019

We also need this ability to come back, and have made temporary workarounds in our code (using other methods to get the thumbnail) in the meanwhile.

@kwhopper
Copy link
Collaborator

kwhopper commented Jan 3, 2019

This PR lays groundwork for pointer-based parsing, which could be enhanced to grab thumbnails. Most cases of byte array storage are removed and replaced with ReaderInfo pointers. These contain the global starting position and other bits of data that could be used to go back and read thumbnails after the fact. A new Thumbnail directory would be a good start that holds a list of ReaderInfo's to thumbnail locations as other readers do their thing. This might be relatively straightforward - although that's always easy to say.

The Java PR version is a port of a similar PR from the .NET version, which is still being reviewed. I think this kind of parsing (pointers instead of byte arrays) is the only way to do certain actions, including giving access to desirable sets of bytes without actually reading those bytes into memory.

@haumacher
Copy link

haumacher commented Aug 20, 2020

Looks like it is still not fixed, see also #149.

For all stumbling over this issue, it is possible to dynamically "hack" the ExifReader without modifying the library as @payton suggested. Place the following lines in your class accessing ImageMetadataReader the first time:

	public static int TAG_THUMBNAIL_DATA = 0x10000;
	
	static {
		List<JpegSegmentMetadataReader> allReaders = (List<JpegSegmentMetadataReader>) JpegMetadataReader.ALL_READERS;
		for (int n = 0, cnt = allReaders.size(); n < cnt; n++) {
			if (allReaders.get(n).getClass() != ExifReader.class) {
				continue;
			}
			
			allReaders.set(n, new ExifReader() {
				@Override
				public void readJpegSegments(@NotNull final Iterable<byte[]> segments, @NotNull final Metadata metadata, @NotNull final JpegSegmentType segmentType) {
					super.readJpegSegments(segments, metadata, segmentType);

				    for (byte[] segmentBytes : segments) {
				        // Filter any segments containing unexpected preambles
				        if (!startsWithJpegExifPreamble(segmentBytes)) {
				        	continue;
				        }
				        
				        // Extract the thumbnail
				        try {
				            ExifThumbnailDirectory tnDirectory = metadata.getFirstDirectoryOfType(ExifThumbnailDirectory.class);
				            if (tnDirectory != null && tnDirectory.containsTag(ExifThumbnailDirectory.TAG_THUMBNAIL_OFFSET)) {
				            	int offset = tnDirectory.getInt(ExifThumbnailDirectory.TAG_THUMBNAIL_OFFSET);
				            	int length = tnDirectory.getInt(ExifThumbnailDirectory.TAG_THUMBNAIL_LENGTH);
				            	
				            	byte[] tnData = new byte[length];
				            	System.arraycopy(segmentBytes, JPEG_SEGMENT_PREAMBLE.length() + offset, tnData, 0, length);
				            	tnDirectory.setObject(TAG_THUMBNAIL_DATA, tnData);
				            }
				        } catch (MetadataException e) {
				            e.printStackTrace();
				        }
				    }
				}				
			});
			break;
		}
	}

This makes the reader create another fake tag in the ExifThumbnailDirectory called TAG_THUMBNAIL_DATA containing the bytes of the thumbnail image.

You can access the thumbnail data in the follwing way:

Metadata metadata = ImageMetadataReader.readMetadata(resource.getFile());
ExifThumbnailDirectory tnDirectory = metadata.getFirstDirectoryOfType(ExifThumbnailDirectory.class);
byte[] data = (byte[]) tnDirectory.getObject(TAG_THUMBNAIL_DATA);

@lfcnassif
Copy link

Thanks @haumacher! Will try your code, so we could upgrade our old metadata-extractor version without breaking things...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants