-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid reading unused bytes from PNG streams #535
Avoid reading unused bytes from PNG streams #535
Conversation
Check added in Tested only as with the file attached to #534:
As I do not intend to mock the |
@@ -86,7 +86,7 @@ | |||
// Process the next chunk. | |||
int chunkDataLength = reader.getInt32(); | |||
|
|||
if (chunkDataLength < 0) | |||
if ((chunkDataLength < 0) || (chunkDataLength > reader.available())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that this isn't a valid use of available()
. Here's the JavaDoc for the method:
Returns an estimate of the number of bytes that can be read (or skipped over) from this SequentialReader without blocking by the next invocation of a method for this input stream. A single read or skip of this many bytes will not block, but may read or skip fewer bytes.
Note that while some implementations of SequentialReader like SequentialByteArrayReader will return the total remaining number of bytes in the stream, others will not. It is never correct to use the return value of this method to allocate a buffer intended to hold all data in this stream.
Returns:
an estimate of the number of bytes that can be read (or skipped over) from this SequentialReader without blocking or 0 when it reaches the end of the input stream.
This is a more fundamental "problem", the nature of a stream is that the total size might be unknown. For SequentialByteArrayReader
using available()
would yield the correct results, while for StreamReader
(and potentially other implementations` it would not.
I don't know how a "sane maximum value" could be deducted, but I guess some more finesse is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, you are right about this. There are implementations of InputStream
which might return unexpected results for available()
. I didn't realized that when I implemented my fix. I just thought available()
is the best chance there is for estimating stream size. And only thought about metadata-extractor as a tool for bounded streams.
What do you suggest then? Using available()
might fail for some crazy streams. And technically there is nothing wrong with allocating array up to max Integer. It can cause some Exceptions which are impossible to catch/prevent from happening outside of the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I have a suggestion from the top of my head. I guess a check could be made with instanceof
and then use your logic if the reader is a SequentialByteArrayReader
, but this still leaves the question of what to do with all the other cases.
Regarding available()
, it's not just "crazy streams" where this fails. Generally, for FileInputStream
it returns the remaining bytes of the file, and I think that's the reason why many seem to think it reports "the remainder of the stream". For network streams for example, available()
will generally just report the remainder of bytes in the buffer at that time, which will probably be a relatively small value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that this might be unnecessary optimization as allocating large buffer is generally OK and reading more bytes from a stream than it actually contains should result in java.io.EOFException: End of data reached.
Let me know though if you come up with any solution or a trick how to handle this issue. I realize it is not caused by the library, but such threat of OOM is really undesirable one.
@@ -86,7 +86,7 @@ | |||
// Process the next chunk. | |||
int chunkDataLength = reader.getInt32(); | |||
|
|||
if (chunkDataLength < 0) | |||
if ((chunkDataLength < 0) || (chunkDataLength > reader.available())) | |||
throw new PngProcessingException("PNG chunk length exceeds maximum"); | |||
|
|||
PngChunkType chunkType = new PngChunkType(reader.getBytes(4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason I'm sure makes a lot of sense to GitHub, I can't place a comment any further down than here. I wanted to comment on line 96 below though.
I'm not saying that this will solve it, but I see one possibility to at least make this less likely to happen. As far as I can tell, chunkData
is only used if willStoreChunk
is true
. It seems very wasteful to me to call getBytes()
even if the data isn't to be kept, it will allocate memory and read information that will be just handed off to the GC. I would suggest something like this instead:
byte[] chunkData;
if (willStoreChunk) {
chunkData = reader.getBytes(chunkDataLength);
} else {
chunkData = null // To satisfy the compiler
reader.skip(chunkDataLength)
}
With that in place, I guess some sane maximum sizes could be deducted based on the chunk type. I'm no PNG expert, but I would assume that only the IDAT
chunk could reasonably contain really large data, and I don't think that chunk is ever relevant for Metadata-extractor to read. For the chunk type of interest for Metadata-extractor, it should be possible to come up with some reasonable maximum value where everything above that would be rejected before calling getBytes()
(which is where the allocation occurs)..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this would be a nice workaround! In case of this specific file, the problematic chunk is not read, only skipped. So in the end int chunkDataLength = reader.getInt32();
is negative in next loop iteration and End of data reached.
is thrown. Nice!
I can see minor complication in the fact that new PngChunkReader().extract(new StreamReader(inputStream), Collections.emptySet())
and new PngChunkReader().extract(new StreamReader(inputStream), null)
can yield completely different result if your code snippet is used. Which is expected I guess, but should be mentioned in documentation/comments.
On the other hand, PngMetadataReader.readMetadata(@NotNull InputStream inputStream)
has static set of PngChunkType
defined, so no changes are required there :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented the changes you proposed, added a short comment regarding the chunk data reading and also included the corrupted file in the tests. Let me know if this solution is OK with you. Thanks!
- Reverted stream available bytes check - Updated tests using corrupted png file
@Draczech thanks for following up with this. The change looks sensible. One funny thing: In other repos I often ask for unit tests to be added. In this repo, I usually ask for them to be removed :) Well, the test is fine, but the image file -- not so much. Rather than add lots of binary files to this repo, we maintain a separate collection of test images and regularly run regression tests against them. We also use these files to drive parity between the Java and .NET implementations. Please remove the file from this PR (and the tests if they cannot be updated to work without the file) and we will add the file to this repo instead: https://github.com/drewnoakes/metadata-extractor-images I can add it, if you don't want to clone that repo. It's quite large. |
…repository instead.
OK, I can see you point. The image file is removed now so are the tests (they don't make much sense without the file. So the pull request has one changed file only. Please add the file to your image repo. Thank you! |
While I think this is a good change, I'd also think that it should be possible to define an absolute ridiculous maximum size for at least some chunk types, so that encountering something larger would abort parsing/declare the file invalid. @drewnoakes Is that something worth considering (the "work" is of course to come up with these upper thresholds)? To explain what I mean, I took a quick look at the PNG spec and can see that:
I'm pretty sure that similar thresholds can be found for many other chunk types, I just read the first ones in the standard above.The maximum thresholds doesn't have to be as narrow as the standard defines, they could leave in a lot of "slack" and still provide a reasonable protection against memory reservation issues.
Maybe Or, all this might just be overkill 😉 |
These seem like reasonable safety rails, though it would be a shame to reject a file that could be parsed just because it had some zero padding that would otherwise be ignored, for example.
Coming up with a number N for which we assume no chunk could ever be larger than is a tricky problem. These chunks can store arbitrary data (such as XMP). I tend to prefer the approach of using the length of the whole data stream when it is available, and not otherwise. Another approach would be to change how we read large byte arrays from the stream. That is, we read in chunks of some amount (say 1MB) so that it's more likely the stream hits EOF than we fail to allocate a crazy large array in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for this. It's a good change that will help in a range of scenarios, not just the one you originally reported. It's also generated some good discussion for future work.
I'll squash merge this so that the image doesn't end up in the history.
That would be great, but how do you intend to decide if the data stream size is "known" or not? I've tried to solve this in another project, and the result is anything but pretty (I'm even somewhat embarrassed to post it): public static long getInputStreamSize(@Nullable InputStream inputStream) {
// This method uses whatever means necessary to acquire the source size
// of known InputStream implementations. If possible acquisition from
// other InputStream implementations is found, the code to acquire it
// should be added to this method.
if (inputStream == null) {
return -1L;
}
if (inputStream instanceof FakeInputStream) {
return ((FakeInputStream) inputStream).getSize();
}
if (inputStream instanceof FilterInputStream) {
try {
Field in = FilterInputStream.class.getDeclaredField("in");
in.setAccessible(true);
return getInputStreamSize((InputStream) in.get(inputStream));
} catch (ReflectiveOperationException | IllegalArgumentException e) {
return -1L;
}
}
if (inputStream instanceof PartialInputStream) {
PartialInputStream partial = (PartialInputStream) inputStream;
long result = getInputStreamSize(partial.source);
if (result < 1L) {
return result;
}
if (partial.getEnd() >= 0L) {
result = Math.min(result, partial.getEnd());
}
return partial.getStart() > 0L ? Math.max(result - partial.getStart(), 0L) : result;
}
if (inputStream instanceof SizeLimitInputStream) {
SizeLimitInputStream limited = (SizeLimitInputStream) inputStream;
return limited.in == null ? -1L : Math.min(getInputStreamSize(limited.in), limited.maxBytesToRead);
}
if (inputStream instanceof DLNAImageInputStream) {
return ((DLNAImageInputStream) inputStream).getSize();
}
if (inputStream instanceof DLNAThumbnailInputStream) {
return ((DLNAThumbnailInputStream) inputStream).getSize();
}
if (inputStream instanceof ByteArrayInputStream) {
try {
Field count = ByteArrayInputStream.class.getDeclaredField("count");
count.setAccessible(true);
return count.getInt(inputStream);
} catch (ReflectiveOperationException | IllegalArgumentException e) {
return -1L;
}
}
if (inputStream instanceof FileInputStream) {
try {
return ((FileInputStream) inputStream).getChannel().size();
} catch (IOException e) {
return -1L;
}
}
if ("com.sun.imageio.plugins.common.InputStreamAdapter".equals(inputStream.getClass().getName())) {
try {
Field stream = inputStream.getClass().getDeclaredField("stream");
stream.setAccessible(true);
return ((ImageInputStream) stream.get(inputStream)).length();
} catch (ReflectiveOperationException | IllegalArgumentException | IOException e) {
return -1L;
}
}
return -1L;
} The above has several problems. There's the use of reflection that isn't allowed in more recent versions of Java, and the fundamental "logic" is that you have to treat each Chunked reading would solve it at the cost of some loss of performance, which of course raises the question if it's "worth it" to lose performance for every valid stream to protect against issues with a very few invalid streams. On the other hand, such invalid streams could be especially crafted to trigger such a problem I guess. |
Shameless plug for drewnoakes/metadata-extractor-dotnet#250, or #361 as a starting point for Java -- only because it's (potentially) a stepping stone towards what you're after. It's creates a single place to hold most buffers instead of splattering them throughout the library, which should make it easier to branch-off under certain conditions. For example, a future optimization could be writing out certain types of stream content directly to disk transparently. |
Previously the code would read a byte[] for every PNG chunk, irrespective of whether it was going to be stored. With this change, these arrays are only allocated when they will actually be needed. This is the .NET port of drewnoakes/metadata-extractor#535
Previously the code would read a byte[] for every PNG chunk, irrespective of whether it was going to be stored. With this change, these arrays are only allocated when they will actually be needed. This is the .NET port of drewnoakes/metadata-extractor#535
Trying to address #534