-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Reader classes, v3 #250
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR. It's much easier to see how this might work.
Have given it a quick look through and added my first wave of questions.
/// <param name="isSequential">flag indicating if caller is using sequential access</param> | ||
/// <exception cref="BufferBoundsException"/> | ||
/// <exception cref="IOException"/> | ||
public ushort GetUInt16(long index, bool IsMotorolaByteOrder, bool isSequential) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to move most (if not all) of these Get*
methods to extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casing of IsMotorolaByteOrder
needs tweaking (ditto elsewhere).
{ | ||
/// <author>Kevin Mott https://github.com/kwhopper</author> | ||
/// <author>Drew Noakes https://drewnoakes.com</author> | ||
public class ReaderInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give this a better name than ReaderInfo
. To me, the current name implies it's some immutable bit of data rather than a stateful actionable object.
Should this type be internal
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could potentially be internal if the bits we need from it (like global position) are recorded with metadata items. Then the instances can disappear in GC.
This was one of the early class names and I never changed it. Maybe something like "RegionReader" or "DataRegion", etc. I'm open to whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also coming up short for name suggestions. Will have a think. If it's internal then it's easier to change later as no one will have taken a dependency on the name.
MetadataExtractor/IO/ReaderInfo.cs
Outdated
/// Skips forward or backward in the sequence. If the sequence ends, an <see cref="IOException"/> is thrown. | ||
/// </remarks> | ||
/// <param name="offset">the number of bytes to seek, in either direction.</param> | ||
/// <param name="isSequential">optional for testing to indicate whether sequential access was used.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unclear on why isSequential
is needed for these Skip
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SequentialReader classes throw an IOException when too many bytes are requested. IndexedReader classes throw a BufferBoundsException. It was added very recently as a flag to help throw different exceptions to match your original readers, and make Test methods match better.
It isn't necessary at all if you want it to throw a common exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think throwing BufferBoundsException
everywhere would be fine. It derives from IOException
anyway, and includes more information. This parameter seems out of place somehow, and it gets passed around a lot which will have some perf impact.
It looks like it's controlling some logic in ReadAtGlobal
though.
I'm a bit unclear on how ReaderInfo
functions in both sequential and indexed access. Can you switch between them for the same instance at leisure? What is the rationale for having a single type versus two different types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A ReaderInfo class can be read indexed or sequentially at the same time. You are free to make indexed calls at any time and only sequential calls will advance LocalPosition.
The reasons for this design are: a) both types access data from a pre-captured byte array source so in essence they are the 'same' anyway; b) it seemed unnecessary to re-implement all the Get* methods simply to separate the method of reading; and c) it removes ambiguity down the parsing chain so you don't have to inspect the parent ReaderInfo of a Clone and compare its original parsing method.
In the last PR where these classes were actually implemented and working throughout the library, ReaderInfo's are cloned constantly (to achieve the benefit of fewer byte array copies) and passed into other reading/parsing classes outside the parent's control. It's not possible to know down the line which access method a parser will want, and the method can actually change many times depending on the metadata segments. So, it's a better design to have only one type of reader and avoid an extra inspection or clone type change. It also allows for the rare case when indexed and sequential are desirable in a parse.
Attempted to address some of your comments in the last commit. Let me know what you think. |
/// <param name="validateIndex">allows for skipping validation if already done by the caller</param> | ||
/// <exception cref="BufferBoundsException"/> | ||
/// <exception cref="IOException"/> | ||
private byte GetByte(long index, bool isSequential, bool validateIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how often we will call this method I think you could make this method GetByteNoValidation
and move the validation back to GetByte
, then remove the validateIndex
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented, with some other changes
// Motorola - MSB first (big endian) | ||
// Motorola - MSB first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove "big endian" here (and "little endian" below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see we don't have it elsewhere.
var fromChunkIndex = fromOffset / p_chunkLength; // chunk integer key | ||
var fromInnerIndex = fromOffset % p_chunkLength; // index inside the chunk to start reading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where we read from multiple chunks, can we avoid the div/mod for all but the first iteration? Would it work if these were initialised before the loop, then at the end of the loop incremented and set to zero respectively?
I think fromOffset
would then not need to be updated either.
Also could remaining
be removed and the loop condition be toIndex < count
?
var chunkIndex = index / p_chunkLength; | ||
var innerIndex = index % p_chunkLength; | ||
|
||
if (p_chunks.ContainsKey(chunkIndex)) | ||
return p_chunks[chunkIndex][innerIndex]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little concerned that these operations (div, mod, dictionary lookup, array lookup), when repeated for every byte consumed, will degrade performance. Rather than speculate, and while we have both the new/old implementations on the same branch, can we add some benchmarks that compare before/after?
I would recommend using BenchmarkDotNet for this. It's a really amazing library. I can help set that up if you'd like.
For scenarios, I'd look at sequential non-indexed access, sequential indexed access, random access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop is an almost verbatim copy of similar methods in IndexedCapturingReader. The only difference really is they get called in all cases instead of being implicitly requested by the caller (through using that reader class).
I can try to setup some benchmarks if you want. I'll probably need some guidance on BenchmarkDotNet if that's recommended.
if (p_chunks.ContainsKey(chunkIndex)) | ||
return p_chunks[chunkIndex][innerIndex]; | ||
else | ||
return unchecked((byte)-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be 255
which is indistinguishable by the caller from a real value. Should this just throw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were some specific metadata cases where GetByte was expected to return this value... but now I can't remember how that went. This one might have to be tested against the metadata repo again to figure out -- but that's down the line.
It might have had something to do with replicating the Stream.ReadByte() return value and behavior. When it can't read any more data, it returns -1. That isn't an error condition to the Stream, but SequentialStreamReader throws an IOException on that value. I could maybe try to throw the same exception and see if it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and changed it to throw for now. Will need to see how it works in more extensive testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think GetByte
used to return int
, which covers the domain of byte
(which is unsigned) while allowing the special -1
value to indicate something else. Casting -1
to byte
here pushed the value into the range of byte
which means the caller can no longer tell that it was meant to be special. If this is going to return byte
, then the exception is that special value.
|
||
private const int DefaultChunkLength = 4 * 1024; | ||
private readonly int p_chunkLength; | ||
public Dictionary<long, byte[]> p_chunks = new Dictionary<long, byte[]>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Dictionary<,>
and not List<>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because chunks aren't necessarily read and stored in a linear sequence. There could be gaps so a Dictionary keyed on the chunk starting position is ideal. If the stream is seekable, note that BytesAvailable will first seek the stream to the desired 'chunkable' position and only read that chunk into the Dictionary; the chunk index becomes the key. This optimization ignores caching all the bytes in between, and is the main reason this implementation is as good or better than the current one on a seekable stream.
For example, suppose you have a FileStream (which is seekable) and the first chunk of the file has everything you need at the moment to start parsing. The Dictionary will have a key of "0" with a byte array the size of the 0-th chunk of the file. Maybe it tells you that the next IFD is 300,000 bytes further into the file. BytesAvailable will Seek to the closest chunk that contains this new position and read that one. Now there will be only two entries in the Dictionary - "0" and "73" if the chunk length is 4096 (assuming my math is right).
So, this avoids caching ~300,000 bytes of unneeded data in the Dictionary. The only way to know whether this chunk is already there is to key it by chunk index relative to the entire stream -- and we're then back to why a keyed structure is better. We might eventually read some or all of those skipped bytes, but at this particular point they aren't needed.
This structure also works just fine if you're reading the whole thing linearly, such as with a NetworkStream. Non-seekable streams will simply read all intervening chunks into the Dictionary instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Using your example, indexes 1 through 72 in a List<byte[]>
could be null
which would take up a trivial amount of space. This would then mean that lookups in the list would be faster (simple pointer arithmetic) than a dictionary lookup.
This kind of thing is why I think it's time to get some benchmarks set up. Performance is a very important feature of this library, particularly in the IO area.
BenchmarkDotNet is an amazing tool and very easy to use. If you haven't tried it before, I think you'll really enjoy it. This page has some copy/paste-able code to start from:
https://benchmarkdotnet.org/articles/guides/getting-started.html
I'd also add [MemoryDiagnoser]
to the benchmark class, which will cause memory usage/GC stats to be included in the output.
It's probably best to create a new project in the solution for these benchmarks. I am happy to help with this.
More review changes pushed. I'll comment on the others, and we'll see if they also need to be addressed |
@drewnoakes - I added a List<byte[]> implementation of RAS. Please let me know what you think. To keep some code around, I also made an interface and a copy of the Dictionary version. You're probably correct that a List will perform better overall and is a little easier to tune and control. But constructing tests for Dictionary vs. List by themselves isn't straightforward or very valuable outside of the (various) ways this library reads data. If this PR makes it into the library, I'm happy to do tests later using the full MetadataExtractor library code which will put the usage in more proper context. |
Thanks @kwhopper. In order to merge this I would want to see those benchmarks. They could be removed prior to merging (though kept in the history for posterity). If you don't have the time to add such benchmarks then I will endeavour to do so but am not sure when that will be. I don't think they'll take long to add. Exploring ideas generated by their results is where the time would likely go. |
I got hung up on the best way to approach it without using the library itself. What's a good way to construct the benchmarks? Maybe random reads against a large file - a few megabytes in size? It could be deterministic reads also, like defined sections of the file in both scenarios. |
Here's a very basic run with deterministic offsets: Results indicate the List version is ~35-40% "faster":
Are there settings to get other data, like memory usage, GC, etc? Thanks |
Pushed a few code changes to address some benchmark differences. A few may be reverted but they are at least documented in a commit. |
I updated BenchmarkDotNet to 0.12.1 in the next tests. Here's a new benchmark file that includes the List version of RAS, and similar calls for IndexedCapturingReader and IndexedSeekingReader: With those files and the last commit, here are the benchmarks:
The time differences are negligible at this scale and due almost entirely to the use of "long" for index params to some RandomAccessStream methods. For some reason, the 64-bit calculations can be relatively slow. You can view the 'hack' parts in the commit where it's listed as a "micro-optimization" for benchmarks. If I remove those, here are the numbers:
We're still talking micro seconds, so even this result isn't terrible. Using "int" index params instead of "long" explains virtually all of the difference. I did some earlier partial tests where those are all bypassed and RAS performed even better on time against IndexedCapturingReader. So, we can either support files larger than an int32 in size and take a small hit (on some architectures), or change it back to int's in RAS classes and probably close the time gap. I think some users have wanted long support in files so I put it in there from the start. |
This looks great. I'll have a play around with the benchmarks later today hopefully. One thing I'll try (in case you want to try it first) is to move the construction of the RAS and ReaderInfo out of the I'll probably also generate random bytes and try reading them all sequentially as various data types, comparing both RAS and the existing sequential readers, to add more sequential coverage. I don't know what's best around int/long indexes. We should probably support long. Feel free to push the benchmark code to this branch. |
Pushed and added some of what you mentioned. The memory allocations are intriguing but not sure what to make of them without more testing:
|
Here's the memory when RAS is always instantiated during a benchmark run:
|
To sum up a little as you look at this... We know that byte arrays will always be read and then sliced and diced as metadata is parsed. The goal here is to read these arrays in reasonable chunks from the source but assign a pointer (a ReaderInfo instance) that specifies location and size and use that until actual bytes are really needed. This instance can be used as-is to read bytes, or cloned into another pointer instance when a subsection is desired in another parsing class. The behavior and output you see so far, then, is generally expected and part of the design to reuse existing byte arrays as much as possible. It's a single reader class as well that avoids almost all the byte copying done in the current implementation down the chain. IndexedSeekingReader doesn't allocate a lot of memory in this test but I don't think that's very representative of reality; most of the time many chunks of a file will be read during parsing and RAS expects that from the start. |
Thanks @kwhopper. I agree with your points and can mostly see the bigger picture when it comes to memory copying. Without implementing parallel versions of the library in terms of old/new it won't be easy to see a more realistic set of values. Still it's useful to ensure there are no major issues, and track changes over time. I will pull this branch down and play with it to gain a better understanding. |
Thanks @drewnoakes. I've hounded you plenty the last few years so it's enough that you're still open to the idea. Parallel libraries isn't out of the question entirely to benchmark. It's certainly not an overnight proposition, but I already did it once to demo the first attempt. There are many potential advantages of a separate pointer object for reading that always knows its own local and global positions. Here are a few that drove this idea in the first place... and my persistence in lobbying for it:
I'm sure there are other things, but that's enough for now. Thanks |
I'm happy to do this if it will help -- implement a test of the entire library using these classes. Most of the work was already done in the first version. I'll need to do fixes for any new additions or changes to the core library. |
The issue from my side is free time. Where I live we have a total lock down which means I have to spend a large chunk of each day looking after my child while also trying to perform my job, and I am feeling burnt out. I have only occasional bandwidth for small simple tasks like small PR reviews and fixing simple bugs. This PR requires several hours of deep thought. Integrating it across the whole library, while a generous offer, will increase the time requirement. I haven't forgotten about this, and I am hopeful we can get this in without many changes. I just need to take the time to think through the implications, and that requires finding the time. Once done we can plan out integrating the changes across the library. |
Thanks for fixing the merge conflicts @kwhopper. |
- Minor input change to BufferBoundsException - Some initial test cases for Ras and ReaderInfo
…y done by the caller - use GetByte instead of Read for Get* methods (similar to old readers) - add the constant RandomAccessStream.UnknownLengthValue - mark NonSeekableStream class as internal (for testing only) - document the RandomAccessStream.Length property - mark RandomAccessStream.ValidateIndex method as internal - document the return value of the RandomAccessStream.ValidateIndex method
…y report a Length - Throw BufferBoundsException instead of IOException in ValidateIndex - Remove "IsSequential" inputs to most RandomAccessStream classes - fix isMotorolaByteOrder casing in RandomAccessStream
- rename overload of GetByte to GetByteNoValidation and remove validateIndex param
…rtial is true - remove allowPartial param from ValidateRange
…t in the Dictionary instead of returning -1
…k buffer - add IRandomAccessStream interface - keep dictionary version of RAS class
- remove nullable designator from RandomAccessStream.p_inputStream (optimization) - convert long index values to int for benchmarks
- leave support for long Stream indexes in RandomAccessStream (and take the mild read hit) - add RASBenchmark class
f757004
to
03bcbc7
Compare
RandomAccessStream + ReaderInfo
This PR is the third attempt at simplifying and enhancing the reading capabilities of MetadataExtractor. Very few code changes were made to the initial classes. The two main goals are to show how few changes are needed to bring the core classes into the library, and provide some Test cases.
The actual operation of these classes was implemented in #130 but has been omitted here. I want it to be easier for you to review.
In this PR:
The main advantages:
closes #130
@drewnoakes The area in this PR can be used for further discussion - including whether I should continue to pursue this. I still believe in the general idea and this is a stable first step. Many other things can be added using nothing but the two main classes in this PR, including things like Span under the hood, easily writing out thumbnail byte arrays to disk, a basic catalog of the Exif data byte locations (needed for writing Exif), and selectively buffering large byte arrays to disk (like ICC data) through wrapping with another RandomAccessStream.
To reiterate: most of the code here was taken from existing reader code -- it has simply been consolidated into a couple of classes to support a new way of maintaining position.