Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Reader classes, v3 #250

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

kwhopper
Copy link
Collaborator

@kwhopper kwhopper commented May 3, 2020

RandomAccessStream + ReaderInfo

This PR is the third attempt at simplifying and enhancing the reading capabilities of MetadataExtractor. Very few code changes were made to the initial classes. The two main goals are to show how few changes are needed to bring the core classes into the library, and provide some Test cases.

The actual operation of these classes was implemented in #130 but has been omitted here. I want it to be easier for you to review.

In this PR:

  • Add RandomAccessStream and ReaderInfo classes to project
  • Minor input change to BufferBoundsException
  • Some initial test cases for Ras and ReaderInfo

The main advantages:

  • All sequential, byte, and indexed readers have been replaced by a single RandomAccessStream that handles all core reading operations. It uses a capturing-like structure to buffer data as requested.
  • The ReaderInfo class is the reading layer on top of an underlying RandomAccessStream. It helps simplify calls into RAS and can support Sequential and Indexed operations. Objects know their own global position and describe their own length, byte order, and local position purely by index; few to no bytes are buffered out of the underlying RAS.
  • Instances of ReaderInfo can be cloned at any time against the same RandomAccessStream. This is crucial to maintaining a flow down through the various reading classes without creating multiple copies of byte arrays. Any point in the current ReaderInfo can become the starting point of a new clone.
  • Virtually all "Get" operations were copied from existing methods in the other reading classes, so they have largely been tested in the current live code. Some modifications were needed to support Indexed and Sequential in the same class.
  • Seekable (e.g. FileStream) and Non-seekable (e.g. HttpClient) streams are supported.

closes #130

@drewnoakes The area in this PR can be used for further discussion - including whether I should continue to pursue this. I still believe in the general idea and this is a stable first step. Many other things can be added using nothing but the two main classes in this PR, including things like Span under the hood, easily writing out thumbnail byte arrays to disk, a basic catalog of the Exif data byte locations (needed for writing Exif), and selectively buffering large byte arrays to disk (like ICC data) through wrapping with another RandomAccessStream.

To reiterate: most of the code here was taken from existing reader code -- it has simply been consolidated into a couple of classes to support a new way of maintaining position.

Copy link
Owner

@drewnoakes drewnoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. It's much easier to see how this might work.

Have given it a quick look through and added my first wave of questions.

MetadataExtractor/IO/RandomAccessStream.cs Outdated Show resolved Hide resolved
MetadataExtractor/IO/RandomAccessStream.cs Show resolved Hide resolved
MetadataExtractor/IO/RandomAccessStream.cs Outdated Show resolved Hide resolved
/// <param name="isSequential">flag indicating if caller is using sequential access</param>
/// <exception cref="BufferBoundsException"/>
/// <exception cref="IOException"/>
public ushort GetUInt16(long index, bool IsMotorolaByteOrder, bool isSequential)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to move most (if not all) of these Get* methods to extensions.

Copy link
Owner

@drewnoakes drewnoakes May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casing of IsMotorolaByteOrder needs tweaking (ditto elsewhere).

MetadataExtractor/IO/RandomAccessStream.cs Outdated Show resolved Hide resolved
{
/// <author>Kevin Mott https://github.com/kwhopper</author>
/// <author>Drew Noakes https://drewnoakes.com</author>
public class ReaderInfo
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give this a better name than ReaderInfo. To me, the current name implies it's some immutable bit of data rather than a stateful actionable object.

Should this type be internal?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could potentially be internal if the bits we need from it (like global position) are recorded with metadata items. Then the instances can disappear in GC.

This was one of the early class names and I never changed it. Maybe something like "RegionReader" or "DataRegion", etc. I'm open to whatever.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also coming up short for name suggestions. Will have a think. If it's internal then it's easier to change later as no one will have taken a dependency on the name.

/// Skips forward or backward in the sequence. If the sequence ends, an <see cref="IOException"/> is thrown.
/// </remarks>
/// <param name="offset">the number of bytes to seek, in either direction.</param>
/// <param name="isSequential">optional for testing to indicate whether sequential access was used.</param>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear on why isSequential is needed for these Skip methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SequentialReader classes throw an IOException when too many bytes are requested. IndexedReader classes throw a BufferBoundsException. It was added very recently as a flag to help throw different exceptions to match your original readers, and make Test methods match better.

It isn't necessary at all if you want it to throw a common exception.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think throwing BufferBoundsException everywhere would be fine. It derives from IOException anyway, and includes more information. This parameter seems out of place somehow, and it gets passed around a lot which will have some perf impact.

It looks like it's controlling some logic in ReadAtGlobal though.

I'm a bit unclear on how ReaderInfo functions in both sequential and indexed access. Can you switch between them for the same instance at leisure? What is the rationale for having a single type versus two different types?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A ReaderInfo class can be read indexed or sequentially at the same time. You are free to make indexed calls at any time and only sequential calls will advance LocalPosition.

The reasons for this design are: a) both types access data from a pre-captured byte array source so in essence they are the 'same' anyway; b) it seemed unnecessary to re-implement all the Get* methods simply to separate the method of reading; and c) it removes ambiguity down the parsing chain so you don't have to inspect the parent ReaderInfo of a Clone and compare its original parsing method.

In the last PR where these classes were actually implemented and working throughout the library, ReaderInfo's are cloned constantly (to achieve the benefit of fewer byte array copies) and passed into other reading/parsing classes outside the parent's control. It's not possible to know down the line which access method a parser will want, and the method can actually change many times depending on the metadata segments. So, it's a better design to have only one type of reader and avoid an extra inspection or clone type change. It also allows for the rare case when indexed and sequential are desirable in a parse.

MetadataExtractor/IO/RandomAccessStream.cs Show resolved Hide resolved
MetadataExtractor/IO/RandomAccessStream.cs Show resolved Hide resolved
@kwhopper
Copy link
Collaborator Author

kwhopper commented May 6, 2020

Attempted to address some of your comments in the last commit. Let me know what you think.

/// <param name="validateIndex">allows for skipping validation if already done by the caller</param>
/// <exception cref="BufferBoundsException"/>
/// <exception cref="IOException"/>
private byte GetByte(long index, bool isSequential, bool validateIndex)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how often we will call this method I think you could make this method GetByteNoValidation and move the validation back to GetByte, then remove the validateIndex parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented, with some other changes

Comment on lines 279 to 297
// Motorola - MSB first (big endian)
// Motorola - MSB first
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove "big endian" here (and "little endian" below)?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see we don't have it elsewhere.

Comment on lines +134 to +144
var fromChunkIndex = fromOffset / p_chunkLength; // chunk integer key
var fromInnerIndex = fromOffset % p_chunkLength; // index inside the chunk to start reading
Copy link
Owner

@drewnoakes drewnoakes May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where we read from multiple chunks, can we avoid the div/mod for all but the first iteration? Would it work if these were initialised before the loop, then at the end of the loop incremented and set to zero respectively?

I think fromOffset would then not need to be updated either.

Also could remaining be removed and the loop condition be toIndex < count?

Comment on lines 174 to 187
var chunkIndex = index / p_chunkLength;
var innerIndex = index % p_chunkLength;

if (p_chunks.ContainsKey(chunkIndex))
return p_chunks[chunkIndex][innerIndex];
Copy link
Owner

@drewnoakes drewnoakes May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little concerned that these operations (div, mod, dictionary lookup, array lookup), when repeated for every byte consumed, will degrade performance. Rather than speculate, and while we have both the new/old implementations on the same branch, can we add some benchmarks that compare before/after?

I would recommend using BenchmarkDotNet for this. It's a really amazing library. I can help set that up if you'd like.

For scenarios, I'd look at sequential non-indexed access, sequential indexed access, random access.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is an almost verbatim copy of similar methods in IndexedCapturingReader. The only difference really is they get called in all cases instead of being implicitly requested by the caller (through using that reader class).

I can try to setup some benchmarks if you want. I'll probably need some guidance on BenchmarkDotNet if that's recommended.

if (p_chunks.ContainsKey(chunkIndex))
return p_chunks[chunkIndex][innerIndex];
else
return unchecked((byte)-1);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be 255 which is indistinguishable by the caller from a real value. Should this just throw?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some specific metadata cases where GetByte was expected to return this value... but now I can't remember how that went. This one might have to be tested against the metadata repo again to figure out -- but that's down the line.

It might have had something to do with replicating the Stream.ReadByte() return value and behavior. When it can't read any more data, it returns -1. That isn't an error condition to the Stream, but SequentialStreamReader throws an IOException on that value. I could maybe try to throw the same exception and see if it works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and changed it to throw for now. Will need to see how it works in more extensive testing.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think GetByte used to return int, which covers the domain of byte (which is unsigned) while allowing the special -1 value to indicate something else. Casting -1 to byte here pushed the value into the range of byte which means the caller can no longer tell that it was meant to be special. If this is going to return byte, then the exception is that special value.


private const int DefaultChunkLength = 4 * 1024;
private readonly int p_chunkLength;
public Dictionary<long, byte[]> p_chunks = new Dictionary<long, byte[]>();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Dictionary<,> and not List<>?

Copy link
Collaborator Author

@kwhopper kwhopper May 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because chunks aren't necessarily read and stored in a linear sequence. There could be gaps so a Dictionary keyed on the chunk starting position is ideal. If the stream is seekable, note that BytesAvailable will first seek the stream to the desired 'chunkable' position and only read that chunk into the Dictionary; the chunk index becomes the key. This optimization ignores caching all the bytes in between, and is the main reason this implementation is as good or better than the current one on a seekable stream.

For example, suppose you have a FileStream (which is seekable) and the first chunk of the file has everything you need at the moment to start parsing. The Dictionary will have a key of "0" with a byte array the size of the 0-th chunk of the file. Maybe it tells you that the next IFD is 300,000 bytes further into the file. BytesAvailable will Seek to the closest chunk that contains this new position and read that one. Now there will be only two entries in the Dictionary - "0" and "73" if the chunk length is 4096 (assuming my math is right).

So, this avoids caching ~300,000 bytes of unneeded data in the Dictionary. The only way to know whether this chunk is already there is to key it by chunk index relative to the entire stream -- and we're then back to why a keyed structure is better. We might eventually read some or all of those skipped bytes, but at this particular point they aren't needed.

This structure also works just fine if you're reading the whole thing linearly, such as with a NetworkStream. Non-seekable streams will simply read all intervening chunks into the Dictionary instead.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Using your example, indexes 1 through 72 in a List<byte[]> could be null which would take up a trivial amount of space. This would then mean that lookups in the list would be faster (simple pointer arithmetic) than a dictionary lookup.

This kind of thing is why I think it's time to get some benchmarks set up. Performance is a very important feature of this library, particularly in the IO area.

BenchmarkDotNet is an amazing tool and very easy to use. If you haven't tried it before, I think you'll really enjoy it. This page has some copy/paste-able code to start from:

https://benchmarkdotnet.org/articles/guides/getting-started.html

I'd also add [MemoryDiagnoser] to the benchmark class, which will cause memory usage/GC stats to be included in the output.

It's probably best to create a new project in the solution for these benchmarks. I am happy to help with this.

@kwhopper
Copy link
Collaborator Author

kwhopper commented May 9, 2020

More review changes pushed. I'll comment on the others, and we'll see if they also need to be addressed

@drewnoakes drewnoakes mentioned this pull request May 14, 2020
@kwhopper
Copy link
Collaborator Author

@drewnoakes - I added a List<byte[]> implementation of RAS. Please let me know what you think. To keep some code around, I also made an interface and a copy of the Dictionary version.

You're probably correct that a List will perform better overall and is a little easier to tune and control. But constructing tests for Dictionary vs. List by themselves isn't straightforward or very valuable outside of the (various) ways this library reads data. If this PR makes it into the library, I'm happy to do tests later using the full MetadataExtractor library code which will put the usage in more proper context.

@drewnoakes
Copy link
Owner

Thanks @kwhopper.

In order to merge this I would want to see those benchmarks. They could be removed prior to merging (though kept in the history for posterity).

If you don't have the time to add such benchmarks then I will endeavour to do so but am not sure when that will be. I don't think they'll take long to add. Exploring ideas generated by their results is where the time would likely go.

@kwhopper
Copy link
Collaborator Author

I got hung up on the best way to approach it without using the library itself. What's a good way to construct the benchmarks? Maybe random reads against a large file - a few megabytes in size?

It could be deterministic reads also, like defined sections of the file in both scenarios.

@kwhopper
Copy link
Collaborator Author

Here's a very basic run with deterministic offsets:

RASBenchmark.zip

Results indicate the List version is ~35-40% "faster":

BenchmarkDotNet=v0.10.12, OS=Windows 10.0.18363
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical cores and 6 physical cores
  [Host]     : .NET Framework 4.6.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.4180.0
  DefaultJob : .NET Framework 4.6.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.4180.0


                            Method |     Mean |     Error |    StdDev | Scaled |
---------------------------------- |---------:|----------:|----------:|-------:|
            RASDictionaryBenchmark | 26.80 us | 0.1253 us | 0.1172 us |   1.00 |
                  RASListBenchmark | 16.41 us | 0.1079 us | 0.1010 us |   0.61 |
 RASDictionaryNonseekableBenchmark | 27.95 us | 0.1190 us | 0.1113 us |   1.04 |
       RASListNonseekableBenchmark | 17.26 us | 0.0910 us | 0.0851 us |   0.64 |

Are there settings to get other data, like memory usage, GC, etc? Thanks

@kwhopper
Copy link
Collaborator Author

Pushed a few code changes to address some benchmark differences. A few may be reverted but they are at least documented in a commit.

@kwhopper
Copy link
Collaborator Author

I updated BenchmarkDotNet to 0.12.1 in the next tests. Here's a new benchmark file that includes the List version of RAS, and similar calls for IndexedCapturingReader and IndexedSeekingReader:

RASBenchmark.zip

With those files and the last commit, here are the benchmarks:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
  [Host]     : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
  DefaultJob : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT


|                          Method |     Mean |   Error |  StdDev | Ratio |
|-------------------------------- |---------:|--------:|--------:|------:|
|                RASListBenchmark | 376.3 us | 1.56 us | 1.38 us |  1.00 |
| IndexedCapturingReaderBenchmark | 202.1 us | 0.88 us | 0.78 us |  0.54 |
|   IndexedSeekingReaderBenchmark | 240.0 us | 1.33 us | 1.18 us |  0.64 |

The time differences are negligible at this scale and due almost entirely to the use of "long" for index params to some RandomAccessStream methods. For some reason, the 64-bit calculations can be relatively slow. You can view the 'hack' parts in the commit where it's listed as a "micro-optimization" for benchmarks. If I remove those, here are the numbers:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
  [Host]     : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
  DefaultJob : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT


|                          Method |     Mean |   Error |  StdDev | Ratio |
|-------------------------------- |---------:|--------:|--------:|------:|
|                RASListBenchmark | 689.4 us | 3.75 us | 3.32 us |  1.00 |
| IndexedCapturingReaderBenchmark | 202.4 us | 1.03 us | 0.91 us |  0.29 |
|   IndexedSeekingReaderBenchmark | 251.2 us | 0.62 us | 0.48 us |  0.36 |

We're still talking micro seconds, so even this result isn't terrible.

Using "int" index params instead of "long" explains virtually all of the difference. I did some earlier partial tests where those are all bypassed and RAS performed even better on time against IndexedCapturingReader. So, we can either support files larger than an int32 in size and take a small hit (on some architectures), or change it back to int's in RAS classes and probably close the time gap.

I think some users have wanted long support in files so I put it in there from the start.

@drewnoakes
Copy link
Owner

This looks great.

I'll have a play around with the benchmarks later today hopefully. One thing I'll try (in case you want to try it first) is to move the construction of the RAS and ReaderInfo out of the [Benchmark] methods (they can be fields which are reused across runs) and add [MemoryDiagnoser] to the benchmark.

I'll probably also generate random bytes and try reading them all sequentially as various data types, comparing both RAS and the existing sequential readers, to add more sequential coverage.

I don't know what's best around int/long indexes. We should probably support long.

Feel free to push the benchmark code to this branch.

@kwhopper
Copy link
Collaborator Author

Pushed and added some of what you mentioned. The memory allocations are intriguing but not sure what to make of them without more testing:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
  [Host]     : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
  DefaultJob : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT


|                          Method |     Mean |   Error |  StdDev | Ratio |   Gen 0 |  Gen 1 | Gen 2 | Allocated |
|-------------------------------- |---------:|--------:|--------:|------:|--------:|-------:|------:|----------:|
|                RASListBenchmark | 685.5 us | 2.06 us | 1.92 us |  1.00 |       - |      - |     - |   2.98 KB |
| IndexedCapturingReaderBenchmark | 216.0 us | 0.73 us | 0.68 us |  0.32 | 10.2539 | 1.9531 |     - |  63.91 KB |
|   IndexedSeekingReaderBenchmark | 238.2 us | 0.71 us | 0.63 us |  0.35 |  0.4883 |      - |     - |   3.03 KB |

@kwhopper
Copy link
Collaborator Author

Here's the memory when RAS is always instantiated during a benchmark run:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
  [Host]     : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
  DefaultJob : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT


|                          Method |     Mean |   Error |  StdDev | Ratio |   Gen 0 |  Gen 1 | Gen 2 | Allocated |
|-------------------------------- |---------:|--------:|--------:|------:|--------:|-------:|------:|----------:|
|                RASListBenchmark | 689.2 us | 3.54 us | 3.31 us |  1.00 |  7.8125 | 0.9766 |     - |  51.85 KB |
| IndexedCapturingReaderBenchmark | 201.4 us | 1.10 us | 1.03 us |  0.29 | 10.2539 | 1.9531 |     - |  63.91 KB |
|   IndexedSeekingReaderBenchmark | 260.7 us | 0.74 us | 0.62 us |  0.38 |  0.4883 |      - |     - |   3.03 KB |

@kwhopper
Copy link
Collaborator Author

kwhopper commented May 25, 2020

To sum up a little as you look at this...

We know that byte arrays will always be read and then sliced and diced as metadata is parsed. The goal here is to read these arrays in reasonable chunks from the source but assign a pointer (a ReaderInfo instance) that specifies location and size and use that until actual bytes are really needed. This instance can be used as-is to read bytes, or cloned into another pointer instance when a subsection is desired in another parsing class.

The behavior and output you see so far, then, is generally expected and part of the design to reuse existing byte arrays as much as possible. It's a single reader class as well that avoids almost all the byte copying done in the current implementation down the chain. IndexedSeekingReader doesn't allocate a lot of memory in this test but I don't think that's very representative of reality; most of the time many chunks of a file will be read during parsing and RAS expects that from the start.

@drewnoakes
Copy link
Owner

Thanks @kwhopper. I agree with your points and can mostly see the bigger picture when it comes to memory copying. Without implementing parallel versions of the library in terms of old/new it won't be easy to see a more realistic set of values. Still it's useful to ensure there are no major issues, and track changes over time.

I will pull this branch down and play with it to gain a better understanding.

@kwhopper
Copy link
Collaborator Author

Thanks @drewnoakes. I've hounded you plenty the last few years so it's enough that you're still open to the idea. Parallel libraries isn't out of the question entirely to benchmark. It's certainly not an overnight proposition, but I already did it once to demo the first attempt.

There are many potential advantages of a separate pointer object for reading that always knows its own local and global positions. Here are a few that drove this idea in the first place... and my persistence in lobbying for it:

  • ReaderInfo's made from an RAS "CreateReader" call could easily be tracked in that RAS instance. This gives you the structure of what was parsed and provides the most basic thing the library always needed to have any hope of writing metadata someday -- some map of the metadata bytes and a way to easily tell what was read and what was skipped.
  • The ReaderInfo class could be given "type" properties for special cases. For example, maybe it's a configuration setting in the future to not read (and buffer) embedded thumbnail bytes directly but only return the ReaderInfo pointer. The user (or even a special method in the code) could use that pointer to go back later and save the thumbnail bytes to disk.
  • NonSeekable streams work fine, although for now it will buffer the entire thing in memory. In the future, the bytes could be buffered to disk in some RAS implementation without changing ReaderInfo, and then the other things become possible and more efficient (saving a thumbnail from above, etc)
  • Large metadata objects like XML and compressed ICC are buffered into byte arrays twice in many cases. Similar to the second item above, small changes to RAS/ReaderInfo could make handling them configurable. Maybe it can say don't keep the parsed bytes in the buffer permanently and write directly to disk or just give me back the ReaderInfo pointer of these bytes and my custom code will read the bytes later.
  • A new RAS instance can be created at anytime, even during a parse. This could help with things like side-car XML files, or buffer larger parses into a local FileStream at some maximum List<byte[]> size and keep only the new RAS object (all handled internally).
  • If we get to a context object, "DefaultChunkLength" is easily override-able by the user
  • "WithShiftedBaseOffset" becomes obsolete as ReaderInfo always knows its global position. It felt like a moderate hack anyway to fix the loss of a global position due to secondary byte array usage.

I'm sure there are other things, but that's enough for now. Thanks

@kwhopper
Copy link
Collaborator Author

Without implementing parallel versions of the library in terms of old/new it won't be easy to see a more realistic set of values.

I'm happy to do this if it will help -- implement a test of the entire library using these classes. Most of the work was already done in the first version. I'll need to do fixes for any new additions or changes to the core library.

@drewnoakes
Copy link
Owner

I'm happy to do this if it will help

The issue from my side is free time. Where I live we have a total lock down which means I have to spend a large chunk of each day looking after my child while also trying to perform my job, and I am feeling burnt out. I have only occasional bandwidth for small simple tasks like small PR reviews and fixing simple bugs. This PR requires several hours of deep thought. Integrating it across the whole library, while a generous offer, will increase the time requirement.

I haven't forgotten about this, and I am hopeful we can get this in without many changes. I just need to take the time to think through the implications, and that requires finding the time. Once done we can plan out integrating the changes across the library.

@drewnoakes
Copy link
Owner

Thanks for fixing the merge conflicts @kwhopper.

kwhopper added 2 commits June 13, 2021 15:18
- Minor input change to BufferBoundsException
- Some initial test cases for Ras and ReaderInfo
…y done by the caller

- use GetByte instead of Read for Get* methods (similar to old readers)
- add the constant RandomAccessStream.UnknownLengthValue
- mark NonSeekableStream class as internal (for testing only)
- document the RandomAccessStream.Length property
- mark RandomAccessStream.ValidateIndex method as internal
- document the return value of the RandomAccessStream.ValidateIndex method
kwhopper added 8 commits June 13, 2021 15:18
…y report a Length

- Throw BufferBoundsException instead of IOException in ValidateIndex
- Remove "IsSequential" inputs to most RandomAccessStream classes
- fix isMotorolaByteOrder casing in RandomAccessStream
- rename overload of GetByte to GetByteNoValidation and remove validateIndex param
…rtial is true

- remove allowPartial param from ValidateRange
…k buffer

- add IRandomAccessStream interface
- keep dictionary version of RAS class
- remove nullable designator from RandomAccessStream.p_inputStream (optimization)
- convert long index values to int for benchmarks
- leave support for long Stream indexes in RandomAccessStream (and take the mild read hit)
- add RASBenchmark class
@kwhopper kwhopper force-pushed the feature/new-reader-v3 branch from f757004 to 03bcbc7 Compare June 13, 2021 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants