Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RandomAccessStream #130

Closed
wants to merge 36 commits into from
Closed

RandomAccessStream #130

wants to merge 36 commits into from

Conversation

kwhopper
Copy link
Collaborator

@kwhopper kwhopper commented May 2, 2018

This is a massive PR and I'll explain more about it soon. In a nutshell:

  • All sequential, byte, and indexed readers have been replaced by a single RandomAccessStream that handles all core reading operations. It uses a capturing-like structure to buffer data as requested.
  • ReaderInfo has replaced byte arrays for segment and related operations. This class is insulated from the underlying data through a RandomAccessStream. These objects know their own global position and describe their own length, byte order, and local position purely by index; few to no bytes are buffered out of the underlying RAS. Sequential and Indexed operations are supported.
  • Seekable (e.g. FileStream) and Non-seekable (e.g. HttpClient) streams are supported.

Related to:
#36
#91
#122
#125
#62
#35

kwhopper added 4 commits May 1, 2018 20:48
- use IsCloserToEnd for possible yield in JpegSegmentReader
- Include a 'shifted base offset' test
@kwhopper
Copy link
Collaborator Author

kwhopper commented May 2, 2018

ToDo: ICC data spread across multiple JPEG segments is not currently implemented. It could be done with a concatenated array but will attempt to do it with ReaderInfo

*Update - went with the original concatenation method for now.

@kwhopper
Copy link
Collaborator Author

kwhopper commented May 2, 2018

Non-seekable example:

const string url = "https://github.com/drewnoakes/metadata-extractor-images/raw/master/png/mspaint-8x10.png";

static HttpClient client = new HttpClient();

static async Task RunAsync()
{
    using (var response = await client.GetAsync(url, HttpCompletionOption.ResponseHeadersRead))
    using (Stream streamToReadFrom = await response.Content.ReadAsStreamAsync())
    {
        var directories = ImageMetadataReader.ReadMetadata(streamToReadFrom);
        // display directory contents however you want
    }
}

// call it somewhere in your code:
RunAsync().GetAwaiter().GetResult();

kwhopper added 10 commits May 3, 2018 09:18
… segments

TODO: try to make it work with just ReaderInfo, although that's complicated
- add TotalBytesRead property in RandomAccessStream to track actual bytes Read
- add ImageMetdataReader overload to take a RandomAccessStream
- add null check for tag Description in TextFileOutputHandler
- fix PsdReader clone offset for Photoshop data
@kwhopper kwhopper requested a review from drewnoakes May 4, 2018 15:05
@kwhopper
Copy link
Collaborator Author

kwhopper commented May 4, 2018

This code was run on the current metadata-extractor-images and no anomalies of large consequence were seen. Some new metadata was extracted that wasn't previously.

@kwhopper
Copy link
Collaborator Author

kwhopper commented Sep 8, 2018

Last commit is a port from Java of drewnoakes/metadata-extractor#250

@drewnoakes
Copy link
Owner

@kwhopper I'll be sure to look at this properly asap. It's an impressive chunk of work and I want to understand it properly.

@kwhopper
Copy link
Collaborator Author

Ok, great. I'll apologize up front for the number of changes in a single PR. As it got rolling, there was just no avoiding it. Feel free to ask or comment about anything. There's certainly a chance I've missed obvious implementation issues.

@drewnoakes
Copy link
Owner

The changes look good, and I know what kind of a roll one can get into when making changes like this. Let's work out how to move this forward.

I first need to convince myself that the design works in all scenarios we might expect it to. So please give me a bit of time.

Have you read anything about System.IO.Pipelines? As with #136 I'm not sure about moving to spans just yet, but in future it could be wonderful for a library like this.

I'd like to give the user-facing API an overhaul to address some of the classes of feedback we receive. It might make sense to do that and release 3.0 before increasing the minimum version requirements and using span in 4.0.

@kwhopper
Copy link
Collaborator Author

I have read a bit on Pipelines, but never implemented anything. It does look interesting though. A big motivation for squishing the reader classes down in this PR is how much easier it is to change the hidden machinery. Even if Pipelines proper isn't a good fit for this library, some of the new BCL memory classes it uses could have a benefit.

I agree on the API and am happy to help on it.

@kwhopper
Copy link
Collaborator Author

If it helps move this (or something like it) along - would you rather make this PR into a new version of the library? That way the current code can still receive updates. There are a number of older issues that can't be implemented in the current code and many of them are significant enhancements (thumb bytes to disk, a context object with settings, etc). This PR feels like the best chance at the moment for some of those to materialize.

There's also a Java port - drewnoakes/metadata-extractor#361 - which brings both libraries closer to parity. Thanks!

@drewnoakes
Copy link
Owner

@kwhopper I'm sorry for being so slow on this. I will look at it asap. From the description it sounds great. I just want to think through any potential problems before merging.

@drewnoakes
Copy link
Owner

I've spent some time this morning going through the huge amount of work you've done here.

I'm finding the loss of support for non-seekable streams to be an unsettling regression. This PR will improve perf for some users, but completely break other users. As we move towards a more connected, cloud-oriented world, network streams become more and more commonplace and important.

Initially it seems you were supporting seeking, but removed that support more recently. Can you elaborate on why you decided to do so?

I'd like to understand whether these perf improvements and API simplifications without the capability regression. In cases where we can seek, it seems like it could be a good optimisation (though whether we'd see gains likely depends upon the cost of seeking on the specific stream, and the file type's data organisation).

The two main restrictions imposed by streams that impact reading of metadata from various file formats are:

  1. Not always being able to seek forwards/backwards
  2. Not always knowing the the length of the stream

The only workaround I know in both cases is to just keep reading from the stream, buffering what is read just in case we need it again later.

For seeking, there may be some cases where the reading code knows that what's being skipped is completely unneeded (and therefore needn't be buffered), such as when skipping over certain JPEG segments or other blocks of known image data. We could expose this "skip and forget" pattern within our API. I'm not sure this works for TIFF data, such as most camera RAW files, unfortunately. It might work for some video formats.

The length restriction is particularly nasty as if you're working with a very large file, the metadata is usually towards the beginning of the file. Buffering all the data is pretty wasteful. Wherever possible we've tried to avoid needing to know the stream's length. It is mostly useful for validation. We regularly see invalid data that leads to out-of-bounds offsets. If we're reading, we get OOM exceptions when we try to allocate enormous arrays. If we're seeking, we end up buffering the whole file before working out that anything was wrong. This is another problem without a good workaround.

These restrictions existed when the Java version was written and influenced its design. That design was ported over to .NET and the IO types changed in a few places to reflect the (numerous) changes between Java and .NET stream classes. The .NET version can be better refined, as this PR shows.

I'll give this some more thought over the next days and play around with the code a little more. I'd be interested to hear your thoughts.


There are several commits that should be pulled out into separate PRs for isolated review/merge. For example, 05dcc2e, 98f15c0, 6ff9026, 60fedc4, 41291c4, 7188c65, e4b1732, 5c4ad9a, 2da922e, 3be1001, b23ab5a, cdd987c

@markegorman

This comment was marked as off-topic.

@kwhopper
Copy link
Collaborator Author

kwhopper commented Nov 11, 2018

Initially it seems you were supporting seeking, but removed that support more recently. Can you elaborate on why you decided to do so?

Use of a seekable stream is still supported. Note the uses of CanSeek in this method:
https://github.com/kwhopper/metadata-extractor-dotnet/blob/c8e990aad2924f35c53d30e855b694faebe952b7/MetadataExtractor/IO/RandomAccessStream.cs#L475

... and this line in particular:

https://github.com/kwhopper/metadata-extractor-dotnet/blob/c8e990aad2924f35c53d30e855b694faebe952b7/MetadataExtractor/IO/RandomAccessStream.cs#L519

What gave you the impression that it had been removed? Maybe I didn't comment something very well. Seek is definitely used to skip over unneeded areas when the stream reports support for it. There are two main goals for the RandomAccessStream class: 1) Do all low-level reading and buffering in one place; and 2) Support both seekable and non-seekable streams in one place.

The reason I added a semi-required Length is for non-seekable streams. There have been various attempts in both libraries over the years to parse with an unknown length. None have really had any lasting success; it's a very difficult problem and tends to be all or nothing. My attempts to check for the end without a length were devolving into a mess that wasn't going to work. Ultimately, it would try to keep reading buffers forever and then fail entirely. As you have noted:

It is mostly useful for validation. We regularly see invalid data that leads to out-of-bounds offsets. If we're reading, we get OOM exceptions when we try to allocate enormous arrays. If we're seeking, we end up buffering the whole file before working out that anything was wrong. This is another problem without a good workaround.

Note too that the length passed can be fake, such as Int.MaxValue, as it's only real purpose is to give the code some boundary to use. This greatly simplifies the bounds checking for a non-seekable stream and stabilized the code. The trade-off is passing something "reasonable."

No matter what for non-seekable streams, we must buffer either in memory or on disk since something like a NetworkStream is forward-only. However, note that if the index requested is valid the non-seekable stream will only buffer as much as is required and not the entire Length of the stream.

I'm out of town at the moment and will comment more on this later -- unless that explains it. But definitely keep sending questions or comments. I could come up with some ways of testing to verify how buffering works in this PR if that would help. A more direct form of communication would be much better to review this, but we have what we have.

@drewnoakes
Copy link
Owner

Thanks for the fast response. It's the end of the day here, so just a few quick thoughts in response.

What gave you the impression that it had been removed?

I fired up an HTTP request against an image and passed the stream to ImageMetadataReader.Read and it threw this exception:

https://github.com/drewnoakes/metadata-extractor-dotnet/pull/130/files#diff-8eefcf5a9c8bcffb02d354ccb8f87ba9R62

Note too that the length passed can be fake, such as Int.MaxValue

I hadn't considered that when reading the code. I'll give it another look with that in mind. However it should work out of the box. Pass a stream, get metadata. Users who demand better performance will be motivated to understand what they can do to achieve that. Users who just want metadata will curse and go looking for another library if the most simple API doesn't work for them. Maybe we just use int.MaxValue as the default.

Will take another look soon and keep the dialogue flowing.

@kwhopper
Copy link
Collaborator Author

I'll go ahead and change it to int.MaxValue if a length isn't provided and seeking isn't possible. That seems reasonable and is probably the best compromise. As you may have said, there is no perfect solution for a non-seekable stream (that I know of). Weren't there other issues or PR's that tried to deal with this directly? I'm certain there were but can't find them at the moment.

I also wanted to say - RandomAccessStream is solely a buffering layer, and the intent is to have either a single class or interface that does it. I'm not against sub-classing it if there's a need. Right now, the current code seems to work in a single class.

The real value in this is the ReaderInfo class. It tracks the data requests using only index values and leaves RAS to handle all the heavy buffering work. ReaderInfo makes it possible to avoid using so many byte arrays and some redundant array copying; ReaderInfo's can have overlapping buffer index values but will read from the same buffered bytes instead of doubling-up. Since it tracks global position new tasks are possible down the road. RAS was just a means to this end.

@kwhopper
Copy link
Collaborator Author

kwhopper commented Mar 3, 2020

@drewnoakes Editing metadata is a hot topic. I'm convinced the reader structure has to be overhauled before editing is even feasible. I'd like to take another shot at this PR if you have interest in reviewing it.

The RandomAccessStream classes wouldn't have to be used throughout for a good test. I'm thinking it's easier to copy/paste those classes into another PR, make some unit tests to demo what it offers, and leave the rest of the lib alone. Thanks

@drewnoakes drewnoakes deleted the branch drewnoakes:master January 29, 2024 00:39
@drewnoakes drewnoakes closed this Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants