-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RandomAccessStream #130
RandomAccessStream #130
Conversation
… non-seekable data in the same way
- use IsCloserToEnd for possible yield in JpegSegmentReader - Include a 'shifted base offset' test
…Exception on AppVeyor for some reason
ToDo: ICC data spread across multiple JPEG segments is not currently implemented. It could be done with a concatenated array but will attempt to do it with ReaderInfo *Update - went with the original concatenation method for now. |
Non-seekable example: const string url = "https://github.com/drewnoakes/metadata-extractor-images/raw/master/png/mspaint-8x10.png";
static HttpClient client = new HttpClient();
static async Task RunAsync()
{
using (var response = await client.GetAsync(url, HttpCompletionOption.ResponseHeadersRead))
using (Stream streamToReadFrom = await response.Content.ReadAsStreamAsync())
{
var directories = ImageMetadataReader.ReadMetadata(streamToReadFrom);
// display directory contents however you want
}
}
// call it somewhere in your code:
RunAsync().GetAwaiter().GetResult(); |
… segments TODO: try to make it work with just ReaderInfo, although that's complicated
- add TotalBytesRead property in RandomAccessStream to track actual bytes Read - add ImageMetdataReader overload to take a RandomAccessStream - add null check for tag Description in TextFileOutputHandler
- fix PsdReader clone offset for Photoshop data
- start some documentation
This code was run on the current metadata-extractor-images and no anomalies of large consequence were seen. Some new metadata was extracted that wasn't previously. |
…op (ported from Java)
…he possibility of extending to the full set of bad characters as needed.
Last commit is a port from Java of drewnoakes/metadata-extractor#250 |
@kwhopper I'll be sure to look at this properly asap. It's an impressive chunk of work and I want to understand it properly. |
Ok, great. I'll apologize up front for the number of changes in a single PR. As it got rolling, there was just no avoiding it. Feel free to ask or comment about anything. There's certainly a chance I've missed obvious implementation issues. |
The changes look good, and I know what kind of a roll one can get into when making changes like this. Let's work out how to move this forward. I first need to convince myself that the design works in all scenarios we might expect it to. So please give me a bit of time. Have you read anything about I'd like to give the user-facing API an overhaul to address some of the classes of feedback we receive. It might make sense to do that and release 3.0 before increasing the minimum version requirements and using span in 4.0. |
I have read a bit on Pipelines, but never implemented anything. It does look interesting though. A big motivation for squishing the reader classes down in this PR is how much easier it is to change the hidden machinery. Even if Pipelines proper isn't a good fit for this library, some of the new BCL memory classes it uses could have a benefit. I agree on the API and am happy to help on it. |
…tput in edge cases
If it helps move this (or something like it) along - would you rather make this PR into a new version of the library? That way the current code can still receive updates. There are a number of older issues that can't be implemented in the current code and many of them are significant enhancements (thumb bytes to disk, a context object with settings, etc). This PR feels like the best chance at the moment for some of those to materialize. There's also a Java port - drewnoakes/metadata-extractor#361 - which brings both libraries closer to parity. Thanks! |
@kwhopper I'm sorry for being so slow on this. I will look at it asap. From the description it sounds great. I just want to think through any potential problems before merging. |
I've spent some time this morning going through the huge amount of work you've done here. I'm finding the loss of support for non-seekable streams to be an unsettling regression. This PR will improve perf for some users, but completely break other users. As we move towards a more connected, cloud-oriented world, network streams become more and more commonplace and important. Initially it seems you were supporting seeking, but removed that support more recently. Can you elaborate on why you decided to do so? I'd like to understand whether these perf improvements and API simplifications without the capability regression. In cases where we can seek, it seems like it could be a good optimisation (though whether we'd see gains likely depends upon the cost of seeking on the specific stream, and the file type's data organisation). The two main restrictions imposed by streams that impact reading of metadata from various file formats are:
The only workaround I know in both cases is to just keep reading from the stream, buffering what is read just in case we need it again later. For seeking, there may be some cases where the reading code knows that what's being skipped is completely unneeded (and therefore needn't be buffered), such as when skipping over certain JPEG segments or other blocks of known image data. We could expose this "skip and forget" pattern within our API. I'm not sure this works for TIFF data, such as most camera RAW files, unfortunately. It might work for some video formats. The length restriction is particularly nasty as if you're working with a very large file, the metadata is usually towards the beginning of the file. Buffering all the data is pretty wasteful. Wherever possible we've tried to avoid needing to know the stream's length. It is mostly useful for validation. We regularly see invalid data that leads to out-of-bounds offsets. If we're reading, we get OOM exceptions when we try to allocate enormous arrays. If we're seeking, we end up buffering the whole file before working out that anything was wrong. This is another problem without a good workaround. These restrictions existed when the Java version was written and influenced its design. That design was ported over to .NET and the IO types changed in a few places to reflect the (numerous) changes between Java and .NET stream classes. The .NET version can be better refined, as this PR shows. I'll give this some more thought over the next days and play around with the code a little more. I'd be interested to hear your thoughts. There are several commits that should be pulled out into separate PRs for isolated review/merge. For example, 05dcc2e, 98f15c0, 6ff9026, 60fedc4, 41291c4, 7188c65, e4b1732, 5c4ad9a, 2da922e, 3be1001, b23ab5a, cdd987c |
This comment was marked as off-topic.
This comment was marked as off-topic.
Use of a seekable stream is still supported. Note the uses of CanSeek in this method: ... and this line in particular: What gave you the impression that it had been removed? Maybe I didn't comment something very well. Seek is definitely used to skip over unneeded areas when the stream reports support for it. There are two main goals for the RandomAccessStream class: 1) Do all low-level reading and buffering in one place; and 2) Support both seekable and non-seekable streams in one place. The reason I added a semi-required Length is for non-seekable streams. There have been various attempts in both libraries over the years to parse with an unknown length. None have really had any lasting success; it's a very difficult problem and tends to be all or nothing. My attempts to check for the end without a length were devolving into a mess that wasn't going to work. Ultimately, it would try to keep reading buffers forever and then fail entirely. As you have noted:
Note too that the length passed can be fake, such as Int.MaxValue, as it's only real purpose is to give the code some boundary to use. This greatly simplifies the bounds checking for a non-seekable stream and stabilized the code. The trade-off is passing something "reasonable." No matter what for non-seekable streams, we must buffer either in memory or on disk since something like a NetworkStream is forward-only. However, note that if the index requested is valid the non-seekable stream will only buffer as much as is required and not the entire Length of the stream. I'm out of town at the moment and will comment more on this later -- unless that explains it. But definitely keep sending questions or comments. I could come up with some ways of testing to verify how buffering works in this PR if that would help. A more direct form of communication would be much better to review this, but we have what we have. |
Thanks for the fast response. It's the end of the day here, so just a few quick thoughts in response.
I fired up an HTTP request against an image and passed the stream to
I hadn't considered that when reading the code. I'll give it another look with that in mind. However it should work out of the box. Pass a stream, get metadata. Users who demand better performance will be motivated to understand what they can do to achieve that. Users who just want metadata will curse and go looking for another library if the most simple API doesn't work for them. Maybe we just use int.MaxValue as the default. Will take another look soon and keep the dialogue flowing. |
I'll go ahead and change it to int.MaxValue if a length isn't provided and seeking isn't possible. That seems reasonable and is probably the best compromise. As you may have said, there is no perfect solution for a non-seekable stream (that I know of). Weren't there other issues or PR's that tried to deal with this directly? I'm certain there were but can't find them at the moment. I also wanted to say - RandomAccessStream is solely a buffering layer, and the intent is to have either a single class or interface that does it. I'm not against sub-classing it if there's a need. Right now, the current code seems to work in a single class. The real value in this is the ReaderInfo class. It tracks the data requests using only index values and leaves RAS to handle all the heavy buffering work. ReaderInfo makes it possible to avoid using so many byte arrays and some redundant array copying; ReaderInfo's can have overlapping buffer index values but will read from the same buffered bytes instead of doubling-up. Since it tracks global position new tasks are possible down the road. RAS was just a means to this end. |
@drewnoakes Editing metadata is a hot topic. I'm convinced the reader structure has to be overhauled before editing is even feasible. I'd like to take another shot at this PR if you have interest in reviewing it. The RandomAccessStream classes wouldn't have to be used throughout for a good test. I'm thinking it's easier to copy/paste those classes into another PR, make some unit tests to demo what it offers, and leave the rest of the lib alone. Thanks |
This is a massive PR and I'll explain more about it soon. In a nutshell:
Related to:
#36
#91
#122
#125
#62
#35