RandomAccessStream #130

kwhopper · 2018-05-02T04:29:51Z

This is a massive PR and I'll explain more about it soon. In a nutshell:

All sequential, byte, and indexed readers have been replaced by a single RandomAccessStream that handles all core reading operations. It uses a capturing-like structure to buffer data as requested.
ReaderInfo has replaced byte arrays for segment and related operations. This class is insulated from the underlying data through a RandomAccessStream. These objects know their own global position and describe their own length, byte order, and local position purely by index; few to no bytes are buffered out of the underlying RAS. Sequential and Indexed operations are supported.
Seekable (e.g. FileStream) and Non-seekable (e.g. HttpClient) streams are supported.

… non-seekable data in the same way

- use IsCloserToEnd for possible yield in JpegSegmentReader - Include a 'shifted base offset' test

…Exception on AppVeyor for some reason

kwhopper · 2018-05-02T04:47:08Z

ToDo: ICC data spread across multiple JPEG segments is not currently implemented. It could be done with a concatenated array but will attempt to do it with ReaderInfo

*Update - went with the original concatenation method for now.

kwhopper · 2018-05-02T04:59:08Z

Non-seekable example:

const string url = "https://github.com/drewnoakes/metadata-extractor-images/raw/master/png/mspaint-8x10.png";

static HttpClient client = new HttpClient();

static async Task RunAsync()
{
    using (var response = await client.GetAsync(url, HttpCompletionOption.ResponseHeadersRead))
    using (Stream streamToReadFrom = await response.Content.ReadAsStreamAsync())
    {
        var directories = ImageMetadataReader.ReadMetadata(streamToReadFrom);
        // display directory contents however you want
    }
}

// call it somewhere in your code:
RunAsync().GetAwaiter().GetResult();

… segments TODO: try to make it work with just ReaderInfo, although that's complicated

- add TotalBytesRead property in RandomAccessStream to track actual bytes Read - add ImageMetdataReader overload to take a RandomAccessStream - add null check for tag Description in TextFileOutputHandler

…dian)

…-endian)

…ader yield-ing

- fix PsdReader clone offset for Photoshop data

…t counts

- start some documentation

kwhopper · 2018-05-04T15:23:54Z

This code was run on the current metadata-extractor-images and no anomalies of large consequence were seen. Some new metadata was extracted that wasn't previously.

…e missed

…utput

…op (ported from Java)

…he possibility of extending to the full set of bad characters as needed.

kwhopper · 2018-09-08T19:16:34Z

Last commit is a port from Java of drewnoakes/metadata-extractor#250

drewnoakes · 2018-09-21T21:41:17Z

@kwhopper I'll be sure to look at this properly asap. It's an impressive chunk of work and I want to understand it properly.

kwhopper · 2018-09-21T21:47:28Z

Ok, great. I'll apologize up front for the number of changes in a single PR. As it got rolling, there was just no avoiding it. Feel free to ask or comment about anything. There's certainly a chance I've missed obvious implementation issues.

drewnoakes · 2018-09-21T22:03:39Z

The changes look good, and I know what kind of a roll one can get into when making changes like this. Let's work out how to move this forward.

I first need to convince myself that the design works in all scenarios we might expect it to. So please give me a bit of time.

Have you read anything about System.IO.Pipelines? As with #136 I'm not sure about moving to spans just yet, but in future it could be wonderful for a library like this.

I'd like to give the user-facing API an overhaul to address some of the classes of feedback we receive. It might make sense to do that and release 3.0 before increasing the minimum version requirements and using span in 4.0.

kwhopper · 2018-09-21T22:14:20Z

I have read a bit on Pipelines, but never implemented anything. It does look interesting though. A big motivation for squishing the reader classes down in this PR is how much easier it is to change the hidden machinery. Even if Pipelines proper isn't a good fit for this library, some of the new BCL memory classes it uses could have a benefit.

I agree on the API and am happy to help on it.

…tput in edge cases

kwhopper · 2018-10-29T19:54:50Z

If it helps move this (or something like it) along - would you rather make this PR into a new version of the library? That way the current code can still receive updates. There are a number of older issues that can't be implemented in the current code and many of them are significant enhancements (thumb bytes to disk, a context object with settings, etc). This PR feels like the best chance at the moment for some of those to materialize.

There's also a Java port - drewnoakes/metadata-extractor#361 - which brings both libraries closer to parity. Thanks!

drewnoakes · 2018-11-09T15:50:45Z

@kwhopper I'm sorry for being so slow on this. I will look at it asap. From the description it sounds great. I just want to think through any potential problems before merging.

drewnoakes · 2018-11-11T09:28:02Z

I've spent some time this morning going through the huge amount of work you've done here.

I'm finding the loss of support for non-seekable streams to be an unsettling regression. This PR will improve perf for some users, but completely break other users. As we move towards a more connected, cloud-oriented world, network streams become more and more commonplace and important.

Initially it seems you were supporting seeking, but removed that support more recently. Can you elaborate on why you decided to do so?

I'd like to understand whether these perf improvements and API simplifications without the capability regression. In cases where we can seek, it seems like it could be a good optimisation (though whether we'd see gains likely depends upon the cost of seeking on the specific stream, and the file type's data organisation).

The two main restrictions imposed by streams that impact reading of metadata from various file formats are:

Not always being able to seek forwards/backwards
Not always knowing the the length of the stream

The only workaround I know in both cases is to just keep reading from the stream, buffering what is read just in case we need it again later.

For seeking, there may be some cases where the reading code knows that what's being skipped is completely unneeded (and therefore needn't be buffered), such as when skipping over certain JPEG segments or other blocks of known image data. We could expose this "skip and forget" pattern within our API. I'm not sure this works for TIFF data, such as most camera RAW files, unfortunately. It might work for some video formats.

The length restriction is particularly nasty as if you're working with a very large file, the metadata is usually towards the beginning of the file. Buffering all the data is pretty wasteful. Wherever possible we've tried to avoid needing to know the stream's length. It is mostly useful for validation. We regularly see invalid data that leads to out-of-bounds offsets. If we're reading, we get OOM exceptions when we try to allocate enormous arrays. If we're seeking, we end up buffering the whole file before working out that anything was wrong. This is another problem without a good workaround.

These restrictions existed when the Java version was written and influenced its design. That design was ported over to .NET and the IO types changed in a few places to reflect the (numerous) changes between Java and .NET stream classes. The .NET version can be better refined, as this PR shows.

I'll give this some more thought over the next days and play around with the code a little more. I'd be interested to hear your thoughts.

There are several commits that should be pulled out into separate PRs for isolated review/merge. For example, 05dcc2e, 98f15c0, 6ff9026, 60fedc4, 41291c4, 7188c65, e4b1732, 5c4ad9a, 2da922e, 3be1001, b23ab5a, cdd987c

kwhopper · 2018-11-11T21:05:12Z

Initially it seems you were supporting seeking, but removed that support more recently. Can you elaborate on why you decided to do so?

Use of a seekable stream is still supported. Note the uses of CanSeek in this method:
https://github.com/kwhopper/metadata-extractor-dotnet/blob/c8e990aad2924f35c53d30e855b694faebe952b7/MetadataExtractor/IO/RandomAccessStream.cs#L475

... and this line in particular:

https://github.com/kwhopper/metadata-extractor-dotnet/blob/c8e990aad2924f35c53d30e855b694faebe952b7/MetadataExtractor/IO/RandomAccessStream.cs#L519

What gave you the impression that it had been removed? Maybe I didn't comment something very well. Seek is definitely used to skip over unneeded areas when the stream reports support for it. There are two main goals for the RandomAccessStream class: 1) Do all low-level reading and buffering in one place; and 2) Support both seekable and non-seekable streams in one place.

The reason I added a semi-required Length is for non-seekable streams. There have been various attempts in both libraries over the years to parse with an unknown length. None have really had any lasting success; it's a very difficult problem and tends to be all or nothing. My attempts to check for the end without a length were devolving into a mess that wasn't going to work. Ultimately, it would try to keep reading buffers forever and then fail entirely. As you have noted:

It is mostly useful for validation. We regularly see invalid data that leads to out-of-bounds offsets. If we're reading, we get OOM exceptions when we try to allocate enormous arrays. If we're seeking, we end up buffering the whole file before working out that anything was wrong. This is another problem without a good workaround.

Note too that the length passed can be fake, such as Int.MaxValue, as it's only real purpose is to give the code some boundary to use. This greatly simplifies the bounds checking for a non-seekable stream and stabilized the code. The trade-off is passing something "reasonable."

No matter what for non-seekable streams, we must buffer either in memory or on disk since something like a NetworkStream is forward-only. However, note that if the index requested is valid the non-seekable stream will only buffer as much as is required and not the entire Length of the stream.

I'm out of town at the moment and will comment more on this later -- unless that explains it. But definitely keep sending questions or comments. I could come up with some ways of testing to verify how buffering works in this PR if that would help. A more direct form of communication would be much better to review this, but we have what we have.

drewnoakes · 2018-11-11T23:04:58Z

Thanks for the fast response. It's the end of the day here, so just a few quick thoughts in response.

What gave you the impression that it had been removed?

I fired up an HTTP request against an image and passed the stream to ImageMetadataReader.Read and it threw this exception:

https://github.com/drewnoakes/metadata-extractor-dotnet/pull/130/files#diff-8eefcf5a9c8bcffb02d354ccb8f87ba9R62

Note too that the length passed can be fake, such as Int.MaxValue

I hadn't considered that when reading the code. I'll give it another look with that in mind. However it should work out of the box. Pass a stream, get metadata. Users who demand better performance will be motivated to understand what they can do to achieve that. Users who just want metadata will curse and go looking for another library if the most simple API doesn't work for them. Maybe we just use int.MaxValue as the default.

Will take another look soon and keep the dialogue flowing.

kwhopper · 2018-11-12T02:06:33Z

I'll go ahead and change it to int.MaxValue if a length isn't provided and seeking isn't possible. That seems reasonable and is probably the best compromise. As you may have said, there is no perfect solution for a non-seekable stream (that I know of). Weren't there other issues or PR's that tried to deal with this directly? I'm certain there were but can't find them at the moment.

I also wanted to say - RandomAccessStream is solely a buffering layer, and the intent is to have either a single class or interface that does it. I'm not against sub-classing it if there's a need. Right now, the current code seems to work in a single class.

The real value in this is the ReaderInfo class. It tracks the data requests using only index values and leaves RAS to handle all the heavy buffering work. ReaderInfo makes it possible to avoid using so many byte arrays and some redundant array copying; ReaderInfo's can have overlapping buffer index values but will read from the same buffered bytes instead of doubling-up. Since it tracks global position new tasks are possible down the road. RAS was just a means to this end.

…ue instead

kwhopper · 2020-03-03T22:43:28Z

@drewnoakes Editing metadata is a hot topic. I'm convinced the reader structure has to be overhauled before editing is even feasible. I'd like to take another shot at this PR if you have interest in reviewing it.

The RandomAccessStream classes wouldn't have to be used throughout for a good test. I'm thinking it's easier to copy/paste those classes into another PR, make some unit tests to demo what it offers, and leave the rest of the lib alone. Thanks

kwhopper added 4 commits May 1, 2018 20:48

Implement RandomAccessStream and ReaderInfo to parse all seekable and…

bf230e7

… non-seekable data in the same way

remove the reader list, for now

c45145f

- fix Clone so length is correct

fa2da35

- use IsCloserToEnd for possible yield in JpegSegmentReader - Include a 'shifted base offset' test

Skip IndexPlusCountExceedsIntMaxValue. Is throwing System.OutOfMemory…

96475d9

…Exception on AppVeyor for some reason

kwhopper mentioned this pull request May 2, 2018

Attempt to hold onto absolute position of a Directory relative to the original file/stream #36

Closed

kwhopper added 10 commits May 3, 2018 09:18

Use byte array concatenation for ICC data spread across multiple JPEG…

c5c0813

… segments TODO: try to make it work with just ReaderInfo, although that's complicated

- fix some subtle reader bugs

69399e9

- add TotalBytesRead property in RandomAccessStream to track actual bytes Read - add ImageMetdataReader overload to take a RandomAccessStream - add null check for tag Description in TextFileOutputHandler

make sure ICC and IPTC readers always use Motorola byte order (big-en…

161143d

…dian)

make sure PhotoshopReader.Extract always use Motorola byte order (big…

9747029

…-endian)

remove -1 checks in favor of IsCloserToEnd for accurate JpegSegmentRe…

84e6c61

…ader yield-ing

- clone when changing byte order in various readers

5ca8cf8

- fix PsdReader clone offset for Photoshop data

read zTXt and iTXt bytes from reader non-sequentially by using correc…

ab39bf6

…t counts

fix makernote offset for a Sony variant

f59ab59

add 'Tem' segment type to non-payload list

ff96011

- rename ReaderInfo.GetAllBytes() to ToArray()

cf98593

- start some documentation

kwhopper requested a review from drewnoakes May 4, 2018 15:05

kwhopper mentioned this pull request May 4, 2018

RandomAccessStream + ReaderInfo #131

Open

kwhopper requested a review from payton May 5, 2018 19:46

kwhopper mentioned this pull request Aug 4, 2018

Port RandomAccessStream to Java drewnoakes/metadata-extractor#357

Open

kwhopper added 8 commits August 12, 2018 19:51

Remove Seek overload that accepts a SeekOrigin. Not used.

a50ed89

cleanup sequential variables and remove unneeded ValidateIndex calls

14378d0

use ReaderInfo.ToArray() instead of GetBytes in a few places that wer…

76da738

…e missed

Add PhotoshopSettings tag (0x8649) to ExifDirectoryBase

05dcc2e

Fix Avi Duration output hour. "HH" is 24 hour time and matches Java o…

98f15c0

…utput

Add ExifTiffHandler.CustomProcessTag conditionals for ICC and Photosh…

6ff9026

…op (ported from Java)

rename variable

45b3aee

Replace null characters (0x00) with a space in XML buffers. Provide t…

cdd987c

…he possibility of extending to the full set of bad characters as needed.

Port extended BMP support from Java

41291c4

use more appropriate Skip

b59352e

kwhopper mentioned this pull request Sep 9, 2018

Initial port of RandomAccessStream/ReaderInfo drewnoakes/metadata-extractor#361

Open

kwhopper added 2 commits September 9, 2018 09:51

port ExifDescriptorBase.GetBrightnessValueDescription from Java

7188c65

Hold xmp array and lazy-load XmpMeta in XmpDirectory

632629f

kwhopper added 6 commits September 28, 2018 14:11

Fix NikonType2 GetPowerUpTimeDescription. It isn't an int64

e4b1732

fix ExifDescriptorBase.GetSubjectDistanceDescription to match Java ou…

5c4ad9a

…tput in edge cases

fix GpsDescriptor.GetGpsAltitudeDescription to match Java output

2da922e

Port GifControlDirectory additions from Java

3be1001

port flash fix and ignore invalid gif xmp blocks from java

b23ab5a

Merge branch 'master' into RandomAccessStream

c8e990a

This comment was marked as off-topic.

Sign in to view

Allow a non-seekable stream to have an unknown length. Use int.MaxVal…

65e7ca7

…ue instead

kwhopper mentioned this pull request Jan 3, 2019

Question: How to extract thumbnail bytes from Jpeg? drewnoakes/metadata-extractor#276

Closed

kwhopper mentioned this pull request May 3, 2020

New Reader classes, v3 #250

Open

drewnoakes mentioned this pull request May 8, 2023

Fix Exif thumbnail offset #333

Merged

drewnoakes deleted the branch drewnoakes:master January 29, 2024 00:39

drewnoakes closed this Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RandomAccessStream #130

RandomAccessStream #130

kwhopper commented May 2, 2018

kwhopper commented May 2, 2018 •

edited

Loading

kwhopper commented May 2, 2018

kwhopper commented May 4, 2018

kwhopper commented Sep 8, 2018

drewnoakes commented Sep 21, 2018

kwhopper commented Sep 21, 2018

drewnoakes commented Sep 21, 2018

kwhopper commented Sep 21, 2018

kwhopper commented Oct 29, 2018

drewnoakes commented Nov 9, 2018

drewnoakes commented Nov 11, 2018

This comment was marked as off-topic.

kwhopper commented Nov 11, 2018 •

edited

Loading

drewnoakes commented Nov 11, 2018

kwhopper commented Nov 12, 2018

kwhopper commented Mar 3, 2020

RandomAccessStream #130

RandomAccessStream #130

Conversation

kwhopper commented May 2, 2018

kwhopper commented May 2, 2018 • edited Loading

kwhopper commented May 2, 2018

kwhopper commented May 4, 2018

kwhopper commented Sep 8, 2018

drewnoakes commented Sep 21, 2018

kwhopper commented Sep 21, 2018

drewnoakes commented Sep 21, 2018

kwhopper commented Sep 21, 2018

kwhopper commented Oct 29, 2018

drewnoakes commented Nov 9, 2018

drewnoakes commented Nov 11, 2018

This comment was marked as off-topic.

kwhopper commented Nov 11, 2018 • edited Loading

drewnoakes commented Nov 11, 2018

kwhopper commented Nov 12, 2018

kwhopper commented Mar 3, 2020

kwhopper commented May 2, 2018 •

edited

Loading

kwhopper commented Nov 11, 2018 •

edited

Loading