-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Code Quality / Tiding #390
Conversation
iamcarbon
commented
Jan 29, 2024
- Simplifies string.Join calls (eliminates various array allocations)
- Uses new GetString polyfill
- Removes Empty class (using empty collection expression instead)
- Eliminates various byte allocation
- Updates ShouldAcceptList to use utf-8 bytes
- Updates StartsWithJpegPreamble to accept ReadOnlySpan
- Switches on integer values to avoid a string allocation
@drewnoakes Ready for review. This is the last of the changes proposed for 2.9. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Thanks for fixing all these. Left some comments. See what you think and we'll get this merged.
MetadataExtractor/Formats/Exif/makernotes/OlympusFocusInfoMakernoteDescriptor.cs
Outdated
Show resolved
Hide resolved
MetadataExtractor/Formats/QuickTime/QuickTimeReaderExtensions.cs
Outdated
Show resolved
Hide resolved
…alFlashZoomDescription
MetadataExtractor/Formats/Exif/makernotes/OlympusFocusInfoMakernoteDescriptor.cs
Show resolved
Hide resolved
@@ -89,17 +91,24 @@ public static class Iso2022Converter | |||
|
|||
foreach (var encoding in encodings) | |||
{ | |||
char[] charBuffer = ArrayPool<char>.Shared.Rent(encoding.GetMaxCharCount(bytes.Length)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. There are some other places we could use ArrayPool
too actually, like in the reader classes. I'll investigate that separately for 2.9.0, unless you'd like to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also some wins introducing a few new specialized readers that can operate directly over ReadOnlySpan.
One callout, would be replacing SequentialByteArrayReader with an optimized ref struct {LittleEndian/BigEndian}BufferReader(ReadOnlySpan buffer) -- and spanifying the outer method.
This would allow us to operate directly over a span, and eliminate the reader allocation.
There's another big win eliminating all the temporary array allocations when we make (int tagName, params string[] descriptions)
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more use of ArrayPool in #392.
There's definitely room for improvement in the reader classes. Check out the PRs from @kwhopper for some more ideas.
My vague idea here is to build on the new span types and @kwhopper's investigations, and actually avoid doing as much parsing work during the extraction phase. Many directories we store could actually be backed by a byte[]
(or Memory<byte>
) that could be inspected when enumerating through tags. This would be quite a big change, and requires some research before it could be pursued.
There's another big win eliminating all the temporary array allocations when we make (int tagName, params string[] descriptions) calls.
This sounds promising. We'd need to verify that the compiler doesn't allocate an array behind the scenes.
EDIT: It seems to do the right thing on modern .NET: https://sharplab.io/#v2:EYLgtghglgdgPgAQEwEYCwAoBBmABAlANnyVwGFcBvTXW/PA4hAFlwFkUAKAJQFMIAJgHkYAGwCeAZQAOEGAB4CABgB8+FEoDOAShp1qGOrgC+e2mfrqmrNkk67D+i0Y6cA2gCIAgh4A0uDwAhPwCyDwBdbQBuC1MMYyA===
...but not on .NET Framework: https://sharplab.io/#v2:EYLgHgbALAPgAgJgIwFgBQcDMACOSK4LYDC2A3utlbjngXFNgLJIAUASgKYCGAJgPIA7ADYBPAMoAHboIA8eAAwA+XEgUBnAJSVqFNNWwBfHVRM1V9RkwStt+3WYMtWAbQBEAQTcAabG4BCPn7EbgC6mgDcZsZohkA==
This is a compiler feature, so we'd need to test netstandard2.1
csc output, which sharplab doesn't support afaik.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's definitely room for improvement in the reader classes. Check out the PRs from @kwhopper for some more ideas.
Thanks. RandomAccessStream+ReaderInfo is an experiment somewhat similar to this Span conversion. It goes a bit further by abstracting away all the buffering (RandomAccessStream) and "span-ifying" (ReaderInfo with byte arrays) entirely; callers then only have to worry about one kind of reader. Side effects are the ability to know your exact physical offset at any time, and support for streamed content.
If you can reach those same goals in this process, that's a great addition and should allow new things in the future. I can also check through the code in those old PR's to see if they could use Spans like you're doing here, if that has some value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kwhopper the recent activity has been small incremental improvements. I see your PR as the kind of thing we want for 3.0. It'll be a lot of work to integrate throughout the code though, so we should noodle out the details with sketches and discussion before starting the work of integration. We only want to do that integration work once.
A direction I think would be good is to divide the parsing into two stages:
- Reading the file in coarse chunks. E.g. for JPEG, this could be just pulling out and labelling the segments we need. These would be allocated in contiguous chunks of memory, with no further processing. I think this phase could mostly be done sequentially. The chunks would remember their offsets relative to the start of the file too, along with whatever metadata is needed for later steps.
- As the consumer walks through the metadata, we process the chunks of data to produce the tags.
Currently we do both 1 and 2 during the read phase. I'm thinking that, with this, we'd just do step 1 during that phase, and step 2 during the enumeration. This will mean a lot less work and fewer allocations during the first phase, and when that work's done during the second phase, any allocations would be shorter-lived and therefore more likely to be GC'd quickly in gen0. It'd also allow consumers to skip decoding bits they don't actually care about.
I'm hoping to write this up a bit more comprehensively and would really appreciate your input.
return Encoding.UTF8.GetString(values) | ||
.Trim('\0', ' ', '\r', '\n', '\t'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we could trim a span rather than a string, but I don't think it's safe to trim these bytes in all UTF8 strings, and that trimming characters is better. It would be possible to use the Encoding
to populate a Span<char>
and trim that, but I don't think it's worth it.
My plan is to look at some traces and take a data-led approach to the next wave of optimisations. There's too much code to go through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a good case for the ArrayPool, where we trim the char[] buffer, before materializing the string. Agree that we need to be careful operating on bytes when the string might have multi-byte codepoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also try to eliminate some of these silently allocated arrays when using functions that accept a params T[] array (as is the case above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, cheers!
@@ -651,15 +651,15 @@ public sealed class OlympusCameraSettingsMakernoteDescriptor(OlympusCameraSettin | |||
if (Directory.GetObject(OlympusCameraSettingsMakernoteDirectory.TagGradation) is not short[] values || values.Length < 3) | |||
return null; | |||
|
|||
var join = $"{values[0]} {values[1]} {values[2]}"; | |||
var ret = join switch | |||
var ret = (values[0], values[1], values[3]) switch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this should have been a 2 instead of a 3. I'll push a fix shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var sb = new StringBuilder(4); | ||
sb.Append((char)reader.GetByte()); | ||
sb.Append((char)reader.GetByte()); | ||
sb.Append((char)reader.GetByte()); | ||
sb.Append((char)reader.GetByte()); | ||
return sb.ToString(); | ||
Span<byte> bytes = stackalloc byte[4]; | ||
|
||
reader.GetBytes(bytes); | ||
|
||
return Encoding.ASCII.GetString(bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes a behaviour change for some inputs, though I've only seen it on fuzzed files that contain very weird data:
The ASCII encoding replaces some characters with ?
which potentially loses information. According to https://en.wikipedia.org/wiki/FourCC non-printable characters are valid. I'm not sure it's a problem in practice, but I think I'll make a change here to restore the old behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.