Add IBufferProtocol to mmap #1866

slozier · 2025-01-06T04:28:45Z

Add IBufferProtocol to mmap. Throws a NotImplementedException if the length does not fit in a int.

For the case of bigger files I think we'll have to rework IPythonBuffer. My initial thoughts on changes would be:

Use nint in place of int where applicable (probably where CPython uses Py_ssize_t).
Change AsSpan/AsReadOnlySpan methods to have an nint start argument.
Fix up everything to use nint arithmetic instead of int.

Marking as draft since I need to do more testing (and maybe write some tests). @BCSharp I'm not sure if you looked at/thought about this before so if you have any feedback would appreciate it.

Related to #1408

BCSharp · 2025-01-08T05:11:25Z

Yes, I have been thinking about the buffer protocol for large blobs, though not in the context of mmap. I didn't know that mmap implements the buffer protocol, but of course, why not.

I was thinking of supporting Numpy's ND arrays, which can be really big too. Also on some .NET versions, CLI arrays can be up to 4GB. The latter is exotic and the former still a little bit away, but mmap will be a good testing ground for large buffers.

The idea is to make it still easy to support by various types that implement the buffer protocol, and relatively easy to use by consumers. I am wary of using too much of nint, since the interesting .NET API is predominantly int/Span based, and I would like to avoid too much of narrowing casting by the clients to be able to do useful things with the buffer. The casts would have to be checked, or guarded with ifs, which is ugly and error prone. In the end, however, it all depends on usage. Now that we have the usage of the protocol in a number of places, it will be easier to accommodate the most common usage patterns. For instance, I've noticed that almost every client requests BufferFlags.Simple.

I see the following three major consuming patterns that should be easy for consumers to implement:

Some consumers have innate limitations that make them unable to consume buffers larger than 2GB no matter what. For instance, constructors or initializers of builtin types, like bytes, bytearray, etc. For them, the current interface form (int/span-based) is the most convenient. If given a buffer larger than it they can handle, OverflowException should be automatically thrown, just as currently an exception is thrown if the requested buffer type is not supported.
Some consumers can handle buffers larger than 2GB but prefer to do it in span-size chunks because they are making various .NET API calls with the data. For instance scanning for bytes, regex, reading/writing/copying data, encoding/decoding, encrypting/decrypting, etc. The interface should make it easy and convenient for them to consume a given buffer.
Some consumers should be fully capable of handling memory data of any size. For instance memoryview. The interface should allow those clients to easily access the whole blob randomly.

I was thinking along these lines:

Add an optional parameter start, and perhaps count too, to AsSpan/AsReadOnlySpan methods, like one of your ideas. This should be easy for exporters to implement, though not always convenient to consume.
OR: Add Apply/ReadOnlyApply that takes a lambda or a delegate and optional start/count, which will repeatedly invoke the lambda with the appropriate and successive span, until the end of the designated data range. This should be convenient for consumers in Group 2, but cumbersome for each exporter to implement.
OR: Go with the optional parameter to AsSpan/AsReadOnlySpan, and provide Apply as an extension method. This will have the best of both worlds.
Extend IBufferProtocol with GetLongBuffer (or GetNativeBuffer?) that returns IPythonLongBuffer, which looks just like IPythonBuffer but everything is nint-based (like another of your ideas). Also AsSpan/AsReadOnlySpan return a new ref struct type that is nint-based. This will be convenient for consumers in Group 3. Provide a helper method to easily implement GetLongBuffer for exporters that never export anything bigger of 2GB. I think we still cannot use default implementations in interfaces, can we?
OR: A modification to the point above: drop AsSpan/AsReadOnlySpan, which are just convenience methods around Pin. Since the consumers in Group 3 are few and far between, it may be just simpler to let them fiddle with unsafe pointers.

The file descriptor work seems to be finally coming to an end, so once you merge this PR I could play with these ideas in code to get some better insights.

BCSharp

While I was reviewing this PR I noticed that there is one case that I failed to address in my PR #1891, so this comment is not about the changes you submit, but you may want to fix it together with the changes. It is about the series if ifs on lines 589-603 in TryAddRef.

There is the fourth case missing that should go after the three existing:

if (exclusive && ((oldState & StateBits.RefCount) > StateBits.RefCount)) {
    // mmap in non-exclusive use, temporarily no exclusive use allowed
    reason = StateBits.Exclusive;
    return false;
}

BCSharp · 2025-01-31T21:47:35Z

src/core/IronPython.Modules/mmap.cs

+            private int InterlockedOrState(int value) {
+#if NET5_0_OR_GREATER
+                return Interlocked.Or(ref _state, value);
+#else
+                int current = _state;
+                while (true) {
+                    int newValue = current | value;
+                    int oldValue = Interlocked.CompareExchange(ref _state, newValue, current);
+                    if (oldValue == current) {
+                        return oldValue;
+                    }
+                    current = oldValue;
+                }
+#endif
+            }


I like that you factored it out. The while (true) form is more efficient than do {...} while because it accesses the volatile field only once per loop.

Was not an intentional performance optimization. I just copy/pasted the .NET implementation. 😄

BCSharp · 2025-01-31T22:07:07Z

src/core/IronPython.Modules/mmap.cs

+                    if ((newState & StateBits.RefCount) == StateBits.RefCountOne) {
+                        newState &= ~StateBits.Exporting;
+                    }


The interesting consequence of this implementation is that the Exporting flag may not be reset right after the last export has been released, but when there are sill some non-exclusive calls in progress. However, it will be reset as soon as all of them exit mmap. I am OK with that since the non-exclusive calls are supposed to be quick and transient, but there is a corner case when mmap is so intensely being used that this bit may not be reset for a while. As a result, trying resize in such state will result in BufferError rather than EAGAIN even when there are no extant exports. Reordering the tests in the MmapLocker constructor would "fix" that, but at the expense of further deviating from CPython's error handling. I think that a 100% fix would require maintaining a separate number of exports counter in the mmap object, which is probably not worth the effort and added complexity.

Indeed. I couldn't think of a way to do it without having a separate counter for exports which would probably result in a lot more complexity. I figured most calls are quick enough that it should drain down to one in most reasonable scenarios.

slozier · 2025-01-31T23:23:34Z

There is the fourth case missing that should go after the three existing:

if (exclusive && ((oldState & StateBits.RefCount) > StateBits.RefCount)) {
    // mmap in non-exclusive use, temporarily no exclusive use allowed
    reason = StateBits.Exclusive;
    return false;
}

Hmm, I thought StateBits.RefCount was negative. Should it be comparing to StateBits.RefCountOne?

BCSharp · 2025-01-31T23:31:43Z

Hmm, I thought StateBits.RefCount was negative. Should it be comparing to StateBits.RefCountOne?

Yes! Sorry...

slozier · 2025-01-31T23:41:33Z

Just a quick note in case I forget (probably won't but who knows). Noticed an issue with the resize on Linux (presumably also applicable to macOS) not respecting the offset. Hopefully I can do a PR later this evening...

BCSharp · 2025-02-01T00:19:26Z

Hopefully I can do a PR later this evening...

If you are at it with another PR, a suggestion: mark the Windows P/Invoke with SupportedOSPlatform("windows")?

slozier mentioned this pull request Jan 8, 2025

Use PosixFileStream for files on POSIX #1855

Merged

slozier force-pushed the mmap_buffer branch from 1ab991f to 2260cdf Compare January 11, 2025 02:54

slozier force-pushed the mmap_buffer branch 3 times, most recently from 0bb4dfe to a472936 Compare January 26, 2025 18:21

BCSharp mentioned this pull request Jan 29, 2025

Thread-safe mmap resize #1891

Merged

Add IBufferProtocol to mmap

167b7fb

slozier force-pushed the mmap_buffer branch from a472936 to 167b7fb Compare January 31, 2025 02:47

Clean up interlocked or

326b6d9

slozier marked this pull request as ready for review January 31, 2025 13:28

slozier requested a review from BCSharp January 31, 2025 13:30

BCSharp approved these changes Jan 31, 2025

View reviewed changes

Add missing TryAddRef case

d982810

slozier force-pushed the mmap_buffer branch from 13d5409 to d982810 Compare January 31, 2025 23:38

slozier merged commit 144146d into IronLanguages:main Feb 1, 2025
8 checks passed

slozier deleted the mmap_buffer branch February 1, 2025 01:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IBufferProtocol to mmap #1866

Add IBufferProtocol to mmap #1866

slozier commented Jan 6, 2025

BCSharp commented Jan 8, 2025

BCSharp left a comment

BCSharp Jan 31, 2025

slozier Jan 31, 2025

BCSharp Jan 31, 2025

slozier Jan 31, 2025

slozier commented Jan 31, 2025

BCSharp commented Jan 31, 2025

slozier commented Jan 31, 2025

BCSharp commented Feb 1, 2025

Add IBufferProtocol to mmap #1866

Add IBufferProtocol to mmap #1866

Conversation

slozier commented Jan 6, 2025

BCSharp commented Jan 8, 2025

BCSharp left a comment

Choose a reason for hiding this comment

BCSharp Jan 31, 2025

Choose a reason for hiding this comment

slozier Jan 31, 2025

Choose a reason for hiding this comment

BCSharp Jan 31, 2025

Choose a reason for hiding this comment

slozier Jan 31, 2025

Choose a reason for hiding this comment

slozier commented Jan 31, 2025

BCSharp commented Jan 31, 2025

slozier commented Jan 31, 2025

BCSharp commented Feb 1, 2025