Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flat data representation proposal: Enables zero copy shared memory, zero allocation return types, binary serialization #398

Open
cpetig opened this issue Sep 21, 2024 · 10 comments

Comments

@cpetig
Copy link

cpetig commented Sep 21, 2024

This all started with defining zero copy shared memory over a WIT interface (channel is WIT resource, inspired by iceoryx2):

   let channel = Channel_u32::new("topic");
   loop {
       let message = channel.allocate().await; // WASI 0.3
       message.set(42);
       message.send();
  }

and on the receiver side

  let subscription = Subscription::new("topic");
  loop {
     dbg!(subscription.read().await);
  }

with a WIT definition similar to

   resource object {
       set: func(u32);
       send: static func(object);
   }
   resource channel {
       allocate: func() -> future<object>;
   }
   resource subscription {
      read: func() -> future<u32>;
   }

This is all fine unless you try to place a list<string> inside the shared memory. This put me on a journey which culminated in this discussion issue, … after I figured out a way to express this in WIT (this is inspired by flatbuffers and capn-proto).

Flat marker

Adding a flat<T[, P]> marker, e.g. flat<list<string>, u16> to arguments or results will change the data representation to flat binary encoding: All pointers in list and string become of the second type and are relative to the current position. The same type is used for length encoding. The default pointer type P could be s32.

Passing an argument will follow the normal ownership rules, so imported functions only pass a view while exported functions pass ownership of the buffer. The flat type is represented by a classical (pointer, length) pair. See https://bytecodealliance.zulipchat.com/#narrow/stream/438936-SIG-Embedded/topic/Sept.2017th.202024.20Meeting/near/470965874 for data encoding examples.

Returning a flat data type would change to a caller provided buffer (uninitialized) as the last argument (also (pointer,length)). The call returns the used length (0 indicates error/buffer overflow). This makes the call defined with respect to (partial) ownership transfer.

Similarly to async with WASI 0.3 and future<T> this could become a general option to apply to all functions, making #385 unnecessary, because this is more flexible and more storage efficient.

Buffer objects

Obtaining these buffers from the IPC component requires two new WIT return types: buffer-mut<T> and buffer-view<T> (read-only), both would encode as (pointer, length) and require a drop method to indicate that the buffer/view is no longer in use.

Side benefits

This data representation can also be used as a disk or network encoding of data expressed in WIT (make sure to version your WIT desciption).

API considerations

True zero copy construction of these flat data types require to know in advance the size of a list and pass it to the constructor to linearly place objects in the buffer, relative pointers could be unsigned to simplify the encoding logic.

See the links in https://bytecodealliance.zulipchat.com/#narrow/stream/438936-SIG-Embedded/topic/Sept.2017th.202024.20Meeting/near/470497166 for API examples in Rust and C++.

PS: I initially represented read-only flat types by address only (as the length can be calculated from the data), but this feels counterproductive from a verification and storing perspective.

@cpetig
Copy link
Author

cpetig commented Sep 21, 2024

Of course the lowering of flat POD types would be identical to normal POD types, I consider (resource) handles as POD here. So the modifier only applies (recursively) to string and list representations.

Update: (Resource) handles don't serialize well across systems, so this needs more thoughts on when to forbid them.

@lukewagner
Copy link
Member

Having a "flat" binary representation of compound values could make a lot of sense and I've tried to imagine different ABI variations too (esp. in the context of streams, which help address the issue of not knowing how much buffer space to allocate since you can always just fill up one buffer, say "not done", and return for the next buffer). However, I've generally thought of this in terms of Canonical ABI options, since it's a low-level representation choice; is there a specific benefit to escalating this detail into the WIT-level type, where it applies to all languages and memory types (e.g., wasm-gc...)?

Second, while I can see potential efficiency benefits to a flat binary representation, I don't see how this achieves "zero copy shared memory" -- it seems like the basic requirements to copy between separate components' separate linear memories remains?

Lastly, I wasn't able to follow the "Buffer objects" section and how it relates to the flat type or how buffer-mut<T>/buffer-view<T> compare to, e.g., the readable-buffer<T> and writable-buffer<T> of #369.

@cpetig
Copy link
Author

cpetig commented Sep 24, 2024

I started with a WIT marker because I assumed that the same interface might mix flat and normal ABI calls, but I am no longer sure about this, especially since flat types offer some unique benefits - but are source code incompatible to normal Vec and String types (Rust, similar for C++).

Zero copy comes into view if you construct the lowered elements in place in shared memory (you use a shared memory located buffer to construct everything) and use them on the receiver side without lifting. Of course for wasm you need either multi-memory (shared pages) or mmap support to enable two components to access the same physical memory. Host (mmap) support could enable spatial freedom from interference, that means only a single component can write to it, exclusive or multiple components can read from the same memory region. The host would handle the transition between these states (similar to what iceoryx does).

This assumes that you reached a state where the copying of information between components is more costly than remapping virtual memory. This is typical for large AI tensors and camera images.

The flat buffer types are handles to the shared memory managed by the host logic*, one read-only shareable, one exclusive writable type. The difference to a non-flat read/write buffer is that the flat buffer will also contain all the second and third level allocations, so a list<list<string>> object becomes a single contiguous memory object within a single allocation.

*) Or local buffers pre-allocated and then passed to functions to place the result into.

@lukewagner
Copy link
Member

The difference to a non-flat read/write buffer is that the flat buffer will also contain all the second and third level allocations, so a list<list> object becomes a single contiguous memory object within a single allocation.

Ah I see, that's an interesting point. I suppose we have the option to say that a readable-buffer<T>/writable-buffer<T> could use a different, flat ABI for the T. That being said, in some cases, the indirection is actually what you want (considering that in many cases 99% of the bytes are in the "leaves" of a compound value and being able to just point to the pre-existing allocations avoids what would otherwise be an extra copy into the flat buffer). But perhaps there could be a flat canonopt that lets you opt into this flat ABI for buffers?

Of course for wasm you need either multi-memory (shared pages) ...

Many folks have suggested using multi-memory as a solution to avoiding copies over the years, but we keep finding that, in practice, "regular" C/C++/Rust code can only access the default memory so if you use a shared non-default memory to pass values, you'll end up with 2 copies (source → shared → destination). I keep asking someone to show me real code that would achieve zero-copy in practice using multi-memory (b/c hypothetically it's possible), but I haven't seen it yet.

... or mmap support to enable two components to access the same physical memory. [...]. This assumes that you reached a state where the copying of information between components is more costly than remapping virtual memory. This is typical for large AI tensors and camera images.

One way to amortize the cost of establishing a shared mappings is creating a long-lived connection between two components which they can use to repeatedly passed chunks of memory. My intuition is that streams might be the right abstraction here (for repeatedly passing a large (flat) element). So perhaps the flat option mentioned above could also apply to streams (which lines up with the idea that streams are just a sequence of buffers).

@cpetig
Copy link
Author

cpetig commented Sep 26, 2024

🤔 I feel that a proof of concept implementation might be a good idea to see how shared memory and flat types could work together to achieve zero copy. I will give it a try (most likely Rust and wasmtime based).

@lum1n0us
Copy link

I feel that 'multi-memory' is more convenient for communication between the host and Wasm. I mean, if we give a Wasm module an additional imported memory that is provided by the host, the host can store data in that specific area, and Wasm can access it directly without needing to copy it from the host's memory to Wasm's linear memory.

@cpetig
Copy link
Author

cpetig commented Sep 28, 2024

@lum1n0us Do you know a good way to model access to a non-zero memory from a clang compiled language, e.g. C or Rust? Load and store intrinsics could be a solution, but that feels clumsy and cannot be passed via a pointer/reference argument to subroutines; segmented memory means that every load/store will pay significant penalty when coding memory indeces and offset separately. I think mmap as an extension of memory-control is the most reasonable strategy I can come up with.

@lukewagner
Copy link
Member

@cpetig Yes, I think that is the fundamental challenge we're working with here. And to summarize previous discussions: if the solution is to copy from the second-memory into the default-memory, I think we end up with something net worse, both in terms of performance (2 copies instead of one) and portability (since the entire contents of this non-default linear memory are now the host/guest interface, observable at all times at any address -- very likely to expose subtle impl differences that break programs in practice at scale over time).

@cpetig
Copy link
Author

cpetig commented Oct 5, 2024

I just created a working proof of concept crate for the flat data parsing and creation at https://github.com/cpetig/flat-types-rust , the API already looks usable but will need a lot of extensions to provide a nice DX. I kept enum, struct and tuple APIs for now out of scope, likely a derivation macro will give this in a "somewhat" elegant way (set_X, get_X functions).

I will continue my work on the shm wasm interface.

@cpetig
Copy link
Author

cpetig commented Nov 26, 2024

I started a first prototype of shared memory zero copy at https://github.com/cpetig/wasm-shm-test/blob/main/wit/shm.wit#L12 but didn't complete it, yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants