Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streams are challenging to implement because they return pointers #8

Closed
brendandburns opened this issue Mar 5, 2023 · 11 comments
Closed

Comments

@brendandburns
Copy link
Contributor

As I look to implement streams, it seems like it is excessively challenging to return a pointer from the stream read function.

Typically in stream reading (e.g. fread in C or InputStream.read in Java) the pattern is to pass a buffer into the stream read, rather than have the stream return an array of bytes.

I think that we should change streams to take a buffer from the guest to write data into, that will make the ownership semantics clearer (#7) and will also make it easier to implement.

@stevelr
Copy link

stevelr commented Mar 5, 2023

Counter-example:
(this is about streams in general, not http-specific, but is based on wasm api use cases I have worked on. I read your content to be about Streams apis in general so I jumped in)

Host manages buffers allocated from guest memory. When guest calls read, host returns a window into the memory. Example: buffered file reader that has a fixed block of memory, say 64kb, Guest reads chunks until end of block, then on the next read the host can page in another 64kb from the file into the same block.

Another use case that I'm looking is when multiple shared memories become implemented. Then the host might return a pointer into a shared memory region to avoid copying.

I can also see the use case for the guest calling read_into and passing in a pointer. I would be in favor of both apis (which admittedly makes resource implementation a little more work). Both read and read_into would have to be implemented, and the documentation might advise api users which one is more efficient.

If I had to choose between the two, I'd favor host managed memory because of its advantages in large data handling.

@brendandburns
Copy link
Contributor Author

I think that having both options is a good idea. If we are going to do it host side, I think we absolutely need to implement a good library to make it easy for host-side implementors to support streams.

We also probably need to be super clear with people about the ownership semantics (#7) because if it is host managed then you definitely don't want to be calling free(...) on the host-managed memory from within the guest code.

However, the other thing that I'm worried about is accidental guest-side DOS attacks or memory leaks via host memory allocations. I guess that because the host is allocating things within the WASM memory this might not be an issue, but it feels an awful lot like the distinction between kernel memory and userspace memory, and it is always harder to detect/prevent memory leaks in kernel space memory.

@brendandburns
Copy link
Contributor Author

Admittedly I'm new to WASM and wasmtime, but I'm struggling to see how the host can allocate memory in the guest without the guest exposing some sort of alloc function back to the host.

I can see how you can create a new page of memory, but it's totally unclear how you could pass the right pointer back to the guest (and more importantly, how the guest would know that that block of memory is in use)

Pointers to how this is supposed to work would be welcome.

@lukewagner
Copy link
Member

To do this, we could extend the Canonical ABI with a special case for returned lists (either directly or lightly nested, e.g., in a result<list<u8>>). It's certainly possible, but, thus far, we've been just trying to avoid adding too many of these cases until we have things more-fully implemented so we can measure while optimizing.

@brendandburns
Copy link
Contributor Author

Can we add read-into into the WIT spec then in the meantime? I added it in my own fork and it works as intended...

@brendandburns
Copy link
Contributor Author

btw, I figured out the cabi_realloc thing, which makes this a little easier. However, some wit-bindgen runtimes (TinyGo for example) do not appear to expose a cabi_realloc function.

@lukewagner
Copy link
Member

That's a good question, and maybe I'm misunderstanding your idea, but iiuc the challenge with having a read-into in Wit is that Wit and the underlying component model don't have a type to directly express an outparam (for a couple of distinct reasons, one being that, if an import is implemented by another component's export, the callee will have a different linear memory). Thus, we have to say that the high-level semantics is plain old value copy of params/results and it's just an ABI-level optimization to say that, instead of using the usual cabi_realloc to allocate memory in the caller, the runtime will instead use a (param i32 i32)-supplied buffer. It's a subtle distinction, but it ends up avoiding a bunch of otherwise hairy ABI and binding issues.

@brendandburns
Copy link
Contributor Author

I think the most important thing is that if the host runtime is using cabi_realloc then you are forcing the guest to allocate memory for every read. In many context, I may want to allocate a buffer once and then re-use it repeatedly for subsequent reads.

The distinction is whether the guest or the host is in control of the memory allocation in the guest for the purposes of reading.

@brendandburns
Copy link
Contributor Author

And concretely, read-into just takes a u32 value which represents a pointer to the location in the guest's memory that has already been allocated by the guest code.

@lukewagner
Copy link
Member

Ah, while that is similar in spirit to what Preview-1 interfaces did, with Wit and the component model, the callee function doesn't have access to the caller's memory. E.g., due to virtualizability, the callee might be another component instance with a distinct linear memory. Or, on the Web, the callee can be JS glue code which doesn't have access to the caller's linear memory.

I do like the optimization of allowing the caller to supply a buffer for returned lists though; it's just something we need to add at the Canonical ABI level. I filed component-model/#175 to track and discuss further.

@brendandburns
Copy link
Contributor Author

brendandburns commented Mar 26, 2023

Closing this in favor of WebAssembly/component-model#175

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants