-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider mixing Tokio & Rayon #25
Comments
Why use
|
(@jacobbieker the conversation we had on Wednesday morning about needing to handle use-cases like your ICON GRIB-to-Zarr use-case has pushed me to completely reconsider the API and internal design for |
Glad it helped! Very excited to see how this progresses! |
Some relevant links:
|
Oh, wait, AsyncIterator already has a from_iter method! https://docs.rs/async-iterator/latest/async_iterator/trait.FromIterator.html Although I need to think how to keep the submission queue topped up, within the async iterator. |
My plan is now for LSIO to "just" focus on async IO. So LSIO won't contain any |
This idea is very early! Right now, I only have a very fuzzy grasp of what I'm trying to achieve, and a fuzzy grasp of how that might be implemented! I'll use this issue to collect links and ideas.
Background and context
At first glance, Tokio should be used for async IO (like building a web server). And Rayon should be used for running CPU-intensive tasks in parallel. The issue is that
light-speed-io
wants to do both those tasks: loading data from huge numbers of files using non-blocking IO (usingio_uring
on Linux), and also doing lots of CPU-intensive processing on those files (like decompressing in parallel).The use-case that motivates me to think again about using Tokio with Rayon
Jacob described a use-case, which I summarised in #24. The basic idea is: say we have millions of small files, and we want to combine these files to create a "re-chunked" version of this dataset on disk. In Jacob's example, there are 146,000 GRIB2 files per NWP init time, and all these files need to be saved into a handful of Zarr chunks. Each GRIB2 file has to be decompressed; and then the decompressed files have to be combined and sliced up again, and then those slices are compressed and saved back to disk.
The broad shape of a possible solution
map
function to be applied (in parallel) to each GRIB2 file (to decompress it)reduce_group
function which receives all the GRIB buffers in that group. This outputs a vector of buffers (and paths and byte_ranges).map
function that'll be applied in parallel to these buffers (to compress them).How Rust async might help
Hopefully users could write code like:
UPDATE: I need to learn more about Rust's Streams (async iterators).
Links
Tutorials / blog posts
spawn_blocking
.spawn_blocking
is best suited for wrapping blocking IO, not for CPU-intensive tasks. Instead, userayon::spawn
withtokio::sync::oneshot
(see Alice's blog for more details).Rust crates & PRs
tokio-rayon
: "Mix async code with CPU-heavy thread pools using Tokio + Rayon". Last release was in 2021.futures
crate. ContainsStreams
(which LSIO could use as the source of data.Streams
are async iterators.), andSinks
(for writing data).async_stream
"Provides two macros,stream!
andtry_stream!
, allowing the caller to define asynchronous streams of elements. These are implemented usingasync
&await
notation." Allows us to define streams using a very similar approach to Python, usingyield
. Also allows us to writefor await value in input
to implement one stream from another stream.The text was updated successfully, but these errors were encountered: