Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider backing Inko processes by OS threads #690

Closed
yorickpeterse opened this issue Feb 9, 2024 · 6 comments
Closed

Consider backing Inko processes by OS threads #690

yorickpeterse opened this issue Feb 9, 2024 · 6 comments
Labels
compiler Changes related to the compiler feature New things to add to Inko, such as a new standard library module performance Changes related to improving performance runtime Changes related to the Rust-based runtime library

Comments

@yorickpeterse
Copy link
Collaborator

yorickpeterse commented Feb 9, 2024

Description

Inko's approach to concurrency is similar to that of Erlang and Go: M Inko processes are mapped onto N OS threads. For sockets we use non-blocking IO, and for files we use blocking operations coupled with a backup thread pool in case these operations take too long.

This setup is based on the general belief that M:N scheduling combined with non-blocking IO leads to better performance compared to 1:1 scheduling with blocking operations. These benefits however are debatable, highly dependant on the type of workload, and come with their own set of non-trivial trade-offs.

A benefit of green threads with an M:N scheduler is that spawning tasks is fast and efficient, such that you can spawn many of them rapidly. On paper this seems beneficial, but in practice it remains to be seen if it truly is. For example, in a typical transactional application (basically any web application), the amount of concurrency is limited not by how many or how fast you can spawn your tasks, but by the concurrency supported by the external services (e.g. a database) the transaction relies upon. This means it doesn't really matter that you're able to spawn 10 000 processes with ease, if you're still limited to running only 20 concurrently due to your database pool being limited to 20 concurrent connections.

Even if your system somehow supported unbounded/unlimited concurrency, you really don't want that in a production setting as planning around unbounded concurrency is impossible, and bound to lead to problems. In contrast, it's much easier to deal with a system that's limited to for example 32 concurrent tasks.

Even if you could somehow solve this, green threading poses additional problems such as:

  1. Additional overhead the scheduler introduces to ensure tasks are run fairly
  2. Additional system calls and locking that comes with the use of non-blocking IO
  3. Poor C interop
  4. Platform specific assembly to support stack swapping, making it more difficult to support multiple platforms (= one of the reasons we don't support Windows at this time)
  5. Poor support for C libraries that require thread-local storage or thread pinning, such as most GUI libraries, without developing a way of pinning Inko processes to OS threads, which in turn introduces scheduler complexity
  6. The code complexity that comes with supporting all this

There are usually two reasons one might want to avoid the typical thread-per-request approach and instead go with the above approach:

  1. Spawning OS threads is more expensive than spawning green threads
  2. The cost of OS thread context switching is greater than that of green threads

The cost of context switching only really matters in systems where we have fully isolated transactions that don't depend on a fixed size pool of sorts, i.e. tasks that are purely CPU bound. But for such workloads I suspect that 1:1 scheduling is in fact better because you don't have the cost of additional bookkeeping.

The cost of spawning threads is something one should be able to mitigate (or at least improve upon) by reusing threads: you maintain a pool of reusable threads, initially at size zero. When threads are needed, we check the pool and reuse a thread if any is present. If not, we spawn a new one. When threads finish, they enter the reusable pool for up to N seconds, after which they stop. Given a sufficiently large upper limit (e.g. 1000), the cost of spawning threads is amortized over time, with the minimal/best-case cost being the equivalent of unlocking a mutex and a pop from a queue of sorts.

The cost of context switching also applies even when using M:N scheduling, because it's still there and we have no control over it. This can in certain scenarios make things worse, such as when a process is rescheduled only for the OS thread to be swapped out with another OS thread by the kernel. In other words, M:N scheduling doesn't solve this but rather makes it less common.

I've been thinking about this over the years, but the more I think about it, and the more challenges I encounter with the M:N scheduler, the more I think we should move to an 1:1 scheduler with the above thread reuse mechanism. The benefits are numerous:

  1. We get to remove a ton of code from the compiler and runtime library
  2. We no longer need a special mechanism to deal with thread pinning, thread-local state, etc, making it easier to interact with C libraries that need this
  3. We can get rid of epoll/kqueue/etc and just use blocking IO and let the kernel handle things. Linux is perfectly capable of handling tens of thousands of threads blocking on IO. Even on my laptop I can easily run 100 000 threads or so without needing additional work (When spawning many processes, the runtime crashes when trying to call mprotect() to set up stack guards #540)
  4. We can (and should) still set the thread stack sizes to something smaller than the default 8 MiB of virtual memory, just as we do now; minus the need for manually needing to reuse stack memory
  5. We can work towards supporting Windows again more easily, as we no longer need the platform specific assembly used for swapping processes and stacks
  6. Types such as Channel could be simplified, as we can now just use a regular condition variable and mutex for blocking processes on channels
  7. No more primary and blocking thread pools
  8. Sockets can be made smaller as we no longer need to track additional state used by the network poller
  9. Debuggers and profilers (e.g. Valgrind) should work better with Inko, as these can get confused when stacks are switched

Of course at the language level nothing would change: processes would still be lightweight processes (because they are more lightweight compared to OS processes), and the way you use channels/etc would remain the same. You'd also still spawn processes per transactions where possible, it's just that now each process is backed by a dedicated OS thread. In other words, the use of 1:1 scheduling is just an implementation detail transparent to the language.

Related work

Issues we could close

Assuming we drop the use of green threading, the following issues could be closed due to no longer being relevant:

@yorickpeterse yorickpeterse added feature New things to add to Inko, such as a new standard library module performance Changes related to improving performance compiler Changes related to the compiler runtime Changes related to the Rust-based runtime library labels Feb 9, 2024
@yorickpeterse
Copy link
Collaborator Author

Somewhere in the last two years I did hack together a small PoC that replaced the scheduler with a 1:1 scheduler. At the time this resulted in a small increase in execution times for the test suite, but this was when we were still using an interpreter. This setup also didn't reuse any threads, so I suspect most of the extra time was spent just starting threads.

@yorickpeterse
Copy link
Collaborator Author

yorickpeterse commented Feb 9, 2024

Here's a simple and admittedly poorly implemented example of amortizing the thread spawn cost by reusing threads:

use std::sync::mpsc::channel;
use std::sync::Mutex;
use std::thread;
use std::time::{Duration, Instant};

fn naive() {
    let mut i = 0;
    let mut fastest = Duration::from_secs(100);

    while i < 50_000 {
        let (input_send, input_rec) = channel();
        let (output_send, output_rec) = channel();

        input_send.send(Instant::now()).unwrap();
        thread::spawn(move || {
            output_send
                .send(input_rec.recv().unwrap().elapsed())
                .unwrap();
        });

        let time = output_rec.recv().unwrap();

        if time < fastest {
            fastest = time;
        }

        i += 1;
    }

    println!("naive: {:?}", fastest);
}

fn reused() {
    let mut i = 0;
    let mut fastest = Duration::from_secs(100);
    let reusable = Mutex::new(Vec::with_capacity(32));

    while i < 50_000 {
        let (input, output) = {
            let mut threads = reusable.lock().unwrap();

            if let Some(res) = threads.pop() {
                res
            } else {
                let (input_send, input_rec) = channel::<Instant>();
                let (output_send, output_rec) = channel::<Duration>();

                thread::spawn(move || loop {
                    if let Ok(t) = input_rec.recv() {
                        let _ = output_send.send(t.elapsed());
                    } else {
                        break;
                    }
                });

                (input_send, output_rec)
            }
        };

        input.send(Instant::now()).unwrap();

        let time = output.recv().unwrap();

        reusable.lock().unwrap().push((input, output));

        if time < fastest {
            fastest = time;
        }

        i += 1;
    }

    println!("reused: {:?}", fastest);
}

fn main() {
    naive();
    reused();
}

In the reused case you can't use join to get the thread results, so in the interest of comparing apples to apples both examples use channels for their input and output.

Running this with cargo run --release yields the following on my laptop:

naive: 14.089µs
reused: 653ns

The "reused" time varies a bit between 500 nsec and 1 µsec, but it highlights how easily you can reduce the spawn cost by just reusing threads. Assuming a real and accurate implementation (the above version only ever spawns a single thread and always reuses it) might need some extra bookkeeping, we'd still be looking at a 10x improvement at least.

The context switch cost remains, but I'm willing to bet that for 95% of the applications out there this is a non-issue to begin with.

@yorickpeterse
Copy link
Collaborator Author

Another point to consider: green threads typically come with smaller growable stacks, such that the initial amount of (virtual) memory they need is smaller. However, Inko's stack sizes are fixed to 1 MiB by default, as resizing stacks comes with its own overhead and complicates code generation (= you have to ensure the stack size check always comes first in every function).

@yorickpeterse yorickpeterse added this to the Future ideas milestone Feb 10, 2024
@yorickpeterse
Copy link
Collaborator Author

An argument against one thread per Inko process is a less consistent experience: running many OS threads requires tuning of various /sys settings to not run into errors. In addition, macOS applies a limit on the number of threads you can spawn per process, and IIRC that limit is around 2000. In contrast, Inko's scheduler doesn't require any tuning whether you spawn 1 or 100 000 processes.

@yorickpeterse
Copy link
Collaborator Author

Another argument against OS threads in the context of FFI:

Pinning an Inko process to an OS thread isn't a great approach to handling C libraries requiring to run on the same thread, but it's also not that big of a deal. We could also change the scheduler such that the main process always runs on the same thread, and not offer a generic pinning mechanism. This is easy enough to implement and sufficient for using libraries that must run on the same thread.

@yorickpeterse yorickpeterse removed this from the Future ideas milestone Aug 5, 2024
@yorickpeterse
Copy link
Collaborator Author

I'm going to close this for the time being. As much as I prefer the use of OS threads due to the simplicity it brings, the cost is simply too great at this stage and this will likely remain the cause for years to come.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler Changes related to the compiler feature New things to add to Inko, such as a new standard library module performance Changes related to improving performance runtime Changes related to the Rust-based runtime library
Projects
None yet
Development

No branches or pull requests

1 participant