Skip to content

Isolate MiriMachine memory from Miri's #4343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 29, 2025

Conversation

nia-e
Copy link
Contributor

@nia-e nia-e commented May 22, 2025

Based on discussion surrounding #4326, this merges in the (very simple) discrete allocator that the MiriMachine will have. See the design document linked there there for considerations, but in brief: we could pull in an off the shelf allocator for this, but performance isn't a massive worry and doing it this way might make it easier to enable support for doing multi-seeded runs in the future (without a lot more extraneous plumbing, at least)

@nia-e
Copy link
Contributor Author

nia-e commented May 22, 2025

@rustbot ready

@rustbot rustbot added the S-waiting-on-review Status: Waiting for a review to complete label May 22, 2025
@nia-e nia-e force-pushed the discrete-allocator branch 4 times, most recently from 6cbc283 to b53ed38 Compare May 23, 2025 10:39
Copy link
Member

@RalfJung RalfJung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left some first comments, but this is not a full review. I'd rather not reverse-engineer the invariants of MachineAlloc myself, so I'll wait for you to document them, which will make review a lot easier.

Furthermore, all pub fn in discrete_alloc should have proper doc comments, not just a safety comment. Please also add some basic unit tests -- we don't use them much in Miri, but this is one of the cases where they would make sense.

On Zulip you mentioned some benchmarks. Can you put benchmark results for the variant that you ended up going for here into the PR?

@RalfJung
Copy link
Member

@rustbot author

@rustbot rustbot removed the S-waiting-on-review Status: Waiting for a review to complete label May 24, 2025
@rustbot
Copy link
Collaborator

rustbot commented May 24, 2025

Reminder, once the PR becomes ready for a review, use @rustbot ready.

@rustbot rustbot added the S-waiting-on-author Status: Waiting for the PR author to address review comments label May 24, 2025
@nia-e
Copy link
Contributor Author

nia-e commented May 24, 2025

I'll post benchmarks in a bit! I realised there might be some speed gains to be made with very simple changes, so I'll just experiment a little first. Thanks for the comments ^^

@nia-e nia-e force-pushed the discrete-allocator branch from 4b1f50f to 9f1047e Compare May 24, 2025 15:13
@nia-e
Copy link
Contributor Author

nia-e commented May 24, 2025

Baseline is set to having the allocator fully disabled. It's only marginally slower in most cases, though it struggles with large allocations it seems. I wonder how much work it would be to improve that, but if we go down the "only machines using this will touch it" path I hope it's not too bad? I got a slight (~4%) improvement from calling mmap directly instead of calling alloc::alloc() but I doubt that's worth it. Is Miri built with optimisations on by default when invoking ./miri bench?

Comparison with baseline (relative speed, lower is better for the new results):
  backtraces: 1.36 ± 0.08
  big-allocs: 32.71 ± 1.60
  mse: 1.05 ± 0.10
  range-iteration: 1.06 ± 0.03
  serde1: 1.00 ± 0.03
  serde2: 1.06 ± 0.03
  slice-chunked: 1.03 ± 0.04
  slice-get-unchecked: 0.93 ± 0.08
  string-replace: 1.06 ± 0.04
  unicode: 1.06 ± 0.05
  zip-equal: 1.08 ± 0.02

@RalfJung
Copy link
Member

Is Miri built with optimisations on by default when invoking ./miri bench?

Yes.

A 32x slowdown with big allocations is hefty.^^ Shouldn't those just forward to alloc::alloc anyway as part of the huge allocs treatment? Without looking at the code, what I'd assume happens when the allocation request is close to the page size is: round up to multiple of page size, and then just allocate that and use it directly without any kind of tracking of "which parts of this page are used" or so. That should have basically no overhead.

@nia-e
Copy link
Contributor Author

nia-e commented May 24, 2025

It does mostly do that, which is what's confusing me... I'll try to fix it, I assume I just missed something really obvious.

@nia-e
Copy link
Contributor Author

nia-e commented May 24, 2025

I checked; seems like the big-allocs test specifically calls alloc_zeroed which was the reason for the slowdown (apparently ptr.write_bytes() is a really slow way to zero out bytes). I changed up the logic a bit to make the logic generic over calling alloc::alloc() vs alloc::alloc_zeroed() and these are the (much better!) results:

Comparison with baseline (relative speed, lower is better for the new results):
  backtraces: 1.23 ± 0.06
  big-allocs: 1.03 ± 0.05
  mse: 0.99 ± 0.09
  range-iteration: 1.01 ± 0.03
  serde1: 0.99 ± 0.02
  serde2: 1.03 ± 0.02
  slice-chunked: 0.97 ± 0.04
  slice-get-unchecked: 0.89 ± 0.08
  string-replace: 1.00 ± 0.03
  unicode: 1.01 ± 0.04
  zip-equal: 1.05 ± 0.04

I might be able to squeeze a bit more perf out by actually making the functions generic instead of just passing in a function pointer but shrug, unsure if it's necessary

@RalfJung
Copy link
Member

Ah yes, that is exactly why we added that particular benchmark. :)

@nia-e
Copy link
Contributor Author

nia-e commented May 24, 2025

What kind of unit tests do you think belong here? I assumed functionality is covered by the usual tests, but I'll happily add in some stuff if you think it's relevant

@RalfJung
Copy link
Member

What kind of unit tests do you think belong here? I assumed functionality is covered by the usual tests, but I'll happily add in some stuff if you think it's relevant

Similar to range_map: just add some functions with #[test] in the file with the allocator implementation to test some particular corner cases -- or at least a basic smoke test if there are no corner cases.

@nia-e
Copy link
Contributor Author

nia-e commented May 24, 2025

Openen the PR on the main repo, I'll get to adding tests if everything there is okay it is not okay I need to do more things oops

@nia-e
Copy link
Contributor Author

nia-e commented May 24, 2025

Expecting the build to fail for now since it's adapted to the changes from the PR (but also Miri seems to be having trouble on the current upstream master commit, so I guess it's pending that being fixed too)

@nia-e
Copy link
Contributor Author

nia-e commented May 24, 2025

Tests added :D let me know if there's anything more to do

@nia-e nia-e force-pushed the discrete-allocator branch from 87f2b3f to f7fe286 Compare May 24, 2025 22:52
@rustbot

This comment has been minimized.

@nia-e nia-e force-pushed the discrete-allocator branch 4 times, most recently from c4ad33f to b832def Compare May 25, 2025 16:20
@nia-e nia-e force-pushed the discrete-allocator branch 2 times, most recently from 32672ff to 1539771 Compare May 28, 2025 15:26
@nia-e
Copy link
Contributor Author

nia-e commented May 28, 2025

My takeaway so far is that I need to get better at double-checking myself after a refactor. I hope I addressed everything brought up, though

jhpratt added a commit to jhpratt/rust that referenced this pull request May 29, 2025
interpret/allocation: Fixup type for `alloc_bytes`

This can be `FnOnce`, which helps us avoid an extra clone in rust-lang/miri#4343

r? RalfJung
@RalfJung
Copy link
Member

This changed quite a bit since the last benchmark run -- could you re-run the benchmark to see whether that Rc has a significant cost?

rust-timer added a commit to rust-lang/rust that referenced this pull request May 29, 2025
Rollup merge of #141682 - nia-e:fixup-alloc, r=RalfJung

interpret/allocation: Fixup type for `alloc_bytes`

This can be `FnOnce`, which helps us avoid an extra clone in rust-lang/miri#4343

r? RalfJung
@nia-e
Copy link
Contributor Author

nia-e commented May 29, 2025

Ok! So there's a small problem, namely that the jemalloc implementation rust uses by default is really slow with page-aligned allocations it seems; the reason perf was good before is that I forgot to actually enforce page-alignment in the alloc_huge calls. I checked; even passing in a constant 4096 as the alignment causes big_allocs to see a very significant slowdown depending on the run (seems quite highly variable - it was as low as 8.9x at one point, but I didn't save those results)

However, if we do the huge allocs by directly calling mmap, it drops back down to parity. Here's the old benchmark run:

Comparison with baseline (relative speed, lower is better for the new results):
  backtraces: 1.18 ± 0.05
  big-allocs: 22.30 ± 1.89
  mse: 1.01 ± 0.07
  range-iteration: 1.08 ± 0.02
  serde1: 1.06 ± 0.03
  serde2: 1.04 ± 0.04
  slice-chunked: 0.98 ± 0.04
  slice-get-unchecked: 1.03 ± 0.04
  string-replace: 1.00 ± 0.06
  unicode: 1.02 ± 0.04
  zip-equal: 1.01 ± 0.03

And this is what I got calling mmap directly (albeit this means the pointer will only ever be page aligned at most - unsure if relying on greater-than-page-alignment is even legal given that it's impossible to guarantee?)

Comparison with baseline (relative speed, lower is better for the new results):
  backtraces: 1.18 ± 0.05
  big-allocs: 1.07 ± 0.10
  mse: 1.00 ± 0.07
  range-iteration: 1.03 ± 0.02
  serde1: 1.02 ± 0.04
  serde2: 1.00 ± 0.04
  slice-chunked: 0.97 ± 0.05
  slice-get-unchecked: 0.97 ± 0.02
  string-replace: 0.96 ± 0.08
  unicode: 1.01 ± 0.04
  zip-equal: 1.00 ± 0.03

@RalfJung
Copy link
Member

the jemalloc implementation rust uses by default is really slow with page-aligned allocations it seems; the reason perf was good before is that I forgot to actually enforce page-alignment in the alloc_huge calls.

What alignment did you set before?

relying on greater-than-page-alignment is even legal given that it's impossible to guarantee

It's definitely legal, and it's also possible to guarantee. You "just" have to allocate more pages than needed and then round up the pointer you got to the needed alignment. jemalloc likely does that if you ask for a high alignment. It's non-trivial to implement this correctly though, in particular regarding deallocation, so I'd rather not have this in the codebase.

Given that the regression only affects native-libs mode, I am also fine with just taking the hit for now and having an issue to track the problem. We can then ping some people there that might know more about this; I am fairly clueless when it comes to allocators.

@nia-e
Copy link
Contributor Author

nia-e commented May 29, 2025

When refactoring I accidentally just left in whatever align was passed in to us (oops). But if you prefer it, I can just leave it as it was before and take the hit, or try to get the mmap implementation to overallocate like you said but I expect that will lead to some significant extra complexity (or special-case align <= page_size to use mmap?)

@RalfJung
Copy link
Member

or special-case max(size, align) <= page_size to use mmap?

I'm pretty sure the offending allocations in big-allocs are the big ones, so that wouldn't help.

I'll be a lot more busy soon than I was recently and thus have less review capacity, so I'd rather land this PR today or tomorrow (in particular the glue code outside the new allocator impl). So I would propose we stick to jemalloc for now, you file an issue for the perf problem, and then if you want to make a follow-up PR that uses mmap you may do so. :) (I thought about it some more and it's not as bad as I thought since the deallocation function has access to the list of all allocations.) I would say that PR should entirely switch the allocator to mmap then, both for small and huge allocs, to avoid mixing the logic for multiple allocators in the same file.

@nia-e nia-e force-pushed the discrete-allocator branch 2 times, most recently from 6b43f8f to 9028dfa Compare May 29, 2025 16:20
@nia-e
Copy link
Contributor Author

nia-e commented May 29, 2025

Hopefully this addresses everything then? I also removed the extra clone in concurrency/thread.rs since the rustc change landed

@nia-e
Copy link
Contributor Author

nia-e commented May 29, 2025

Gah I messed up, one sec

@nia-e
Copy link
Contributor Author

nia-e commented May 29, 2025

There. Is this fine?

@RalfJung
Copy link
Member

Please squash, I'll take a look later :)

Update src/alloc/isolated_alloc.rs

Co-authored-by: Ralf Jung <[email protected]>

allow multiple seeds

use bitsets

fix xcompile

listened to reason and made my life so much easier

fmt

Update src/machine.rs

Co-authored-by: Ralf Jung <[email protected]>

fixups

avoid some clones

Update src/alloc/isolated_alloc.rs

Co-authored-by: Ralf Jung <[email protected]>

Update src/alloc/isolated_alloc.rs

Co-authored-by: Ralf Jung <[email protected]>

address review

Update src/alloc/isolated_alloc.rs

Co-authored-by: Ralf Jung <[email protected]>

fixup comment

Update src/alloc/isolated_alloc.rs

Co-authored-by: Ralf Jung <[email protected]>

Update src/alloc/isolated_alloc.rs

Co-authored-by: Ralf Jung <[email protected]>

address review pt 2

nit

rem fn

Update src/alloc/isolated_alloc.rs

Co-authored-by: Ralf Jung <[email protected]>

Update src/alloc/isolated_alloc.rs

Co-authored-by: Ralf Jung <[email protected]>

address review

unneeded unsafe
@nia-e nia-e force-pushed the discrete-allocator branch from d51d7cc to a30ad41 Compare May 29, 2025 16:37
@RalfJung
Copy link
Member

I have done some lite refactoring, could you take a look if it makes sense to you?

@nia-e
Copy link
Contributor Author

nia-e commented May 29, 2025

Everything checks out, ty! Should I squash it also?

@RalfJung
Copy link
Member

Thanks! No need to :)

@RalfJung RalfJung enabled auto-merge May 29, 2025 17:50
@RalfJung RalfJung added this pull request to the merge queue May 29, 2025
Merged via the queue into rust-lang:master with commit 2c35406 May 29, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Waiting for a review to complete
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants