Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hugepages support for Firecracker #4360

Merged
merged 17 commits into from
Feb 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ and this project adheres to
`VcpuExit::MmioRead`, `VcpuExit::MmioWrite`, `VcpuExit::IoIn` and
`VcpuExit::IoOut`. The average for these VM exits is not emitted since it can
be deduced from the available emitted metrics.
- [#4360](https://github.com/firecracker-microvm/firecracker/pull/4360): Added
dev-preview support for backing a VM's guest memory by 2M hugetlbfs pages.
roypat marked this conversation as resolved.
Show resolved Hide resolved
Please see the [documentation](docs/hugepages.md) for more information.

### Changed

Expand Down
55 changes: 55 additions & 0 deletions docs/hugepages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Backing Guest Memory by Huge Pages

> \[!WARNING\]
>
> Support is currently in **developer preview**. See
> [this section](RELEASE_POLICY.md#developer-preview-features) for more info.

Firecracker supports backing the guest memory of a VM by 2MB hugetlbfs pages.
kalyazin marked this conversation as resolved.
Show resolved Hide resolved
This can be enabled by setting the `huge_pages` field of `PUT` or `PATCH`
requests to the `/machine-config` endpoint to `2M`.

Backing guest memory by huge pages can bring performance improvements for
specific workloads, due to less TLB contention and less overhead during
virtual->physical address resolution. It can also help reduce the number of
KVM_EXITS required to rebuild extended page tables post snapshot restore, as
well as improve boot times (by up to 50% as measured by Firecracker's
[boot time performance tests](../tests/integration_tests/performance/test_boottime.py))

Using hugetlbfs requires the host running Firecracker to have a pre-allocated
pool of 2M pages. Should this pool be too small, Firecracker may behave
erratically or receive the `SIGBUS` signal. This is because Firecracker uses the
`MAP_NORESERVE` flag when mapping guest memory. This flag means the kernel will
not try to reserve sufficient hugetlbfs pages at the time of the `mmap` call,
trying to claim them from the pool on-demand. For details on how to manage this
pool, please refer to the [Linux Documentation][hugetlbfs_docs].

## Huge Pages and Snapshotting

Restoring a Firecracker snapshot of a microVM backed by huge pages will also use
huge pages to back the restored guest. There is no option to flip between
regular, 4K, pages and huge pages at restore time. Furthermore, snapshots of
microVMs backed with huge pages can only be restored via UFFD. Lastly, note that
even for guests backed by huge pages, differential snapshots will always track
write accesses to guest memory at 4K granularity.

## Known Limitations

Currently, hugetlbfs support is mutually exclusive with the following
Firecracker features:

- Memory Ballooning via the [Balloon Device](./ballooning.md)
- Initrd

## FAQ

### Why does Firecracker not offer a transparent huge pages (THP) setting?

Firecracker's guest memory is memfd based. Linux (as of 6.1) does not offer a
way to dynamically enable THP for such memory regions. Additionally, UFFD does
not integrate with THP (no transparent huge pages will be allocated during
userfaulting). Please refer to the [Linux Documentation][thp_docs] for more
information.

[hugetlbfs_docs]: https://docs.kernel.org/admin-guide/mm/hugetlbpage.html
[thp_docs]: https://www.kernel.org/doc/html/next/admin-guide/mm/transhuge.html#hugepages-in-tmpfs-shmem
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ connect/send data.
### Example

An example of a handler process can be found
[here](../../src/firecracker/examples/uffd/valid_handler.rs). The process is
[here](../../src/firecracker/examples/uffd/valid_4k_handler.rs). The process is
designed to tackle faults on a certain address by loading into memory the entire
region that the address belongs to, but users can choose any other behavior that
suits their use case best.
44 changes: 44 additions & 0 deletions resources/overlay/usr/local/bin/fast_page_fault_helper.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
// Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0

// Helper program for triggering fast page faults after UFFD snapshot restore.
// Allocates a 128M memory area using mmap, touches every page in it using memset and then
// calls `sigwait` to wait for a SIGUSR1 signal. Upon receiving this signal,
// set the entire memory area to 1, to trigger fast page fault.
// The idea is that an integration test takes a snapshot while the process is
// waiting for the SIGUSR1 signal, and then sends the signal after restoring.
// This way, the `memset` will trigger a fast page fault for every page in
// the memory region.

#include <stdio.h> // perror
#include <signal.h> // sigwait and friends
#include <string.h> // memset
#include <sys/mman.h> // mmap

#define MEM_SIZE_MIB (128 * 1024 * 1024)

int main(int argc, char *const argv[]) {
sigset_t set;
int signal;

sigemptyset(&set);
if(sigaddset(&set, SIGUSR1) == -1) {
roypat marked this conversation as resolved.
Show resolved Hide resolved
perror("sigaddset");
return -1;
}

void *ptr = mmap(NULL, MEM_SIZE_MIB, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

memset(ptr, 1, MEM_SIZE_MIB);

if(MAP_FAILED == ptr) {
perror("mmap");
return -1;
}

sigwait(&set, &signal);

memset(ptr, 2, MEM_SIZE_MIB);

return 0;
}
kalyazin marked this conversation as resolved.
Show resolved Hide resolved
1 change: 1 addition & 0 deletions resources/rebuild.sh
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,7 @@ install_dependencies
BIN=overlay/usr/local/bin
compile_and_install $BIN/init.c $BIN/init
compile_and_install $BIN/fillmem.c $BIN/fillmem
compile_and_install $BIN/fast_page_fault_helper.c $BIN/fast_page_fault_helper
compile_and_install $BIN/readmem.c $BIN/readmem
if [ $ARCH == "aarch64" ]; then
compile_and_install $BIN/devmemread.c $BIN/devmemread
Expand Down
16 changes: 12 additions & 4 deletions src/firecracker/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -50,12 +50,20 @@ serde_json = "1.0.113"
tracing = ["log-instrument", "seccompiler/tracing", "utils/tracing", "vmm/tracing"]

[[example]]
name = "uffd_malicious_handler"
path = "examples/uffd/malicious_handler.rs"
name = "uffd_malicious_4k_handler"
path = "examples/uffd/malicious_4k_handler.rs"

[[example]]
name = "uffd_valid_handler"
path = "examples/uffd/valid_handler.rs"
name = "uffd_valid_4k_handler"
path = "examples/uffd/valid_4k_handler.rs"
ShadowCurse marked this conversation as resolved.
Show resolved Hide resolved

[[example]]
name = "uffd_valid_2m_handler"
path = "examples/uffd/valid_2m_handler.rs"

[[example]]
name = "uffd_fault_all_handler"
path = "examples/uffd/fault_all_handler.rs"

[[example]]
name = "seccomp_harmless"
Expand Down
50 changes: 50 additions & 0 deletions src/firecracker/examples/uffd/fault_all_handler.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
// Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0

//! Provides functionality for a userspace page fault handler
//! which loads the whole region from the backing memory file
//! when a page fault occurs.

mod uffd_utils;

use std::fs::File;
use std::os::unix::net::UnixListener;

use uffd_utils::{Runtime, UffdHandler};
use utils::get_page_size;

fn main() {
let mut args = std::env::args();
let uffd_sock_path = args.nth(1).expect("No socket path given");
let mem_file_path = args.next().expect("No memory file given");

let file = File::open(mem_file_path).expect("Cannot open memfile");

// Get Uffd from UDS. We'll use the uffd to handle PFs for Firecracker.
let listener = UnixListener::bind(uffd_sock_path).expect("Cannot bind to socket path");
let (stream, _) = listener.accept().expect("Cannot listen on UDS socket");

// Populate a single page from backing memory file.
// This is just an example, probably, with the worst-case latency scenario,
// of how memory can be loaded in guest RAM.
let len = get_page_size().unwrap(); // page size does not matter, we fault in everything on the first fault

let mut runtime = Runtime::new(stream, file);
runtime.run(len, |uffd_handler: &mut UffdHandler| {
// Read an event from the userfaultfd.
let event = uffd_handler
.read_event()
.expect("Failed to read uffd_msg")
.expect("uffd_msg not ready");

match event {
userfaultfd::Event::Pagefault { .. } => {
for region in uffd_handler.mem_regions.clone() {
uffd_handler
.serve_pf(region.mapping.base_host_virt_addr as _, region.mapping.size)
}
}
_ => panic!("Unexpected event on userfaultfd"),
}
});
}
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ fn main() {
let (stream, _) = listener.accept().expect("Cannot listen on UDS socket");

let mut runtime = Runtime::new(stream, file);
runtime.run(|uffd_handler: &mut UffdHandler| {
runtime.run(4096, |uffd_handler: &mut UffdHandler| {
// Read an event from the userfaultfd.
let event = uffd_handler
.read_event()
Expand Down
51 changes: 27 additions & 24 deletions src/firecracker/examples/uffd/uffd_utils.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ use std::ptr;

use serde::{Deserialize, Serialize};
use userfaultfd::{Error, Event, Uffd};
use utils::get_page_size;
use utils::sock_ctrl_msg::ScmSocket;

// This is the same with the one used in src/vmm.
Expand All @@ -33,29 +32,35 @@ pub struct GuestRegionUffdMapping {
pub offset: u64,
}

#[derive(Debug, Clone)]
#[derive(Debug, Clone, Copy)]
pub enum MemPageState {
Uninitialized,
FromFile,
Removed,
Anonymous,
}

#[derive(Debug)]
struct MemRegion {
mapping: GuestRegionUffdMapping,
#[derive(Debug, Clone)]
pub struct MemRegion {
pub mapping: GuestRegionUffdMapping,
page_states: HashMap<u64, MemPageState>,
}

#[derive(Debug)]
pub struct UffdHandler {
mem_regions: Vec<MemRegion>,
pub mem_regions: Vec<MemRegion>,
page_size: usize,
backing_buffer: *const u8,
uffd: Uffd,
}

impl UffdHandler {
pub fn from_unix_stream(stream: &UnixStream, backing_buffer: *const u8, size: usize) -> Self {
pub fn from_unix_stream(
stream: &UnixStream,
page_size: usize,
backing_buffer: *const u8,
size: usize,
) -> Self {
let mut message_buf = vec![0u8; 1024];
let (bytes_read, file) = stream
.recv_with_fd(&mut message_buf[..])
Expand All @@ -71,13 +76,15 @@ impl UffdHandler {

// Make sure memory size matches backing data size.
assert_eq!(memsize, size);
assert!(page_size.is_power_of_two());

let uffd = unsafe { Uffd::from_raw_fd(file.into_raw_fd()) };

let mem_regions = create_mem_regions(&mappings);
let mem_regions = create_mem_regions(&mappings, page_size);

Self {
mem_regions,
page_size,
backing_buffer,
uffd,
}
Expand All @@ -87,21 +94,19 @@ impl UffdHandler {
self.uffd.read_event()
}

pub fn update_mem_state_mappings(&mut self, start: u64, end: u64, state: &MemPageState) {
pub fn update_mem_state_mappings(&mut self, start: u64, end: u64, state: MemPageState) {
for region in self.mem_regions.iter_mut() {
for (key, value) in region.page_states.iter_mut() {
if key >= &start && key < &end {
*value = state.clone();
*value = state;
}
}
}
}

pub fn serve_pf(&mut self, addr: *mut u8, len: usize) {
let page_size = get_page_size().unwrap();

// Find the start of the page that the current faulting address belongs to.
let dst = (addr as usize & !(page_size as usize - 1)) as *mut libc::c_void;
let dst = (addr as usize & !(self.page_size - 1)) as *mut libc::c_void;
let fault_page_addr = dst as u64;

// Get the state of the current faulting page.
Expand All @@ -117,12 +122,12 @@ impl UffdHandler {
// memory from the host (through balloon device)
Some(MemPageState::Uninitialized) | Some(MemPageState::FromFile) => {
let (start, end) = self.populate_from_file(region, fault_page_addr, len);
self.update_mem_state_mappings(start, end, &MemPageState::FromFile);
self.update_mem_state_mappings(start, end, MemPageState::FromFile);
return;
}
Some(MemPageState::Removed) | Some(MemPageState::Anonymous) => {
let (start, end) = self.zero_out(fault_page_addr);
self.update_mem_state_mappings(start, end, &MemPageState::Anonymous);
self.update_mem_state_mappings(start, end, MemPageState::Anonymous);
return;
}
None => {}
Expand Down Expand Up @@ -152,17 +157,15 @@ impl UffdHandler {
}

fn zero_out(&mut self, addr: u64) -> (u64, u64) {
let page_size = get_page_size().unwrap();

let ret = unsafe {
self.uffd
.zeropage(addr as *mut _, page_size, true)
.zeropage(addr as *mut _, self.page_size, true)
.expect("Uffd zeropage failed")
};
// Make sure the UFFD zeroed out some bytes.
assert!(ret > 0);

(addr, addr + page_size as u64)
(addr, addr + self.page_size as u64)
}
}

Expand Down Expand Up @@ -211,7 +214,7 @@ impl Runtime {
/// When uffd is polled, page fault is handled by
/// calling `pf_event_dispatch` with corresponding
/// uffd object passed in.
pub fn run(&mut self, pf_event_dispatch: impl Fn(&mut UffdHandler)) {
pub fn run(&mut self, page_size: usize, pf_event_dispatch: impl Fn(&mut UffdHandler)) {
let mut pollfds = vec![];

// Poll the stream for incoming uffds
Expand Down Expand Up @@ -246,6 +249,7 @@ impl Runtime {
// Handle new uffd from stream
let handler = UffdHandler::from_unix_stream(
&self.stream,
page_size,
self.backing_memory,
self.backing_memory_size,
);
Expand All @@ -270,8 +274,7 @@ impl Runtime {
}
}

fn create_mem_regions(mappings: &Vec<GuestRegionUffdMapping>) -> Vec<MemRegion> {
let page_size = get_page_size().unwrap();
fn create_mem_regions(mappings: &Vec<GuestRegionUffdMapping>, page_size: usize) -> Vec<MemRegion> {
let mut mem_regions: Vec<MemRegion> = Vec::with_capacity(mappings.len());

for r in mappings.iter() {
Expand Down Expand Up @@ -314,7 +317,7 @@ mod tests {
let mut uninit_runtime = Box::new(MaybeUninit::<Runtime>::uninit());
// We will use this pointer to bypass a bunch of Rust Safety
// for the sake of convenience.
let runtime_ptr = uninit_runtime.as_ptr() as *const Runtime;
let runtime_ptr = uninit_runtime.as_ptr().cast::<Runtime>();

let runtime_thread = std::thread::spawn(move || {
let tmp_file = TempFile::new().unwrap();
Expand All @@ -327,7 +330,7 @@ mod tests {
let (stream, _) = listener.accept().expect("Cannot listen on UDS socket");
// Update runtime with actual runtime
let runtime = uninit_runtime.write(Runtime::new(stream, file));
runtime.run(|_: &mut UffdHandler| {});
runtime.run(4096, |_: &mut UffdHandler| {});
});

// wait for runtime thread to initialize itself
Expand Down
Loading
Loading