Local R1 inferencing optimization discussion thread

Hey this is an interesting repo, is [this twitter post](https://x.com/BlinkDL_AI/status/1887883983461179546) your other documentation/discussion?

I lot more stats and metrics and testing suggesting that regardless of fast storage array, there are some software bottle necks with unbuffered i/o currently on llama.cpp over at [level1techs](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826)

## Goal

[Myself](https://www.reddit.com/r/LocalLLaMA/comments/1idseqb/comment/mbukuer/) and [at least a few others](https://www.reddit.com/r/LocalLLaMA/comments/1ikprg7/trouble_with_running_llamacpp_with_deepseekr1_on/) have been trying to get current llama.cpp CPU inferencing going with larger unsloth quants that don't even fit into system RAM. The idea being that perhaps a fast quad RAID0 striped array of PCIe Gen 5.0 NVMe drives (e.g. Crucial T705 2TB+) could provide enough IOPS to be competitive with DDR4 or slower DDR5 RAM i/o bandwidth speeds.

## Problem

However, myself and others have also realized there is a bottleneck in the current mmap() implementation that relies on Linux kernel page cache buffered i/o which is unable to saturate the disk's potential bandwidth. The current implementation works very well in many situations and allows a lot of flexibility across platforms. My impression is with the bigger DeepSeek R1 unsloth quants weighing in over 200+ GB the limitations of the underlying buffered page cache begin to come into play.

## Solutions

There are some possible solutions that could fix it or a combination of which might at least improve the situation that I've found:
 
1. [Transparent Huge Pages](https://github.com/ggerganov/llama.cpp/issues/2251)
2. [Smart Layer Offloading](https://github.com/ggerganov/llama.cpp/pull/11397)
3. [Direct i/o backend](https://github.com/ggerganov/llama.cpp/pull/7420)
4. Possibly instead of RAID0 array, manually distribute the 50GB gguf file parts across individual drives and load them like that instead of assuming a single directory of files. Might be able to test with symlinks quickly without patching code. (i tested and confirmed what another found that it doesn't help), which implies bottleneck with kernel buffered page cache reads??)

#### Transparent Huge Pages
There are some possible Transparent Huge Pages THP settings that seem like they might offer improved performance, but I haven't seen any major successes and it is a bit tricky to test as default Ubuntu kernels leave out the experimental `CONFIG_READ_ONLY_THP_FOR_FS=y` option. You can check with `zcat /proc/config.gz | grep THP_FOR_FS` or `cat /boot/config-6.13.0-061300-generic | grep THP_FOR_FS` 

Then you can check to see if `AnonHugePages` or `FileHugePages` increases. On my ARCH `6.13.1-arch1-1` `THP_FOR_FS=y`:
```
watch -d grep Huge /proc/meminfo
AnonHugePages:    256000 kB
ShmemHugePages:        0 kB
FileHugePages:    550912 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
```

In theory using 2M or 1G page sizes instead of default 4k page cache could reduce page faults leading to less overall read requests with larger sizes to better saturate the disk i/o.

#### Smart layer offloading
I just saw this one today and will give it a try soon, some folks suggest improvements in performance.

#### Direct i/o backend
This would take more work as instead of relying on mmap() which provides a nice application layer interface to read arbitrary bytes from arbitrary addresses any sort of O_DIRECT libaio type interface would have to manage aligned reads etc. This may give the best performance but likely less portable and more difficult to implement.

#### Manually Distribute GGUF files
Just thought of this today, might give it a try to see if a simple symlink will work to redirect the mmap() to a different drive for different model files. It might not be truly "interleaved" depending on how the weights/experts are distributed in the actual GGUF files...

## Conclusion
The feeling is there is a lot of software optimizations that could be made to bring usable generation speeds and context sizes to home users on high end gaming rigs and especially workstation or homelab class setups.

Given that VRAM is becoming unobtanium for home users (especially enough to fit 200+GB models), some optimizations like this or other ideas could help keep access to higher quality LLMs available for all!

Cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Local R1 inferencing optimization discussion thread #1

Goal

Problem

Solutions

Transparent Huge Pages

Smart layer offloading

Direct i/o backend

Manually Distribute GGUF files

Conclusion

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Local R1 inferencing optimization discussion thread #1

Description

Goal

Problem

Solutions

Transparent Huge Pages

Smart layer offloading

Direct i/o backend

Manually Distribute GGUF files

Conclusion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions