Description
Hey this is an interesting repo, is this twitter post your other documentation/discussion?
I lot more stats and metrics and testing suggesting that regardless of fast storage array, there are some software bottle necks with unbuffered i/o currently on llama.cpp over at level1techs
Goal
Myself and at least a few others have been trying to get current llama.cpp CPU inferencing going with larger unsloth quants that don't even fit into system RAM. The idea being that perhaps a fast quad RAID0 striped array of PCIe Gen 5.0 NVMe drives (e.g. Crucial T705 2TB+) could provide enough IOPS to be competitive with DDR4 or slower DDR5 RAM i/o bandwidth speeds.
Problem
However, myself and others have also realized there is a bottleneck in the current mmap() implementation that relies on Linux kernel page cache buffered i/o which is unable to saturate the disk's potential bandwidth. The current implementation works very well in many situations and allows a lot of flexibility across platforms. My impression is with the bigger DeepSeek R1 unsloth quants weighing in over 200+ GB the limitations of the underlying buffered page cache begin to come into play.
Solutions
There are some possible solutions that could fix it or a combination of which might at least improve the situation that I've found:
- Transparent Huge Pages
- Smart Layer Offloading
- Direct i/o backend
- Possibly instead of RAID0 array, manually distribute the 50GB gguf file parts across individual drives and load them like that instead of assuming a single directory of files. Might be able to test with symlinks quickly without patching code. (i tested and confirmed what another found that it doesn't help), which implies bottleneck with kernel buffered page cache reads??)
Transparent Huge Pages
There are some possible Transparent Huge Pages THP settings that seem like they might offer improved performance, but I haven't seen any major successes and it is a bit tricky to test as default Ubuntu kernels leave out the experimental CONFIG_READ_ONLY_THP_FOR_FS=y
option. You can check with zcat /proc/config.gz | grep THP_FOR_FS
or cat /boot/config-6.13.0-061300-generic | grep THP_FOR_FS
Then you can check to see if AnonHugePages
or FileHugePages
increases. On my ARCH 6.13.1-arch1-1
THP_FOR_FS=y
:
watch -d grep Huge /proc/meminfo
AnonHugePages: 256000 kB
ShmemHugePages: 0 kB
FileHugePages: 550912 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
In theory using 2M or 1G page sizes instead of default 4k page cache could reduce page faults leading to less overall read requests with larger sizes to better saturate the disk i/o.
Smart layer offloading
I just saw this one today and will give it a try soon, some folks suggest improvements in performance.
Direct i/o backend
This would take more work as instead of relying on mmap() which provides a nice application layer interface to read arbitrary bytes from arbitrary addresses any sort of O_DIRECT libaio type interface would have to manage aligned reads etc. This may give the best performance but likely less portable and more difficult to implement.
Manually Distribute GGUF files
Just thought of this today, might give it a try to see if a simple symlink will work to redirect the mmap() to a different drive for different model files. It might not be truly "interleaved" depending on how the weights/experts are distributed in the actual GGUF files...
Conclusion
The feeling is there is a lot of software optimizations that could be made to bring usable generation speeds and context sizes to home users on high end gaming rigs and especially workstation or homelab class setups.
Given that VRAM is becoming unobtanium for home users (especially enough to fit 200+GB models), some optimizations like this or other ideas could help keep access to higher quality LLMs available for all!
Cheers!