Description
Consider the following program:
package x
var x func(a, b, c, d uint64)
//go:nosplit
//go:noinline
func y(a, b, c, d uint64) {
x(a, b, c, d)
}
//go:nosplit
func z(a, b, c, d uint64) {
y(a, b, c, d)
}
When assembled, it looks like this:
TEXT .y(SB), NOSPLIT|ABIInternal, $40-32
PUSHQ BP
MOVQ SP, BP
SUBQ $32, SP
MOVQ .x(SB), DX
MOVQ (DX), SI
PCDATA $1, $0
CALL SI
ADDQ $32, SP
POPQ BP
RET
TEXT .z(SB), NOSPLIT|ABIInternal, $40-32
CMPQ SP, 16(R14)
JLS // morestack
PUSHQ BP
MOVQ SP, BP
SUBQ $32, SP
CALL .y(SB)
ADDQ $32, SP
POPQ BP
RET
The primary thing to note here is that both .y
and .z
reserves 32 bytes of spill space for its callees to spill their arguments. However, .z
's sole callee is a nosplit function, which therefore does not contain a morestack check. As far as I know, these 32 bytes are never written to in any code path.
This has a few unfortunate side effects, but the itch I'm trying to scratch is that I have a bunch of performance-critical nosplit functions whose arguments/returns fully saturate the argument and return registers, and are never spilled. On x86, I am limited to about 10 stack frames before I hit the nosplit limit in the linker.
This is all well and fine: 10 frames is a lot. Unfortunately, this assumes two things:
- No further stack variables are created. I have a custom build tag that turns on debug instrumentation, which blows up the size of the three nested nosplit frames I actually have just enough that my program fails to link.
- I am running into problems with turning on fuzzing inserting nosplit instrumentation function calls that cause me to blow the stack, and fail to link.
I have been working around this in a few different ways, because this is a very niche problem being suffered by a performance weirdo. However, I did notice that 72 bytes of each frame go unused: the morestack spill path.
As the ABI documentaiton observes, there are many options for improving this situation. I'd like to suggest an improvement that should be simple to implement, and will go some way to eliminating redundant stack growth: if a function only calls functions declared as nosplit, treat it like a leaf function for the purposes of prologue.
Of course, this isn't quite so simple. First, the argument registers no longer have a natural home, so those will need to be allocated if they are in fact necessary. Second, there might be a place in reflect that expects this spill area to be here, but I'm not certain. It also messes up traceback printing, which will need to be aware of spill-space-less functions.
This also only benefits nosplit code, which isn't particularly common. I'm more-or-less hitting a pathological case. The real fix is to modify stack growth to allow callees to reserve their own space, as the ABI document details.