TAS: Kueue crashes with panic when the node running a workload is deleted #3693

mimowo · 2024-11-29T18:39:48Z

What happened:

Kueue crashes with panic when a new workload is scheduled after a node is deleted, which was hosting another workload.

What you expected to happen:

No panic, admission of new workloads continues.

How to reproduce it (as minimally and precisely as possible):

create a TAS workload
delete one of the nodes hosting one of the pods of the workload
schedule another workload
Issue: kueue panics with the following stacktrace:

E1129 18:33:53.869062       1 panic.go:262] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<
	goroutine 478 [running]:
	k8s.io/apimachinery/pkg/util/runtime.logPanic({0x302f5d0, 0x49be620}, {0x27352c0, 0x49259f0})
		/workspace/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:107 +0xbc
	k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x302f5d0, 0x49be620}, {0x27352c0, 0x49259f0}, {0x49be620, 0x0, 0x10000000043b665?})
		/workspace/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:82 +0x5e
	k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000baf500?})
		/workspace/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:59 +0x108
	panic({0x27352c0?, 0x49259f0?})
		/usr/local/go/src/runtime/panic.go:785 +0x132
	sigs.k8s.io/kueue/pkg/cache.(*TASFlavorSnapshot).initializeFreeCapacityPerDomain(...)
		/workspace/pkg/cache/tas_flavor_snapshot.go:198
	sigs.k8s.io/kueue/pkg/cache.(*TASFlavorSnapshot).addUsage(...)
		/workspace/pkg/cache/tas_flavor_snapshot.go:193
	sigs.k8s.io/kueue/pkg/cache.(*TASFlavorCache).snapshotForNodes(0xc001702ba0, {{0x3038760?, 0xc001897800?}, 0x0?}, {0xc0018ac408, 0x3, 0x3?}, {0xc0012c5008, 0x10, 0x10})
		/workspace/pkg/cache/tas_flavor.go:121 +0x730
	sigs.k8s.io/kueue/pkg/cache.(*TASFlavorCache).snapshot(0xc001702ba0, {0x302f4f0, 0xc0016c9140})
		/workspace/pkg/cache/tas_flavor.go:105 +0x64e
	sigs.k8s.io/kueue/pkg/cache.(*Cache).Snapshot(0xc000828d20, {0x302f4f0, 0xc0016c9140})
		/workspace/pkg/cache/snapshot.go:103 +0x4c5
	sigs.k8s.io/kueue/pkg/scheduler.(*Scheduler).schedule(0xc000a48820, {0x302f4f0, 0xc0015d6120})
		/workspace/pkg/scheduler/scheduler.go:192 +0x225
	sigs.k8s.io/kueue/pkg/util/wait.untilWithBackoff.func1()
		/workspace/pkg/util/wait/backoff.go:43 +0x2b
	k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000c8c078?)
		/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
	k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001ec5f30, {0x30018c0, 0xc000c8c078}, 0x0, 0xc001b161c0)
		/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
	sigs.k8s.io/kueue/pkg/util/wait.untilWithBackoff({0x302f4f0, 0xc0015d6120}, 0xc000f8c050, {0x3028410, 0xc000ff2000})
		/workspace/pkg/util/wait/backoff.go:42 +0xd3
	sigs.k8s.io/kueue/pkg/util/wait.UntilWithBackoff({0x302f4f0, 0xc0015d6120}, 0xc000f8c050)
		/workspace/pkg/util/wait/backoff.go:34 +0x8f
	created by sigs.k8s.io/kueue/pkg/scheduler.(*Scheduler).Start in goroutine 1123
		/workspace/pkg/scheduler/scheduler.go:147 +0x131

Anything else we need to know?:

The reason is here. If the node is deleted, then we didn't initialize the domainID, then we try to addUsage coming from workload in cache. I believe the simplest fix is to just skip adding usage for such domains. This will prevent panic and allow the workloads to schedule on existing nodes.

The text was updated successfully, but these errors were encountered:

mimowo · 2024-11-29T18:40:14Z

cc @PBundyra @mwysokin @mbobrovskyi

mimowo added the kind/bug Categorizes issue or PR as related to a bug. label Nov 29, 2024

swastik959 linked a pull request Nov 30, 2024 that will close this issue

skipped adding usage when there is no leaf domain available #3694

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TAS: Kueue crashes with panic when the node running a workload is deleted #3693

TAS: Kueue crashes with panic when the node running a workload is deleted #3693

mimowo commented Nov 29, 2024

mimowo commented Nov 29, 2024

TAS: Kueue crashes with panic when the node running a workload is deleted #3693

TAS: Kueue crashes with panic when the node running a workload is deleted #3693

Comments

mimowo commented Nov 29, 2024

mimowo commented Nov 29, 2024