Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TAS: Kueue crashes with panic when the node running a workload is deleted #3693

Open
mimowo opened this issue Nov 29, 2024 · 1 comment · May be fixed by #3694
Open

TAS: Kueue crashes with panic when the node running a workload is deleted #3693

mimowo opened this issue Nov 29, 2024 · 1 comment · May be fixed by #3694
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mimowo
Copy link
Contributor

mimowo commented Nov 29, 2024

What happened:

Kueue crashes with panic when a new workload is scheduled after a node is deleted, which was hosting another workload.

What you expected to happen:

No panic, admission of new workloads continues.

How to reproduce it (as minimally and precisely as possible):

  1. create a TAS workload
  2. delete one of the nodes hosting one of the pods of the workload
  3. schedule another workload
    Issue: kueue panics with the following stacktrace:
E1129 18:33:53.869062       1 panic.go:262] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<
	goroutine 478 [running]:
	k8s.io/apimachinery/pkg/util/runtime.logPanic({0x302f5d0, 0x49be620}, {0x27352c0, 0x49259f0})
		/workspace/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:107 +0xbc
	k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x302f5d0, 0x49be620}, {0x27352c0, 0x49259f0}, {0x49be620, 0x0, 0x10000000043b665?})
		/workspace/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:82 +0x5e
	k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000baf500?})
		/workspace/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:59 +0x108
	panic({0x27352c0?, 0x49259f0?})
		/usr/local/go/src/runtime/panic.go:785 +0x132
	sigs.k8s.io/kueue/pkg/cache.(*TASFlavorSnapshot).initializeFreeCapacityPerDomain(...)
		/workspace/pkg/cache/tas_flavor_snapshot.go:198
	sigs.k8s.io/kueue/pkg/cache.(*TASFlavorSnapshot).addUsage(...)
		/workspace/pkg/cache/tas_flavor_snapshot.go:193
	sigs.k8s.io/kueue/pkg/cache.(*TASFlavorCache).snapshotForNodes(0xc001702ba0, {{0x3038760?, 0xc001897800?}, 0x0?}, {0xc0018ac408, 0x3, 0x3?}, {0xc0012c5008, 0x10, 0x10})
		/workspace/pkg/cache/tas_flavor.go:121 +0x730
	sigs.k8s.io/kueue/pkg/cache.(*TASFlavorCache).snapshot(0xc001702ba0, {0x302f4f0, 0xc0016c9140})
		/workspace/pkg/cache/tas_flavor.go:105 +0x64e
	sigs.k8s.io/kueue/pkg/cache.(*Cache).Snapshot(0xc000828d20, {0x302f4f0, 0xc0016c9140})
		/workspace/pkg/cache/snapshot.go:103 +0x4c5
	sigs.k8s.io/kueue/pkg/scheduler.(*Scheduler).schedule(0xc000a48820, {0x302f4f0, 0xc0015d6120})
		/workspace/pkg/scheduler/scheduler.go:192 +0x225
	sigs.k8s.io/kueue/pkg/util/wait.untilWithBackoff.func1()
		/workspace/pkg/util/wait/backoff.go:43 +0x2b
	k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000c8c078?)
		/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
	k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001ec5f30, {0x30018c0, 0xc000c8c078}, 0x0, 0xc001b161c0)
		/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
	sigs.k8s.io/kueue/pkg/util/wait.untilWithBackoff({0x302f4f0, 0xc0015d6120}, 0xc000f8c050, {0x3028410, 0xc000ff2000})
		/workspace/pkg/util/wait/backoff.go:42 +0xd3
	sigs.k8s.io/kueue/pkg/util/wait.UntilWithBackoff({0x302f4f0, 0xc0015d6120}, 0xc000f8c050)
		/workspace/pkg/util/wait/backoff.go:34 +0x8f
	created by sigs.k8s.io/kueue/pkg/scheduler.(*Scheduler).Start in goroutine 1123
		/workspace/pkg/scheduler/scheduler.go:147 +0x131

Anything else we need to know?:

The reason is here. If the node is deleted, then we didn't initialize the domainID, then we try to addUsage coming from workload in cache. I believe the simplest fix is to just skip adding usage for such domains. This will prevent panic and allow the workloads to schedule on existing nodes.

@mimowo mimowo added the kind/bug Categorizes issue or PR as related to a bug. label Nov 29, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Nov 29, 2024

cc @PBundyra @mwysokin @mbobrovskyi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant