Nomad 1.9.3: panic: runtime error: slice bounds out of range [:12] with capacity 0 #24441

HINT-SJ · 2024-11-12T10:58:05Z

Nomad version

Nomad v1.9.3
BuildDate 2024-11-11T16:35:41Z
Revision d92bf1014886c0ff9f882f4a2691d5ae8ad8131c

Operating system and Environment details

Amazon Linux 2023 (minimal)
AWS EC2 Gravitron (t4g.small)

Issue

At an attempt to upgrade from Nomad 1.9.1 to 1.9.3 (skipping 1.9.2) the first server node we updated failed to start:

==> Nomad agent started! Log data will stream in below:
    2024-11-12T10:47:11.753Z [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
    2024-11-12T10:47:11.757Z [INFO]  nomad.raft: initial configuration: index=0 servers=[]
    2024-11-12T10:47:11.758Z [INFO]  nomad.raft: entering follower state: follower="Node at XXXXX.11:4647 [Follower]" leader-address= leader-id=
    2024-11-12T10:47:11.761Z [INFO]  nomad: serf: EventMemberJoin: ip-XXXXX-11.XXXXX.compute.internal.global XXXXX.11
    2024-11-12T10:47:11.761Z [INFO]  nomad: starting scheduling worker(s): num_workers=2 schedulers=["sysbatch", "service", "batch", "system", "_core"]
    2024-11-12T10:47:11.761Z [INFO]  nomad: started scheduling worker(s): num_workers=2 schedulers=["sysbatch", "service", "batch", "system", "_core"]
    2024-11-12T10:47:11.767Z [INFO]  nomad: adding server: server="ip-XXXXX-11.XXXXX.compute.internal.global (Addr: XXXXX.11:4647) (DC: dc1)"
    2024-11-12T10:47:11.774Z [INFO]  nomad: serf: EventMemberJoin: ip-XXXXX-115.XXXXX.compute.internal.global XXXXX.115
    2024-11-12T10:47:11.775Z [INFO]  nomad: serf: EventMemberJoin: ip-XXXXX-105.XXXXX.compute.internal.global XXXXX.105
    2024-11-12T10:47:11.775Z [INFO]  nomad: adding server: server="ip-XXXXX-115.XXXXX.compute.internal.global (Addr: XXXXX.115:4647) (DC: dc1)"
    2024-11-12T10:47:11.776Z [INFO]  nomad: serf: EventMemberJoin: ip-XXXXX-154.XXXXX.compute.internal.global XXXXX.154
    2024-11-12T10:47:11.776Z [INFO]  nomad: adding server: server="ip-XXXXX-105.XXXXX.compute.internal.global (Addr: XXXXX.105:4647) (DC: dc1)"
    2024-11-12T10:47:11.776Z [INFO]  nomad: serf: EventMemberJoin: ip-XXXXX-53.XXXXX.compute.internal.global XXXXX.53
    2024-11-12T10:47:11.779Z [INFO]  nomad: disabling bootstrap mode because existing Raft peers being reported by peer: peer_name=ip-XXXXX-154.XXXXX.compute.internal.global peer_address=10.169>
    2024-11-12T10:47:11.779Z [INFO]  nomad: adding server: server="ip-XXXXX-154.XXXXX.compute.internal.global (Addr: XXXXX.154:4647) (DC: dc1)"
    2024-11-12T10:47:11.779Z [INFO]  nomad: adding server: server="ip-XXXXX-53.XXXXX.compute.internal.global (Addr: XXXXX.53:4647) (DC: dc1)"
    2024-11-12T10:47:11.783Z [INFO]  nomad: successfully contacted Nomad servers: num_servers=4
    2024-11-12T10:47:11.784Z [WARN]  nomad.raft: failed to get previous log: previous-index=5202671 last-index=0 error="log not found"
    2024-11-12T10:47:11.788Z [INFO]  snapshot: creating new snapshot: path=/opt/nomad/data/server/raft/snapshots/11112-5202550-1731408431788.tmp
    2024-11-12T10:47:11.804Z [INFO]  nomad.raft: snapshot network transfer progress: read-bytes=3154303 percent-complete="100.00%"
    2024-11-12T10:47:11.818Z [INFO]  nomad.raft: copied to local snapshot: bytes=3154303
panic: runtime error: slice bounds out of range [:12] with capacity 0
goroutine 96 [running]:
github.com/hashicorp/go-kms-wrapping/v2/aead.(*Wrapper).Decrypt(0x4000a37d40, {0xd085bc?, 0x40009c55c0?}, 0x4000ec1810, {0x0?, 0x40009c55c0?, 0x400008fd78?})
        github.com/hashicorp/go-kms-wrapping/[email protected]/aead/aead.go:272 +0x1a0
github.com/hashicorp/nomad/nomad.(*Encrypter).decryptWrappedKeyTask.func2()
        github.com/hashicorp/nomad/nomad/encrypter.go:481 +0x68
github.com/hashicorp/nomad/helper.WithBackoffFunc({0x3696740, 0x40007d8e60}, 0x3b9aca00, 0x12a05f200, 0x400008ff28)
        github.com/hashicorp/nomad/helper/backoff.go:50 +0xe0
github.com/hashicorp/nomad/nomad.(*Encrypter).decryptWrappedKeyTask(0x40009401c0, {0x3696740, 0x40007d8e60}, 0x40009c5550, {0x369bf80, 0x4000a37d40}, 0x0?, 0x40007d8ff0, 0x40010ac780)
        github.com/hashicorp/nomad/nomad/encrypter.go:474 +0x11c
created by github.com/hashicorp/nomad/nomad.(*Encrypter).AddWrappedKey in goroutine 102
        github.com/hashicorp/nomad/nomad/encrypter.go:426 +0x480
nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
nomad.service: Failed with result 'exit-code'.

Maybe related to issue #24379 and #24411

We actually rollout new EC2 instances (so no old data left on system) if that helps

Reproduction steps

Update 1.9.1 cluster to 1.9.3 ^^
Cluster has been up for several years and major versions.

Expected Result

Successfully starting a new server node with new version :)

The text was updated successfully, but these errors were encountered:

HINT-SJ · 2024-11-12T11:11:43Z

FYI, node rollback to 1.9.1 yields the exact same issue as above:

panic: runtime error: slice bounds out of range [:12] with capacity 0
goroutine 50 [running]:
github.com/hashicorp/go-kms-wrapping/v2/aead.(*Wrapper).Decrypt(0x4000f06000, {0xd085bc?, 0x4000dfa070?}, 0x4000ab8370, {0x0?, 0x4000dfa070?, 0x4000abed78?})
        github.com/hashicorp/go-kms-wrapping/[email protected]/aead/aead.go:272 +0x1a0
github.com/hashicorp/nomad/nomad.(*Encrypter).decryptWrappedKeyTask.func2()
        github.com/hashicorp/nomad/nomad/encrypter.go:481 +0x68
github.com/hashicorp/nomad/helper.WithBackoffFunc({0x36835c0, 0x4000f02000}, 0x3b9aca00, 0x12a05f200, 0x4000abef28)
        github.com/hashicorp/nomad/helper/backoff.go:50 +0xe0
github.com/hashicorp/nomad/nomad.(*Encrypter).decryptWrappedKeyTask(0x400095d810, {0x36835c0, 0x4000f02000}, 0x4000dfa040, {0x3688d80, 0x4000f06000}, 0x400095d810?, 0x4000f02050, 0x4000a584b0)
        github.com/hashicorp/nomad/nomad/encrypter.go:474 +0x11c
created by github.com/hashicorp/nomad/nomad.(*Encrypter).AddWrappedKey in goroutine 35
        github.com/hashicorp/nomad/nomad/encrypter.go:426 +0x480
nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

jrasell · 2024-11-12T11:50:01Z

Hi @HINT-SJ and thanks for raising this issue and sorry you've hit yet another class of this bug. I have already raised a linked PR to fix this issue along with additional spot-checks to ensure this pattern is not elsewhere. I'll work with the rest of the team to get this merged and look to add some additional tests in this area in the future.

HINT-SJ · 2024-11-12T11:52:24Z

Thanks for your continuous work :)
I felt totally stupid when I encountered the "similar" panic again and had to check the version ^^

Fingers crossed!

jrasell · 2024-11-12T11:55:51Z

@HINT-SJ my pleasure. I also had to do a double take on this :D

bfqrst · 2024-11-14T10:14:19Z

Guys, is there any timeline on when this fix will be shipped? Our cluster(s) are running thin up to the point where some can't elect a leader anymore. Since we can't add new server, we're caught between a rock and a hard place! I am not a paying customer and I'm fully aware that I'm using an open core product! Having that said, can somebody share a roadmap on this? Thanks

blalor · 2024-11-18T01:41:36Z

@bfqrst there are a couple of workarounds (building from source, downloading an artifact from CI) discussed on #24442.

bfqrst · 2024-11-18T09:25:14Z

Thanks @blalor, we did end up doing exactly that... Plus some lessons learned along the way.

HINT-SJ added the type/bug label Nov 12, 2024

jrasell mentioned this issue Nov 12, 2024

keyring: Fix a panic when decrypting aead with empty RSA block. #24442

Merged

6 tasks

jrasell self-assigned this Nov 12, 2024

jrasell added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/workload-identity hcc/jira labels Nov 12, 2024

jrasell added this to Nomad - Community Issues Triage Nov 12, 2024

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Nov 12, 2024

jrasell moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Nov 12, 2024

jrasell pinned this issue Nov 12, 2024

jrasell closed this as completed in #24442 Nov 12, 2024

github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Nov 12, 2024

hc-github-team-nomad-core mentioned this issue Nov 12, 2024

Backport of keyring: Fix a panic when decrypting aead with empty RSA block. into release/1.9.x #24443

Merged

6 tasks

Juanadelacuesta unpinned this issue Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad 1.9.3: panic: runtime error: slice bounds out of range [:12] with capacity 0 #24441

Nomad 1.9.3: panic: runtime error: slice bounds out of range [:12] with capacity 0 #24441

HINT-SJ commented Nov 12, 2024 •

edited

Loading

HINT-SJ commented Nov 12, 2024 •

edited

Loading

jrasell commented Nov 12, 2024

HINT-SJ commented Nov 12, 2024

jrasell commented Nov 12, 2024

bfqrst commented Nov 14, 2024

blalor commented Nov 18, 2024

bfqrst commented Nov 18, 2024

Nomad 1.9.3: panic: runtime error: slice bounds out of range [:12] with capacity 0 #24441

Nomad 1.9.3: panic: runtime error: slice bounds out of range [:12] with capacity 0 #24441

Comments

HINT-SJ commented Nov 12, 2024 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

HINT-SJ commented Nov 12, 2024 • edited Loading

jrasell commented Nov 12, 2024

HINT-SJ commented Nov 12, 2024

jrasell commented Nov 12, 2024

bfqrst commented Nov 14, 2024

blalor commented Nov 18, 2024

bfqrst commented Nov 18, 2024

HINT-SJ commented Nov 12, 2024 •

edited

Loading

HINT-SJ commented Nov 12, 2024 •

edited

Loading