13466

issue etcd-io/etcd#13466

Steps to reproduce this issue

The steps are performed on etcd version 3.5.0.

Create a new cluster with 3 members, add the option "--snapshot-count 10" for each instance on startup.
Add 15 keys using a command below,

$ for i in {1..15}; do etcdctl  put k$i v$i; done

Remove one member

$ etcdctl member remove fd422379fda50e48

Restart one etcd instance, then you will see the instance panic.

{"level":"panic","ts":"2021-11-19T10:23:36.501+0800","caller":"rafthttp/transport.go:349","msg":"unexpected removal of unknown remote peer","remote-peer-id":"fd422379fda50e48","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer\n\t/Users/wachao/go/src/go.etcd.io/etcd/server/etcdserver/api/rafthttp/transport.go:349\ngo.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer\n\t/Users/wachao/go/src/go.etcd.io/etcd/server/etcdserver/api/rafthttp/transport.go:330\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\t/Users/wachao/go/src/go.etcd.io/etcd/server/etcdserver/server.go:2012\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\t/Users/wachao/go/src/go.etcd.io/etcd/server/etcdserver/server.go:1852\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\t/Users/wachao/go/src/go.etcd.io/etcd/server/etcdserver/server.go:1078\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\t/Users/wachao/go/src/go.etcd.io/etcd/server/etcdserver/server.go:900\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\t/Users/wachao/go/src/go.etcd.io/etcd/server/etcdserver/server.go:832\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\t/Users/wachao/go/src/go.etcd.io/etcd/pkg/schedule/schedule.go:157"}
panic: unexpected removal of unknown remote peer

goroutine 209 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000186cc0, {0xc00022c740, 0x1, 0x1})
	/Users/wachao/go/gopath/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:234 +0x499
go.uber.org/zap.(*Logger).Panic(0x1b716c0, {0x1c04d22, 0xc0002c91b0}, {0xc00022c740, 0x1, 0x1})
	/Users/wachao/go/gopath/pkg/mod/go.uber.org/[email protected]/logger.go:227 +0x59
go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer(0xc000174e00, 0xc0002c92d8)
	/Users/wachao/go/src/go.etcd.io/etcd/server/etcdserver/api/rafthttp/transport.go:349 +0x26a
go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer(0xc000174e00, 0xc00000e018)
	/Users/wachao/go/src/go.etcd.io/etcd/server/etcdserver/api/rafthttp/transport.go:330 +0x85
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange(0xc00010c600, {0x1, 0xfd422379fda50e48, {0x0, 0x0, 0x0}, 0x32697d35fb628418}, 0xc0002b8000, 0x0)
	/Users/wachao/go/src/go.etcd.io/etcd/server/etcdserver/server.go:2012 +0x2c6
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply(0xc00010c600, {0xc00033c120, 0x4, 0x0}, 0x0)
	/Users/wachao/go/src/go.etcd.io/etcd/server/etcdserver/server.go:1852 +0x5e5
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries(0xc00010c600, 0xc0002b8000, 0xc0003cfb80)
	/Users/wachao/go/src/go.etcd.io/etcd/server/etcdserver/server.go:1078 +0x27d
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll(0xc00010c600, 0xc0002b8000, 0xc0003cfb80)
	/Users/wachao/go/src/go.etcd.io/etcd/server/etcdserver/server.go:900 +0x65
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8({0xc00054e790, 0xc00054e758})
	/Users/wachao/go/src/go.etcd.io/etcd/server/etcdserver/server.go:832 +0x25
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run(0xc0000aa9c0)
	/Users/wachao/go/src/go.etcd.io/etcd/pkg/schedule/schedule.go:157 +0x119
created by go.etcd.io/etcd/pkg/v3/schedule.NewFIFOScheduler
	/Users/wachao/go/src/go.etcd.io/etcd/pkg/schedule/schedule.go:70 +0x15c

Root cause analysis

The value for "--snapshot-count" is 10, so at least a snapshot should have already been created at the step 2. When performing step 3, the raft log (raftpb.ConfChangeRemoveNode) is persisted in the WAL files.

When stopping & starting one etcd instance, it loads the member info from the db file, please see cluster.go#L257-L263, so the RaftCluster.members has only 2 members, which means the removed member isn't included in the members map. But etcd replays the WAL files based on the latest snapshot, so it will remove the already removed member again, accordingly the etcd instance panics, please see transport.go#L346.

The main branch has this issue as well, because it also loads the member info from the db file firstly, see cluster.go#L264-L270.

Please note that etcd 3.5.1 doesn't have this issue, because it loads the member info from the v2store firstly, see cluster.go#L259-L265, so the RaftCluster.members has 3 members, including the removed member.

Workaround

The easiest way is to upgrade etcd to 3.5.1. If you don't want to upgrade etcd, then you can follow the steps below to workaround this issue,

Backup the etcd binary;
Manually change the log level from Panic to Warn;
Build & replace the binary in your running environment, and add "--snapshot-count 2" to the etcd instance
Start the etcd instance;
Add a couple of k/v using etcdctl, and them remove the k/v;
Stop the etcd instance;
Restore the etcd binary backed up at the first step;
Start the etcd instance.

PLease note that you may need to perform the above steps for each etcd member.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

13466

13466

README.md

issue etcd-io/etcd#13466

Steps to reproduce this issue

Root cause analysis

Workaround

Files

13466

Directory actions

More options

Directory actions

More options

Latest commit

History

13466

Folders and files

parent directory

README.md

issue etcd-io/etcd#13466

Steps to reproduce this issue

Root cause analysis

Workaround