Created attachment 1822824 [details] ansible logs Description of problem: While running https://docs.openshift.com/container-platform/4.9/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member procedure it may happen that the etcd clusteroperator remains degraded after the Force etcd redeployment: $ oc get clusteroperators etcd NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE etcd 4.9.0-0.nightly-2021-09-10-170926 True True True 16h DefragControllerDegraded: cluster is unhealthy: 2 of 3 members are available, ostest-g6l2z-master-0 is unhealthy... $ oc get pods -n openshift-etcd -l app=etcd NAME READY STATUS RESTARTS AGE etcd-ostest-g6l2z-master-0 3/4 CrashLoopBackOff 17 (4m49s ago) 69m etcd-ostest-g6l2z-master-1 4/4 Running 0 75m etcd-ostest-g6l2z-master-2-replacement 4/4 Running 0 69m $ oc logs -n openshift-etcd etcd-ostest-g6l2z-master-0 etcd | tail -26 {"level":"panic","ts":"2021-09-13T16:33:25.277Z","caller":"rafthttp/transport.go:346","msg":"unexpected removal of unknown remote peer","remote-peer-id":"3c5f54a28a2cc9c9","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdserver/api/rafthttp/transport.go:346\ngo.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdserver/api/rafthttp/transport.go:329\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:2301\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:2133\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:1357\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:1179\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:1111\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\t/remote-source/cachito-gomod-with-deps/app/pkg/schedule/schedule.go:157"} panic: unexpected removal of unknown remote peer goroutine 234 [running]: go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0004123c0, 0xc00cb08bc0, 0x1, 0x1) /remote-source/cachito-gomod-with-deps/deps/gomod/pkg/mod/go.uber.org/zap.0/zapcore/entry.go:234 +0x58d go.uber.org/zap.(*Logger).Panic(0xc00007c870, 0x1232430, 0x29, 0xc00cb08bc0, 0x1, 0x1) /remote-source/cachito-gomod-with-deps/deps/gomod/pkg/mod/go.uber.org/zap.0/logger.go:227 +0x85 go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer(0xc0000af6c0, 0x3c5f54a28a2cc9c9) /remote-source/cachito-gomod-with-deps/app/server/etcdserver/api/rafthttp/transport.go:346 +0x58c go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer(0xc0000af6c0, 0x3c5f54a28a2cc9c9) /remote-source/cachito-gomod-with-deps/app/server/etcdserver/api/rafthttp/transport.go:329 +0x7d go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange(0xc0004ad800, 0x1, 0x3c5f54a28a2cc9c9, 0x0, 0x0, 0x0, 0x77397bdc651f6ac9, 0xc0000b8900, 0x0, 0x0, ...) /remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:2301 +0x872 go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply(0xc0004ad800, 0xc007c44780, 0x121, 0x5d8, 0xc0000b8900, 0xc00ab002f8, 0xc00ab002b0, 0xc74a9d) /remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:2133 +0x59a go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries(0xc0004ad800, 0xc0000b8900, 0xc009afe790) /remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:1357 +0xe5 go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll(0xc0004ad800, 0xc0000b8900, 0xc009afe790) /remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:1179 +0x88 go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8(0x139f670, 0xc009523680) /remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:1111 +0x3c go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run(0xc009543e00) /remote-source/cachito-gomod-with-deps/app/pkg/schedule/schedule.go:157 +0xf3 created by go.etcd.io/etcd/pkg/v3/schedule.NewFIFOScheduler /remote-source/cachito-gomod-with-deps/app/pkg/schedule/schedule.go:70 +0x13b Version-Release number of selected component (if applicable): OCP4.9.0-0.nightly-2021-09-10-170926 OSP16.1 (RHOS-16.1-RHEL-8-20210818.n.0) Kuryr network_type, IPI installation. How reproducible: Sometimes Steps to Reproduce: 1. https://docs.openshift.com/container-platform/4.9/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member 2. oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "single-master-recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge 3. Actual results: clusteroperator etcd degraded. Expected results: clusteroperator etcd healthy. Additional info: - must_gather: http://file.rdu.redhat.com/rlobillo/must-gather-master-replacement.tar.gz - Playbook running the procedure attached.
Can you please verify each step you performed vs a link to the steps? For example are you sure that you stopped etcd my moving the etcd-pod.yaml from /etc/kubernetes/manifests. Then removed the data directory of the failed member. `rm -rf /var/lib/etcd` Next removed the etcd member `etcdctl member remove $ID` Then after that force a new rollout. `oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "single-master-recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge ` Finally was master-0 the member you replaced I assume?
I see the ansible logs now.. reviewing.
I believe this is an upstream bug related to new logic around the handling of membership data[1],[2]. [1] https://github.com/etcd-io/etcd/issues/13196 [2] https://github.com/etcd-io/etcd/pull/13348.
Thanks Sam. Removing NEEDINFO flag.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056