2003775 – etcd pod on CrashLoopBackOff after master replacement procedure

Bug 2003775 - etcd pod on CrashLoopBackOff after master replacement procedure

Summary: etcd pod on CrashLoopBackOff after master replacement procedure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Nobody
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2016174
TreeView+	depends on / blocked

Reported:	2021-09-13 16:54 UTC by rlobillo
Modified:	2022-03-10 16:10 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:10:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ansible logs (266.24 KB, text/plain) 2021-09-13 16:54 UTC, rlobillo	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	etcd-io etcd pull 13348	None	open	Fix for v3.5 Ensure that cluster members stored in v2store and backend are in sync	2021-09-14 12:47:59 UTC
Github	openshift etcd pull 98	None	open	Bug 2003775: UPSTREAM: <carry>: server: Fix for v3.5 Ensure that cluster members stored in v2store and backend are in sy...	2021-10-15 22:15:23 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:10:28 UTC

Description rlobillo 2021-09-13 16:54:11 UTC

Created attachment 1822824 [details]
ansible logs

Description of problem:

While running https://docs.openshift.com/container-platform/4.9/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member procedure it may happen that the etcd clusteroperator remains degraded after the Force etcd redeployment:

$ oc get clusteroperators etcd
NAME   VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.9.0-0.nightly-2021-09-10-170926   True        True          True       16h     DefragControllerDegraded: cluster is unhealthy: 2 of 3 members are available, ostest-g6l2z-master-0 is unhealthy...

$ oc get pods -n openshift-etcd -l app=etcd
NAME                                     READY   STATUS             RESTARTS         AGE
etcd-ostest-g6l2z-master-0               3/4     CrashLoopBackOff   17 (4m49s ago)   69m
etcd-ostest-g6l2z-master-1               4/4     Running            0                75m
etcd-ostest-g6l2z-master-2-replacement   4/4     Running            0                69m

$ oc logs -n openshift-etcd etcd-ostest-g6l2z-master-0 etcd | tail -26
{"level":"panic","ts":"2021-09-13T16:33:25.277Z","caller":"rafthttp/transport.go:346","msg":"unexpected removal of unknown remote peer","remote-peer-id":"3c5f54a28a2cc9c9","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdserver/api/rafthttp/transport.go:346\ngo.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdserver/api/rafthttp/transport.go:329\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:2301\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:2133\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:1357\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:1179\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:1111\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\t/remote-source/cachito-gomod-with-deps/app/pkg/schedule/schedule.go:157"}
panic: unexpected removal of unknown remote peer

goroutine 234 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0004123c0, 0xc00cb08bc0, 0x1, 0x1)
        /remote-source/cachito-gomod-with-deps/deps/gomod/pkg/mod/go.uber.org/zap.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc00007c870, 0x1232430, 0x29, 0xc00cb08bc0, 0x1, 0x1)
        /remote-source/cachito-gomod-with-deps/deps/gomod/pkg/mod/go.uber.org/zap.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer(0xc0000af6c0, 0x3c5f54a28a2cc9c9)
        /remote-source/cachito-gomod-with-deps/app/server/etcdserver/api/rafthttp/transport.go:346 +0x58c
go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer(0xc0000af6c0, 0x3c5f54a28a2cc9c9)
        /remote-source/cachito-gomod-with-deps/app/server/etcdserver/api/rafthttp/transport.go:329 +0x7d
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange(0xc0004ad800, 0x1, 0x3c5f54a28a2cc9c9, 0x0, 0x0, 0x0, 0x77397bdc651f6ac9, 0xc0000b8900, 0x0, 0x0, ...)
        /remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:2301 +0x872
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply(0xc0004ad800, 0xc007c44780, 0x121, 0x5d8, 0xc0000b8900, 0xc00ab002f8, 0xc00ab002b0, 0xc74a9d)
        /remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:2133 +0x59a
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries(0xc0004ad800, 0xc0000b8900, 0xc009afe790)
        /remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:1357 +0xe5
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll(0xc0004ad800, 0xc0000b8900, 0xc009afe790)
        /remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:1179 +0x88
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8(0x139f670, 0xc009523680)
        /remote-source/cachito-gomod-with-deps/app/server/etcdserver/server.go:1111 +0x3c
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run(0xc009543e00)
        /remote-source/cachito-gomod-with-deps/app/pkg/schedule/schedule.go:157 +0xf3
created by go.etcd.io/etcd/pkg/v3/schedule.NewFIFOScheduler
        /remote-source/cachito-gomod-with-deps/app/pkg/schedule/schedule.go:70 +0x13b


Version-Release number of selected component (if applicable):
OCP4.9.0-0.nightly-2021-09-10-170926
OSP16.1 (RHOS-16.1-RHEL-8-20210818.n.0)
Kuryr network_type, IPI installation.


How reproducible: Sometimes


Steps to Reproduce:
1. https://docs.openshift.com/container-platform/4.9/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member
2. oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "single-master-recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge 
3. 

Actual results: clusteroperator etcd degraded.


Expected results: clusteroperator etcd healthy.


Additional info:
- must_gather: http://file.rdu.redhat.com/rlobillo/must-gather-master-replacement.tar.gz
- Playbook running the procedure attached.

Comment 1 Sam Batschelet 2021-09-13 17:55:40 UTC

Can you please verify each step you performed vs a link to the steps?

For example are you sure that you stopped etcd my moving the etcd-pod.yaml from /etc/kubernetes/manifests.

Then removed the data directory of the failed member.

`rm -rf /var/lib/etcd`

Next removed the etcd member `etcdctl member remove $ID`

Then after that force a new rollout.

`oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "single-master-recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge `

Finally was master-0 the member you replaced I assume?

Comment 2 Sam Batschelet 2021-09-13 21:18:01 UTC

I see the ansible logs now.. reviewing.

Comment 3 Sam Batschelet 2021-09-14 12:48:00 UTC

I believe this is an upstream bug related to new logic around the handling of membership data[1],[2].

[1] https://github.com/etcd-io/etcd/issues/13196
[2] https://github.com/etcd-io/etcd/pull/13348.

Comment 5 rlobillo 2021-09-15 12:58:50 UTC

Thanks Sam. Removing NEEDINFO flag.

Comment 12 errata-xmlrpc 2022-03-10 16:10:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.