Bug 1883686

Summary:	When replacing a failed master node, unless the node is deleted via `oc delete node` and is re-added, the etcd operator does not appear to deploy the etcd static pod.
Product:	OpenShift Container Platform	Reporter:	rvanderp
Component:	Documentation	Assignee:	Andrea Hoffer <ahoffer>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Xiaoli Tian <xtian>
Severity:	low	Docs Contact:	Vikram Goyal <vigoyal>
Priority:	medium
Version:	4.6	CC:	aos-bugs, jokerman, lshilin, sbatsche
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-01 13:44:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description rvanderp 2020-09-29 20:50:13 UTC

Description of problem:
When replacing a failed master node, unless the node is deleted via `oc delete node` and is re-added, the etcd operator does not appear to deploy the etcd static pod.

Version-Release number of selected component (if applicable):
4.5.7, UPI on bare metal

How reproducible:
Consistently

Steps to Reproduce:
1. Power off and destroy a master node by 
2. Reignite a master node of the same name
3. The number of running etcd static pods remained at 2

Actual results:
etcd operator does not recognize that the new master does not have a deployed etcd pod

Expected results:
etcd operator will deploy the etcd static pod once the node rejoins the cluster

Additional info:

I'm honestly torn on if this is a doc or function bug.  It did create a confusing situation that we weren't entirely sure how to address.  

Link to followed doc: https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html#replacing-the-unhealthy-etcd-member

Comment 2 Sam Batschelet 2020-10-02 18:34:16 UTC

If you are removing a node as documented[1] you must "Remove the unhealthy etcd member by providing the ID to the etcdctl member remove command". I didnt see that in your workflow.

I think in general the operator will see the change in nodes and then look to see if etcd should be scaled. If you don't remove the old etcd member it thinks work is done. Please verify that this does not work.

1.) remove etcd member on node
2.) delete node
3.) create node with same name
4.) observe if new pod starts

The workaround would be to roll a new static pod revision for etcd.

oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "replace-node-'"$( date --rfc-3339=ns )"'"}}' --type=merge

In the future the operator will be smarter about scaling down.

[1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html#replacing-the-unhealthy-etcd-member

Comment 3 rvanderp 2020-10-09 12:28:41 UTC

My apologies on the delay in responding.  I have been on PTO this week.  We actually did remove the member per the documentation:

etcdctl member list -w table
+------------------+---------+----------------------------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |               NAME               |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+----------------------------------+----------------------------+----------------------------+------------+
| 4cdbd57e339eefb8 | started | redacted.master1                 | https://10.x.x.x:2380      | https://10.x.x.x:2379      |      false |
| 5291ed806ee6265e | started | redacted.master2                 | https://10.x.x.x:2380      | https://10.x.x.x:2379      |      false |
+------------------+---------+----------------------------------+----------------------------+----------------------------+------------+

Until we ran `oc delete node` and rebooted the node we replaced the static pod didn't roll out.  We had discussed using that patch to force a rollout but it wasn't in the section specific to replacing an unhealthy member.  This issue was encountered by the customer running through the disaster recovery docs.

Comment 4 Sam Batschelet 2020-10-09 13:47:48 UTC

We will review updating docs but it would be great if you could test if rolling a new revision would resolve the issue. I just wonder if the issue could possibly be on the node side. The operator watches for change in nodes. If the node is never removed then the operator would not have observability to this action. In which case we would not create a new etcd static pod revision. So the manner in which you delete the node is also important information.

> 1. Power off and destroy a master node by 

Can you flesh out your complete process here?

Comment 5 rvanderp 2020-10-09 15:41:25 UTC

Your explanation makes complete sense.  The node to be replaced was shut down, its VM deleted, and new node spun up to replace it.  After looking over the PR, I think it addresses the concern.  Also, fwiw, it was noted just how much simpler this process has become.  Cool stuff.  Thanks!

Comment 7 Andrea Hoffer 2020-12-01 13:44:25 UTC

Marking this BZ closed as the PR has been merged and the changes are live here: https://docs.openshift.com/container-platform/4.6/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member