Bug 1883686

Summary: When replacing a failed master node, unless the node is deleted via `oc delete node` and is re-added, the etcd operator does not appear to deploy the etcd static pod.
Product: OpenShift Container Platform Reporter: rvanderp
Component: DocumentationAssignee: Andrea Hoffer <ahoffer>
Status: CLOSED CURRENTRELEASE QA Contact: Xiaoli Tian <xtian>
Severity: low Docs Contact: Vikram Goyal <vigoyal>
Priority: medium    
Version: 4.6CC: aos-bugs, jokerman, lshilin, sbatsche
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-01 13:44:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description rvanderp 2020-09-29 20:50:13 UTC
Description of problem:
When replacing a failed master node, unless the node is deleted via `oc delete node` and is re-added, the etcd operator does not appear to deploy the etcd static pod.

Version-Release number of selected component (if applicable):
4.5.7, UPI on bare metal

How reproducible:
Consistently

Steps to Reproduce:
1. Power off and destroy a master node by 
2. Reignite a master node of the same name
3. The number of running etcd static pods remained at 2

Actual results:
etcd operator does not recognize that the new master does not have a deployed etcd pod

Expected results:
etcd operator will deploy the etcd static pod once the node rejoins the cluster

Additional info:

I'm honestly torn on if this is a doc or function bug.  It did create a confusing situation that we weren't entirely sure how to address.  

Link to followed doc: https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html#replacing-the-unhealthy-etcd-member

Comment 2 Sam Batschelet 2020-10-02 18:34:16 UTC
If you are removing a node as documented[1] you must "Remove the unhealthy etcd member by providing the ID to the etcdctl member remove command". I didnt see that in your workflow.

I think in general the operator will see the change in nodes and then look to see if etcd should be scaled. If you don't remove the old etcd member it thinks work is done. Please verify that this does not work.

1.) remove etcd member on node
2.) delete node
3.) create node with same name
4.) observe if new pod starts

The workaround would be to roll a new static pod revision for etcd.

oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "replace-node-'"$( date --rfc-3339=ns )"'"}}' --type=merge

In the future the operator will be smarter about scaling down.

[1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html#replacing-the-unhealthy-etcd-member

Comment 3 rvanderp 2020-10-09 12:28:41 UTC
My apologies on the delay in responding.  I have been on PTO this week.  We actually did remove the member per the documentation:

etcdctl member list -w table
+------------------+---------+----------------------------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |               NAME               |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+----------------------------------+----------------------------+----------------------------+------------+
| 4cdbd57e339eefb8 | started | redacted.master1                 | https://10.x.x.x:2380      | https://10.x.x.x:2379      |      false |
| 5291ed806ee6265e | started | redacted.master2                 | https://10.x.x.x:2380      | https://10.x.x.x:2379      |      false |
+------------------+---------+----------------------------------+----------------------------+----------------------------+------------+

Until we ran `oc delete node` and rebooted the node we replaced the static pod didn't roll out.  We had discussed using that patch to force a rollout but it wasn't in the section specific to replacing an unhealthy member.  This issue was encountered by the customer running through the disaster recovery docs.

Comment 4 Sam Batschelet 2020-10-09 13:47:48 UTC
We will review updating docs but it would be great if you could test if rolling a new revision would resolve the issue. I just wonder if the issue could possibly be on the node side. The operator watches for change in nodes. If the node is never removed then the operator would not have observability to this action. In which case we would not create a new etcd static pod revision. So the manner in which you delete the node is also important information.

> 1. Power off and destroy a master node by 

Can you flesh out your complete process here?

Comment 5 rvanderp 2020-10-09 15:41:25 UTC
Your explanation makes complete sense.  The node to be replaced was shut down, its VM deleted, and new node spun up to replace it.  After looking over the PR, I think it addresses the concern.  Also, fwiw, it was noted just how much simpler this process has become.  Cool stuff.  Thanks!