Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1883686 - When replacing a failed master node, unless the node is deleted via `oc delete node` and is re-added, the etcd operator does not appear to deploy the etcd static pod.
Summary: When replacing a failed master node, unless the node is deleted via `oc delet...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.7.0
Assignee: Andrea Hoffer
QA Contact: Xiaoli Tian
Vikram Goyal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-29 20:50 UTC by rvanderp
Modified: 2020-12-01 13:44 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-01 13:44:25 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-docs pull 26283 0 None closed Bug 1883686: restore-replace-stopped-etcd-member: add section covering forcing new etcd revision 2020-12-01 13:40:21 UTC

Description rvanderp 2020-09-29 20:50:13 UTC
Description of problem:
When replacing a failed master node, unless the node is deleted via `oc delete node` and is re-added, the etcd operator does not appear to deploy the etcd static pod.

Version-Release number of selected component (if applicable):
4.5.7, UPI on bare metal

How reproducible:
Consistently

Steps to Reproduce:
1. Power off and destroy a master node by 
2. Reignite a master node of the same name
3. The number of running etcd static pods remained at 2

Actual results:
etcd operator does not recognize that the new master does not have a deployed etcd pod

Expected results:
etcd operator will deploy the etcd static pod once the node rejoins the cluster

Additional info:

I'm honestly torn on if this is a doc or function bug.  It did create a confusing situation that we weren't entirely sure how to address.  

Link to followed doc: https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html#replacing-the-unhealthy-etcd-member

Comment 2 Sam Batschelet 2020-10-02 18:34:16 UTC
If you are removing a node as documented[1] you must "Remove the unhealthy etcd member by providing the ID to the etcdctl member remove command". I didnt see that in your workflow.

I think in general the operator will see the change in nodes and then look to see if etcd should be scaled. If you don't remove the old etcd member it thinks work is done. Please verify that this does not work.

1.) remove etcd member on node
2.) delete node
3.) create node with same name
4.) observe if new pod starts

The workaround would be to roll a new static pod revision for etcd.

oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "replace-node-'"$( date --rfc-3339=ns )"'"}}' --type=merge

In the future the operator will be smarter about scaling down.

[1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html#replacing-the-unhealthy-etcd-member

Comment 3 rvanderp 2020-10-09 12:28:41 UTC
My apologies on the delay in responding.  I have been on PTO this week.  We actually did remove the member per the documentation:

etcdctl member list -w table
+------------------+---------+----------------------------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |               NAME               |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+----------------------------------+----------------------------+----------------------------+------------+
| 4cdbd57e339eefb8 | started | redacted.master1                 | https://10.x.x.x:2380      | https://10.x.x.x:2379      |      false |
| 5291ed806ee6265e | started | redacted.master2                 | https://10.x.x.x:2380      | https://10.x.x.x:2379      |      false |
+------------------+---------+----------------------------------+----------------------------+----------------------------+------------+

Until we ran `oc delete node` and rebooted the node we replaced the static pod didn't roll out.  We had discussed using that patch to force a rollout but it wasn't in the section specific to replacing an unhealthy member.  This issue was encountered by the customer running through the disaster recovery docs.

Comment 4 Sam Batschelet 2020-10-09 13:47:48 UTC
We will review updating docs but it would be great if you could test if rolling a new revision would resolve the issue. I just wonder if the issue could possibly be on the node side. The operator watches for change in nodes. If the node is never removed then the operator would not have observability to this action. In which case we would not create a new etcd static pod revision. So the manner in which you delete the node is also important information.

> 1. Power off and destroy a master node by 

Can you flesh out your complete process here?

Comment 5 rvanderp 2020-10-09 15:41:25 UTC
Your explanation makes complete sense.  The node to be replaced was shut down, its VM deleted, and new node spun up to replace it.  After looking over the PR, I think it addresses the concern.  Also, fwiw, it was noted just how much simpler this process has become.  Cool stuff.  Thanks!


Note You need to log in before you can comment on or make changes to this bug.