Bug 1883686
Summary: | When replacing a failed master node, unless the node is deleted via `oc delete node` and is re-added, the etcd operator does not appear to deploy the etcd static pod. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | rvanderp |
Component: | Documentation | Assignee: | Andrea Hoffer <ahoffer> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Xiaoli Tian <xtian> |
Severity: | low | Docs Contact: | Vikram Goyal <vigoyal> |
Priority: | medium | ||
Version: | 4.6 | CC: | aos-bugs, jokerman, lshilin, sbatsche |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-12-01 13:44:25 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
rvanderp
2020-09-29 20:50:13 UTC
If you are removing a node as documented[1] you must "Remove the unhealthy etcd member by providing the ID to the etcdctl member remove command". I didnt see that in your workflow. I think in general the operator will see the change in nodes and then look to see if etcd should be scaled. If you don't remove the old etcd member it thinks work is done. Please verify that this does not work. 1.) remove etcd member on node 2.) delete node 3.) create node with same name 4.) observe if new pod starts The workaround would be to roll a new static pod revision for etcd. oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "replace-node-'"$( date --rfc-3339=ns )"'"}}' --type=merge In the future the operator will be smarter about scaling down. [1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html#replacing-the-unhealthy-etcd-member My apologies on the delay in responding. I have been on PTO this week. We actually did remove the member per the documentation: etcdctl member list -w table +------------------+---------+----------------------------------+----------------------------+----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+----------------------------------+----------------------------+----------------------------+------------+ | 4cdbd57e339eefb8 | started | redacted.master1 | https://10.x.x.x:2380 | https://10.x.x.x:2379 | false | | 5291ed806ee6265e | started | redacted.master2 | https://10.x.x.x:2380 | https://10.x.x.x:2379 | false | +------------------+---------+----------------------------------+----------------------------+----------------------------+------------+ Until we ran `oc delete node` and rebooted the node we replaced the static pod didn't roll out. We had discussed using that patch to force a rollout but it wasn't in the section specific to replacing an unhealthy member. This issue was encountered by the customer running through the disaster recovery docs. We will review updating docs but it would be great if you could test if rolling a new revision would resolve the issue. I just wonder if the issue could possibly be on the node side. The operator watches for change in nodes. If the node is never removed then the operator would not have observability to this action. In which case we would not create a new etcd static pod revision. So the manner in which you delete the node is also important information.
> 1. Power off and destroy a master node by
Can you flesh out your complete process here?
Your explanation makes complete sense. The node to be replaced was shut down, its VM deleted, and new node spun up to replace it. After looking over the PR, I think it addresses the concern. Also, fwiw, it was noted just how much simpler this process has become. Cool stuff. Thanks! Marking this BZ closed as the PR has been merged and the changes are live here: https://docs.openshift.com/container-platform/4.6/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member |