Bug 1941604

Summary: Cluster unstable replacing an unhealthy etcd member
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: NetworkingAssignee: Maysa Macedo <mdemaced>
Networking sub component: kuryr QA Contact: Jon Uriarte <juriarte>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: adeshpan, akashem, aos-bugs, aship, bleanhar, eduen, gcheresh, hchatter, igarciam, juriarte, mdemaced, mdulko, mfojtik, oarribas, openshift-bugs-escalate, ppostler, sbatsche, scuppett, travi
Version: 4.5   
Target Milestone: ---   
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-13 23:43:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1938280    
Bug Blocks:    

Comment 3 Jon Uriarte 2021-03-29 16:44:28 UTC
Verified in OCP 4.5.0-0.nightly-2020-04-24-091134 on top of RHOS-16.1-RHEL-8-20210311.n.1 with OVN-Octavia.

New master is successfully created with different port name:

$ openstack port list -c Name -f value| grep master
ostest-pz7zk-master-port-1
ostest-pz7zk-master-3
ostest-pz7zk-master-port-0

Procedure: Replacing an unhealthy etcd member whose machine is not running or whose node is not ready:

1. Remove master-2
$ oc -n openshift-machine-api get machines
NAME                        PHASE     TYPE        REGION      ZONE   AGE
ostest-pz7zk-master-0       Running   m4.xlarge   regionOne   nova   33m
ostest-pz7zk-master-1       Running   m4.xlarge   regionOne   nova   33m
ostest-pz7zk-master-2       Running   m4.xlarge   regionOne   nova   33m
ostest-pz7zk-worker-bmpb2   Running   m4.xlarge   regionOne   nova   18m
ostest-pz7zk-worker-swgx5   Running   m4.xlarge   regionOne   nova   18m
ostest-pz7zk-worker-vfbqn   Running   m4.xlarge   regionOne   nova   18m

$ oc -n openshift-machine-api delete machine ostest-pz7zk-master-2

2. Remove etcd member (ostest-pz7zk-master-2):
$ oc rsh -n openshift-etcd etcd-ostest-pz7zk-master-0
Defaulting container name to etcdctl.
Use 'oc describe pod/etcd-ostest-pz7zk-master-0 -n openshift-etcd' to see all of the containers in this pod.
sh-4.4# etcdctl member list -w table
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |         NAME          |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
| 5c13780e866b473b | started | ostest-pz7zk-master-1 |   https://10.196.1.4:2380 |   https://10.196.1.4:2379 |      false |
| 794465c1dc67a32b | started | ostest-pz7zk-master-0 | https://10.196.1.227:2380 | https://10.196.1.227:2379 |      false |
| 8f899f0814a46849 | started | ostest-pz7zk-master-2 | https://10.196.2.247:2380 | https://10.196.2.247:2379 |      false |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+

sh-4.4# etcdctl member remove 8f899f0814a46849     
sh-4.4# exit

3. Remove secrets from master:

$ oc -n openshift-etcd get secrets | grep ostest-pz7zk-master-2
etcd-peer-ostest-pz7zk-master-2              kubernetes.io/tls                     2      31m
etcd-serving-metrics-ostest-pz7zk-master-2   kubernetes.io/tls                     2      31m
etcd-serving-ostest-pz7zk-master-2           kubernetes.io/tls                     2      31m

$ oc -n openshift-etcd delete secret etcd-peer-ostest-pz7zk-master-2 etcd-serving-metrics-ostest-pz7zk-master-2 etcd-serving-ostest-pz7zk-master-2


4. Create new master machine:

$ oc get machine ostest-pz7zk-master-0 -n openshift-machine-api -o yaml > new-master-machine.yaml

Edit the new-master-machine.yaml: Remove the entire status section and the annotations and change the name field to a new name (ostest-pz7zk-master-3).

$ oc apply -f new-master-machine.yaml
machine.machine.openshift.io/ostest-pz7zk-master-3 created

$ oc get machines -A
NAMESPACE               NAME                        PHASE     TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-pz7zk-master-0       Running   m4.xlarge   regionOne   nova   158m
openshift-machine-api   ostest-pz7zk-master-1       Running   m4.xlarge   regionOne   nova   158m
openshift-machine-api   ostest-pz7zk-master-3       Running   m4.xlarge   regionOne   nova   13m
openshift-machine-api   ostest-pz7zk-worker-bmpb2   Running   m4.xlarge   regionOne   nova   143m
openshift-machine-api   ostest-pz7zk-worker-swgx5   Running   m4.xlarge   regionOne   nova   143m
openshift-machine-api   ostest-pz7zk-worker-vfbqn   Running   m4.xlarge   regionOne   nova   143m

$ openstack port list | grep master
| a9d7e427-6c05-45cd-b7bf-b619afa04611 | ostest-pz7zk-master-port-1                           | fa:16:3e:7d:5d:a9 | ip_address='10.196.1.4', subnet_id='0bc1dbe7-88d2-497d-8d9f-e4d55417c349'     | ACTIVE |
| af2ca250-0bd7-4c16-bc6c-72d5f5710f37 | ostest-pz7zk-master-3                                | fa:16:3e:42:32:02 | ip_address='10.196.1.208', subnet_id='0bc1dbe7-88d2-497d-8d9f-e4d55417c349'   | ACTIVE |
| e90c91fb-32d4-49ad-a260-b5029baa342a | ostest-pz7zk-master-port-0                           | fa:16:3e:f5:6d:ee | ip_address='10.196.1.227', subnet_id='0bc1dbe7-88d2-497d-8d9f-e4d55417c349'   | ACTIVE |


(shiftstack) [stack@undercloud-0 ~]$ oc get nodes
NAME                        STATUS   ROLES    AGE     VERSION
ostest-pz7zk-master-0       Ready    master   157m    v1.18.3+cdb0358
ostest-pz7zk-master-1       Ready    master   157m    v1.18.3+cdb0358
ostest-pz7zk-master-3       Ready    master   5m22s   v1.18.3+cdb0358
ostest-pz7zk-worker-bmpb2   Ready    worker   134m    v1.18.3+cdb0358
ostest-pz7zk-worker-swgx5   Ready    worker   134m    v1.18.3+cdb0358
ostest-pz7zk-worker-vfbqn   Ready    worker   135m    v1.18.3+cdb0358


$ oc rsh -n openshift-etcd etcd-ostest-pz7zk-master-0
sh-4.4# etcdctl member list -w table
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |         NAME          |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
| 463e8f41186ae6db | started | ostest-pz7zk-master-3 | https://10.196.1.208:2380 | https://10.196.1.208:2379 |      false |
| 5c13780e866b473b | started | ostest-pz7zk-master-1 |   https://10.196.1.4:2380 |   https://10.196.1.4:2379 |      false |
| 794465c1dc67a32b | started | ostest-pz7zk-master-0 | https://10.196.1.227:2380 | https://10.196.1.227:2379 |      false |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+

Comment 5 errata-xmlrpc 2021-04-13 23:43:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.37 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1015