Bug 1941604 - Cluster unstable replacing an unhealthy etcd member
Summary: Cluster unstable replacing an unhealthy etcd member
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.5.z
Assignee: Maysa Macedo
QA Contact: Jon Uriarte
URL:
Whiteboard:
Depends On: 1938280
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-22 13:28 UTC by OpenShift BugZilla Robot
Modified: 2021-04-13 23:43 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-13 23:43:27 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1032 0 None open [release-4.5] Bug 1941604: Include LB members for Machines created on day-2 operation 2021-03-25 13:53:07 UTC
Red Hat Product Errata RHBA-2021:1015 0 None None None 2021-04-13 23:43:32 UTC

Comment 3 Jon Uriarte 2021-03-29 16:44:28 UTC
Verified in OCP 4.5.0-0.nightly-2020-04-24-091134 on top of RHOS-16.1-RHEL-8-20210311.n.1 with OVN-Octavia.

New master is successfully created with different port name:

$ openstack port list -c Name -f value| grep master
ostest-pz7zk-master-port-1
ostest-pz7zk-master-3
ostest-pz7zk-master-port-0

Procedure: Replacing an unhealthy etcd member whose machine is not running or whose node is not ready:

1. Remove master-2
$ oc -n openshift-machine-api get machines
NAME                        PHASE     TYPE        REGION      ZONE   AGE
ostest-pz7zk-master-0       Running   m4.xlarge   regionOne   nova   33m
ostest-pz7zk-master-1       Running   m4.xlarge   regionOne   nova   33m
ostest-pz7zk-master-2       Running   m4.xlarge   regionOne   nova   33m
ostest-pz7zk-worker-bmpb2   Running   m4.xlarge   regionOne   nova   18m
ostest-pz7zk-worker-swgx5   Running   m4.xlarge   regionOne   nova   18m
ostest-pz7zk-worker-vfbqn   Running   m4.xlarge   regionOne   nova   18m

$ oc -n openshift-machine-api delete machine ostest-pz7zk-master-2

2. Remove etcd member (ostest-pz7zk-master-2):
$ oc rsh -n openshift-etcd etcd-ostest-pz7zk-master-0
Defaulting container name to etcdctl.
Use 'oc describe pod/etcd-ostest-pz7zk-master-0 -n openshift-etcd' to see all of the containers in this pod.
sh-4.4# etcdctl member list -w table
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |         NAME          |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
| 5c13780e866b473b | started | ostest-pz7zk-master-1 |   https://10.196.1.4:2380 |   https://10.196.1.4:2379 |      false |
| 794465c1dc67a32b | started | ostest-pz7zk-master-0 | https://10.196.1.227:2380 | https://10.196.1.227:2379 |      false |
| 8f899f0814a46849 | started | ostest-pz7zk-master-2 | https://10.196.2.247:2380 | https://10.196.2.247:2379 |      false |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+

sh-4.4# etcdctl member remove 8f899f0814a46849     
sh-4.4# exit

3. Remove secrets from master:

$ oc -n openshift-etcd get secrets | grep ostest-pz7zk-master-2
etcd-peer-ostest-pz7zk-master-2              kubernetes.io/tls                     2      31m
etcd-serving-metrics-ostest-pz7zk-master-2   kubernetes.io/tls                     2      31m
etcd-serving-ostest-pz7zk-master-2           kubernetes.io/tls                     2      31m

$ oc -n openshift-etcd delete secret etcd-peer-ostest-pz7zk-master-2 etcd-serving-metrics-ostest-pz7zk-master-2 etcd-serving-ostest-pz7zk-master-2


4. Create new master machine:

$ oc get machine ostest-pz7zk-master-0 -n openshift-machine-api -o yaml > new-master-machine.yaml

Edit the new-master-machine.yaml: Remove the entire status section and the annotations and change the name field to a new name (ostest-pz7zk-master-3).

$ oc apply -f new-master-machine.yaml
machine.machine.openshift.io/ostest-pz7zk-master-3 created

$ oc get machines -A
NAMESPACE               NAME                        PHASE     TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-pz7zk-master-0       Running   m4.xlarge   regionOne   nova   158m
openshift-machine-api   ostest-pz7zk-master-1       Running   m4.xlarge   regionOne   nova   158m
openshift-machine-api   ostest-pz7zk-master-3       Running   m4.xlarge   regionOne   nova   13m
openshift-machine-api   ostest-pz7zk-worker-bmpb2   Running   m4.xlarge   regionOne   nova   143m
openshift-machine-api   ostest-pz7zk-worker-swgx5   Running   m4.xlarge   regionOne   nova   143m
openshift-machine-api   ostest-pz7zk-worker-vfbqn   Running   m4.xlarge   regionOne   nova   143m

$ openstack port list | grep master
| a9d7e427-6c05-45cd-b7bf-b619afa04611 | ostest-pz7zk-master-port-1                           | fa:16:3e:7d:5d:a9 | ip_address='10.196.1.4', subnet_id='0bc1dbe7-88d2-497d-8d9f-e4d55417c349'     | ACTIVE |
| af2ca250-0bd7-4c16-bc6c-72d5f5710f37 | ostest-pz7zk-master-3                                | fa:16:3e:42:32:02 | ip_address='10.196.1.208', subnet_id='0bc1dbe7-88d2-497d-8d9f-e4d55417c349'   | ACTIVE |
| e90c91fb-32d4-49ad-a260-b5029baa342a | ostest-pz7zk-master-port-0                           | fa:16:3e:f5:6d:ee | ip_address='10.196.1.227', subnet_id='0bc1dbe7-88d2-497d-8d9f-e4d55417c349'   | ACTIVE |


(shiftstack) [stack@undercloud-0 ~]$ oc get nodes
NAME                        STATUS   ROLES    AGE     VERSION
ostest-pz7zk-master-0       Ready    master   157m    v1.18.3+cdb0358
ostest-pz7zk-master-1       Ready    master   157m    v1.18.3+cdb0358
ostest-pz7zk-master-3       Ready    master   5m22s   v1.18.3+cdb0358
ostest-pz7zk-worker-bmpb2   Ready    worker   134m    v1.18.3+cdb0358
ostest-pz7zk-worker-swgx5   Ready    worker   134m    v1.18.3+cdb0358
ostest-pz7zk-worker-vfbqn   Ready    worker   135m    v1.18.3+cdb0358


$ oc rsh -n openshift-etcd etcd-ostest-pz7zk-master-0
sh-4.4# etcdctl member list -w table
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |         NAME          |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
| 463e8f41186ae6db | started | ostest-pz7zk-master-3 | https://10.196.1.208:2380 | https://10.196.1.208:2379 |      false |
| 5c13780e866b473b | started | ostest-pz7zk-master-1 |   https://10.196.1.4:2380 |   https://10.196.1.4:2379 |      false |
| 794465c1dc67a32b | started | ostest-pz7zk-master-0 | https://10.196.1.227:2380 | https://10.196.1.227:2379 |      false |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+

Comment 5 errata-xmlrpc 2021-04-13 23:43:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.37 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1015


Note You need to log in before you can comment on or make changes to this bug.