Bug 1938280

Summary: Cluster unstable replacing an unhealthy etcd member
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: NetworkingAssignee: Maysa Macedo <mdemaced>
Networking sub component: kuryr QA Contact: Jon Uriarte <juriarte>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: adeshpan, akashem, aos-bugs, aship, bleanhar, eduen, gcheresh, hchatter, juriarte, mdemaced, mdulko, mfojtik, oarribas, openshift-bugs-escalate, ppostler, sbatsche, scuppett, travi
Version: 4.5   
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-30 17:03:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1935473    
Bug Blocks: 1941604    

Comment 3 Jon Uriarte 2021-03-25 13:46:10 UTC
Verified on OCP 4.6.0-0.nightly-2021-03-21-131139 on top of OSP 13.0.14 (2021-01-20.1) using OVS and amphora provider.

New master is successfully created with different port name:

$ openstack port list -c Name -f value| grep master
ostest-snksz-master-port-0
ostest-snksz-master-port-1
ostest-snksz-master-3

Procedure: Replacing an unhealthy etcd member whose machine is not running or whose node is not ready:

1. Create new master manifest:
$ cat new-master-machine.yaml 
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  name: ostest-snksz-master-3
  namespace: openshift-machine-api
spec:
  metadata: {}
  providerSpec:
    value:
      apiVersion: openstackproviderconfig.openshift.io/v1alpha1
      cloudName: openstack
      cloudsSecret:
        name: openstack-cloud-credentials
        namespace: openshift-machine-api
      flavor: m4.xlarge
      image: ostest-snksz-rhcos
      kind: OpenstackProviderSpec
      metadata:
        creationTimestamp: null
      networks:
      - filter: {}
        subnets:
        - filter:
            name: ostest-snksz-nodes
            tags: openshiftClusterID=ostest-snksz
      securityGroups:
      - filter: {}
        name: ostest-snksz-master
      serverGroupName: ostest-snksz-master
      serverMetadata:
        Name: ostest-snksz-master
        openshiftClusterID: ostest-snksz
      tags:
      - openshiftClusterID=ostest-snksz
      trunk: true
      userDataSecret:
        name: master-user-data

2. Remove master-2
$ oc -n openshift-machine-api get machines
NAME                          PHASE     TYPE        REGION      ZONE   AGE
ostest-snksz-master-0         Running   m4.xlarge   regionOne   nova   43m
ostest-snksz-master-1         Running   m4.xlarge   regionOne   nova   43m
ostest-snksz-master-2         Running   m4.xlarge   regionOne   nova   43m
ostest-snksz-worker-0-9c4rw   Running   m4.xlarge   regionOne   nova   33m
ostest-snksz-worker-0-pjj2x   Running   m4.xlarge   regionOne   nova   33m
ostest-snksz-worker-0-xvlvb   Running   m4.xlarge   regionOne   nova   33m

$ oc -n openshift-machine-api delete machine ostest-snksz-master-2

3. Remove failed etcd member (ostest-snksz-master-2):
$ oc rsh -n openshift-etcd etcd-ostest-snksz-master-0
Defaulting container name to etcdctl.
Use 'oc describe pod/etcd-ostest-snksz-master-0 -n openshift-etcd' to see all of the containers in this pod.
sh-4.4# etcdctl member list -w table
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |         NAME          |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
| 4f06537b0009ab3f | started | ostest-snksz-master-1 |   https://10.196.3.3:2380 |   https://10.196.3.3:2379 |      false |
| 98618b9a875b38e8 | started | ostest-snksz-master-2 |  https://10.196.1.99:2380 |  https://10.196.1.99:2379 |      false |
| f6fcd785775989d7 | started | ostest-snksz-master-0 | https://10.196.2.121:2380 | https://10.196.2.121:2379 |      false |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+

sh-4.4# etcdctl member remove 98618b9a875b38e8     
sh-4.4# exit

4. Remove secrets from failed master:

$ oc -n openshift-etcd get secrets | grep ostest-snksz-master-2
etcd-peer-ostest-snksz-master-2              kubernetes.io/tls                     2      37m
etcd-serving-metrics-ostest-snksz-master-2   kubernetes.io/tls                     2      37m
etcd-serving-ostest-snksz-master-2           kubernetes.io/tls                     2      37m

$ oc -n openshift-etcd delete secret etcd-peer-ostest-snksz-master-2 etcd-serving-metrics-ostest-snksz-master-2 etcd-serving-ostest-snksz-master-2


5. Create new master machine:
$ oc apply -f new-master-machine.yaml
machine.machine.openshift.io/ostest-snksz-master-3 created

$ oc get machines -A
NAMESPACE               NAME                          PHASE     TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-snksz-master-0         Running   m4.xlarge   regionOne   nova   85m
openshift-machine-api   ostest-snksz-master-1         Running   m4.xlarge   regionOne   nova   85m
openshift-machine-api   ostest-snksz-master-3         Running   m4.xlarge   regionOne   nova   5m56s
openshift-machine-api   ostest-snksz-worker-0-9c4rw   Running   m4.xlarge   regionOne   nova   75m
openshift-machine-api   ostest-snksz-worker-0-pjj2x   Running   m4.xlarge   regionOne   nova   75m
openshift-machine-api   ostest-snksz-worker-0-xvlvb   Running   m4.xlarge   regionOne   nova   75m

$ openstack port list | grep master
| 466ca7e9-d282-42c6-a855-37b9bb1e3212 | ostest-snksz-master-port-0                           | fa:16:3e:23:09:35 | ip_address='10.196.2.121', subnet_id='fe034996-da63-4b51-a1bc-f1b8452a9069'| ACTIVE |
| bba49258-571c-49a7-b268-5277f55472f7 | ostest-snksz-master-3                                | fa:16:3e:6e:25:68 | ip_address='10.196.1.23', subnet_id='fe034996-da63-4b51-a1bc-f1b8452a9069' | ACTIVE |
| fec14ade-774f-407d-b2d6-94fe7069ca77 | ostest-snksz-master-port-1                           | fa:16:3e:08:76:58 | ip_address='10.196.3.3', subnet_id='fe034996-da63-4b51-a1bc-f1b8452a9069'  | ACTIVE |


(shiftstack) [stack@undercloud-0 ~]$ oc get nodes
NAME                          STATUS   ROLES    AGE    VERSION
ostest-snksz-master-0         Ready    master   130m   v1.19.0+263ee0d
ostest-snksz-master-1         Ready    master   129m   v1.19.0+263ee0d
ostest-snksz-master-3         Ready    master   48m    v1.19.0+263ee0d
ostest-snksz-worker-0-9c4rw   Ready    worker   116m   v1.19.0+263ee0d
ostest-snksz-worker-0-pjj2x   Ready    worker   112m   v1.19.0+263ee0d
ostest-snksz-worker-0-xvlvb   Ready    worker   116m   v1.19.0+263ee0d


$ oc rsh -n openshift-etcd etcd-ostest-snksz-master-0
sh-4.4# etcdctl member list -w table
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |         NAME          |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
| 19e9bc16baef4507 | started | ostest-snksz-master-3 |  https://10.196.1.23:2380 |  https://10.196.1.23:2379 |      false |
| 4f06537b0009ab3f | started | ostest-snksz-master-1 |   https://10.196.3.3:2380 |   https://10.196.3.3:2379 |      false |
| f6fcd785775989d7 | started | ostest-snksz-master-0 | https://10.196.2.121:2380 | https://10.196.2.121:2379 |      false |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+

Comment 5 errata-xmlrpc 2021-03-30 17:03:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.23 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0952