Bug 1938280 - Cluster unstable replacing an unhealthy etcd member
Summary: Cluster unstable replacing an unhealthy etcd member
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.6.z
Assignee: Maysa Macedo
QA Contact: Jon Uriarte
URL:
Whiteboard:
Depends On: 1935473
Blocks: 1941604
TreeView+ depends on / blocked
 
Reported: 2021-03-12 17:01 UTC by OpenShift BugZilla Robot
Modified: 2021-03-30 17:03 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-30 17:03:15 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1018 0 None open Bug 1938280: Include LB members for Machines created on day-2 operation 2021-03-15 16:41:49 UTC
Red Hat Product Errata RHBA-2021:0952 0 None None None 2021-03-30 17:03:29 UTC

Comment 3 Jon Uriarte 2021-03-25 13:46:10 UTC
Verified on OCP 4.6.0-0.nightly-2021-03-21-131139 on top of OSP 13.0.14 (2021-01-20.1) using OVS and amphora provider.

New master is successfully created with different port name:

$ openstack port list -c Name -f value| grep master
ostest-snksz-master-port-0
ostest-snksz-master-port-1
ostest-snksz-master-3

Procedure: Replacing an unhealthy etcd member whose machine is not running or whose node is not ready:

1. Create new master manifest:
$ cat new-master-machine.yaml 
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  name: ostest-snksz-master-3
  namespace: openshift-machine-api
spec:
  metadata: {}
  providerSpec:
    value:
      apiVersion: openstackproviderconfig.openshift.io/v1alpha1
      cloudName: openstack
      cloudsSecret:
        name: openstack-cloud-credentials
        namespace: openshift-machine-api
      flavor: m4.xlarge
      image: ostest-snksz-rhcos
      kind: OpenstackProviderSpec
      metadata:
        creationTimestamp: null
      networks:
      - filter: {}
        subnets:
        - filter:
            name: ostest-snksz-nodes
            tags: openshiftClusterID=ostest-snksz
      securityGroups:
      - filter: {}
        name: ostest-snksz-master
      serverGroupName: ostest-snksz-master
      serverMetadata:
        Name: ostest-snksz-master
        openshiftClusterID: ostest-snksz
      tags:
      - openshiftClusterID=ostest-snksz
      trunk: true
      userDataSecret:
        name: master-user-data

2. Remove master-2
$ oc -n openshift-machine-api get machines
NAME                          PHASE     TYPE        REGION      ZONE   AGE
ostest-snksz-master-0         Running   m4.xlarge   regionOne   nova   43m
ostest-snksz-master-1         Running   m4.xlarge   regionOne   nova   43m
ostest-snksz-master-2         Running   m4.xlarge   regionOne   nova   43m
ostest-snksz-worker-0-9c4rw   Running   m4.xlarge   regionOne   nova   33m
ostest-snksz-worker-0-pjj2x   Running   m4.xlarge   regionOne   nova   33m
ostest-snksz-worker-0-xvlvb   Running   m4.xlarge   regionOne   nova   33m

$ oc -n openshift-machine-api delete machine ostest-snksz-master-2

3. Remove failed etcd member (ostest-snksz-master-2):
$ oc rsh -n openshift-etcd etcd-ostest-snksz-master-0
Defaulting container name to etcdctl.
Use 'oc describe pod/etcd-ostest-snksz-master-0 -n openshift-etcd' to see all of the containers in this pod.
sh-4.4# etcdctl member list -w table
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |         NAME          |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
| 4f06537b0009ab3f | started | ostest-snksz-master-1 |   https://10.196.3.3:2380 |   https://10.196.3.3:2379 |      false |
| 98618b9a875b38e8 | started | ostest-snksz-master-2 |  https://10.196.1.99:2380 |  https://10.196.1.99:2379 |      false |
| f6fcd785775989d7 | started | ostest-snksz-master-0 | https://10.196.2.121:2380 | https://10.196.2.121:2379 |      false |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+

sh-4.4# etcdctl member remove 98618b9a875b38e8     
sh-4.4# exit

4. Remove secrets from failed master:

$ oc -n openshift-etcd get secrets | grep ostest-snksz-master-2
etcd-peer-ostest-snksz-master-2              kubernetes.io/tls                     2      37m
etcd-serving-metrics-ostest-snksz-master-2   kubernetes.io/tls                     2      37m
etcd-serving-ostest-snksz-master-2           kubernetes.io/tls                     2      37m

$ oc -n openshift-etcd delete secret etcd-peer-ostest-snksz-master-2 etcd-serving-metrics-ostest-snksz-master-2 etcd-serving-ostest-snksz-master-2


5. Create new master machine:
$ oc apply -f new-master-machine.yaml
machine.machine.openshift.io/ostest-snksz-master-3 created

$ oc get machines -A
NAMESPACE               NAME                          PHASE     TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-snksz-master-0         Running   m4.xlarge   regionOne   nova   85m
openshift-machine-api   ostest-snksz-master-1         Running   m4.xlarge   regionOne   nova   85m
openshift-machine-api   ostest-snksz-master-3         Running   m4.xlarge   regionOne   nova   5m56s
openshift-machine-api   ostest-snksz-worker-0-9c4rw   Running   m4.xlarge   regionOne   nova   75m
openshift-machine-api   ostest-snksz-worker-0-pjj2x   Running   m4.xlarge   regionOne   nova   75m
openshift-machine-api   ostest-snksz-worker-0-xvlvb   Running   m4.xlarge   regionOne   nova   75m

$ openstack port list | grep master
| 466ca7e9-d282-42c6-a855-37b9bb1e3212 | ostest-snksz-master-port-0                           | fa:16:3e:23:09:35 | ip_address='10.196.2.121', subnet_id='fe034996-da63-4b51-a1bc-f1b8452a9069'| ACTIVE |
| bba49258-571c-49a7-b268-5277f55472f7 | ostest-snksz-master-3                                | fa:16:3e:6e:25:68 | ip_address='10.196.1.23', subnet_id='fe034996-da63-4b51-a1bc-f1b8452a9069' | ACTIVE |
| fec14ade-774f-407d-b2d6-94fe7069ca77 | ostest-snksz-master-port-1                           | fa:16:3e:08:76:58 | ip_address='10.196.3.3', subnet_id='fe034996-da63-4b51-a1bc-f1b8452a9069'  | ACTIVE |


(shiftstack) [stack@undercloud-0 ~]$ oc get nodes
NAME                          STATUS   ROLES    AGE    VERSION
ostest-snksz-master-0         Ready    master   130m   v1.19.0+263ee0d
ostest-snksz-master-1         Ready    master   129m   v1.19.0+263ee0d
ostest-snksz-master-3         Ready    master   48m    v1.19.0+263ee0d
ostest-snksz-worker-0-9c4rw   Ready    worker   116m   v1.19.0+263ee0d
ostest-snksz-worker-0-pjj2x   Ready    worker   112m   v1.19.0+263ee0d
ostest-snksz-worker-0-xvlvb   Ready    worker   116m   v1.19.0+263ee0d


$ oc rsh -n openshift-etcd etcd-ostest-snksz-master-0
sh-4.4# etcdctl member list -w table
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |         NAME          |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
| 19e9bc16baef4507 | started | ostest-snksz-master-3 |  https://10.196.1.23:2380 |  https://10.196.1.23:2379 |      false |
| 4f06537b0009ab3f | started | ostest-snksz-master-1 |   https://10.196.3.3:2380 |   https://10.196.3.3:2379 |      false |
| f6fcd785775989d7 | started | ostest-snksz-master-0 | https://10.196.2.121:2380 | https://10.196.2.121:2379 |      false |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+

Comment 5 errata-xmlrpc 2021-03-30 17:03:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.23 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0952


Note You need to log in before you can comment on or make changes to this bug.