Bug 1935473 - Cluster unstable replacing an unhealthy etcd member
Summary: Cluster unstable replacing an unhealthy etcd member
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.7.z
Assignee: Maysa Macedo
QA Contact: GenadiC
URL:
Whiteboard:
Depends On: 1933269
Blocks: 1938280
TreeView+ depends on / blocked
 
Reported: 2021-03-05 01:02 UTC by Maysa Macedo
Modified: 2021-03-25 01:53 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1933269
Environment:
Last Closed: 2021-03-25 01:53:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1002 0 None open Bug 1935473: Include LB members for Machines created on day-2 operation 2021-03-09 18:49:35 UTC
Red Hat Product Errata RHBA-2021:0821 0 None None None 2021-03-25 01:53:17 UTC

Comment 3 rlobillo 2021-03-15 16:38:45 UTC
Verified on OCP4.7.0-0.nightly-2021-03-14-223051 over OSP16.1 (RHOS-16.1-RHEL-8-20201214.n.3) with OVN-Octavia.

New master is successfully created with different port name:

$ openstack port list -c Name -f value| grep master
ostest-858gf-master-port-0
ostest-858gf-master-port-1
ostest-858gf-master-3

Procedure: Replacing an unhealthy etcd member whose machine is not running or whose node is not ready:

1. Create new master manifest:
$ cat new-master-machine.yaml 
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  name: ostest-858gf-master-3
  namespace: openshift-machine-api
spec:
  metadata: {}
  providerSpec:
    value:
      apiVersion: openstackproviderconfig.openshift.io/v1alpha1
      cloudName: openstack
      cloudsSecret:
        name: openstack-cloud-credentials
        namespace: openshift-machine-api
      flavor: m4.xlarge
      image: ostest-858gf-rhcos
      kind: OpenstackProviderSpec
      metadata:
        creationTimestamp: null
      networks:
      - filter: {}
        subnets:
        - filter:
            name: ostest-858gf-nodes
            tags: openshiftClusterID=ostest-858gf
      securityGroups:
      - filter: {}
        name: ostest-858gf-master
      serverGroupName: ostest-858gf-master
      serverMetadata:
        Name: ostest-858gf-master
        openshiftClusterID: ostest-858gf
      tags:
      - openshiftClusterID=ostest-858gf
      trunk: true
      userDataSecret:
        name: master-user-data

2. Remove failed etcd member (ostest-858gf-master-2):
$ oc rsh -n openshift-etcd etcd-ostest-858gf-master-0
Defaulting container name to etcdctl.
Use 'oc describe pod/etcd-ostest-858gf-master-0 -n openshift-etcd' to see all of the containers in this pod.
sh-4.4# etcdctl member list -w table
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |         NAME          |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
| ce181a6303f59023 | started | ostest-858gf-master-0 | https://10.196.3.229:2380 | https://10.196.3.229:2379 |      false |
| daab8b22de58ce9d | started | ostest-858gf-master-2 |  https://10.196.2.78:2380 |  https://10.196.2.78:2379 |      false |
| e945b77b066c2312 | started | ostest-858gf-master-1 | https://10.196.0.178:2380 | https://10.196.0.178:2379 |      false |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
sh-4.4# etcdctl member remove daab8b22de58ce9d     
sh-4.4# exit


3. Remove secrets from failed master:
$ oc get secrets -n openshift-etcd | grep ostest-858gf-master-2
etcd-peer-ostest-858gf-master-2              kubernetes.io/tls                     2      6h7m
etcd-serving-metrics-ostest-858gf-master-2   kubernetes.io/tls                     2      6h7m
etcd-serving-ostest-858gf-master-2           kubernetes.io/tls                     2      6h7m

$ oc delete secret -n openshift-etcd etcd-peer-ostest-858gf-master-2 etcd-serving-metrics-ostest-858gf-master-2 etcd-serving-ostest-858gf-master-2
secret "etcd-peer-ostest-858gf-master-2" deleted
secret "etcd-serving-metrics-ostest-858gf-master-2" deleted
secret "etcd-serving-ostest-858gf-master-2" deleted

4. destroy failed master and create new one:
$ oc apply -f new-master-machine.yaml && oc delete machine -n openshift-machine-api ostest-858gf-master-2

$ oc get machines -A
NAMESPACE               NAME                          PHASE         TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-858gf-master-0         Running       m4.xlarge   regionOne   nova   6h19m
openshift-machine-api   ostest-858gf-master-1         Running       m4.xlarge   regionOne   nova   6h19m
openshift-machine-api   ostest-858gf-master-2         Deleting      m4.xlarge   regionOne   nova   6h19m
openshift-machine-api   ostest-858gf-master-3         Provisioned   m4.xlarge   regionOne   nova   112s
openshift-machine-api   ostest-858gf-worker-0-9pgwp   Running       m4.xlarge   regionOne   nova   6h7m
openshift-machine-api   ostest-858gf-worker-0-qtc8n   Running       m4.xlarge   regionOne   nova   6h7m
openshift-machine-api   ostest-858gf-worker-0-w6psd   Running       m4.xlarge   regionOne   nova   6h7m

$ openstack port list | grep master
| 0a1bd5ad-0fb4-405a-af2f-3a2e83acb789 | ostest-858gf-master-port-0                           | fa:16:3e:a6:af:77 | ip_address='10.196.3.229', subnet_id='7bbfcc1c-247f-4d72-927a-e188c082848c'   | ACTIVE |
| 1c3eee82-e07d-48e7-83e0-4c5be72218d5 | ostest-858gf-master-port-1                           | fa:16:3e:fa:b1:b5 | ip_address='10.196.0.178', subnet_id='7bbfcc1c-247f-4d72-927a-e188c082848c'   | ACTIVE |
| 86e14de6-e6e7-436d-a33d-161c7f18e8b5 | ostest-858gf-master-3                                | fa:16:3e:38:3d:6a | ip_address='10.196.0.204', subnet_id='7bbfcc1c-247f-4d72-927a-e188c082848c'   | ACTIVE |
| e613411e-785e-48ed-a4e5-f898b5c6fab3 | ostest-858gf-master-port-2                           | fa:16:3e:3e:1b:e9 | ip_address='10.196.2.78', subnet_id='7bbfcc1c-247f-4d72-927a-e188c082848c'    | DOWN   |

5. waiting until new master is ready:

$ openstack port list | grep master
| 0a1bd5ad-0fb4-405a-af2f-3a2e83acb789 | ostest-858gf-master-port-0                           | fa:16:3e:a6:af:77 | ip_address='10.196.3.229', subnet_id='7bbfcc1c-247f-4d72-927a-e188c082848c'   | ACTIVE |
| 1c3eee82-e07d-48e7-83e0-4c5be72218d5 | ostest-858gf-master-port-1                           | fa:16:3e:fa:b1:b5 | ip_address='10.196.0.178', subnet_id='7bbfcc1c-247f-4d72-927a-e188c082848c'   | ACTIVE |
| 86e14de6-e6e7-436d-a33d-161c7f18e8b5 | ostest-858gf-master-3                                | fa:16:3e:38:3d:6a | ip_address='10.196.0.204', subnet_id='7bbfcc1c-247f-4d72-927a-e188c082848c'   | ACTIVE |

(shiftstack) [stack@undercloud-0 ~]$ oc get nodes
NAME                          STATUS   ROLES    AGE     VERSION
ostest-858gf-master-0         Ready    master   6h31m   v1.20.0+bafe72f
ostest-858gf-master-1         Ready    master   6h30m   v1.20.0+bafe72f
ostest-858gf-master-3         Ready    master   4m40s   v1.20.0+bafe72f
ostest-858gf-worker-0-9pgwp   Ready    worker   6h11m   v1.20.0+bafe72f
ostest-858gf-worker-0-qtc8n   Ready    worker   6h10m   v1.20.0+bafe72f
ostest-858gf-worker-0-w6psd   Ready    worker   6h10m   v1.20.0+bafe72f
(shiftstack) [stack@undercloud-0 ~]$ oc get machines -A
NAMESPACE               NAME                          PHASE     TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-858gf-master-0         Running   m4.xlarge   regionOne   nova   6h34m
openshift-machine-api   ostest-858gf-master-1         Running   m4.xlarge   regionOne   nova   6h34m
openshift-machine-api   ostest-858gf-master-3         Running   m4.xlarge   regionOne   nova   17m
openshift-machine-api   ostest-858gf-worker-0-9pgwp   Running   m4.xlarge   regionOne   nova   6h22m
openshift-machine-api   ostest-858gf-worker-0-qtc8n   Running   m4.xlarge   regionOne   nova   6h22m
openshift-machine-api   ostest-858gf-worker-0-w6psd   Running   m4.xlarge   regionOne   nova   6h22m


$ oc rsh -n openshift-etcd etcd-ostest-858gf-master-0
sh-4.4# etcdctl member list -w table
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |         NAME          |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
| 325baa0d738d4617 | started | ostest-858gf-master-0 | https://10.196.3.229:2380 | https://10.196.3.229:2379 |      false |
| 4482884f4b163114 | started | ostest-858gf-master-3 | https://10.196.0.204:2380 | https://10.196.0.204:2379 |      false |
| e945b77b066c2312 | started | ostest-858gf-master-1 | https://10.196.0.178:2380 | https://10.196.0.178:2379 |      false |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+

Comment 5 errata-xmlrpc 2021-03-25 01:53:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.3 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0821


Note You need to log in before you can comment on or make changes to this bug.