1933269 – Cluster unstable replacing an unhealthy etcd member

Bug 1933269 - Cluster unstable replacing an unhealthy etcd member

Summary: Cluster unstable replacing an unhealthy etcd member

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Maysa Macedo
QA Contact:	GenadiC
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1935473
TreeView+	depends on / blocked

Reported:	2021-02-26 11:58 UTC by oarribas
Modified:	2024-06-14 00:32 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Neutron Ports created for OpenStack master VMs recreated as a day-2 operation having different name pattern than when created on day-1 operation. Consequence: Kuryr was not able to detect the new Neutron Port created and create the respective Load Balancer member for the default kubernetes Load Balancer. Fix: Included the extra naming pattern to Kuryr detection mechanism. Result: Load Balancer members correspond to the actual addresses of the currents master VMs and cluster operators are available after Masters recreation.
Clone Of:
Clones:	1935473 (view as bug list)
Environment:
Last Closed:	2021-07-27 22:48:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 999	0	None	open	Bug 1933269: Include LB members for Machines created on day-2 operation	2021-03-03 16:58:30 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:48:48 UTC

Description oarribas 2021-02-26 11:58:10 UTC

Description of problem:

Following the procedure to replace an unhealthy etcd member [1], the master node was stuck on deletion, with a similar issue than BZ [2].

Now, the master node is removed from the OCP cluster. The `etcdctl member list` shows only 2 nodes, but the `etcdctl endpoint status` shows an error with the removed node:

~~~
Failed to get the status of endpoint https://10.0.2.221:2379 (context deadline exceeded)
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.0.148:2379 | b5a514c4e3b5e34a |   3.4.9 |  211 MB |     false |      false |       558 |  181312010 |          181312010 |        |
|  https://10.0.0.18:2379 | eed24c665c829821 |   3.4.9 |  211 MB |      true |      false |       558 |  181312011 |          181312011 |        |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
~~~

And also in the `oc get ep`:
~~~
$ oc get ep -n openshift-etcd
NAME         ENDPOINTS                                                   AGE
etcd         10.0.0.148:2379,10.0.0.148:9979,10.0.0.18:2379 + 1 more...  24d
host-etcd    192.0.2.200:2379,192.0.2.2:2379,192.0.2.3:2379 + 1 more...  24d
host-etcd-2  10.0.2.221:2379,10.0.0.148:2379,10.0.0.18:2379              24d
~~~

The etcd operator pod, among others, are not running properly:
~~~
openshift-authentication-operator                 authentication-operator-5d644ccc49-s4mnm                 0/1    Running    299       3d     10.128.78.228   rhocp-2tt9k-master-3
openshift-cloud-credential-operator               cloud-credential-operator-797b4d57fb-zvf4d               0/1    Running    117       9h55m  10.128.25.173   rhocp-2tt9k-master-4
openshift-machine-api                             cluster-autoscaler-operator-56ddf77dd5-8b6cg             1/2    Running    119       10h    10.128.15.236   rhocp-2tt9k-master-3
openshift-machine-api                             machine-api-controllers-6644844f59-g96lg                 0/4    Running    120       10h    10.128.15.106   rhocp-2tt9k-master-4
openshift-config-operator                         openshift-config-operator-68f5db698-bht7f                0/1    Running    583       1d     10.128.60.23    rhocp-2tt9k-master-4
openshift-console-operator                        console-operator-7755f96bfb-wnsjg                        0/1    Running    397       20h    10.128.110.55   rhocp-2tt9k-master-4
openshift-cluster-storage-operator                csi-snapshot-controller-operator-554d97c5df-k2t7m        0/1    Running    291       1d     10.128.72.71    rhocp-2tt9k-master-4
openshift-etcd-operator                           etcd-operator-684f5779d7-g7hb6                           0/1    Running    300       3d     10.128.26.241   rhocp-2tt9k-master-3
openshift-kube-apiserver-operator                 kube-apiserver-operator-6cfdc9fcd6-jjwzq                 0/1    Running    27        2h12m  10.128.85.171   rhocp-2tt9k-master-4
openshift-kube-controller-manager-operator        kube-controller-manager-operator-594bb8c779-8l5l8        0/1    Running    301       3d     10.128.81.244   rhocp-2tt9k-master-3
openshift-kube-scheduler-operator                 openshift-kube-scheduler-operator-7655447b94-fq7rf       0/1    Running    303       3d     10.128.62.31    rhocp-2tt9k-master-3
openshift-kube-storage-version-migrator-operator  kube-storage-version-migrator-operator-6fdb78c7c4-mjv9s  0/1    Running    301       3d     10.128.45.243   rhocp-2tt9k-master-3
openshift-apiserver-operator                      openshift-apiserver-operator-77c7d9c89c-lms26            0/1    Running    301       3d     10.128.28.197   rhocp-2tt9k-master-3
openshift-controller-manager-operator             openshift-controller-manager-operator-647b648c9d-tnf8v   0/1    Running    302       3d     10.128.77.174   rhocp-2tt9k-master-3
openshift-operator-lifecycle-manager              packageserver-bd7956f6c-hhhvj                            0/1    Running    462       3d     10.128.11.145   rhocp-2tt9k-master-4
openshift-operator-lifecycle-manager              packageserver-bd7956f6c-hpkdb                            0/1    Running    467       3d     10.128.10.162   rhocp-2tt9k-master-3
openshift-service-ca-operator                     service-ca-operator-b8845c4f6-4fcwk                      0/1    Running    304       3d     10.128.40.60    rhocp-2tt9k-master-3
openshift-service-ca                              service-ca-7c9f744c76-hzv27                              0/1    Running    303       3d     10.128.98.82    rhocp-2tt9k-master-3
~~~


The logs from the etcd-operator:
~~~
[...]
2021-02-26T06:45:17.668169264Z I0226 06:45:17.668148       1 request.go:907] Got a Retry-After 1s response for attempt 1 to https://172.30.0.1:443/api/v1/namespaces/openshift-etcd-operator/pods/etcd-operator-684f5779d7-g7hb6
[...]
2021-02-26T06:45:26.676399002Z W0226 06:45:26.676264       1 builder.go:206] unable to get owner reference (falling back to namespace): an error on the server ("") has prevented the request from succeeding (get pods etcd-operator-684f5779d7-g7hb6)
2021-02-26T06:45:26.676399002Z I0226 06:45:26.676393       1 builder.go:233] openshift-cluster-etcd-operator version v0.0.0-alpha.0-500-g0aa652d2-0aa652d2bbedbed5a3ced4a9ad806e6b18e19137
2021-02-26T06:45:26.677013024Z I0226 06:45:26.676988       1 dynamic_serving_content.go:111] Loaded a new cert/key pair for "serving-cert::/var/run/secrets/serving-cert/tls.crt::/var/run/secrets/serving-cert/tls.key"
2021-02-26T06:45:26.886336941Z I0226 06:45:26.886231       1 request.go:907] Got a Retry-After 1s response for attempt 1 to https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication
[...]
2021-02-26T06:45:34.895401748Z I0226 06:45:34.895363       1 request.go:907] Got a Retry-After 1s response for attempt 9 to https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication
2021-02-26T06:45:35.89688721Z I0226 06:45:35.896830       1 server.go:48] Error initializing delegating authentication (will retry): <nil>
2021-02-26T06:45:36.899388254Z I0226 06:45:36.899254       1 request.go:907] Got a Retry-After 1s response for attempt 1 to https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication
[...]
2021-02-26T06:45:44.908993366Z I0226 06:45:44.908950       1 request.go:907] Got a Retry-After 1s response for attempt 9 to https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication
2021-02-26T06:45:45.910359607Z I0226 06:45:45.910249       1 server.go:48] Error initializing delegating authentication (will retry): <nil>
2021-02-26T06:45:46.899511172Z I0226 06:45:46.899468       1 request.go:907] Got a Retry-After 1s response for attempt 1 to https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication
[...]
2021-02-26T06:45:54.907458919Z I0226 06:45:54.907351       1 request.go:907] Got a Retry-After 1s response for attempt 9 to https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication
2021-02-26T06:45:55.908861359Z I0226 06:45:55.908750       1 server.go:48] Error initializing delegating authentication (will retry): <nil>
2021-02-26T06:45:56.883110591Z F0226 06:45:56.882987       1 cmd.go:125] unable to load configmap based request-header-client-ca-file: an error on the server ("") has prevented the request from succeeding (get configmaps extension-apiserver-authentication)
~~~



Version-Release number of selected component (if applicable):

OCP 4.5.15




[1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html#replacing-unhealthy-etcd-member
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1856270

Comment 3 oarribas 2021-02-26 13:11:55 UTC

There are several pods with the same error: 'an error on the server ("") has prevented the request from succeeding':

~~~
$ omg logs -n openshift-authentication-operator authentication-operator-5d644ccc49-s4mnm                                                                                 
/cases/02878649/0100-inspect.local.2845478784944772599.tar.gz/home/stack/inspect.local.2845478784944772599/namespaces/openshift-authentication-operator/pods/authentication-operator-5d644ccc49-s4mnm/operator/operator/logs/current.log
2021-02-26T06:41:19.642972796Z Copying system trust bundle
2021-02-26T06:41:19.687685624Z I0226 06:41:19.687549       1 observer_polling.go:159] Starting file observer
2021-02-26T06:41:19.687832982Z I0226 06:41:19.687812       1 cmd.go:196] Using service-serving-cert provided certificates
2021-02-26T06:41:19.688272441Z I0226 06:41:19.688235       1 observer_polling.go:159] Starting file observer
2021-02-26T06:41:28.701796199Z W0226 06:41:28.701738       1 builder.go:206] unable to get owner reference (falling back to namespace): an error on the server ("") has prevented the request from succeeding (get pods authentication-operator-5d644ccc49-s4mnm)
2021-02-26T06:41:28.70190845Z I0226 06:41:28.701895       1 builder.go:233] cluster-authentication-operator version v4.5.0-202010081312.p0-0-g3759b76-
2021-02-26T06:41:28.703646734Z I0226 06:41:28.703599       1 dynamic_serving_content.go:111] Loaded a new cert/key pair for "serving-cert::/var/run/secrets/serving-cert/tls.crt::/var/run/secrets/serving-cert/tls.key"
2021-02-26T06:41:59.071053223Z F0226 06:41:59.070968       1 cmd.go:125] unable to load configmap based request-header-client-ca-file: an error on the server ("") has prevented the request from succeeding (get configmaps extension-apiserver-authentication)
~~~

~~~
$ oc logs -n openshift-cloud-credential-operator cloud-credential-operator-797b4d57fb-zvf4d
/cases/02878649/0100-inspect.local.2845478784944772599.tar.gz/home/stack/inspect.local.2845478784944772599/namespaces/openshift-cloud-credential-operator/pods/cloud-credential-operator-797b4d57fb-zvf4d/cloud-credential-operator/cloud-credential-operator/logs/current.log
2021-02-26T06:42:20.510958266Z Copying system trust bundle
2021-02-26T06:42:20.532227689Z time="2021-02-26T06:42:20Z" level=debug msg="debug logging enabled"
2021-02-26T06:42:20.532227689Z time="2021-02-26T06:42:20Z" level=info msg="setting up client for manager"
2021-02-26T06:42:20.532517877Z time="2021-02-26T06:42:20Z" level=info msg="setting up manager"
2021-02-26T06:42:29.553114581Z time="2021-02-26T06:42:29Z" level=fatal msg="unable to set up overall controller manager" error="an error on the server (\"\") has prevented the request from succeeding"
~~~

~~~
$ oc logs -n openshift-machine-api -c machine-controller machine-api-controllers-6644844f59-g96lg
/cases/02878649/0100-inspect.local.2845478784944772599.tar.gz/home/stack/inspect.local.2845478784944772599/namespaces/openshift-machine-api/pods/machine-api-controllers-6644844f59-g96lg/machine-controller/machine-controller/logs/current.log
2021-02-26T06:45:11.732990015Z I0226 06:45:11.732877       1 main.go:87] Watching machine-api objects only in namespace "openshift-machine-api" for reconciliation.
2021-02-26T06:45:20.743109611Z F0226 06:45:20.743073       1 main.go:92] an error on the server ("") has prevented the request from succeeding
~~~

Comment 5 Sam Batschelet 2021-02-26 15:18:17 UTC

update from etcd side.


# (03:13:41) etcd cluster observes the loss of peer 10.0.2.221 with "no route to host". This is concerning as this is host network.
> 2021-02-25T03:13:41.137464256Z 2021-02-25 03:13:41.137414 W | rafthttp: health check for peer f16689fea56f63e6 could not connect: dial tcp 10.0.2.221:2380: connect: no route to host


# (03:42:43) etcd cluster is scaled down by admin.
> 2021-02-25T03:42:43.498246739Z 2021-02-25 03:42:43.498221 I | etcdserver/membership: removed member f16689fea56f63e6 from cluster b34163542f143c27

From here out etcd is generally happy on master-{3,4} review of kubelet logs so no failures from quorum-guard health checks.


~~~
$ cat  ./kubelet_service.log | awk '/Readiness probe for \"etcd-quorum-guard.*/ { print $1, $2, $3, $4, $10, $11, $13, $14; }'
~~~

We do see that the endpoints are stale, but this should not cause disruption. The etcd client balancer will complain that a subconn is down as we see in kube-apiserver logs.


> 2021-02-22T11:55:55.638668206Z W0222 11:55:55.638608       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://10.0.2.221:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.2.221:2379: connect: connection refused". Reconnecting...

But etcd client balancer will failover instantly to next active subconn roundrobin to another endpoint even if 2 of the 4 are valid this should not cause disruption.

> 2021-02-22T11:38:25.651695013Z I0222 11:38:25.651689       1 flags.go:33] FLAG: --etcd-servers="[https://10.0.0.148:2379,https://10.0.0.18:2379,https://10.0.2.221:2379,https://localhost:2379]"

My understanding is that if kubelet fails healthz check to kube-apiserver (quorum loss) for 2m? it will restart it.

# kas starts 
2021-02-22T11:38:25

# logs end
2021-02-25T22:03:52

More evidence that kube-apiserver can talk to etcd is we see compaction requests every 5m as expected in the etcd log. The compaction RPC request originates from kas.

grep -rn 'finished scheduled compaction' openshift-etcd/pods/etcd-rhocp-2tt9k-master-3/etcd/etcd/logs/current.log | wc -l
227

I did see a tls errors from kube-apiserver logs

> 2021-02-22T15:59:36.020367947Z I0222 15:59:36.020323       1 log.go:172] http: TLS handshake error from 10.0.2.221:50868: remote error: tls: bad certificate
> 2021-02-25T08:50:39.961900648Z I0225 08:50:39.961837       1 log.go:172] http: TLS handshake error from 10.0.0.148:40214: remote error: tls: bad certificate


When I look at openshift-apiserver I am seeing a ton of 

>2021-02-25T22:12:22.557366311Z E0225 22:12:22.557294       1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF
>2021-02-25T22:12:22.557389706Z E0225 22:12:22.557370       1 errors.go:77] Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF
>2021-02-25T22:12:22.557435275Z E0225 22:12:22.557419       1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF
>2021-02-25T22:12:22.557484701Z E0225 22:12:22.557466       1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF
>2021-02-25T22:12:22.55750078Z E0225 22:12:22.557492       1 errors.go:77] Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF
>2021-02-25T22:12:22.557543857Z E0225 22:12:22.557534       1 errors.go:77] Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF

These issues could be the result of an upstream bug[1]? I am going to move this to apiserver team to dig into this more and validate.

[1]https://github.com/kubernetes/kubernetes/pull/95725

Comment 6 Abu Kashem 2021-02-26 16:05:58 UTC

> Must gater (25/02/2021 23:27 UTC) -> https://attachments.access.redhat.com/hydra/rest/cases/02878649/attachments/f01edfd1-5513-44ec-977f-7f8a69d2ef0b?usePresignedUrl=true

I downloaded the must-gather and did a grep to see if there are evidence of the panic from the aggregation layer
> grep -rni "timeout.go:" home/stack/02878649-New-must-gather/inspect.local.8562484343804519940/namespaces/openshift-kube-apiserver/*
>
> 1 timeout.go:132] net/http: abort Handler

I saw 10 of these panics in this capture. Please note that these panics happening over time can trigger the p&f to reject requests once the concurrency limit has been reached.

If we execute the following prometheus queries over 48-hour window, I expect to see a steady rise in the number of currently executing requests.
> topk(25, sum(apiserver_flowcontrol_current_executing_requests) by (priorityLevel,instance))
> topk(25, sum(apiserver_flowcontrol_request_concurrency_limit) by (priorityLevel,instance))


We have asked the customer to apply the 'Workaround B' from https://bugzilla.redhat.com/show_bug.cgi?id=1908383#c19. 
This should stabilize the cluster for now. Upgrading to 4.5.30 should resolve the issue.

Comment 11 rlobillo 2021-03-09 15:16:28 UTC

Verified on OCP4.8.0-0.nightly-2021-03-08-092651 over OSP16.1 (RHOS-16.1-RHEL-8-20201214.n.3) using OVN-Octavia.

New master is successfully created with different port name:

(shiftstack) [stack@undercloud-0 ~]$ openstack port list -c Name -f value| grep master
ostest-q9d4h-master-port-0
ostest-q9d4h-master-port-1
ostest-q9d4h-master-3


Procedure: Replacing an unhealthy etcd member whose machine is not running or whose node is not ready:

1. Remove the unhealthy member.

oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd
oc rsh -n openshift-etcd etcd-ostest-q9d4h-master-0 # connect to a healthy one and remove the failing one.
etcdctl member list -w table
sh-4.4# etcdctl member remove 29c283237c03d0a # remove ostest-q9d4h-master-2 member


2. Remove the old secrets for the unhealthy etcd member that was removed.

oc get secrets -n openshift-etcd | grep ostest-q9d4h-master-2 
oc delete secret -n openshift-etcd etcd-peer-ostest-q9d4h-master-2
oc delete secret -n openshift-etcd etcd-serving-metrics-ostest-q9d4h-master-2
oc delete secret -n openshift-etcd etcd-serving-ostest-q9d4h-master-2

3. Delete and recreate the master machine. After this machine is recreated, a new revision is forced and etcd scales up automatically.

oc get machines -n openshift-machine-api -o wide
oc get machine ostest-q9d4h-master-0 -n openshift-machine-api -o yaml > new-master-machine.yaml
    
Edit the new-master-machine.yaml: Remove the entire status section and the annotations. Change the name field to a new name (ostest-q9d4h-master-3).
	
oc delete machine -n openshift-machine-api ostest-q9d4h-master-2 (~ 15 minutes)
oc apply -f new-master-machine.yaml (API unstable. ~25 minutes)

(shiftstack) [stack@undercloud-0 ~]$ oc get machines -n openshift-machine-api -o wide
NAME                          PHASE     TYPE        REGION   ZONE   AGE    NODE                          PROVIDERID   STATE
ostest-q9d4h-master-0         Running   m4.xlarge            nova   155m   ostest-q9d4h-master-0                      ACTIVE
ostest-q9d4h-master-1         Running   m4.xlarge            nova   155m   ostest-q9d4h-master-1                      ACTIVE
ostest-q9d4h-master-3         Running   m4.xlarge            nova   43m    ostest-q9d4h-master-3                      ACTIVE
ostest-q9d4h-worker-0-2tmsp   Running   m4.xlarge            nova   143m   ostest-q9d4h-worker-0-2tmsp                ACTIVE
ostest-q9d4h-worker-0-lcww5   Running   m4.xlarge            nova   143m   ostest-q9d4h-worker-0-lcww5                ACTIVE
ostest-q9d4h-worker-0-t49h4   Running   m4.xlarge            nova   12m    ostest-q9d4h-worker-0-t49h4                ACTIVE


4. Verify that all etcd pods are running properly:

$ oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd
etcd-ostest-q9d4h-master-0                3/3     Running     0          16m
etcd-ostest-q9d4h-master-1                2/3     Running     0          20s
etcd-ostest-q9d4h-master-3                3/3     Running     0          94s



(shiftstack) [stack@undercloud-0 ~]$ oc rsh -n openshift-etcd etcd-ostest-q9d4h-master-0
Defaulting container name to etcdctl.
Use 'oc describe pod/etcd-ostest-q9d4h-master-0 -n openshift-etcd' to see all of the containers in this pod.
sh-4.4# etcdctl member list -w table
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |         NAME          |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+
| 84f538fbc1296514 | started | ostest-q9d4h-master-3 | https://10.196.0.150:2380 | https://10.196.0.150:2379 |      false |
| b560e0985124a29c | started | ostest-q9d4h-master-1 | https://10.196.1.146:2380 | https://10.196.1.146:2379 |      false |
| f9a4b687127f5769 | started | ostest-q9d4h-master-0 |  https://10.196.0.18:2380 |  https://10.196.0.18:2379 |      false |
+------------------+---------+-----------------------+---------------------------+---------------------------+------------+

Comment 17 errata-xmlrpc 2021-07-27 22:48:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.