Hide Forgot
Description of problem: Following the procedure to replace an unhealthy etcd member [1], the master node was stuck on deletion, with a similar issue than BZ [2]. Now, the master node is removed from the OCP cluster. The `etcdctl member list` shows only 2 nodes, but the `etcdctl endpoint status` shows an error with the removed node: ~~~ Failed to get the status of endpoint https://10.0.2.221:2379 (context deadline exceeded) +-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.0.0.148:2379 | b5a514c4e3b5e34a | 3.4.9 | 211 MB | false | false | 558 | 181312010 | 181312010 | | | https://10.0.0.18:2379 | eed24c665c829821 | 3.4.9 | 211 MB | true | false | 558 | 181312011 | 181312011 | | +-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ ~~~ And also in the `oc get ep`: ~~~ $ oc get ep -n openshift-etcd NAME ENDPOINTS AGE etcd 10.0.0.148:2379,10.0.0.148:9979,10.0.0.18:2379 + 1 more... 24d host-etcd 192.0.2.200:2379,192.0.2.2:2379,192.0.2.3:2379 + 1 more... 24d host-etcd-2 10.0.2.221:2379,10.0.0.148:2379,10.0.0.18:2379 24d ~~~ The etcd operator pod, among others, are not running properly: ~~~ openshift-authentication-operator authentication-operator-5d644ccc49-s4mnm 0/1 Running 299 3d 10.128.78.228 rhocp-2tt9k-master-3 openshift-cloud-credential-operator cloud-credential-operator-797b4d57fb-zvf4d 0/1 Running 117 9h55m 10.128.25.173 rhocp-2tt9k-master-4 openshift-machine-api cluster-autoscaler-operator-56ddf77dd5-8b6cg 1/2 Running 119 10h 10.128.15.236 rhocp-2tt9k-master-3 openshift-machine-api machine-api-controllers-6644844f59-g96lg 0/4 Running 120 10h 10.128.15.106 rhocp-2tt9k-master-4 openshift-config-operator openshift-config-operator-68f5db698-bht7f 0/1 Running 583 1d 10.128.60.23 rhocp-2tt9k-master-4 openshift-console-operator console-operator-7755f96bfb-wnsjg 0/1 Running 397 20h 10.128.110.55 rhocp-2tt9k-master-4 openshift-cluster-storage-operator csi-snapshot-controller-operator-554d97c5df-k2t7m 0/1 Running 291 1d 10.128.72.71 rhocp-2tt9k-master-4 openshift-etcd-operator etcd-operator-684f5779d7-g7hb6 0/1 Running 300 3d 10.128.26.241 rhocp-2tt9k-master-3 openshift-kube-apiserver-operator kube-apiserver-operator-6cfdc9fcd6-jjwzq 0/1 Running 27 2h12m 10.128.85.171 rhocp-2tt9k-master-4 openshift-kube-controller-manager-operator kube-controller-manager-operator-594bb8c779-8l5l8 0/1 Running 301 3d 10.128.81.244 rhocp-2tt9k-master-3 openshift-kube-scheduler-operator openshift-kube-scheduler-operator-7655447b94-fq7rf 0/1 Running 303 3d 10.128.62.31 rhocp-2tt9k-master-3 openshift-kube-storage-version-migrator-operator kube-storage-version-migrator-operator-6fdb78c7c4-mjv9s 0/1 Running 301 3d 10.128.45.243 rhocp-2tt9k-master-3 openshift-apiserver-operator openshift-apiserver-operator-77c7d9c89c-lms26 0/1 Running 301 3d 10.128.28.197 rhocp-2tt9k-master-3 openshift-controller-manager-operator openshift-controller-manager-operator-647b648c9d-tnf8v 0/1 Running 302 3d 10.128.77.174 rhocp-2tt9k-master-3 openshift-operator-lifecycle-manager packageserver-bd7956f6c-hhhvj 0/1 Running 462 3d 10.128.11.145 rhocp-2tt9k-master-4 openshift-operator-lifecycle-manager packageserver-bd7956f6c-hpkdb 0/1 Running 467 3d 10.128.10.162 rhocp-2tt9k-master-3 openshift-service-ca-operator service-ca-operator-b8845c4f6-4fcwk 0/1 Running 304 3d 10.128.40.60 rhocp-2tt9k-master-3 openshift-service-ca service-ca-7c9f744c76-hzv27 0/1 Running 303 3d 10.128.98.82 rhocp-2tt9k-master-3 ~~~ The logs from the etcd-operator: ~~~ [...] 2021-02-26T06:45:17.668169264Z I0226 06:45:17.668148 1 request.go:907] Got a Retry-After 1s response for attempt 1 to https://172.30.0.1:443/api/v1/namespaces/openshift-etcd-operator/pods/etcd-operator-684f5779d7-g7hb6 [...] 2021-02-26T06:45:26.676399002Z W0226 06:45:26.676264 1 builder.go:206] unable to get owner reference (falling back to namespace): an error on the server ("") has prevented the request from succeeding (get pods etcd-operator-684f5779d7-g7hb6) 2021-02-26T06:45:26.676399002Z I0226 06:45:26.676393 1 builder.go:233] openshift-cluster-etcd-operator version v0.0.0-alpha.0-500-g0aa652d2-0aa652d2bbedbed5a3ced4a9ad806e6b18e19137 2021-02-26T06:45:26.677013024Z I0226 06:45:26.676988 1 dynamic_serving_content.go:111] Loaded a new cert/key pair for "serving-cert::/var/run/secrets/serving-cert/tls.crt::/var/run/secrets/serving-cert/tls.key" 2021-02-26T06:45:26.886336941Z I0226 06:45:26.886231 1 request.go:907] Got a Retry-After 1s response for attempt 1 to https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication [...] 2021-02-26T06:45:34.895401748Z I0226 06:45:34.895363 1 request.go:907] Got a Retry-After 1s response for attempt 9 to https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication 2021-02-26T06:45:35.89688721Z I0226 06:45:35.896830 1 server.go:48] Error initializing delegating authentication (will retry): <nil> 2021-02-26T06:45:36.899388254Z I0226 06:45:36.899254 1 request.go:907] Got a Retry-After 1s response for attempt 1 to https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication [...] 2021-02-26T06:45:44.908993366Z I0226 06:45:44.908950 1 request.go:907] Got a Retry-After 1s response for attempt 9 to https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication 2021-02-26T06:45:45.910359607Z I0226 06:45:45.910249 1 server.go:48] Error initializing delegating authentication (will retry): <nil> 2021-02-26T06:45:46.899511172Z I0226 06:45:46.899468 1 request.go:907] Got a Retry-After 1s response for attempt 1 to https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication [...] 2021-02-26T06:45:54.907458919Z I0226 06:45:54.907351 1 request.go:907] Got a Retry-After 1s response for attempt 9 to https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication 2021-02-26T06:45:55.908861359Z I0226 06:45:55.908750 1 server.go:48] Error initializing delegating authentication (will retry): <nil> 2021-02-26T06:45:56.883110591Z F0226 06:45:56.882987 1 cmd.go:125] unable to load configmap based request-header-client-ca-file: an error on the server ("") has prevented the request from succeeding (get configmaps extension-apiserver-authentication) ~~~ Version-Release number of selected component (if applicable): OCP 4.5.15 [1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html#replacing-unhealthy-etcd-member [2] https://bugzilla.redhat.com/show_bug.cgi?id=1856270
There are several pods with the same error: 'an error on the server ("") has prevented the request from succeeding': ~~~ $ omg logs -n openshift-authentication-operator authentication-operator-5d644ccc49-s4mnm /cases/02878649/0100-inspect.local.2845478784944772599.tar.gz/home/stack/inspect.local.2845478784944772599/namespaces/openshift-authentication-operator/pods/authentication-operator-5d644ccc49-s4mnm/operator/operator/logs/current.log 2021-02-26T06:41:19.642972796Z Copying system trust bundle 2021-02-26T06:41:19.687685624Z I0226 06:41:19.687549 1 observer_polling.go:159] Starting file observer 2021-02-26T06:41:19.687832982Z I0226 06:41:19.687812 1 cmd.go:196] Using service-serving-cert provided certificates 2021-02-26T06:41:19.688272441Z I0226 06:41:19.688235 1 observer_polling.go:159] Starting file observer 2021-02-26T06:41:28.701796199Z W0226 06:41:28.701738 1 builder.go:206] unable to get owner reference (falling back to namespace): an error on the server ("") has prevented the request from succeeding (get pods authentication-operator-5d644ccc49-s4mnm) 2021-02-26T06:41:28.70190845Z I0226 06:41:28.701895 1 builder.go:233] cluster-authentication-operator version v4.5.0-202010081312.p0-0-g3759b76- 2021-02-26T06:41:28.703646734Z I0226 06:41:28.703599 1 dynamic_serving_content.go:111] Loaded a new cert/key pair for "serving-cert::/var/run/secrets/serving-cert/tls.crt::/var/run/secrets/serving-cert/tls.key" 2021-02-26T06:41:59.071053223Z F0226 06:41:59.070968 1 cmd.go:125] unable to load configmap based request-header-client-ca-file: an error on the server ("") has prevented the request from succeeding (get configmaps extension-apiserver-authentication) ~~~ ~~~ $ oc logs -n openshift-cloud-credential-operator cloud-credential-operator-797b4d57fb-zvf4d /cases/02878649/0100-inspect.local.2845478784944772599.tar.gz/home/stack/inspect.local.2845478784944772599/namespaces/openshift-cloud-credential-operator/pods/cloud-credential-operator-797b4d57fb-zvf4d/cloud-credential-operator/cloud-credential-operator/logs/current.log 2021-02-26T06:42:20.510958266Z Copying system trust bundle 2021-02-26T06:42:20.532227689Z time="2021-02-26T06:42:20Z" level=debug msg="debug logging enabled" 2021-02-26T06:42:20.532227689Z time="2021-02-26T06:42:20Z" level=info msg="setting up client for manager" 2021-02-26T06:42:20.532517877Z time="2021-02-26T06:42:20Z" level=info msg="setting up manager" 2021-02-26T06:42:29.553114581Z time="2021-02-26T06:42:29Z" level=fatal msg="unable to set up overall controller manager" error="an error on the server (\"\") has prevented the request from succeeding" ~~~ ~~~ $ oc logs -n openshift-machine-api -c machine-controller machine-api-controllers-6644844f59-g96lg /cases/02878649/0100-inspect.local.2845478784944772599.tar.gz/home/stack/inspect.local.2845478784944772599/namespaces/openshift-machine-api/pods/machine-api-controllers-6644844f59-g96lg/machine-controller/machine-controller/logs/current.log 2021-02-26T06:45:11.732990015Z I0226 06:45:11.732877 1 main.go:87] Watching machine-api objects only in namespace "openshift-machine-api" for reconciliation. 2021-02-26T06:45:20.743109611Z F0226 06:45:20.743073 1 main.go:92] an error on the server ("") has prevented the request from succeeding ~~~
update from etcd side. # (03:13:41) etcd cluster observes the loss of peer 10.0.2.221 with "no route to host". This is concerning as this is host network. > 2021-02-25T03:13:41.137464256Z 2021-02-25 03:13:41.137414 W | rafthttp: health check for peer f16689fea56f63e6 could not connect: dial tcp 10.0.2.221:2380: connect: no route to host # (03:42:43) etcd cluster is scaled down by admin. > 2021-02-25T03:42:43.498246739Z 2021-02-25 03:42:43.498221 I | etcdserver/membership: removed member f16689fea56f63e6 from cluster b34163542f143c27 From here out etcd is generally happy on master-{3,4} review of kubelet logs so no failures from quorum-guard health checks. ~~~ $ cat ./kubelet_service.log | awk '/Readiness probe for \"etcd-quorum-guard.*/ { print $1, $2, $3, $4, $10, $11, $13, $14; }' ~~~ We do see that the endpoints are stale, but this should not cause disruption. The etcd client balancer will complain that a subconn is down as we see in kube-apiserver logs. > 2021-02-22T11:55:55.638668206Z W0222 11:55:55.638608 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://10.0.2.221:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.2.221:2379: connect: connection refused". Reconnecting... But etcd client balancer will failover instantly to next active subconn roundrobin to another endpoint even if 2 of the 4 are valid this should not cause disruption. > 2021-02-22T11:38:25.651695013Z I0222 11:38:25.651689 1 flags.go:33] FLAG: --etcd-servers="[https://10.0.0.148:2379,https://10.0.0.18:2379,https://10.0.2.221:2379,https://localhost:2379]" My understanding is that if kubelet fails healthz check to kube-apiserver (quorum loss) for 2m? it will restart it. # kas starts 2021-02-22T11:38:25 # logs end 2021-02-25T22:03:52 More evidence that kube-apiserver can talk to etcd is we see compaction requests every 5m as expected in the etcd log. The compaction RPC request originates from kas. grep -rn 'finished scheduled compaction' openshift-etcd/pods/etcd-rhocp-2tt9k-master-3/etcd/etcd/logs/current.log | wc -l 227 I did see a tls errors from kube-apiserver logs > 2021-02-22T15:59:36.020367947Z I0222 15:59:36.020323 1 log.go:172] http: TLS handshake error from 10.0.2.221:50868: remote error: tls: bad certificate > 2021-02-25T08:50:39.961900648Z I0225 08:50:39.961837 1 log.go:172] http: TLS handshake error from 10.0.0.148:40214: remote error: tls: bad certificate When I look at openshift-apiserver I am seeing a ton of >2021-02-25T22:12:22.557366311Z E0225 22:12:22.557294 1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF >2021-02-25T22:12:22.557389706Z E0225 22:12:22.557370 1 errors.go:77] Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF >2021-02-25T22:12:22.557435275Z E0225 22:12:22.557419 1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF >2021-02-25T22:12:22.557484701Z E0225 22:12:22.557466 1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF >2021-02-25T22:12:22.55750078Z E0225 22:12:22.557492 1 errors.go:77] Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF >2021-02-25T22:12:22.557543857Z E0225 22:12:22.557534 1 errors.go:77] Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF These issues could be the result of an upstream bug[1]? I am going to move this to apiserver team to dig into this more and validate. [1]https://github.com/kubernetes/kubernetes/pull/95725
> Must gater (25/02/2021 23:27 UTC) -> https://attachments.access.redhat.com/hydra/rest/cases/02878649/attachments/f01edfd1-5513-44ec-977f-7f8a69d2ef0b?usePresignedUrl=true I downloaded the must-gather and did a grep to see if there are evidence of the panic from the aggregation layer > grep -rni "timeout.go:" home/stack/02878649-New-must-gather/inspect.local.8562484343804519940/namespaces/openshift-kube-apiserver/* > > 1 timeout.go:132] net/http: abort Handler I saw 10 of these panics in this capture. Please note that these panics happening over time can trigger the p&f to reject requests once the concurrency limit has been reached. If we execute the following prometheus queries over 48-hour window, I expect to see a steady rise in the number of currently executing requests. > topk(25, sum(apiserver_flowcontrol_current_executing_requests) by (priorityLevel,instance)) > topk(25, sum(apiserver_flowcontrol_request_concurrency_limit) by (priorityLevel,instance)) We have asked the customer to apply the 'Workaround B' from https://bugzilla.redhat.com/show_bug.cgi?id=1908383#c19. This should stabilize the cluster for now. Upgrading to 4.5.30 should resolve the issue.
Verified on OCP4.8.0-0.nightly-2021-03-08-092651 over OSP16.1 (RHOS-16.1-RHEL-8-20201214.n.3) using OVN-Octavia. New master is successfully created with different port name: (shiftstack) [stack@undercloud-0 ~]$ openstack port list -c Name -f value| grep master ostest-q9d4h-master-port-0 ostest-q9d4h-master-port-1 ostest-q9d4h-master-3 Procedure: Replacing an unhealthy etcd member whose machine is not running or whose node is not ready: 1. Remove the unhealthy member. oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd oc rsh -n openshift-etcd etcd-ostest-q9d4h-master-0 # connect to a healthy one and remove the failing one. etcdctl member list -w table sh-4.4# etcdctl member remove 29c283237c03d0a # remove ostest-q9d4h-master-2 member 2. Remove the old secrets for the unhealthy etcd member that was removed. oc get secrets -n openshift-etcd | grep ostest-q9d4h-master-2 oc delete secret -n openshift-etcd etcd-peer-ostest-q9d4h-master-2 oc delete secret -n openshift-etcd etcd-serving-metrics-ostest-q9d4h-master-2 oc delete secret -n openshift-etcd etcd-serving-ostest-q9d4h-master-2 3. Delete and recreate the master machine. After this machine is recreated, a new revision is forced and etcd scales up automatically. oc get machines -n openshift-machine-api -o wide oc get machine ostest-q9d4h-master-0 -n openshift-machine-api -o yaml > new-master-machine.yaml Edit the new-master-machine.yaml: Remove the entire status section and the annotations. Change the name field to a new name (ostest-q9d4h-master-3). oc delete machine -n openshift-machine-api ostest-q9d4h-master-2 (~ 15 minutes) oc apply -f new-master-machine.yaml (API unstable. ~25 minutes) (shiftstack) [stack@undercloud-0 ~]$ oc get machines -n openshift-machine-api -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE ostest-q9d4h-master-0 Running m4.xlarge nova 155m ostest-q9d4h-master-0 ACTIVE ostest-q9d4h-master-1 Running m4.xlarge nova 155m ostest-q9d4h-master-1 ACTIVE ostest-q9d4h-master-3 Running m4.xlarge nova 43m ostest-q9d4h-master-3 ACTIVE ostest-q9d4h-worker-0-2tmsp Running m4.xlarge nova 143m ostest-q9d4h-worker-0-2tmsp ACTIVE ostest-q9d4h-worker-0-lcww5 Running m4.xlarge nova 143m ostest-q9d4h-worker-0-lcww5 ACTIVE ostest-q9d4h-worker-0-t49h4 Running m4.xlarge nova 12m ostest-q9d4h-worker-0-t49h4 ACTIVE 4. Verify that all etcd pods are running properly: $ oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd etcd-ostest-q9d4h-master-0 3/3 Running 0 16m etcd-ostest-q9d4h-master-1 2/3 Running 0 20s etcd-ostest-q9d4h-master-3 3/3 Running 0 94s (shiftstack) [stack@undercloud-0 ~]$ oc rsh -n openshift-etcd etcd-ostest-q9d4h-master-0 Defaulting container name to etcdctl. Use 'oc describe pod/etcd-ostest-q9d4h-master-0 -n openshift-etcd' to see all of the containers in this pod. sh-4.4# etcdctl member list -w table +------------------+---------+-----------------------+---------------------------+---------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-----------------------+---------------------------+---------------------------+------------+ | 84f538fbc1296514 | started | ostest-q9d4h-master-3 | https://10.196.0.150:2380 | https://10.196.0.150:2379 | false | | b560e0985124a29c | started | ostest-q9d4h-master-1 | https://10.196.1.146:2380 | https://10.196.1.146:2379 | false | | f9a4b687127f5769 | started | ostest-q9d4h-master-0 | https://10.196.0.18:2380 | https://10.196.0.18:2379 | false | +------------------+---------+-----------------------+---------------------------+---------------------------+------------+
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438