Bug 1933269
Summary: | Cluster unstable replacing an unhealthy etcd member | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | oarribas <oarribas> | |
Component: | Networking | Assignee: | Maysa Macedo <mdemaced> | |
Networking sub component: | kuryr | QA Contact: | GenadiC <gcheresh> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | urgent | CC: | adeshpan, akashem, andeshmu, aos-bugs, aship, bleanhar, eduen, hchatter, mdemaced, mdulko, mfojtik, michjohn, openshift-bugs-escalate, ppostler, rlobillo, sbatsche, scuppett, travi | |
Version: | 4.5 | |||
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause:
Neutron Ports created for OpenStack master VMs recreated as a day-2 operation having different name pattern than when created on day-1 operation.
Consequence:
Kuryr was not able to detect the new Neutron Port created and create the respective Load Balancer member for the default kubernetes Load Balancer.
Fix:
Included the extra naming pattern to Kuryr detection mechanism.
Result:
Load Balancer members correspond to the actual addresses of the currents master VMs and cluster operators are available after Masters recreation.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1935473 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 22:48:28 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1935473 |
Description
oarribas
2021-02-26 11:58:10 UTC
There are several pods with the same error: 'an error on the server ("") has prevented the request from succeeding': ~~~ $ omg logs -n openshift-authentication-operator authentication-operator-5d644ccc49-s4mnm /cases/02878649/0100-inspect.local.2845478784944772599.tar.gz/home/stack/inspect.local.2845478784944772599/namespaces/openshift-authentication-operator/pods/authentication-operator-5d644ccc49-s4mnm/operator/operator/logs/current.log 2021-02-26T06:41:19.642972796Z Copying system trust bundle 2021-02-26T06:41:19.687685624Z I0226 06:41:19.687549 1 observer_polling.go:159] Starting file observer 2021-02-26T06:41:19.687832982Z I0226 06:41:19.687812 1 cmd.go:196] Using service-serving-cert provided certificates 2021-02-26T06:41:19.688272441Z I0226 06:41:19.688235 1 observer_polling.go:159] Starting file observer 2021-02-26T06:41:28.701796199Z W0226 06:41:28.701738 1 builder.go:206] unable to get owner reference (falling back to namespace): an error on the server ("") has prevented the request from succeeding (get pods authentication-operator-5d644ccc49-s4mnm) 2021-02-26T06:41:28.70190845Z I0226 06:41:28.701895 1 builder.go:233] cluster-authentication-operator version v4.5.0-202010081312.p0-0-g3759b76- 2021-02-26T06:41:28.703646734Z I0226 06:41:28.703599 1 dynamic_serving_content.go:111] Loaded a new cert/key pair for "serving-cert::/var/run/secrets/serving-cert/tls.crt::/var/run/secrets/serving-cert/tls.key" 2021-02-26T06:41:59.071053223Z F0226 06:41:59.070968 1 cmd.go:125] unable to load configmap based request-header-client-ca-file: an error on the server ("") has prevented the request from succeeding (get configmaps extension-apiserver-authentication) ~~~ ~~~ $ oc logs -n openshift-cloud-credential-operator cloud-credential-operator-797b4d57fb-zvf4d /cases/02878649/0100-inspect.local.2845478784944772599.tar.gz/home/stack/inspect.local.2845478784944772599/namespaces/openshift-cloud-credential-operator/pods/cloud-credential-operator-797b4d57fb-zvf4d/cloud-credential-operator/cloud-credential-operator/logs/current.log 2021-02-26T06:42:20.510958266Z Copying system trust bundle 2021-02-26T06:42:20.532227689Z time="2021-02-26T06:42:20Z" level=debug msg="debug logging enabled" 2021-02-26T06:42:20.532227689Z time="2021-02-26T06:42:20Z" level=info msg="setting up client for manager" 2021-02-26T06:42:20.532517877Z time="2021-02-26T06:42:20Z" level=info msg="setting up manager" 2021-02-26T06:42:29.553114581Z time="2021-02-26T06:42:29Z" level=fatal msg="unable to set up overall controller manager" error="an error on the server (\"\") has prevented the request from succeeding" ~~~ ~~~ $ oc logs -n openshift-machine-api -c machine-controller machine-api-controllers-6644844f59-g96lg /cases/02878649/0100-inspect.local.2845478784944772599.tar.gz/home/stack/inspect.local.2845478784944772599/namespaces/openshift-machine-api/pods/machine-api-controllers-6644844f59-g96lg/machine-controller/machine-controller/logs/current.log 2021-02-26T06:45:11.732990015Z I0226 06:45:11.732877 1 main.go:87] Watching machine-api objects only in namespace "openshift-machine-api" for reconciliation. 2021-02-26T06:45:20.743109611Z F0226 06:45:20.743073 1 main.go:92] an error on the server ("") has prevented the request from succeeding ~~~ update from etcd side. # (03:13:41) etcd cluster observes the loss of peer 10.0.2.221 with "no route to host". This is concerning as this is host network. > 2021-02-25T03:13:41.137464256Z 2021-02-25 03:13:41.137414 W | rafthttp: health check for peer f16689fea56f63e6 could not connect: dial tcp 10.0.2.221:2380: connect: no route to host # (03:42:43) etcd cluster is scaled down by admin. > 2021-02-25T03:42:43.498246739Z 2021-02-25 03:42:43.498221 I | etcdserver/membership: removed member f16689fea56f63e6 from cluster b34163542f143c27 From here out etcd is generally happy on master-{3,4} review of kubelet logs so no failures from quorum-guard health checks. ~~~ $ cat ./kubelet_service.log | awk '/Readiness probe for \"etcd-quorum-guard.*/ { print $1, $2, $3, $4, $10, $11, $13, $14; }' ~~~ We do see that the endpoints are stale, but this should not cause disruption. The etcd client balancer will complain that a subconn is down as we see in kube-apiserver logs. > 2021-02-22T11:55:55.638668206Z W0222 11:55:55.638608 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://10.0.2.221:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.2.221:2379: connect: connection refused". Reconnecting... But etcd client balancer will failover instantly to next active subconn roundrobin to another endpoint even if 2 of the 4 are valid this should not cause disruption. > 2021-02-22T11:38:25.651695013Z I0222 11:38:25.651689 1 flags.go:33] FLAG: --etcd-servers="[https://10.0.0.148:2379,https://10.0.0.18:2379,https://10.0.2.221:2379,https://localhost:2379]" My understanding is that if kubelet fails healthz check to kube-apiserver (quorum loss) for 2m? it will restart it. # kas starts 2021-02-22T11:38:25 # logs end 2021-02-25T22:03:52 More evidence that kube-apiserver can talk to etcd is we see compaction requests every 5m as expected in the etcd log. The compaction RPC request originates from kas. grep -rn 'finished scheduled compaction' openshift-etcd/pods/etcd-rhocp-2tt9k-master-3/etcd/etcd/logs/current.log | wc -l 227 I did see a tls errors from kube-apiserver logs > 2021-02-22T15:59:36.020367947Z I0222 15:59:36.020323 1 log.go:172] http: TLS handshake error from 10.0.2.221:50868: remote error: tls: bad certificate > 2021-02-25T08:50:39.961900648Z I0225 08:50:39.961837 1 log.go:172] http: TLS handshake error from 10.0.0.148:40214: remote error: tls: bad certificate When I look at openshift-apiserver I am seeing a ton of >2021-02-25T22:12:22.557366311Z E0225 22:12:22.557294 1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF >2021-02-25T22:12:22.557389706Z E0225 22:12:22.557370 1 errors.go:77] Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF >2021-02-25T22:12:22.557435275Z E0225 22:12:22.557419 1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF >2021-02-25T22:12:22.557484701Z E0225 22:12:22.557466 1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF >2021-02-25T22:12:22.55750078Z E0225 22:12:22.557492 1 errors.go:77] Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF >2021-02-25T22:12:22.557543857Z E0225 22:12:22.557534 1 errors.go:77] Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: EOF These issues could be the result of an upstream bug[1]? I am going to move this to apiserver team to dig into this more and validate. [1]https://github.com/kubernetes/kubernetes/pull/95725 > Must gater (25/02/2021 23:27 UTC) -> https://attachments.access.redhat.com/hydra/rest/cases/02878649/attachments/f01edfd1-5513-44ec-977f-7f8a69d2ef0b?usePresignedUrl=true I downloaded the must-gather and did a grep to see if there are evidence of the panic from the aggregation layer > grep -rni "timeout.go:" home/stack/02878649-New-must-gather/inspect.local.8562484343804519940/namespaces/openshift-kube-apiserver/* > > 1 timeout.go:132] net/http: abort Handler I saw 10 of these panics in this capture. Please note that these panics happening over time can trigger the p&f to reject requests once the concurrency limit has been reached. If we execute the following prometheus queries over 48-hour window, I expect to see a steady rise in the number of currently executing requests. > topk(25, sum(apiserver_flowcontrol_current_executing_requests) by (priorityLevel,instance)) > topk(25, sum(apiserver_flowcontrol_request_concurrency_limit) by (priorityLevel,instance)) We have asked the customer to apply the 'Workaround B' from https://bugzilla.redhat.com/show_bug.cgi?id=1908383#c19. This should stabilize the cluster for now. Upgrading to 4.5.30 should resolve the issue. Verified on OCP4.8.0-0.nightly-2021-03-08-092651 over OSP16.1 (RHOS-16.1-RHEL-8-20201214.n.3) using OVN-Octavia. New master is successfully created with different port name: (shiftstack) [stack@undercloud-0 ~]$ openstack port list -c Name -f value| grep master ostest-q9d4h-master-port-0 ostest-q9d4h-master-port-1 ostest-q9d4h-master-3 Procedure: Replacing an unhealthy etcd member whose machine is not running or whose node is not ready: 1. Remove the unhealthy member. oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd oc rsh -n openshift-etcd etcd-ostest-q9d4h-master-0 # connect to a healthy one and remove the failing one. etcdctl member list -w table sh-4.4# etcdctl member remove 29c283237c03d0a # remove ostest-q9d4h-master-2 member 2. Remove the old secrets for the unhealthy etcd member that was removed. oc get secrets -n openshift-etcd | grep ostest-q9d4h-master-2 oc delete secret -n openshift-etcd etcd-peer-ostest-q9d4h-master-2 oc delete secret -n openshift-etcd etcd-serving-metrics-ostest-q9d4h-master-2 oc delete secret -n openshift-etcd etcd-serving-ostest-q9d4h-master-2 3. Delete and recreate the master machine. After this machine is recreated, a new revision is forced and etcd scales up automatically. oc get machines -n openshift-machine-api -o wide oc get machine ostest-q9d4h-master-0 -n openshift-machine-api -o yaml > new-master-machine.yaml Edit the new-master-machine.yaml: Remove the entire status section and the annotations. Change the name field to a new name (ostest-q9d4h-master-3). oc delete machine -n openshift-machine-api ostest-q9d4h-master-2 (~ 15 minutes) oc apply -f new-master-machine.yaml (API unstable. ~25 minutes) (shiftstack) [stack@undercloud-0 ~]$ oc get machines -n openshift-machine-api -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE ostest-q9d4h-master-0 Running m4.xlarge nova 155m ostest-q9d4h-master-0 ACTIVE ostest-q9d4h-master-1 Running m4.xlarge nova 155m ostest-q9d4h-master-1 ACTIVE ostest-q9d4h-master-3 Running m4.xlarge nova 43m ostest-q9d4h-master-3 ACTIVE ostest-q9d4h-worker-0-2tmsp Running m4.xlarge nova 143m ostest-q9d4h-worker-0-2tmsp ACTIVE ostest-q9d4h-worker-0-lcww5 Running m4.xlarge nova 143m ostest-q9d4h-worker-0-lcww5 ACTIVE ostest-q9d4h-worker-0-t49h4 Running m4.xlarge nova 12m ostest-q9d4h-worker-0-t49h4 ACTIVE 4. Verify that all etcd pods are running properly: $ oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd etcd-ostest-q9d4h-master-0 3/3 Running 0 16m etcd-ostest-q9d4h-master-1 2/3 Running 0 20s etcd-ostest-q9d4h-master-3 3/3 Running 0 94s (shiftstack) [stack@undercloud-0 ~]$ oc rsh -n openshift-etcd etcd-ostest-q9d4h-master-0 Defaulting container name to etcdctl. Use 'oc describe pod/etcd-ostest-q9d4h-master-0 -n openshift-etcd' to see all of the containers in this pod. sh-4.4# etcdctl member list -w table +------------------+---------+-----------------------+---------------------------+---------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-----------------------+---------------------------+---------------------------+------------+ | 84f538fbc1296514 | started | ostest-q9d4h-master-3 | https://10.196.0.150:2380 | https://10.196.0.150:2379 | false | | b560e0985124a29c | started | ostest-q9d4h-master-1 | https://10.196.1.146:2380 | https://10.196.1.146:2379 | false | | f9a4b687127f5769 | started | ostest-q9d4h-master-0 | https://10.196.0.18:2380 | https://10.196.0.18:2379 | false | +------------------+---------+-----------------------+---------------------------+---------------------------+------------+ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |