Bug 1883589
Summary: | Cascading failure across all openshift-apiservers renders cluster unusable | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Pedro Amoedo <pamoedom> | ||||||||
Component: | kube-apiserver | Assignee: | Abu Kashem <akashem> | ||||||||
Status: | CLOSED NOTABUG | QA Contact: | Xingxing Xia <xxia> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 4.5 | CC: | agawand, aivaras.laimikis, akaris, akashem, alchan, aos-bugs, bleanhar, bmontgom, braander, bshirren, cblecker, cpassare, cruhm, dahernan, ehashman, erich, ffranz, gshereme, gwest, hgomes, inecas, jcromer, jeder, jhuddles, jkaur, jlucky, jminter, jnordell, jwang, kelly.brown1, lmohanty, lszaszki, mabajodu, mfojtik, mike, mleonard, mtleilia, mvardhan, mwoodson, nmalik, nstielau, ocasalsa, openshift-bugs-escalate, palonsor, palshure, pbergene, ppostler, pstrick, rcarrier, rdiscala, rsandu, sburke, sdodson, sople, sreber, sreichar, sttts, tahonen, tmanor, travi, wking, xxia, yjoseph | ||||||||
Target Milestone: | --- | Keywords: | ServiceDeliveryBlocker | ||||||||
Target Release: | 4.7.0 | Flags: | sreichar:
needinfo-
|
||||||||
Hardware: | x86_64 | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2020-12-11 11:30:54 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1885353, 1885356, 1886442 | ||||||||||
Bug Blocks: | |||||||||||
Attachments: |
|
Description
Pedro Amoedo
2020-09-29 16:03:22 UTC
After looking at the metrics I saw some evidence of p&f throttling requests, not entirely sure if this is the only symptom. The customer has disabled p&f on their cluster just to see (as a test) if this resolves the issue. The next step is to take must-gather (with audit logs) and do further investigation We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 > Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? - It seems a certain type of workload with a certain level of load (requests in flight) seems to trigger this issue. We are currently investigating to find the root cause. - If we are to block upgrade, we would block upgrade to 4.5.*. Upgrade to 4.4.* is okay since this apiserver feature is enabled in 4.5 and onward. > What is the impact? Is it serious enough to warrant blocking edges? The cluster is pretty much unusable, so it is very serious. Based on the conversation with the customer, until the customer restarts the master nodes the cluster is unusable. After restart, the cluster seems to get into a bad state again after some time. > How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? There are two approaches: - Disable the priority and fairness feature of the apiserver by using the following command > oc patch kubeapiserver cluster --type=merge -p '{"spec":{"unsupportedConfigOverrides":{"apiServerArguments":{"feature-gates":["APIPriorityAndFairness=false"]}}}}' This will trigger a new kube-apiserver revision. Since kube-apiserver operator might be borked as well, this change will likely not take effect. In this case, you might need to manipulate the static kube-apiserver resources on the host. - The following 'oc patch' should work seamlessly > oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"global-default"}}}' But this is yet to be tested on a real cluster with this issue. If this works, it will not require any apiserver revision or node restart. I am working with the ARO team to test this solution. > Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? "Priority and Fairness" has been enabled since 4.5. Updating to 4.6 does not increase exposure/risk This has impacted an Azure Red Hat OpenShift customer who upgraded their cluster to 4.5, resulting in a ~10 hour period where they were unable to connect to their cluster to run commands as `oc login` was failing. I believe that it potentially affects *all* 4.5 clusters with non-trivial control plane load. Summary ------- Symptoms to detect the issue, in the order they appeared: - oc is slow and/or commands are failing on 4.5.x - "Throttling request" log entries for the kubernetes-apiserver-operator and openshift-apiserver-operator - Many 429 response codes in the Kubernetes audit logs - Many "Failed to make webhook authorizer request ... context cancelled" log entries on the kubernetes apiservers - "\"<COMPONENT>.openshift.io.v1\" is not ready: 503 (the server is currently unable to handle the request)" errors on the openshift-apiserver-operator, but openshift-apiserver pods are running, NOT crashlooping The fix detailed in the comment above mitigated the issue. Detailed timeline ----------------- Customer upgraded cluster from 4.4.14 to 4.5.11 around 2020-09-24 12:00 UTC. Starting at 12:17 after the upgrade began, we begin to see many instances of "Throttling request" errors from client-go on various components, but in particular on the Kubernetes and OpenShift apiserver-operators in the 1-2s range. Prior to the upgrade, we did see other instances of the "Throttling request" error on various components, but none from the API server components, and the time ranges were significantly smaller (ms vs. seconds range). From the cluster audit logs, we see that prior to the upgrade, no requests in the past 5 days received a response code of 429. But after the upgrade, beginning at 2020-09-26 03:09 UTC, we begin logging nearly 135k errors with code 429 per hour. Customer filed a support ticket and reported that the control plane (oc commands, web console) was slow around 2020-09-28 05:00 UTC, stating that `oc adm top node` commands are failing. Customer responds to this by rebooting master-1 around 2020-09-29 11:00 UTC, but this does not improve the slowness. We begin to see "E0929 ... Failed to make webhook authorizer request: Post https://<kubernetes load balancer IP>:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled" errors from the KASes around 2020-09-29 11:06 UTC. At 2020-09-29 11:11 UTC, we begin seeing errors in the openshift-apiserver-operator logs stating e.g. "\"build.openshift.io.v1\" is not ready: 503 (the server is currently unable to handle the request)" for various OpenShift API components (build, image, oauth, etc.) Customer reports that `oc login` is now failing at 2020-09-30 13:04 UTC. We narrowed the issue down to the OpenShift API servers and were finally able to identify the issue as matching this BZ. It is notable that the OpenShift API Server pods appeared to be healthy throughout the incident and liveness probes were *not* failing, making it difficult to detect the issue. Per Abu's guidance, we were able to deploy the above fix to disable P&F on the KASes. However, we needed to manually edit the static pod manifests for the apiservers as the cluster was so unhealthy that the install containers were failing to provision. The customer confirmed that this workaround mitigated the issue. One addition to my previous comment: I see that the openshift-apiserver operator was reporting a degraded/unavailable status for about 35 hours. It was Degraded and Available=False starting from around 2020-09-29 11:00 UTC, approximately when the 503 errors started showing up in the openshift-apiserver-operator logs, through 2020-09-30 22:00 UTC when the fix was applied. (Note the dashboards I'm looking at have limited granularity so the timestamps are approximate.) The underlying pods were "healthy" so we did not think to look at the operator status at the time; we were too heads down in logs to look at dashboards. Prior to 2020-09-29 11:00 UTC, all cluster operators were reporting healthy, despite significant cluster slowness/error rates as we saw above. Going to assume to undo the flowschema patch we'd put back to prior state. Therefore: oc patch flowschema service-accounts --type merge -p '{"spec":{"priorityLevelConfiguration":{"name":"workload-low"}}}' We can adjust this as we get more info, but this is my placeholder for OSD. there is a KCS in progress here [1], but it pretty much repeats the suggested actions here in the BZ: [1] https://access.redhat.com/solutions/5448851 Created attachment 1719472 [details]
apf configuration for 4.5
Hi sople, Can you please gather data from the cluster in order for us to find the root cause? Data Capture: A: Run must-gather to gather whatever must-gather collects by default. > oc adm must-gather B: Run must-gather to capture audit logs > oc adm must-gather -- /usr/bin/gather_audit_logs C: Capture cluster prometheus metrics Once the capture is complete (A, B and C) please share the data with us. Resolution: Try the following: - Download the attachment https://bugzilla.redhat.com/attachment.cgi?id=1719472 into a file 'apf-configuration-45.yaml' and run the following command: > oc apply -f apf-configuration-45.yaml If the above does not resolve the issue then try the following: > oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"global-default"}}}' For C: if you don't already have a way to collect prometheus snapshot, you can run the following script: #!/usr/bin/env bash function queue() { local TARGET="${1}" shift local LIVE LIVE="$(jobs | wc -l)" while [[ "${LIVE}" -ge 45 ]]; do sleep 1 LIVE="$(jobs | wc -l)" done echo "${@}" if [[ -n "${FILTER:-}" ]]; then "${@}" | "${FILTER}" >"${TARGET}" & else "${@}" >"${TARGET}" & fi } ARTIFACT_DIR=$PWD mkdir -p $ARTIFACT_DIR/metrics echo "Snapshotting prometheus (may take 15s) ..." queue ${ARTIFACT_DIR}/metrics/prometheus.tar.gz oc --insecure-skip-tls-verify exec -n openshift-monitoring prometheus-k8s-0 -- tar cvzf - -C /prometheus . FILTER=gzip queue ${ARTIFACT_DIR}/metrics/prometheus-target-metadata.json.gz oc --insecure-skip-tls-verify exec -n openshift-monitoring prometheus-k8s-0 -- /bin/bash -c "curl -G http://localhost:9090/api/v1/targets/metadata --data-urlencode 'match_target={instance!=\"\"}'" Related BZ: - https://bugzilla.redhat.com/show_bug.cgi?id=1885353 - https://bugzilla.redhat.com/show_bug.cgi?id=1885356 - https://bugzilla.redhat.com/show_bug.cgi?id=1885358 Created attachment 1719786 [details]
apf-configuration-with-oauth-server-45.yaml
If a customer is running into this issue, please follow the steps below: - Capture data from the cluster, if possible. - Apply the patch. Data Capture: A: Run must-gather to gather whatever must-gather collects by default. > oc adm must-gather B: Run must-gather to capture audit logs > oc adm must-gather -- /usr/bin/gather_audit_logs C: Capture cluster prometheus metrics Once the capture is complete (A, B and C) please share the data with us. Apply the patch: Try the following: - Download the attachment https://bugzilla.redhat.com/attachment.cgi?id=1719786 into a file 'apf-configuration.yaml' and run the following command: > oc apply -f apf-configuration.yaml If the above does not resolve the issue then try the following: > oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"global-default"}}}' Hi agawand, That's fine. Please follow the steps outlined here https://bugzilla.redhat.com/show_bug.cgi?id=1883589#c31 to apply the patch. Please let us know if the issue resolves. Also, please advise the customer to try the Data collection if the issue is resolved after applying the patch. Thanks! Hi mleonard, For the KCS you are working on. If the customer has already taken a step to resolve the issue they need to undo those changes before applying https://bugzilla.redhat.com/show_bug.cgi?id=1883589#c31 Scenario 1: The customer has disabled p&f feature: In this case the customer needs to reset it and enable p&f feature. Scenario 2: The customer has applied the following patch > oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"global-default"}}}' In this case the customer can undo it by applying the following patch: > oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"workload-low"}}}' Hi Abu, I published and made a few changes to the aforementioned KCS: https://access.redhat.com/solutions/5448851 Mind going over it quickly and helping out by verifying the provided workaround steps. Also, I'd like to provide further diagnostics steps, e.g. what kind of logs do we have to check and what log messages would we see in case of running into this issue. Thanks! Also, we might want to specify rollback steps for the different workarounds 1,2,3 i have a question on the prometheus metrics grab code: how long is it supposed to take? im on a cluster w no significant workload and its 15m and still running. is that to be expected? It can take a while, if the prometheus db is large. Hi sople,
> oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"global-default"}}}'
Even the above "oc patch" did not resolve the customer issue?
*** Bug 1880463 has been marked as a duplicate of this bug. *** (In reply to Abu Kashem from comment #41) > Hi sople, > > oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"global-default"}}}' > > Even the above "oc patch" did not resolve the customer issue? -- After patching it is Not working for them, do we need a reboot the master nodes after patching ? Thanks Hi gwest, > I tested a upgrade of 4.4.5 -> 4.4.11 > 1. Without the patch mentioned, I saw the issue. > 2. With the patch the upgrade was clean. (Validated Qty 2 over the weekend). I have a couple of questions: - Which patch are you referring to, the oc apply or oc patch in https://bugzilla.redhat.com/show_bug.cgi?id=1883589#c31? - This feature is enabled since 4.5, so an upgrade of "4.4.5 -> 4.4.11" failing is not related to p&f. I am not clear how you were able to apply the patch in 4.4. In 4.4.5, the patch did appear to apply, and did seem to make the upgrade to 4.4.11 work. From 4.4.11, it seem to cause failure of further upgrades. oc patch kubeapiserver cluster --type=merge -p '{"spec":{"unsupportedConfigOverrides":{"apiServerArguments":{"feature-gates":["APIPriorityAndFairness=false"]}}}}' Hello sople, akashem. Rebooting the cluster according to the official docs[1] made the cluster responsive again and it appears to be working correctly, with the exception of the cluster-operator `image-registry` which is Available but in Degraded state. This was done after patching flowschema/services-accounts and the modification is still present. However, please consider that, when the problem occurred for the first time in early September, rebooting the cluster appeared to work, with no patch applied, until it became unresponsive again nearly 4 weeks later. 1. https://docs.openshift.com/container-platform/4.5/backup_and_restore/graceful-cluster-restart.html Hi gwest, >In 4.4.5, the patch did appear to apply, and did seem to make the upgrade to 4.4.11 work. >From 4.4.11, it seem to cause failure of further upgrades. > >oc patch kubeapiserver cluster --type=merge -p '{"spec":{"unsupportedConfigOverrides":{"apiServerArguments":{"feature-gates":["APIPriorityAndFairness=false"]}}}}' The above patch disables the APIPriorityAndFairness feature. In 4.4 this feature is not enabled. So the oc patch command was ineffective. If you check the revision before and after the path, I expect it to be at the same revision. Hi rdiscala, Is it the same cluster that sople and I have been looking at? If so, my hypothesis is that this cluster is experiencing a different issue than what this BZ relates to. I am waiting for the logs to do further investigation. Created attachment 1721522 [details]
apf-configuration-with-oauth-server-with-exempt-45.yaml
(We have made some improvements to the P&F rules and so would like the customer to try the following) If a customer is running into this issue, please follow the steps below: - Capture data from the cluster, if possible. - Tweak P&F Rules. Data Capture: A: Run must-gather to gather whatever must-gather collects by default. > oc adm must-gather B: Run must-gather to capture audit logs > oc adm must-gather -- /usr/bin/gather_audit_logs C: Capture cluster prometheus metrics Once the capture is complete (A, B and C) please share the data with us. Tweak P&F Rules: - If "Priority & Fairness" feature is disabled then you must enable it before the rules can be applied. - The new attachment yaml can be safely applied over the old attachment (https://bugzilla.redhat.com/attachment.cgi?id=1719786). So if you have already applied the old attachment then you don't need to "oc delete". Solution A: - Download the new attachment https://bugzilla.redhat.com/attachment.cgi?id=1721522 into a file 'apf-configuration.yaml' and run the following command: > oc apply -f apf-configuration.yaml Solution B: If the above does not resolve the issue then try the following 'oc patch' commands': > oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"workload-low"}}}' > oc patch prioritylevelconfiguration workload-low --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 100}}}' > oc patch prioritylevelconfiguration global-default --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 20}}}' Both Solution A and B will not block upgrade 4.5.x -> 4.6.x. @akashem I will plan the test with the client next week and let you know. Thanks. Hi rdiscala, sople, I looked at the openshift-apiserver logs: instance: apiserver-fd95d8b8b-m9q58: > First entry timestamp: 2020-10-11T15:05:41.965736348Z > Last entry timestamp: 2020-10-12T13:39:28.058463563Z > Time span: ~22 hours > Total lines: 67373 > grep 'http: TLS handshake error from' | wc -l > 65909 This is the actual error: > 2020-10-12T13:38:43.252655666Z I1012 13:38:43.252549 1 log.go:172] http: TLS handshake error from 10.130.0.1:36798: EOF instance apiserver-fd95d8b8b-wvlgq: > First entry timestamp: 2020-10-12T01:34:32.525040392Z > Last entry timestamp: 2020-10-12T13:40:38.602400659Z > Total lines: 174453 > grep 'http: TLS handshake error from' | wc -l > 174360 So two log files from openshift apiserver are completely filled with the above error. instance apiserver-fd95d8b8b-qbw2h: It has a lot of the following error: > 2020-10-07T12:23:25.313630894Z E1007 12:23:25.313597 1 errors.go:77] context canceled > 2020-10-07T12:23:25.314132997Z E1007 12:23:25.314093 1 webhook.go:199] Failed to make webhook authorizer request: context canceled I grepped for '10.130.0.1' across all log files: kube-apiserver: > loading OpenAPI spec for "v1.image.openshift.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: Error trying to reach service: 'read tcp 10.130.0.1:43092->10.129.0.11:8443: read: connection timed out', Header: map[Content- Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]] package server (component of OLM) > namespaces/openshift-operator-lifecycle-manager/pods/packageserver-54868d888b-hbqdk/packageserver/packageserver/logs/current.log:156:2020-10-11T08:25:11.776640335Z I1011 08:25:11.776546 1 log.go:172] http: TLS handshake error from 10.130.0.1:55780: EOF openshift-monitoring: > namespaces/openshift-monitoring/pods/prometheus-adapter-5497f8b5bb-9ztgr/prometheus-adapter/prometheus-adapter/logs/current.log:467:2020-10-10T05:24:39.992392312Z I1010 05:24:39.992358 1 log.go:172] http2: panic serving 10.130.0.1: 38754: killing connection/stream because serving request timed out and response had been started I have opened a thread to learn more about '10.130.0.1' (it is used for routing purposes if I am not mistaken) - https://coreos.slack.com/archives/CDCP2LA9L/p1603141588175200 Hi pamoedom, The solution (https://access.redhat.com/solutions/5448851) needs update. The recommended steps (with improved p&f rules) have been captured in this comment - https://bugzilla.redhat.com/show_bug.cgi?id=1883589#c56. Did the customer apply the new rules from comment 56? Hi pamoedom Now that the cluster in question is stable, can you close this BZ? If the issue(s) reappear then feel free to reopen it. Thanks! The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |