Hide Forgot
Especially on metal IPI CI, we're starting to see some openshift tests fail with "the server has received too many requests and has asked us to try again later." Here's a sample job[1]. It's starting to impact release gating jobs, and as we turn on more tests it's becoming more common that we max out the number of watches and get unit test failures. The problem is not limited to metal IPI, as on AWS we can see that the namespace controller during a run ends up with hundreds of thousands of watches. Here's the stats for the metal ipi run in[2]: $ ./kubectl-dev_tool audit -f '/tmp/must-gather-ipi-ci-op-4tdpwk2c-c170c-1380440176282570752/audit-logs/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ef98264bbb397ec8f4e0902189294d0d0fd4ba21d669a4da3472f7832eef462/audit_logs/kube-apiserver' verb=watch --by=user -otop | head -n6 count: 1058164, first: 2021-04-09T05:12:38-04:00, last: 2021-04-09T06:16:38-04:00, duration: 1h4m0.21978s 234938x system:serviceaccount:kube-system:namespace-controller 109663x system:serviceaccount:openshift-machine-api:cluster-baremetal-operator 72794x system:serviceaccount:openshift-apiserver:openshift-apiserver-sa 61638x system:admin 52262x system:kube-controller-manager And here's the top resources being watched from a metal IPI run: $ ./kubectl-dev_tool audit -f '/tmp/must-gather-ipi-ci-op-4tdpwk2c-c170c-1380440176282570752/audit-logs/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ef98264bbb397ec8f4e0902189294d0d0fd4ba21d669a4da3472f7832eef462/audit_logs/kube-apiserver' verb=watch --by=resource -otop count: 1058164, first: 2021-04-09T05:12:38-04:00, last: 2021-04-09T06:16:38-04:00, duration: 1h4m0.21978s 144902x v1/secrets 112169x v1/configmaps 76924x v1/serviceaccounts 57605x authorization.k8s.io/v1/subjectaccessreviews 51910x v1/pods Here's a comparable run from AWS: $ ./kubectl-dev_tool audit -f '/tmp/audit-logs/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ef98264bbb397ec8f4e0902189294d0d0fd4ba21d669a4da3472f7832eef462/audit_logs/kube-apiserver' verb=watch --by=resource -otop | head -n6 had 39622 line read failures count: 956438, first: 2021-04-08T23:51:46-04:00, last: 2021-04-09T00:41:37-04:00, duration: 49m51.914093s 87128x v1/configmaps 85711x v1/secrets 72993x v1/pods 65097x v1/serviceaccounts 64961x authorization.k8s.io/v1/subjectaccessreviews $ ./kubectl-dev_tool audit -f '/tmp/audit-logs/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ef98264bbb397ec8f4e0902189294d0d0fd4ba21d669a4da3472f7832eef462/audit_logs/kube-apiserver' verb=watch --by=user -otop | head -n6 count: 956438, first: 2021-04-08T23:51:46-04:00, last: 2021-04-09T00:41:37-04:00, duration: 49m51.914093s 320882x system:serviceaccount:kube-system:namespace-controller 79642x system:serviceaccount:openshift-apiserver:openshift-apiserver-sa 56920x system:kube-controller-manager 52288x system:admin 25974x system:serviceaccount:openshift-kube-storage-version-migrator-operator:kube-storage-version-migrator-operator [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1380440176282570752 [2] https://coreos.slack.com/archives/CB48XQ4KZ/p1617815405044500 [3] https://github.com/openshift/cluster-debug-tools [4] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws/1380363578401361920 [5] https://github.com/openshift/cluster-baremetal-operator/blob/master/controllers/provisioning_controller.go#L416-L424
It turns out that setting a timeout on HTTP client affect watch requests. For example, with a 10 second timeout watch requests are being re-established exactly after 10 seconds even though the default request timeout for them is ~5 minutes. This is because if multiple timeouts were set, the stdlib picks the smaller timeout to be applied, leaving other useless. For more details see https://github.com/golang/go/blob/a937729c2c2f6950a32bc5cd0f5b88700882f078/src/net/http/client.go#L364 I've opened https://github.com/kubernetes/kubernetes/pull/101022 as a potential fix.
Sorry there was a typo in my original data gathering commands (verb=watch instead of --verb=watch), so what you see is an unfiltered total number of requests per user. I am not sure if the 300K namespace-controller requests were expected or not. The high number of requests from cluster-baremetal-operator is a bug, which we're dealing with in https://bugzilla.redhat.com/show_bug.cgi?id=1948787. I'll leave it up to you to decide what to do with this one.
There still seems to be an issue somewhere else inside of OpenShift. metal jobs continue to fail quite often with various failure modes related to "too many requests" For example prometheus errors were one of the more common ones, with one of the prometheus-adapter replicas not coming up: - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release[…]ter-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1384071524993994752 - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release[…]ter-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1383882320846524416 - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release[…]ter-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1383686488096509952 prometheus pod is just scrolling this at the end: 2021-04-18T08:54:24.096320119Z level=error ts=2021-04-18T08:54:24.096Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)" This is hindering the baremetal IPI's team's ability to land bug fixes due to high failure rates. Note: audit logs are in the must-gather tarballs.
Another manifestation is Ingress is failing, operator logs show: 2021-04-19T13:40:26.228875001Z E0419 13:40:26.228823 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1.Route: the server has received too many requests and has asked us to try again later (get routes.route.openshift.io) And haproxy health checks are failing for on-premise, api server seems overloaded? 2021-04-19T13:22:55.952879999Z time="2021-04-19T13:22:55Z" level=info msg="An error occurred while trying to read master nodes details from api-vip:kube-apiserver: the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)" From: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1384125432177102848
I have prepared https://github.com/openshift/cluster-kube-apiserver-operator/pull/1110 to reduce watch requests for the aggregated APIs. Unfortunately, the number of watch requests is still high. I am going to investigate it further - https://github.com/openshift/cluster-kube-apiserver-operator/pull/1110#issuecomment-822602299
It looks like my previous query was broken. The improvement seems significant: 26 watch requests from the kube-apiserver-operator after the fix vs 1248 before the fix. I am going to identify components that use the delegated AuthN (this is the place that had the bug) and fill BZ. I think I'm going to use this BZ as a top-level one.
*** Bug 1950789 has been marked as a duplicate of this bug. ***
*** Bug 1958974 has been marked as a duplicate of this bug. ***
*** Bug 1960278 has been marked as a duplicate of this bug. ***
Stephen, do you still observe some jobs failing?
I still see occurrences of it today: e.g., https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-dualstack/1394220063090806784 https://search.ci.openshift.org/?search=Too+many+WATCH+requests&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
I did take a look at the failing job. Here are the requests that took more than 2 sec but less than 2 min obtained from the audit logs (mainly kubelets): had 70436 line read failures count: 4653, first: 2021-05-17T11:56:50+02:00, last: 2021-05-17T13:08:26+02:00, duration: 1h11m36.417024s 1045x system:node:worker-0.ostest.test.metalkube.org 996x system:node:worker-1.ostest.test.metalkube.org 361x system:serviceaccount:openshift-monitoring:prometheus-k8s 250x system:node:master-1.ostest.test.metalkube.org 244x system:admin 192x system:kube-controller-manager 171x system:node:master-0.ostest.test.metalkube.org 135x system:apiserver 116x system:node:master-2.ostest.test.metalkube.org 65x system:kube-scheduler Further filtering the requests only for --user=system:node:worker-0.ostest.test.metalkube.org shows had 70436 line read failures count: 1045, first: 2021-05-17T12:30:12+02:00, last: 2021-05-17T13:08:02+02:00, duration: 37m49.942062s 983x v1/secrets 62x v1/configmaps I saved the summary lines for that user at https://gist.github.com/p0lyn0mial/8b8a9ce65a75caa209e9584caa5cc0aa It looks like "normal" requests made by the e2e tests. In addition to that, the attached metrics also don't reveal anything suspicious. In summary, it looks like the operators have been fixed and don't storm APIs anymore. We are going to merge 2 PRs that retry on 429 and that should fix the failures you are observing.
@Lukasz when you manually added a link to `Link ID: Github openshift/kubernetes/pull/669/` you broke automation with the trailing slash. Why did you choose to add this manually when the bot could add it for you?
@Steve I'm terribly sorry. I don't know why I did it. Looking on the bright side, we have just uncovered a new bug that shouldn't be hard to fix.
@Stephen the recent CI jobs show no trace of the issue. Please verify it and close the issue.
Is this bug dup of bug 1947829 ?
Yes, could you please close https://bugzilla.redhat.com/show_bug.cgi?id=1947829 and verify this BZ as we will need it to backport to 4.7.
- The original PR got reverted due to unit test flake: https://github.com/kubernetes/kubernetes/pull/102581 - We brought back the PR and fixed the test flake: https://github.com/kubernetes/kubernetes/pull/102606 setting the BZ to assigned so we can retest it.
*** Bug 1947829 has been marked as a duplicate of this bug. ***
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-06-09-023740 True False 144m Cluster version is 4.8.0-0.nightly-2021-06-09-023740 $ oc debug node/<master0> ... If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# cd /var/log sh-4.4# grep -nr "Failed to watch.*the server has received too many requests" | grep -v debug $ oc debug node/<master1> ... If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# cd /var/log sh-4.4# grep -nr "Failed to watch.*the server has received too many requests" | grep -v debug $ oc debug node/<master2> ... If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# cd /var/log sh-4.4# grep -nr "Failed to watch.*the server has received too many requests" | grep -v debug No longer see similar error ' Failed to watch.*the server has received too many requests and has asked us to try again later' in master server log files, so move the bug VERIFIED.
Roll out kube-apiserver and checked again: $ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]' After that, kas pods become ready, run following script to check logs of kubeapiserver: $ cat ./check_specific_log.sh for project in $(oc get project | grep -v NAME | awk '{print $1}') do for pod_name in $(oc get pod -n $project 2> /dev/null | grep -v NAME | awk '{print $1}') do for container in $(oc -n $project get pods $pod_name -o jsonpath="{.spec.containers[*].name}") do oc -n $project logs --since-time "$T1" -c $container $pod_name | grep -E "(09|10):[0-9]*.*the server has received too many requests and has asked us to try again later" > a.log; N=`cat a.log | wc -l` if [ $N -gt 0 ]; then echo "$project/$pod_name/$container:" echo "----- $N lines -----" echo head -n 2 a.log echo "..." tail -n 2 a.log echo fi done done done $ ./check_specific_log.sh $ cat a.log No results found, as expected.
resetting it to assigned because we need to pick it into o/o so ci tests do not flake
kewang, turns out we don't need the o/o PR, because o/o master already has the client-go Watch and Stream retry logic - moving it back to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438