Bug 1948311 - Some jobs failing due to excessive watches: the server has received too many requests and has asked us to try again later
Summary: Some jobs failing due to excessive watches: the server has received too many ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: Lukasz Szaszkiewicz
QA Contact: Ke Wang
URL:
Whiteboard:
: 1947829 1950789 1958974 1960278 (view as bug list)
Depends On: 1952001 1957127
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-11 18:20 UTC by Stephen Benjamin
Modified: 2021-08-27 20:02 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:58:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes pull 101022 0 None open client-go: changes config.Timeout field semantic 2021-06-11 00:56:13 UTC
Github kubernetes kubernetes pull 102107 0 None closed client-go: add retry logic for Watch and Stream 2021-06-02 14:10:54 UTC
Github openshift api pull 901 0 None closed Bug 1948311: DeprecatedAPIRequestStatus.Conditions: makes sure that tags and comments align 2021-04-30 14:07:52 UTC
Github openshift api pull 903 0 None closed Bug 1948311: APIRequestCountStatus.Conditions makes sure that tags and comments align 2021-04-30 14:07:50 UTC
Github openshift aws-ebs-csi-driver-operator pull 121 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-04-30 14:07:51 UTC
Github openshift aws-ebs-csi-driver-operator pull 126 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-05-14 12:51:01 UTC
Github openshift cluster-authentication-operator pull 436 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-04-30 14:07:55 UTC
Github openshift cluster-config-operator pull 195 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-04-30 14:07:55 UTC
Github openshift cluster-csi-snapshot-controller-operator pull 86 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-04-30 14:07:56 UTC
Github openshift cluster-etcd-operator pull 570 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-04-30 14:07:57 UTC
Github openshift cluster-kube-apiserver-operator pull 1110 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-04-30 14:07:54 UTC
Github openshift cluster-kube-controller-manager-operator pull 522 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-04-30 14:07:55 UTC
Github openshift cluster-kube-scheduler-operator pull 348 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-04-30 14:07:59 UTC
Github openshift cluster-kube-storage-version-migrator-operator pull 56 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-05-13 09:35:28 UTC
Github openshift cluster-openshift-apiserver-operator pull 447 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-04-30 14:08:00 UTC
Github openshift cluster-openshift-controller-manager-operator pull 212 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-05-14 12:51:15 UTC
Github openshift cluster-storage-operator pull 165 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-05-13 09:35:39 UTC
Github openshift console-operator pull 532 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-04-30 14:08:01 UTC
Github openshift insights-operator pull 406 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-04-30 14:08:02 UTC
Github openshift kubernetes pull 654 0 None closed Bug 1948311: UPSTREAM: 100959: DelegatingAuthenticationOptions: TokenReview reques… 2021-04-30 14:08:02 UTC
Github openshift kubernetes pull 669 0 None closed Bug 1948311: Handle 429 errors from watch requests in reflector 2021-06-11 00:56:20 UTC
Github openshift kubernetes pull 773 0 None closed Bug 1964492: UPSTREAM: 102171: Implement support for watch initialization in P&F 2021-06-11 00:56:25 UTC
Github openshift kubernetes pull 783 0 None open Bug 1948311: UPSTREAM: 102107: client-go: add retry logic for Watch and Stream 2021-06-02 14:47:01 UTC
Github openshift kubernetes pull 792 0 None open Bug 1948311: UPSTREAM: 102606: Bring back "client-go: add retry logic for Watch and Stream" and fix test flake 2021-06-08 14:00:59 UTC
Github openshift oauth-apiserver pull 50 0 None closed Bug 1948311: bump to kube 1.21.0 2021-04-30 14:08:04 UTC
Github openshift oauth-server pull 75 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-04-30 14:08:04 UTC
Github openshift openshift-apiserver pull 202 0 None closed Bug 1948311: bump to kube 1.21.0 and pick up the delegated AuthN fix 2021-05-13 09:35:50 UTC
Github openshift origin pull 26218 0 None closed bump(k8s) 2021-06-11 00:56:37 UTC
Github openshift service-ca-operator pull 157 0 None closed Bug 1948311: DelegatingAuthenticationOptions TokenReview request timeout 2021-04-30 14:08:05 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:59:18 UTC

Description Stephen Benjamin 2021-04-11 18:20:11 UTC
Especially on metal IPI CI, we're starting to see some openshift tests fail with "the server has received too many requests and has asked us to try again later." Here's a sample job[1]. It's starting to impact release gating jobs, and as we turn on more tests it's becoming more common that we max out the number of watches and get unit test failures.  The problem is not limited to metal IPI, as on AWS we can see that the namespace controller during a run ends up with hundreds of thousands of watches.

Here's the stats for the metal ipi run in[2]:

$ ./kubectl-dev_tool audit -f '/tmp/must-gather-ipi-ci-op-4tdpwk2c-c170c-1380440176282570752/audit-logs/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ef98264bbb397ec8f4e0902189294d0d0fd4ba21d669a4da3472f7832eef462/audit_logs/kube-apiserver' verb=watch --by=user -otop | head -n6
count: 1058164, first: 2021-04-09T05:12:38-04:00, last: 2021-04-09T06:16:38-04:00, duration: 1h4m0.21978s
234938x              system:serviceaccount:kube-system:namespace-controller
109663x              system:serviceaccount:openshift-machine-api:cluster-baremetal-operator
72794x               system:serviceaccount:openshift-apiserver:openshift-apiserver-sa
61638x               system:admin
52262x               system:kube-controller-manager

And here's the top resources being watched from a metal IPI run:

$ ./kubectl-dev_tool audit -f '/tmp/must-gather-ipi-ci-op-4tdpwk2c-c170c-1380440176282570752/audit-logs/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ef98264bbb397ec8f4e0902189294d0d0fd4ba21d669a4da3472f7832eef462/audit_logs/kube-apiserver' verb=watch --by=resource -otop
count: 1058164, first: 2021-04-09T05:12:38-04:00, last: 2021-04-09T06:16:38-04:00, duration: 1h4m0.21978s
144902x              v1/secrets
112169x              v1/configmaps
76924x               v1/serviceaccounts
57605x               authorization.k8s.io/v1/subjectaccessreviews
51910x               v1/pods


Here's a comparable run from AWS:

$ ./kubectl-dev_tool audit -f '/tmp/audit-logs/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ef98264bbb397ec8f4e0902189294d0d0fd4ba21d669a4da3472f7832eef462/audit_logs/kube-apiserver' verb=watch --by=resource -otop | head -n6
had 39622 line read failures
count: 956438, first: 2021-04-08T23:51:46-04:00, last: 2021-04-09T00:41:37-04:00, duration: 49m51.914093s
87128x               v1/configmaps
85711x               v1/secrets
72993x               v1/pods
65097x               v1/serviceaccounts
64961x               authorization.k8s.io/v1/subjectaccessreviews

$ ./kubectl-dev_tool audit -f '/tmp/audit-logs/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ef98264bbb397ec8f4e0902189294d0d0fd4ba21d669a4da3472f7832eef462/audit_logs/kube-apiserver' verb=watch --by=user -otop | head -n6
count: 956438, first: 2021-04-08T23:51:46-04:00, last: 2021-04-09T00:41:37-04:00, duration: 49m51.914093s
320882x              system:serviceaccount:kube-system:namespace-controller
79642x               system:serviceaccount:openshift-apiserver:openshift-apiserver-sa
56920x               system:kube-controller-manager
52288x               system:admin
25974x               system:serviceaccount:openshift-kube-storage-version-migrator-operator:kube-storage-version-migrator-operator


[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1380440176282570752
[2] https://coreos.slack.com/archives/CB48XQ4KZ/p1617815405044500
[3] https://github.com/openshift/cluster-debug-tools
[4] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws/1380363578401361920
[5] https://github.com/openshift/cluster-baremetal-operator/blob/master/controllers/provisioning_controller.go#L416-L424

Comment 1 Lukasz Szaszkiewicz 2021-04-12 11:58:02 UTC
It turns out that setting a timeout on HTTP client affect watch requests.
For example, with a 10 second timeout watch requests are being re-established exactly after 10 seconds even though the default request timeout for them is ~5 minutes.

This is because if multiple timeouts were set, the stdlib picks the smaller timeout to be applied, leaving other useless.
For more details see https://github.com/golang/go/blob/a937729c2c2f6950a32bc5cd0f5b88700882f078/src/net/http/client.go#L364

I've opened https://github.com/kubernetes/kubernetes/pull/101022 as a potential fix.

Comment 2 Stephen Benjamin 2021-04-12 22:25:12 UTC
Sorry there was a typo in my original data gathering commands (verb=watch instead of --verb=watch), so what you see is an unfiltered total number of requests per user. I am not sure if the 300K namespace-controller requests were expected or not.  

The high number of requests from cluster-baremetal-operator is a bug, which we're dealing with in https://bugzilla.redhat.com/show_bug.cgi?id=1948787.  I'll leave it up to you to decide what to do with this one.

Comment 3 Stephen Benjamin 2021-04-19 15:42:24 UTC
There still seems to be an issue somewhere else inside of OpenShift. metal jobs continue to fail quite often with various failure modes related to "too many requests"

For example prometheus errors were one of the more common ones, with one of the prometheus-adapter replicas not coming up: 

- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release[…]ter-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1384071524993994752
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release[…]ter-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1383882320846524416
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release[…]ter-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1383686488096509952

prometheus pod is just scrolling this at the end:

  2021-04-18T08:54:24.096320119Z level=error ts=2021-04-18T08:54:24.096Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)"

This is hindering the baremetal IPI's team's ability to land bug fixes due to high failure rates.


Note: audit logs are in the must-gather tarballs.

Comment 4 Stephen Benjamin 2021-04-19 16:40:58 UTC
Another manifestation is Ingress is failing, operator logs show:

2021-04-19T13:40:26.228875001Z E0419 13:40:26.228823       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1.Route: the server has received too many requests and has asked us to try again later (get routes.route.openshift.io)

And haproxy health checks are failing for on-premise, api server seems overloaded?

  2021-04-19T13:22:55.952879999Z time="2021-04-19T13:22:55Z" level=info msg="An error occurred while trying to read master nodes details from api-vip:kube-apiserver: the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)"




From: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1384125432177102848

Comment 5 Lukasz Szaszkiewicz 2021-04-20 07:13:26 UTC
I have prepared https://github.com/openshift/cluster-kube-apiserver-operator/pull/1110 to reduce watch requests for the aggregated APIs.

Unfortunately, the number of watch requests is still high. I am going to investigate it further - https://github.com/openshift/cluster-kube-apiserver-operator/pull/1110#issuecomment-822602299

Comment 6 Lukasz Szaszkiewicz 2021-04-20 08:46:10 UTC
It looks like my previous query was broken. The improvement seems significant: 26 watch requests from the kube-apiserver-operator after the fix vs 1248 before the fix.
I am going to identify components that use the delegated AuthN (this is the place that had the bug) and fill BZ. I think I'm going to use this BZ as a top-level one.

Comment 7 Arda Guclu 2021-04-20 16:20:46 UTC
*** Bug 1950789 has been marked as a duplicate of this bug. ***

Comment 8 Jan Chaloupka 2021-05-14 12:56:08 UTC
*** Bug 1958974 has been marked as a duplicate of this bug. ***

Comment 9 Maciej Szulik 2021-05-17 08:53:55 UTC
*** Bug 1960278 has been marked as a duplicate of this bug. ***

Comment 10 Lukasz Szaszkiewicz 2021-05-17 11:21:40 UTC
Stephen, do you still observe some jobs failing?

Comment 12 Lukasz Szaszkiewicz 2021-05-18 13:23:54 UTC
I did take a look at the failing job.
Here are the requests that took more than 2 sec but less than 2 min obtained from the audit logs (mainly kubelets):

had 70436 line read failures
count: 4653, first: 2021-05-17T11:56:50+02:00, last: 2021-05-17T13:08:26+02:00, duration: 1h11m36.417024s
1045x                system:node:worker-0.ostest.test.metalkube.org
996x                 system:node:worker-1.ostest.test.metalkube.org
361x                 system:serviceaccount:openshift-monitoring:prometheus-k8s
250x                 system:node:master-1.ostest.test.metalkube.org
244x                 system:admin
192x                 system:kube-controller-manager
171x                 system:node:master-0.ostest.test.metalkube.org
135x                 system:apiserver
116x                 system:node:master-2.ostest.test.metalkube.org
65x                  system:kube-scheduler


Further filtering the requests only for --user=system:node:worker-0.ostest.test.metalkube.org shows

had 70436 line read failures
count: 1045, first: 2021-05-17T12:30:12+02:00, last: 2021-05-17T13:08:02+02:00, duration: 37m49.942062s
983x                 v1/secrets
62x                  v1/configmaps


I saved the summary lines for that user at https://gist.github.com/p0lyn0mial/8b8a9ce65a75caa209e9584caa5cc0aa
It looks like "normal" requests made by the e2e tests.
In addition to that, the attached metrics also don't reveal anything suspicious.

In summary, it looks like the operators have been fixed and don't storm APIs anymore.
We are going to merge 2 PRs that retry on 429 and that should fix the failures you are observing.

Comment 13 Steve Kuznetsov 2021-05-19 17:34:03 UTC
@Lukasz when you manually added a link to `Link ID: Github openshift/kubernetes/pull/669/` you broke automation with the trailing slash. Why did you choose to add this manually when the bot could add it for you?

Comment 14 Lukasz Szaszkiewicz 2021-05-20 07:37:53 UTC
@Steve I'm terribly sorry. I don't know why I did it. Looking on the bright side, we have just uncovered a new bug that shouldn't be hard to fix.

Comment 15 Lukasz Szaszkiewicz 2021-06-08 08:42:36 UTC
@Stephen the recent CI jobs show no trace of the issue. Please verify it and close the issue.

Comment 17 Xingxing Xia 2021-06-08 10:21:15 UTC
Is this bug dup of bug 1947829 ?

Comment 18 Lukasz Szaszkiewicz 2021-06-08 11:07:20 UTC
Yes, could you please close https://bugzilla.redhat.com/show_bug.cgi?id=1947829 and verify this BZ as we will need it to backport to 4.7.

Comment 19 Abu Kashem 2021-06-08 14:00:26 UTC
- The original PR got reverted due to unit test flake: https://github.com/kubernetes/kubernetes/pull/102581
- We brought back the PR and fixed the test flake: https://github.com/kubernetes/kubernetes/pull/102606

setting the BZ to assigned so we can retest it.

Comment 21 Lukasz Szaszkiewicz 2021-06-09 08:25:58 UTC
*** Bug 1947829 has been marked as a duplicate of this bug. ***

Comment 22 Ke Wang 2021-06-09 15:38:43 UTC
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-09-023740   True        False         144m    Cluster version is 4.8.0-0.nightly-2021-06-09-023740

$ oc debug node/<master0>
...
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cd /var/log
sh-4.4# grep -nr "Failed to watch.*the server has received too many requests" | grep -v debug

$ oc debug node/<master1>
...
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cd /var/log
sh-4.4# grep -nr "Failed to watch.*the server has received too many requests" | grep -v debug

$ oc debug node/<master2>
...
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cd /var/log
sh-4.4# grep -nr "Failed to watch.*the server has received too many requests" | grep -v debug

No longer see similar error ' Failed to watch.*the server has received too many requests and has asked us to try again later' in master server log files, so move the bug VERIFIED.

Comment 23 Ke Wang 2021-06-10 03:59:08 UTC
Roll out kube-apiserver and checked again:

$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'

After that, kas pods become ready, run following script to check logs of kubeapiserver:

$ cat ./check_specific_log.sh
for project in $(oc get project | grep -v NAME | awk '{print $1}') 
do 
    for pod_name in $(oc get pod -n $project 2> /dev/null | grep -v NAME | awk '{print $1}')
    do 
        for container in $(oc -n $project get pods $pod_name -o jsonpath="{.spec.containers[*].name}") 
        do 
          oc -n $project logs --since-time "$T1" -c $container $pod_name | grep -E "(09|10):[0-9]*.*the server has received too many requests and has asked us to try again later" > a.log;
          N=`cat a.log | wc -l`
          if [ $N -gt 0 ]; then
            echo "$project/$pod_name/$container:"
            echo "----- $N lines -----"
            echo
            head -n 2 a.log
            echo "..."
            tail -n 2 a.log
            echo
          fi
        done 
     done 
done

$ ./check_specific_log.sh
$ cat a.log
No results found, as expected.

Comment 24 Abu Kashem 2021-06-10 14:24:46 UTC
resetting it to assigned because we need to pick it into o/o so ci tests do not flake

Comment 25 Abu Kashem 2021-06-10 14:55:16 UTC
kewang,
turns out we don't need the o/o PR, because o/o master already has the client-go Watch and Stream retry logic - moving it back to verified.

Comment 28 errata-xmlrpc 2021-07-27 22:58:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.