Bug 1948311

Summary: Some jobs failing due to excessive watches: the server has received too many requests and has asked us to try again later
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: kube-apiserverAssignee: Lukasz Szaszkiewicz <lszaszki>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.8CC: akashem, aos-bugs, ccoleman, lshilin, mfojtik, pmuller, skuznets, spasquie, sttts, wking, xxia
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:58:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1952001, 1957127    
Bug Blocks:    

Description Stephen Benjamin 2021-04-11 18:20:11 UTC
Especially on metal IPI CI, we're starting to see some openshift tests fail with "the server has received too many requests and has asked us to try again later." Here's a sample job[1]. It's starting to impact release gating jobs, and as we turn on more tests it's becoming more common that we max out the number of watches and get unit test failures.  The problem is not limited to metal IPI, as on AWS we can see that the namespace controller during a run ends up with hundreds of thousands of watches.

Here's the stats for the metal ipi run in[2]:

$ ./kubectl-dev_tool audit -f '/tmp/must-gather-ipi-ci-op-4tdpwk2c-c170c-1380440176282570752/audit-logs/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ef98264bbb397ec8f4e0902189294d0d0fd4ba21d669a4da3472f7832eef462/audit_logs/kube-apiserver' verb=watch --by=user -otop | head -n6
count: 1058164, first: 2021-04-09T05:12:38-04:00, last: 2021-04-09T06:16:38-04:00, duration: 1h4m0.21978s
234938x              system:serviceaccount:kube-system:namespace-controller
109663x              system:serviceaccount:openshift-machine-api:cluster-baremetal-operator
72794x               system:serviceaccount:openshift-apiserver:openshift-apiserver-sa
61638x               system:admin
52262x               system:kube-controller-manager

And here's the top resources being watched from a metal IPI run:

$ ./kubectl-dev_tool audit -f '/tmp/must-gather-ipi-ci-op-4tdpwk2c-c170c-1380440176282570752/audit-logs/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ef98264bbb397ec8f4e0902189294d0d0fd4ba21d669a4da3472f7832eef462/audit_logs/kube-apiserver' verb=watch --by=resource -otop
count: 1058164, first: 2021-04-09T05:12:38-04:00, last: 2021-04-09T06:16:38-04:00, duration: 1h4m0.21978s
144902x              v1/secrets
112169x              v1/configmaps
76924x               v1/serviceaccounts
57605x               authorization.k8s.io/v1/subjectaccessreviews
51910x               v1/pods


Here's a comparable run from AWS:

$ ./kubectl-dev_tool audit -f '/tmp/audit-logs/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ef98264bbb397ec8f4e0902189294d0d0fd4ba21d669a4da3472f7832eef462/audit_logs/kube-apiserver' verb=watch --by=resource -otop | head -n6
had 39622 line read failures
count: 956438, first: 2021-04-08T23:51:46-04:00, last: 2021-04-09T00:41:37-04:00, duration: 49m51.914093s
87128x               v1/configmaps
85711x               v1/secrets
72993x               v1/pods
65097x               v1/serviceaccounts
64961x               authorization.k8s.io/v1/subjectaccessreviews

$ ./kubectl-dev_tool audit -f '/tmp/audit-logs/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ef98264bbb397ec8f4e0902189294d0d0fd4ba21d669a4da3472f7832eef462/audit_logs/kube-apiserver' verb=watch --by=user -otop | head -n6
count: 956438, first: 2021-04-08T23:51:46-04:00, last: 2021-04-09T00:41:37-04:00, duration: 49m51.914093s
320882x              system:serviceaccount:kube-system:namespace-controller
79642x               system:serviceaccount:openshift-apiserver:openshift-apiserver-sa
56920x               system:kube-controller-manager
52288x               system:admin
25974x               system:serviceaccount:openshift-kube-storage-version-migrator-operator:kube-storage-version-migrator-operator


[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1380440176282570752
[2] https://coreos.slack.com/archives/CB48XQ4KZ/p1617815405044500
[3] https://github.com/openshift/cluster-debug-tools
[4] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws/1380363578401361920
[5] https://github.com/openshift/cluster-baremetal-operator/blob/master/controllers/provisioning_controller.go#L416-L424

Comment 1 Lukasz Szaszkiewicz 2021-04-12 11:58:02 UTC
It turns out that setting a timeout on HTTP client affect watch requests.
For example, with a 10 second timeout watch requests are being re-established exactly after 10 seconds even though the default request timeout for them is ~5 minutes.

This is because if multiple timeouts were set, the stdlib picks the smaller timeout to be applied, leaving other useless.
For more details see https://github.com/golang/go/blob/a937729c2c2f6950a32bc5cd0f5b88700882f078/src/net/http/client.go#L364

I've opened https://github.com/kubernetes/kubernetes/pull/101022 as a potential fix.

Comment 2 Stephen Benjamin 2021-04-12 22:25:12 UTC
Sorry there was a typo in my original data gathering commands (verb=watch instead of --verb=watch), so what you see is an unfiltered total number of requests per user. I am not sure if the 300K namespace-controller requests were expected or not.  

The high number of requests from cluster-baremetal-operator is a bug, which we're dealing with in https://bugzilla.redhat.com/show_bug.cgi?id=1948787.  I'll leave it up to you to decide what to do with this one.

Comment 3 Stephen Benjamin 2021-04-19 15:42:24 UTC
There still seems to be an issue somewhere else inside of OpenShift. metal jobs continue to fail quite often with various failure modes related to "too many requests"

For example prometheus errors were one of the more common ones, with one of the prometheus-adapter replicas not coming up: 

- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release[…]ter-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1384071524993994752
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release[…]ter-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1383882320846524416
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release[…]ter-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1383686488096509952

prometheus pod is just scrolling this at the end:

  2021-04-18T08:54:24.096320119Z level=error ts=2021-04-18T08:54:24.096Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)"

This is hindering the baremetal IPI's team's ability to land bug fixes due to high failure rates.


Note: audit logs are in the must-gather tarballs.

Comment 4 Stephen Benjamin 2021-04-19 16:40:58 UTC
Another manifestation is Ingress is failing, operator logs show:

2021-04-19T13:40:26.228875001Z E0419 13:40:26.228823       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1.Route: the server has received too many requests and has asked us to try again later (get routes.route.openshift.io)

And haproxy health checks are failing for on-premise, api server seems overloaded?

  2021-04-19T13:22:55.952879999Z time="2021-04-19T13:22:55Z" level=info msg="An error occurred while trying to read master nodes details from api-vip:kube-apiserver: the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)"




From: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1384125432177102848

Comment 5 Lukasz Szaszkiewicz 2021-04-20 07:13:26 UTC
I have prepared https://github.com/openshift/cluster-kube-apiserver-operator/pull/1110 to reduce watch requests for the aggregated APIs.

Unfortunately, the number of watch requests is still high. I am going to investigate it further - https://github.com/openshift/cluster-kube-apiserver-operator/pull/1110#issuecomment-822602299

Comment 6 Lukasz Szaszkiewicz 2021-04-20 08:46:10 UTC
It looks like my previous query was broken. The improvement seems significant: 26 watch requests from the kube-apiserver-operator after the fix vs 1248 before the fix.
I am going to identify components that use the delegated AuthN (this is the place that had the bug) and fill BZ. I think I'm going to use this BZ as a top-level one.

Comment 7 Arda Guclu 2021-04-20 16:20:46 UTC
*** Bug 1950789 has been marked as a duplicate of this bug. ***

Comment 8 Jan Chaloupka 2021-05-14 12:56:08 UTC
*** Bug 1958974 has been marked as a duplicate of this bug. ***

Comment 9 Maciej Szulik 2021-05-17 08:53:55 UTC
*** Bug 1960278 has been marked as a duplicate of this bug. ***

Comment 10 Lukasz Szaszkiewicz 2021-05-17 11:21:40 UTC
Stephen, do you still observe some jobs failing?

Comment 12 Lukasz Szaszkiewicz 2021-05-18 13:23:54 UTC
I did take a look at the failing job.
Here are the requests that took more than 2 sec but less than 2 min obtained from the audit logs (mainly kubelets):

had 70436 line read failures
count: 4653, first: 2021-05-17T11:56:50+02:00, last: 2021-05-17T13:08:26+02:00, duration: 1h11m36.417024s
1045x                system:node:worker-0.ostest.test.metalkube.org
996x                 system:node:worker-1.ostest.test.metalkube.org
361x                 system:serviceaccount:openshift-monitoring:prometheus-k8s
250x                 system:node:master-1.ostest.test.metalkube.org
244x                 system:admin
192x                 system:kube-controller-manager
171x                 system:node:master-0.ostest.test.metalkube.org
135x                 system:apiserver
116x                 system:node:master-2.ostest.test.metalkube.org
65x                  system:kube-scheduler


Further filtering the requests only for --user=system:node:worker-0.ostest.test.metalkube.org shows

had 70436 line read failures
count: 1045, first: 2021-05-17T12:30:12+02:00, last: 2021-05-17T13:08:02+02:00, duration: 37m49.942062s
983x                 v1/secrets
62x                  v1/configmaps


I saved the summary lines for that user at https://gist.github.com/p0lyn0mial/8b8a9ce65a75caa209e9584caa5cc0aa
It looks like "normal" requests made by the e2e tests.
In addition to that, the attached metrics also don't reveal anything suspicious.

In summary, it looks like the operators have been fixed and don't storm APIs anymore.
We are going to merge 2 PRs that retry on 429 and that should fix the failures you are observing.

Comment 13 Steve Kuznetsov 2021-05-19 17:34:03 UTC
@Lukasz when you manually added a link to `Link ID: Github openshift/kubernetes/pull/669/` you broke automation with the trailing slash. Why did you choose to add this manually when the bot could add it for you?

Comment 14 Lukasz Szaszkiewicz 2021-05-20 07:37:53 UTC
@Steve I'm terribly sorry. I don't know why I did it. Looking on the bright side, we have just uncovered a new bug that shouldn't be hard to fix.

Comment 15 Lukasz Szaszkiewicz 2021-06-08 08:42:36 UTC
@Stephen the recent CI jobs show no trace of the issue. Please verify it and close the issue.

Comment 17 Xingxing Xia 2021-06-08 10:21:15 UTC
Is this bug dup of bug 1947829 ?

Comment 18 Lukasz Szaszkiewicz 2021-06-08 11:07:20 UTC
Yes, could you please close https://bugzilla.redhat.com/show_bug.cgi?id=1947829 and verify this BZ as we will need it to backport to 4.7.

Comment 19 Abu Kashem 2021-06-08 14:00:26 UTC
- The original PR got reverted due to unit test flake: https://github.com/kubernetes/kubernetes/pull/102581
- We brought back the PR and fixed the test flake: https://github.com/kubernetes/kubernetes/pull/102606

setting the BZ to assigned so we can retest it.

Comment 21 Lukasz Szaszkiewicz 2021-06-09 08:25:58 UTC
*** Bug 1947829 has been marked as a duplicate of this bug. ***

Comment 22 Ke Wang 2021-06-09 15:38:43 UTC
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-09-023740   True        False         144m    Cluster version is 4.8.0-0.nightly-2021-06-09-023740

$ oc debug node/<master0>
...
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cd /var/log
sh-4.4# grep -nr "Failed to watch.*the server has received too many requests" | grep -v debug

$ oc debug node/<master1>
...
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cd /var/log
sh-4.4# grep -nr "Failed to watch.*the server has received too many requests" | grep -v debug

$ oc debug node/<master2>
...
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cd /var/log
sh-4.4# grep -nr "Failed to watch.*the server has received too many requests" | grep -v debug

No longer see similar error ' Failed to watch.*the server has received too many requests and has asked us to try again later' in master server log files, so move the bug VERIFIED.

Comment 23 Ke Wang 2021-06-10 03:59:08 UTC
Roll out kube-apiserver and checked again:

$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'

After that, kas pods become ready, run following script to check logs of kubeapiserver:

$ cat ./check_specific_log.sh
for project in $(oc get project | grep -v NAME | awk '{print $1}') 
do 
    for pod_name in $(oc get pod -n $project 2> /dev/null | grep -v NAME | awk '{print $1}')
    do 
        for container in $(oc -n $project get pods $pod_name -o jsonpath="{.spec.containers[*].name}") 
        do 
          oc -n $project logs --since-time "$T1" -c $container $pod_name | grep -E "(09|10):[0-9]*.*the server has received too many requests and has asked us to try again later" > a.log;
          N=`cat a.log | wc -l`
          if [ $N -gt 0 ]; then
            echo "$project/$pod_name/$container:"
            echo "----- $N lines -----"
            echo
            head -n 2 a.log
            echo "..."
            tail -n 2 a.log
            echo
          fi
        done 
     done 
done

$ ./check_specific_log.sh
$ cat a.log
No results found, as expected.

Comment 24 Abu Kashem 2021-06-10 14:24:46 UTC
resetting it to assigned because we need to pick it into o/o so ci tests do not flake

Comment 25 Abu Kashem 2021-06-10 14:55:16 UTC
kewang,
turns out we don't need the o/o PR, because o/o master already has the client-go Watch and Stream retry logic - moving it back to verified.

Comment 28 errata-xmlrpc 2021-07-27 22:58:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438