1947829 – When kube-apiserver rolls out, many other component containers hit many "Failed to watch ***: the server has received too many requests and has asked us to try again later"

Bug 1947829 - When kube-apiserver rolls out, many other component containers hit many "Failed to watch ***: the server has received too many requests and has asked us to try again later" [NEEDINFO]

Summary: When kube-apiserver rolls out, many other component containers hit many "Fail...

Keywords:
Status:	CLOSED DUPLICATE of bug 1948311
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Abu Kashem
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-09 10:59 UTC by Xingxing Xia
Modified:	2021-06-09 08:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-09 08:25:58 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)

Description Xingxing Xia 2021-04-09 10:59:43 UTC

Description of problem:
When kube-apiserver rolls out, many other component containers hit many "Failed to watch ***: the server has received too many requests and has asked us to try again later"

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-04-09-000946

How reproducible:
Always

Steps to Reproduce:
1. $ export T1=`date -u '+%Y-%m-%dT%H:%M:%SZ'`
$ echo "$T1"
2021-04-09T09:33:34Z

Roll out kube-apiserver:
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'

2. After kas pods become ready, check:
ss/check_specific_log.sh |& tee c.log

Actual results:
Got:
openshift-apiserver/apiserver-6bbb8b6b4d-8n4mb/openshift-apiserver:
----- 1 lines -----

E0409 09:38:10.243724       1 reflector.go:138] k8s.io/client-go.4/tools/cache/reflector.go:167: Failed to watch *v1.ConfigMap: the server has received too many requests and has asked us to try again later (get configmaps)

<snipped>

openshift-monitoring/prometheus-k8s-0/prometheus:
----- 331 lines -----

level=error ts=2021-04-09T09:38:08.609Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
level=error ts=2021-04-09T09:38:08.609Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
...
level=error ts=2021-04-09T09:38:30.082Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
level=error ts=2021-04-09T09:38:30.397Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"

openshift-monitoring/prometheus-k8s-1/prometheus:
----- 336 lines -----

level=error ts=2021-04-09T09:38:08.604Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: the server has received too many requests and has asked us to try again later (get services)"
level=error ts=2021-04-09T09:38:08.604Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: the server has received too many requests and has asked us to try again later (get services)"
...
level=error ts=2021-04-09T09:38:30.473Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
level=error ts=2021-04-09T09:38:30.612Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"

<snipped>

Expected results:
Before, this issue is not seen, e.g. 4.7.0-0.nightly-2021-04-05-201242 does not reproduce. Seems a regression issue. Should have no such errors.

Additional info:
$ cat ss/check_specific_log.sh
for project in $(oc get project | grep -v NAME | awk '{print $1}') 
do 
    for pod_name in $(oc get pod -n $project 2> /dev/null | grep -v NAME | awk '{print $1}')
    do 
        for container in $(oc -n $project get pods $pod_name -o jsonpath="{.spec.containers[*].name}") 
        do 
          oc -n $project logs --since-time "$T1" -c $container $pod_name | grep -E "(09|10):[0-9]*.*the server has received too many requests and has asked us to try again later" > a.log;
          N=`cat a.log | wc -l`
          if [ $N -gt 0 ]; then
            echo "$project/$pod_name/$container:"
            echo "----- $N lines -----"
            echo
            head -n 2 a.log
            echo "..."
            tail -n 2 a.log
            echo
          fi
        done 
     done 
done

Comment 1 Stefan Schimanski 2021-04-09 11:12:24 UTC

This is expected and not an error, but rather due to priority and fairness now also covering watches (with 429 http responses).

Comment 2 Xingxing Xia 2021-04-09 12:02:51 UTC

It only happens during kas rollout. After finishing rollout, no new such occurrences. Should rollout handle it better?

Comment 3 Abu Kashem 2021-04-09 12:55:32 UTC

xxia,
yes, it is expected. 

On the other hand, if you see any watch request being throttled for any objects in `openshift-*` namespace, then it should be a bug.

Comment 4 Xingxing Xia 2021-04-12 03:18:34 UTC

Comment 0 only listed 3 containers. Actually many containers had the logs. Below is the sort of "the server has received too many requests and has asked us to try again later" lines in various containers:
336 lines: openshift-monitoring/prometheus-k8s-1/prometheus
331 lines: openshift-monitoring/prometheus-k8s-0/prometheus
59 lines: openshift-cluster-storage-operator/cluster-storage-operator-cb7d49d6b-xszgb/cluster-storage-operator
51 lines: openshift-apiserver-operator/openshift-apiserver-operator-67986b6969-hstjc/openshift-apiserver-operator
44 lines: openshift-service-ca-operator/service-ca-operator-5867c6548f-4w8nt/service-ca-operator
39 lines: openshift-kube-apiserver-operator/kube-apiserver-operator-c98cfbb94-g6vsl/kube-apiserver-operator
31 lines: openshift-etcd-operator/etcd-operator-67764474db-ftbz8/etcd-operator
29 lines: openshift-controller-manager-operator/openshift-controller-manager-operator-c6d7bd565-j4mjn/openshift-controller-manager-operator
26 lines: openshift-kube-controller-manager-operator/kube-controller-manager-operator-5985c9666f-gd9pv/kube-controller-manager-operator
21 lines: openshift-console-operator/console-operator-649cfdc674-cdbnx/console-operator
20 lines: openshift-image-registry/cluster-image-registry-operator-6594d4c4d6-tp7vf/cluster-image-registry-operator
13 lines: openshift-kube-storage-version-migrator-operator/kube-storage-version-migrator-operator-67f67f4c7b-bg4j5/kube-storage-version-migrator-operator
13 lines: openshift-config-operator/openshift-config-operator-dc8fbc9d5-nb55l/openshift-config-operator
11 lines: openshift-dns-operator/dns-operator-7c5676c5d7-zjkwk/dns-operator
9 lines: openshift-network-operator/network-operator-54687bc749-hmw4k/network-operator
9 lines: openshift-machine-api/machine-api-controllers-85b876d7cf-gk4dm/machineset-controller
9 lines: openshift-cluster-storage-operator/csi-snapshot-controller-operator-5776df6f9f-nbdkn/csi-snapshot-controller-operator
8 lines: openshift-cluster-node-tuning-operator/tuned-nw775/tuned
8 lines: openshift-apiserver/apiserver-6bbb8b6b4d-nqhnf/openshift-apiserver
7 lines: openshift-machine-api/machine-api-operator-c9b896d5-w4xmz/machine-api-operator
7 lines: openshift-machine-api/machine-api-controllers-85b876d7cf-gk4dm/machine-controller
7 lines: openshift-apiserver/apiserver-6bbb8b6b4d-blctx/openshift-apiserver
6 lines: openshift-kube-apiserver/kube-apiserver-ip-10-0-49-193.ap-northeast-1.compute.internal/kube-apiserver-check-endpoints
6 lines: openshift-cluster-samples-operator/cluster-samples-operator-664ddc5bfd-cjq5b/cluster-samples-operator
6 lines: openshift-cluster-node-tuning-operator/tuned-txrsz/tuned
6 lines: openshift-cluster-node-tuning-operator/tuned-px9f5/tuned
6 lines: openshift-apiserver/apiserver-6bbb8b6b4d-nqhnf/openshift-apiserver-check-endpoints
5 lines: openshift-monitoring/prometheus-operator-8547b964f6-gjdg5/prometheus-operator
5 lines: openshift-machine-api/machine-api-controllers-85b876d7cf-gk4dm/nodelink-controller
5 lines: openshift-kube-apiserver/kube-apiserver-ip-10-0-60-8.ap-northeast-1.compute.internal/kube-apiserver-check-endpoints
5 lines: openshift-authentication-operator/authentication-operator-df848bcb6-sqcmd/authentication-operator
4 lines: openshift-operator-lifecycle-manager/packageserver-56d8c4d89-ztks8/packageserver
4 lines: openshift-machine-config-operator/machine-config-controller-645f45745b-6lvsd/machine-config-controller
4 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-49-193.ap-northeast-1.compute.internal/kube-scheduler-recovery-controller
4 lines: openshift-insights/insights-operator-64cc56b55-7sn5d/insights-operator
4 lines: openshift-authentication/oauth-openshift-688fbdbbb9-vk2kp/oauth-openshift
3 lines: openshift-oauth-apiserver/apiserver-ffd476c66-79bgn/oauth-apiserver
3 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-79-135.ap-northeast-1.compute.internal/kube-scheduler
3 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-60-8.ap-northeast-1.compute.internal/kube-scheduler
3 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-49-193.ap-northeast-1.compute.internal/kube-scheduler-cert-syncer
3 lines: openshift-cluster-node-tuning-operator/cluster-node-tuning-operator-8c99b949f-mfjcq/cluster-node-tuning-operator
3 lines: openshift-cluster-csi-drivers/aws-ebs-csi-driver-operator-66454d5dd5-wdmgb/aws-ebs-csi-driver-operator
2 lines: openshift-oauth-apiserver/apiserver-ffd476c66-dkz24/oauth-apiserver
2 lines: openshift-network-diagnostics/network-check-source-5585685c7c-f5k58/check-endpoints
2 lines: openshift-machine-api/cluster-autoscaler-operator-695748744b-lwxmd/cluster-autoscaler-operator
2 lines: openshift-cluster-node-tuning-operator/tuned-2p6mb/tuned
2 lines: openshift-authentication/oauth-openshift-688fbdbbb9-nsvh6/oauth-openshift
2 lines: openshift-apiserver/apiserver-6bbb8b6b4d-blctx/openshift-apiserver-check-endpoints
1 lines: openshift-service-ca/service-ca-6df7b46945-46fxp/service-ca-controller
1 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-49-193.ap-northeast-1.compute.internal/kube-scheduler
1 lines: openshift-authentication/oauth-openshift-688fbdbbb9-xbg6d/oauth-openshift
1 lines: openshift-apiserver/apiserver-6bbb8b6b4d-8n4mb/openshift-apiserver-check-endpoints
1 lines: openshift-apiserver/apiserver-6bbb8b6b4d-8n4mb/openshift-apiserver

http://file.rdu.redhat.com/~xxia/bug/1947829_c.log is the full content of comment 0 script. Below is some abstract of the top ones:
331 lines: openshift-monitoring/prometheus-k8s-0/prometheus:
level=error ts=2021-04-09T09:38:08.609Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
59 lines: openshift-cluster-storage-operator/cluster-storage-operator-cb7d49d6b-xszgb/cluster-storage-operator:
E0409 09:38:08.696989       1 reflector.go:138] k8s.io/client-go.0+incompatible/tools/cache/reflector.go:167: Failed to watch *v1.Secret: the server has received too many requests and has asked us to try again later (get secrets)
51 lines: openshift-apiserver-operator/openshift-apiserver-operator-67986b6969-hstjc/openshift-apiserver-operator:
E0409 09:38:08.651854       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Secret: the server has received too many requests and has asked us to try again later (get secrets)

Below are the grep of the *.go in the error messages, and the sort of their occurrences, I'M SURE it is bug somewhere in kas or these clients:
$ grep -o "[^ ]*go:.*.go" t.log | sed 's/:[0-9]*/:***/' | sort | uniq -c | sort -nr
   3408 caller=klog.go:*** component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go
   2186 caller=klog.go:*** component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go
    446 reflector.go:***] k8s.io/client-go/informers/factory.go
    404 reflector.go:***] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go
    304 reflector.go:***] k8s.io/client-go.0-rc.0/tools/cache/reflector.go
    176 reflector.go:***] k8s.io/client-go.0+incompatible/tools/cache/reflector.go
    168 reflector.go:***] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go
     90 reflector.go:***] k8s.io/client-go.0/tools/cache/reflector.go
     85 reflector.go:***] k8s.io/apiserver/pkg/authentication/request/headerrequest/requestheader_controller.go
     72 reflector.go:***] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go
     69 reflector.go:***] k8s.io/client-go.4/tools/cache/reflector.go
     69 reflector.go:***] k8s.io/client-go.1/tools/cache/reflector.go
     66 reflector.go:***] k8s.io/client-go.5/tools/cache/reflector.go
     30 reflector.go:***] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go
     21 reflector.go:***] k8s.io/client-go.2/tools/cache/reflector.go
     20 reflector.go:***] github.com/openshift/client-go/route/informers/externalversions/factory.go
     11 reflector.go:***] k8s.io/kube-state-metrics/internal/store/builder.go
      5 reflector.go:***] github.com/openshift/machine-api-operator/pkg/generated/informers/externalversions/factory.go

Comment 6 Junqi Zhao 2021-04-14 09:26:06 UTC

this should be a regression issue for 4.8, since we did not meet it in 4.7

Comment 7 Michael Gugino 2021-04-21 13:58:36 UTC

Node link controller (runs in openshift-machine-api namespace, watches machines in openshift-machine-api) is hitting this: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032/artifacts/e2e-agnostic-cmd/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-c4f9b7478-q2nrb_nodelink-controller.log

Comment 8 Michal Fojtik 2021-05-21 14:31:35 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 9 Lukasz Szaszkiewicz 2021-06-09 08:25:58 UTC


*** This bug has been marked as a duplicate of bug 1948311 ***

Note You need to log in before you can comment on or make changes to this bug.