Bug 1947829

Summary:	When kube-apiserver rolls out, many other component containers hit many "Failed to watch ***: the server has received too many requests and has asked us to try again later"
Product:	OpenShift Container Platform	Reporter:	Xingxing Xia <xxia>
Component:	kube-apiserver	Assignee:	Abu Kashem <akashem>
Status:	CLOSED DUPLICATE	QA Contact:	Ke Wang <kewang>
Severity:	medium	Docs Contact:
Priority:	low
Version:	4.8	CC:	akashem, aos-bugs, juzhao, lszaszki, mfojtik, mgugino, xxia
Target Milestone:	---	Flags:	mfojtik: needinfo?
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-06-09 08:25:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Xingxing Xia 2021-04-09 10:59:43 UTC

Description of problem:
When kube-apiserver rolls out, many other component containers hit many "Failed to watch ***: the server has received too many requests and has asked us to try again later"

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-04-09-000946

How reproducible:
Always

Steps to Reproduce:
1. $ export T1=`date -u '+%Y-%m-%dT%H:%M:%SZ'`
$ echo "$T1"
2021-04-09T09:33:34Z

Roll out kube-apiserver:
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'

2. After kas pods become ready, check:
ss/check_specific_log.sh |& tee c.log

Actual results:
Got:
openshift-apiserver/apiserver-6bbb8b6b4d-8n4mb/openshift-apiserver:
----- 1 lines -----

E0409 09:38:10.243724       1 reflector.go:138] k8s.io/client-go.4/tools/cache/reflector.go:167: Failed to watch *v1.ConfigMap: the server has received too many requests and has asked us to try again later (get configmaps)

<snipped>

openshift-monitoring/prometheus-k8s-0/prometheus:
----- 331 lines -----

level=error ts=2021-04-09T09:38:08.609Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
level=error ts=2021-04-09T09:38:08.609Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
...
level=error ts=2021-04-09T09:38:30.082Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
level=error ts=2021-04-09T09:38:30.397Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"

openshift-monitoring/prometheus-k8s-1/prometheus:
----- 336 lines -----

level=error ts=2021-04-09T09:38:08.604Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: the server has received too many requests and has asked us to try again later (get services)"
level=error ts=2021-04-09T09:38:08.604Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: the server has received too many requests and has asked us to try again later (get services)"
...
level=error ts=2021-04-09T09:38:30.473Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
level=error ts=2021-04-09T09:38:30.612Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"

<snipped>

Expected results:
Before, this issue is not seen, e.g. 4.7.0-0.nightly-2021-04-05-201242 does not reproduce. Seems a regression issue. Should have no such errors.

Additional info:
$ cat ss/check_specific_log.sh
for project in $(oc get project | grep -v NAME | awk '{print $1}') 
do 
    for pod_name in $(oc get pod -n $project 2> /dev/null | grep -v NAME | awk '{print $1}')
    do 
        for container in $(oc -n $project get pods $pod_name -o jsonpath="{.spec.containers[*].name}") 
        do 
          oc -n $project logs --since-time "$T1" -c $container $pod_name | grep -E "(09|10):[0-9]*.*the server has received too many requests and has asked us to try again later" > a.log;
          N=`cat a.log | wc -l`
          if [ $N -gt 0 ]; then
            echo "$project/$pod_name/$container:"
            echo "----- $N lines -----"
            echo
            head -n 2 a.log
            echo "..."
            tail -n 2 a.log
            echo
          fi
        done 
     done 
done

Comment 1 Stefan Schimanski 2021-04-09 11:12:24 UTC

This is expected and not an error, but rather due to priority and fairness now also covering watches (with 429 http responses).

Comment 2 Xingxing Xia 2021-04-09 12:02:51 UTC

It only happens during kas rollout. After finishing rollout, no new such occurrences. Should rollout handle it better?

Comment 3 Abu Kashem 2021-04-09 12:55:32 UTC

xxia,
yes, it is expected. 

On the other hand, if you see any watch request being throttled for any objects in `openshift-*` namespace, then it should be a bug.

Comment 4 Xingxing Xia 2021-04-12 03:18:34 UTC

Comment 0 only listed 3 containers. Actually many containers had the logs. Below is the sort of "the server has received too many requests and has asked us to try again later" lines in various containers:
336 lines: openshift-monitoring/prometheus-k8s-1/prometheus
331 lines: openshift-monitoring/prometheus-k8s-0/prometheus
59 lines: openshift-cluster-storage-operator/cluster-storage-operator-cb7d49d6b-xszgb/cluster-storage-operator
51 lines: openshift-apiserver-operator/openshift-apiserver-operator-67986b6969-hstjc/openshift-apiserver-operator
44 lines: openshift-service-ca-operator/service-ca-operator-5867c6548f-4w8nt/service-ca-operator
39 lines: openshift-kube-apiserver-operator/kube-apiserver-operator-c98cfbb94-g6vsl/kube-apiserver-operator
31 lines: openshift-etcd-operator/etcd-operator-67764474db-ftbz8/etcd-operator
29 lines: openshift-controller-manager-operator/openshift-controller-manager-operator-c6d7bd565-j4mjn/openshift-controller-manager-operator
26 lines: openshift-kube-controller-manager-operator/kube-controller-manager-operator-5985c9666f-gd9pv/kube-controller-manager-operator
21 lines: openshift-console-operator/console-operator-649cfdc674-cdbnx/console-operator
20 lines: openshift-image-registry/cluster-image-registry-operator-6594d4c4d6-tp7vf/cluster-image-registry-operator
13 lines: openshift-kube-storage-version-migrator-operator/kube-storage-version-migrator-operator-67f67f4c7b-bg4j5/kube-storage-version-migrator-operator
13 lines: openshift-config-operator/openshift-config-operator-dc8fbc9d5-nb55l/openshift-config-operator
11 lines: openshift-dns-operator/dns-operator-7c5676c5d7-zjkwk/dns-operator
9 lines: openshift-network-operator/network-operator-54687bc749-hmw4k/network-operator
9 lines: openshift-machine-api/machine-api-controllers-85b876d7cf-gk4dm/machineset-controller
9 lines: openshift-cluster-storage-operator/csi-snapshot-controller-operator-5776df6f9f-nbdkn/csi-snapshot-controller-operator
8 lines: openshift-cluster-node-tuning-operator/tuned-nw775/tuned
8 lines: openshift-apiserver/apiserver-6bbb8b6b4d-nqhnf/openshift-apiserver
7 lines: openshift-machine-api/machine-api-operator-c9b896d5-w4xmz/machine-api-operator
7 lines: openshift-machine-api/machine-api-controllers-85b876d7cf-gk4dm/machine-controller
7 lines: openshift-apiserver/apiserver-6bbb8b6b4d-blctx/openshift-apiserver
6 lines: openshift-kube-apiserver/kube-apiserver-ip-10-0-49-193.ap-northeast-1.compute.internal/kube-apiserver-check-endpoints
6 lines: openshift-cluster-samples-operator/cluster-samples-operator-664ddc5bfd-cjq5b/cluster-samples-operator
6 lines: openshift-cluster-node-tuning-operator/tuned-txrsz/tuned
6 lines: openshift-cluster-node-tuning-operator/tuned-px9f5/tuned
6 lines: openshift-apiserver/apiserver-6bbb8b6b4d-nqhnf/openshift-apiserver-check-endpoints
5 lines: openshift-monitoring/prometheus-operator-8547b964f6-gjdg5/prometheus-operator
5 lines: openshift-machine-api/machine-api-controllers-85b876d7cf-gk4dm/nodelink-controller
5 lines: openshift-kube-apiserver/kube-apiserver-ip-10-0-60-8.ap-northeast-1.compute.internal/kube-apiserver-check-endpoints
5 lines: openshift-authentication-operator/authentication-operator-df848bcb6-sqcmd/authentication-operator
4 lines: openshift-operator-lifecycle-manager/packageserver-56d8c4d89-ztks8/packageserver
4 lines: openshift-machine-config-operator/machine-config-controller-645f45745b-6lvsd/machine-config-controller
4 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-49-193.ap-northeast-1.compute.internal/kube-scheduler-recovery-controller
4 lines: openshift-insights/insights-operator-64cc56b55-7sn5d/insights-operator
4 lines: openshift-authentication/oauth-openshift-688fbdbbb9-vk2kp/oauth-openshift
3 lines: openshift-oauth-apiserver/apiserver-ffd476c66-79bgn/oauth-apiserver
3 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-79-135.ap-northeast-1.compute.internal/kube-scheduler
3 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-60-8.ap-northeast-1.compute.internal/kube-scheduler
3 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-49-193.ap-northeast-1.compute.internal/kube-scheduler-cert-syncer
3 lines: openshift-cluster-node-tuning-operator/cluster-node-tuning-operator-8c99b949f-mfjcq/cluster-node-tuning-operator
3 lines: openshift-cluster-csi-drivers/aws-ebs-csi-driver-operator-66454d5dd5-wdmgb/aws-ebs-csi-driver-operator
2 lines: openshift-oauth-apiserver/apiserver-ffd476c66-dkz24/oauth-apiserver
2 lines: openshift-network-diagnostics/network-check-source-5585685c7c-f5k58/check-endpoints
2 lines: openshift-machine-api/cluster-autoscaler-operator-695748744b-lwxmd/cluster-autoscaler-operator
2 lines: openshift-cluster-node-tuning-operator/tuned-2p6mb/tuned
2 lines: openshift-authentication/oauth-openshift-688fbdbbb9-nsvh6/oauth-openshift
2 lines: openshift-apiserver/apiserver-6bbb8b6b4d-blctx/openshift-apiserver-check-endpoints
1 lines: openshift-service-ca/service-ca-6df7b46945-46fxp/service-ca-controller
1 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-49-193.ap-northeast-1.compute.internal/kube-scheduler
1 lines: openshift-authentication/oauth-openshift-688fbdbbb9-xbg6d/oauth-openshift
1 lines: openshift-apiserver/apiserver-6bbb8b6b4d-8n4mb/openshift-apiserver-check-endpoints
1 lines: openshift-apiserver/apiserver-6bbb8b6b4d-8n4mb/openshift-apiserver

http://file.rdu.redhat.com/~xxia/bug/1947829_c.log is the full content of comment 0 script. Below is some abstract of the top ones:
331 lines: openshift-monitoring/prometheus-k8s-0/prometheus:
level=error ts=2021-04-09T09:38:08.609Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
59 lines: openshift-cluster-storage-operator/cluster-storage-operator-cb7d49d6b-xszgb/cluster-storage-operator:
E0409 09:38:08.696989       1 reflector.go:138] k8s.io/client-go.0+incompatible/tools/cache/reflector.go:167: Failed to watch *v1.Secret: the server has received too many requests and has asked us to try again later (get secrets)
51 lines: openshift-apiserver-operator/openshift-apiserver-operator-67986b6969-hstjc/openshift-apiserver-operator:
E0409 09:38:08.651854       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Secret: the server has received too many requests and has asked us to try again later (get secrets)

Below are the grep of the *.go in the error messages, and the sort of their occurrences, I'M SURE it is bug somewhere in kas or these clients:
$ grep -o "[^ ]*go:.*.go" t.log | sed 's/:[0-9]*/:***/' | sort | uniq -c | sort -nr
   3408 caller=klog.go:*** component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go
   2186 caller=klog.go:*** component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go
    446 reflector.go:***] k8s.io/client-go/informers/factory.go
    404 reflector.go:***] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go
    304 reflector.go:***] k8s.io/client-go.0-rc.0/tools/cache/reflector.go
    176 reflector.go:***] k8s.io/client-go.0+incompatible/tools/cache/reflector.go
    168 reflector.go:***] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go
     90 reflector.go:***] k8s.io/client-go.0/tools/cache/reflector.go
     85 reflector.go:***] k8s.io/apiserver/pkg/authentication/request/headerrequest/requestheader_controller.go
     72 reflector.go:***] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go
     69 reflector.go:***] k8s.io/client-go.4/tools/cache/reflector.go
     69 reflector.go:***] k8s.io/client-go.1/tools/cache/reflector.go
     66 reflector.go:***] k8s.io/client-go.5/tools/cache/reflector.go
     30 reflector.go:***] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go
     21 reflector.go:***] k8s.io/client-go.2/tools/cache/reflector.go
     20 reflector.go:***] github.com/openshift/client-go/route/informers/externalversions/factory.go
     11 reflector.go:***] k8s.io/kube-state-metrics/internal/store/builder.go
      5 reflector.go:***] github.com/openshift/machine-api-operator/pkg/generated/informers/externalversions/factory.go

Comment 6 Junqi Zhao 2021-04-14 09:26:06 UTC

this should be a regression issue for 4.8, since we did not meet it in 4.7

Comment 7 Michael Gugino 2021-04-21 13:58:36 UTC

Node link controller (runs in openshift-machine-api namespace, watches machines in openshift-machine-api) is hitting this: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032/artifacts/e2e-agnostic-cmd/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-c4f9b7478-q2nrb_nodelink-controller.log

Comment 8 Michal Fojtik 2021-05-21 14:31:35 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 9 Lukasz Szaszkiewicz 2021-06-09 08:25:58 UTC


*** This bug has been marked as a duplicate of bug 1948311 ***