Description of problem: When kube-apiserver rolls out, many other component containers hit many "Failed to watch ***: the server has received too many requests and has asked us to try again later" Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-04-09-000946 How reproducible: Always Steps to Reproduce: 1. $ export T1=`date -u '+%Y-%m-%dT%H:%M:%SZ'` $ echo "$T1" 2021-04-09T09:33:34Z Roll out kube-apiserver: $ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]' 2. After kas pods become ready, check: ss/check_specific_log.sh |& tee c.log Actual results: Got: openshift-apiserver/apiserver-6bbb8b6b4d-8n4mb/openshift-apiserver: ----- 1 lines ----- E0409 09:38:10.243724 1 reflector.go:138] k8s.io/client-go.4/tools/cache/reflector.go:167: Failed to watch *v1.ConfigMap: the server has received too many requests and has asked us to try again later (get configmaps) <snipped> openshift-monitoring/prometheus-k8s-0/prometheus: ----- 331 lines ----- level=error ts=2021-04-09T09:38:08.609Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" level=error ts=2021-04-09T09:38:08.609Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" ... level=error ts=2021-04-09T09:38:30.082Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" level=error ts=2021-04-09T09:38:30.397Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" openshift-monitoring/prometheus-k8s-1/prometheus: ----- 336 lines ----- level=error ts=2021-04-09T09:38:08.604Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: the server has received too many requests and has asked us to try again later (get services)" level=error ts=2021-04-09T09:38:08.604Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: the server has received too many requests and has asked us to try again later (get services)" ... level=error ts=2021-04-09T09:38:30.473Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" level=error ts=2021-04-09T09:38:30.612Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" <snipped> Expected results: Before, this issue is not seen, e.g. 4.7.0-0.nightly-2021-04-05-201242 does not reproduce. Seems a regression issue. Should have no such errors. Additional info: $ cat ss/check_specific_log.sh for project in $(oc get project | grep -v NAME | awk '{print $1}') do for pod_name in $(oc get pod -n $project 2> /dev/null | grep -v NAME | awk '{print $1}') do for container in $(oc -n $project get pods $pod_name -o jsonpath="{.spec.containers[*].name}") do oc -n $project logs --since-time "$T1" -c $container $pod_name | grep -E "(09|10):[0-9]*.*the server has received too many requests and has asked us to try again later" > a.log; N=`cat a.log | wc -l` if [ $N -gt 0 ]; then echo "$project/$pod_name/$container:" echo "----- $N lines -----" echo head -n 2 a.log echo "..." tail -n 2 a.log echo fi done done done
This is expected and not an error, but rather due to priority and fairness now also covering watches (with 429 http responses).
It only happens during kas rollout. After finishing rollout, no new such occurrences. Should rollout handle it better?
xxia, yes, it is expected. On the other hand, if you see any watch request being throttled for any objects in `openshift-*` namespace, then it should be a bug.
Comment 0 only listed 3 containers. Actually many containers had the logs. Below is the sort of "the server has received too many requests and has asked us to try again later" lines in various containers: 336 lines: openshift-monitoring/prometheus-k8s-1/prometheus 331 lines: openshift-monitoring/prometheus-k8s-0/prometheus 59 lines: openshift-cluster-storage-operator/cluster-storage-operator-cb7d49d6b-xszgb/cluster-storage-operator 51 lines: openshift-apiserver-operator/openshift-apiserver-operator-67986b6969-hstjc/openshift-apiserver-operator 44 lines: openshift-service-ca-operator/service-ca-operator-5867c6548f-4w8nt/service-ca-operator 39 lines: openshift-kube-apiserver-operator/kube-apiserver-operator-c98cfbb94-g6vsl/kube-apiserver-operator 31 lines: openshift-etcd-operator/etcd-operator-67764474db-ftbz8/etcd-operator 29 lines: openshift-controller-manager-operator/openshift-controller-manager-operator-c6d7bd565-j4mjn/openshift-controller-manager-operator 26 lines: openshift-kube-controller-manager-operator/kube-controller-manager-operator-5985c9666f-gd9pv/kube-controller-manager-operator 21 lines: openshift-console-operator/console-operator-649cfdc674-cdbnx/console-operator 20 lines: openshift-image-registry/cluster-image-registry-operator-6594d4c4d6-tp7vf/cluster-image-registry-operator 13 lines: openshift-kube-storage-version-migrator-operator/kube-storage-version-migrator-operator-67f67f4c7b-bg4j5/kube-storage-version-migrator-operator 13 lines: openshift-config-operator/openshift-config-operator-dc8fbc9d5-nb55l/openshift-config-operator 11 lines: openshift-dns-operator/dns-operator-7c5676c5d7-zjkwk/dns-operator 9 lines: openshift-network-operator/network-operator-54687bc749-hmw4k/network-operator 9 lines: openshift-machine-api/machine-api-controllers-85b876d7cf-gk4dm/machineset-controller 9 lines: openshift-cluster-storage-operator/csi-snapshot-controller-operator-5776df6f9f-nbdkn/csi-snapshot-controller-operator 8 lines: openshift-cluster-node-tuning-operator/tuned-nw775/tuned 8 lines: openshift-apiserver/apiserver-6bbb8b6b4d-nqhnf/openshift-apiserver 7 lines: openshift-machine-api/machine-api-operator-c9b896d5-w4xmz/machine-api-operator 7 lines: openshift-machine-api/machine-api-controllers-85b876d7cf-gk4dm/machine-controller 7 lines: openshift-apiserver/apiserver-6bbb8b6b4d-blctx/openshift-apiserver 6 lines: openshift-kube-apiserver/kube-apiserver-ip-10-0-49-193.ap-northeast-1.compute.internal/kube-apiserver-check-endpoints 6 lines: openshift-cluster-samples-operator/cluster-samples-operator-664ddc5bfd-cjq5b/cluster-samples-operator 6 lines: openshift-cluster-node-tuning-operator/tuned-txrsz/tuned 6 lines: openshift-cluster-node-tuning-operator/tuned-px9f5/tuned 6 lines: openshift-apiserver/apiserver-6bbb8b6b4d-nqhnf/openshift-apiserver-check-endpoints 5 lines: openshift-monitoring/prometheus-operator-8547b964f6-gjdg5/prometheus-operator 5 lines: openshift-machine-api/machine-api-controllers-85b876d7cf-gk4dm/nodelink-controller 5 lines: openshift-kube-apiserver/kube-apiserver-ip-10-0-60-8.ap-northeast-1.compute.internal/kube-apiserver-check-endpoints 5 lines: openshift-authentication-operator/authentication-operator-df848bcb6-sqcmd/authentication-operator 4 lines: openshift-operator-lifecycle-manager/packageserver-56d8c4d89-ztks8/packageserver 4 lines: openshift-machine-config-operator/machine-config-controller-645f45745b-6lvsd/machine-config-controller 4 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-49-193.ap-northeast-1.compute.internal/kube-scheduler-recovery-controller 4 lines: openshift-insights/insights-operator-64cc56b55-7sn5d/insights-operator 4 lines: openshift-authentication/oauth-openshift-688fbdbbb9-vk2kp/oauth-openshift 3 lines: openshift-oauth-apiserver/apiserver-ffd476c66-79bgn/oauth-apiserver 3 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-79-135.ap-northeast-1.compute.internal/kube-scheduler 3 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-60-8.ap-northeast-1.compute.internal/kube-scheduler 3 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-49-193.ap-northeast-1.compute.internal/kube-scheduler-cert-syncer 3 lines: openshift-cluster-node-tuning-operator/cluster-node-tuning-operator-8c99b949f-mfjcq/cluster-node-tuning-operator 3 lines: openshift-cluster-csi-drivers/aws-ebs-csi-driver-operator-66454d5dd5-wdmgb/aws-ebs-csi-driver-operator 2 lines: openshift-oauth-apiserver/apiserver-ffd476c66-dkz24/oauth-apiserver 2 lines: openshift-network-diagnostics/network-check-source-5585685c7c-f5k58/check-endpoints 2 lines: openshift-machine-api/cluster-autoscaler-operator-695748744b-lwxmd/cluster-autoscaler-operator 2 lines: openshift-cluster-node-tuning-operator/tuned-2p6mb/tuned 2 lines: openshift-authentication/oauth-openshift-688fbdbbb9-nsvh6/oauth-openshift 2 lines: openshift-apiserver/apiserver-6bbb8b6b4d-blctx/openshift-apiserver-check-endpoints 1 lines: openshift-service-ca/service-ca-6df7b46945-46fxp/service-ca-controller 1 lines: openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-49-193.ap-northeast-1.compute.internal/kube-scheduler 1 lines: openshift-authentication/oauth-openshift-688fbdbbb9-xbg6d/oauth-openshift 1 lines: openshift-apiserver/apiserver-6bbb8b6b4d-8n4mb/openshift-apiserver-check-endpoints 1 lines: openshift-apiserver/apiserver-6bbb8b6b4d-8n4mb/openshift-apiserver http://file.rdu.redhat.com/~xxia/bug/1947829_c.log is the full content of comment 0 script. Below is some abstract of the top ones: 331 lines: openshift-monitoring/prometheus-k8s-0/prometheus: level=error ts=2021-04-09T09:38:08.609Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" 59 lines: openshift-cluster-storage-operator/cluster-storage-operator-cb7d49d6b-xszgb/cluster-storage-operator: E0409 09:38:08.696989 1 reflector.go:138] k8s.io/client-go.0+incompatible/tools/cache/reflector.go:167: Failed to watch *v1.Secret: the server has received too many requests and has asked us to try again later (get secrets) 51 lines: openshift-apiserver-operator/openshift-apiserver-operator-67986b6969-hstjc/openshift-apiserver-operator: E0409 09:38:08.651854 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Secret: the server has received too many requests and has asked us to try again later (get secrets) Below are the grep of the *.go in the error messages, and the sort of their occurrences, I'M SURE it is bug somewhere in kas or these clients: $ grep -o "[^ ]*go:.*.go" t.log | sed 's/:[0-9]*/:***/' | sort | uniq -c | sort -nr 3408 caller=klog.go:*** component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go 2186 caller=klog.go:*** component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go 446 reflector.go:***] k8s.io/client-go/informers/factory.go 404 reflector.go:***] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go 304 reflector.go:***] k8s.io/client-go.0-rc.0/tools/cache/reflector.go 176 reflector.go:***] k8s.io/client-go.0+incompatible/tools/cache/reflector.go 168 reflector.go:***] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go 90 reflector.go:***] k8s.io/client-go.0/tools/cache/reflector.go 85 reflector.go:***] k8s.io/apiserver/pkg/authentication/request/headerrequest/requestheader_controller.go 72 reflector.go:***] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go 69 reflector.go:***] k8s.io/client-go.4/tools/cache/reflector.go 69 reflector.go:***] k8s.io/client-go.1/tools/cache/reflector.go 66 reflector.go:***] k8s.io/client-go.5/tools/cache/reflector.go 30 reflector.go:***] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go 21 reflector.go:***] k8s.io/client-go.2/tools/cache/reflector.go 20 reflector.go:***] github.com/openshift/client-go/route/informers/externalversions/factory.go 11 reflector.go:***] k8s.io/kube-state-metrics/internal/store/builder.go 5 reflector.go:***] github.com/openshift/machine-api-operator/pkg/generated/informers/externalversions/factory.go
this should be a regression issue for 4.8, since we did not meet it in 4.7
Node link controller (runs in openshift-machine-api namespace, watches machines in openshift-machine-api) is hitting this: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032/artifacts/e2e-agnostic-cmd/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-c4f9b7478-q2nrb_nodelink-controller.log
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
*** This bug has been marked as a duplicate of bug 1948311 ***