Description of problem: In the past weeks we observed a number of CI jobs failures related to tests getting the error message "Too many WATCH requests, please try again later.". For example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4900/pull-ci-openshift-installer-master-e2e-metal-ipi/1389762405155737600. The problem seems to appear more frequently on metal-ipi cluster profiles, but it is observed as well as in other profiles, such as gcp and azure. In the most recent failures it seems emerging a pattern where the prometheus-operator figures out as the top watches producer, for example: count: 42506, first: 2021-05-05T04:59:22+02:00, last: 2021-05-05T06:07:18+02:00, duration: 1h7m56.219595s 8990x system:serviceaccount:openshift-monitoring:prometheus-operator 3347x system:serviceaccount:openshift-monitoring:prometheus-k8s 2674x system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount 2371x system:node:worker-0 2026x system:admin 1888x system:node:master-2 1886x system:kube-controller-manager 1720x system:node:worker-1 1626x system:node:master-0 1572x system:node:master-1 In addition, prometheus-operator remains a top watches producer also by narrowing down the the audit logs in a zone surrounding the test failure (+/- a couple of seconds): count: 234, first: 2021-05-05T05:41:47+02:00, last: 2021-05-05T05:41:51+02:00, duration: 4.788038s 102x system:serviceaccount:openshift-monitoring:prometheus-operator 32x system:node:master-1 31x system:serviceaccount:openshift-monitoring:prometheus-k8s 30x system:node:worker-0 9x system:serviceaccount:openshift-controller-manager-operator:openshift-controller-manager-operator 5x system:serviceaccount:openshift-kube-apiserver-operator:kube-apiserver-operator 4x system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount 4x e2e-test-image-change-build-trigger-tscqg-user 3x system:serviceaccount:openshift-apiserver:openshift-apiserver-sa 3x system:serviceaccount:openshift-kube-storage-version-migrator-operator:kube-storage-version-migrator-operator Version-Release number of selected component (if applicable): 4.8 How reproducible: It's pretty frequent Steps to Reproduce: 1. Currently it happens on CI jobs running e2e tests Actual results: Some e2e tests are failing with the message "Too many WATCH requests, please try again later." Expected results: No e2e test failure due the number of watch requests Additional info:
There were multiple fixes regarding watcher logic in recent releases of upstream prometheus-operator and we just bumped downstream prometheus-operator fork to the latest version in https://github.com/openshift/prometheus-operator/pull/120. This issue needs another run of investigation to check if 0.47.1 releases fixed it or not.
It might also be connected to bug 1948311.
same bug as bug 1945808
*** Bug 1945808 has been marked as a duplicate of this bug. ***
fix is in 4.8.0-0.nightly-2021-06-01-190440, but still can see "the server has received too many requests and has asked us to try again later" error # oc -n openshift-monitoring get prometheus -oyaml | grep -E "podMonitorNamespaceSelector|probeNamespaceSelector|ruleNamespaceSelector|serviceMonitorNamespaceSelector" -A2 podMonitorNamespaceSelector: matchLabels: openshift.io/cluster-monitoring: "true" -- probeNamespaceSelector: matchLabels: openshift.io/cluster-monitoring: "true" -- ruleNamespaceSelector: matchLabels: openshift.io/cluster-monitoring: "true" -- serviceMonitorNamespaceSelector: matchLabels: openshift.io/cluster-monitoring: "true" # oc -n openshift-monitoring get deploy prometheus-operator -oyaml | grep "\- --namespaces\=" no result # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "the server has received too many requests and has asked us to try again later" level=error ts=2021-06-02T00:49:56.108703953Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.ConfigMap: the server has received too many requests and has asked us to try again later (get configmaps)" level=error ts=2021-06-02T00:49:56.108729228Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.Secret: the server has received too many requests and has asked us to try again later (get secrets)" level=error ts=2021-06-02T00:49:56.108730691Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.StatefulSet: the server has received too many requests and has asked us to try again later (get statefulsets.apps)" this bug should be fixed in apisever side, since the error could be found in many project, see bug 1947829 # for project in $(oc get project | grep -v NAME | awk '{print $1}'); do echo "namespace is: " $project; for pod_name in $(oc get pod -n $project| grep -v NAME | awk '{print $1}'); do echo "pod name is: " $pod_name; for container in $(oc -n $project get pods $pod_name -o jsonpath="{.spec.containers[*].name}"); do echo container: $container;oc -n $project logs -c $container $pod_name | grep "the server has received too many requests and has asked us to try again later" | head -n 2; done; done; done
Created attachment 1788584 [details] prometheus-operator container logs
> this bug should be fixed in apisever side, since the error could be found in many project, see bug 1947829 As noted in https://bugzilla.redhat.com/show_bug.cgi?id=1947829#c1, it is a protection from the API server to throttle clients sending too many requests. Regarding prometheus operator, you can use kubectl-dev_tool [1] to check how many watch requests are issued by the prometheus operator. When I run the following command on the audit logs from the ci/prow/e2e-agnostic job of the pull request [2], I see that the number of requests have decreased a lot compared to what has been reported initially (from 8990 requests in comment 0 to 288). $ kubectl-dev_tool audit -f ./registry-build01-ci-openshift-org-ci-op-tnpzbdlc-stable-sha256-b26af26229ab63635a788809f1997f11bbe81d2a5a815203f7002c67711f68fb/audit_logs/ --user system:serviceaccount:openshift-monitoring:prometheus-operator --verb=watch -o top --by httpstatus had 20045 line read failures count: 288, first: 2021-05-31T09:52:09+02:00, last: 2021-05-31T10:39:42+02:00, duration: 47m32.750796s Top 10 200 (of 252 total hits): 34x [7m28.296162352s] [200-33] /api/v1/namespaces?allowWatchBookmarks=true&resourceVersion=20715&timeoutSeconds=586&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 34x [7m33.706732294s] [200-33] /apis/apps/v1/namespaces/openshift-monitoring/statefulsets?allowWatchBookmarks=true&resourceVersion=20715&timeout=9m53s&timeoutSeconds=593&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 30x [5m33.9332754s] [200-29] /apis/monitoring.coreos.com/v1/prometheusrules?allowWatchBookmarks=true&resourceVersion=13684&timeout=8m29s&timeoutSeconds=509&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 18x [5m17.430707666s] [200-17] /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/prometheuses?allowWatchBookmarks=true&resourceVersion=13700&timeout=5m23s&timeoutSeconds=323&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 16x [5m18.936009s] [200-15] /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/alertmanagers?allowWatchBookmarks=true&resourceVersion=13683&timeout=8m27s&timeoutSeconds=507&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 16x [5m41.387553125s] [200-15] /apis/monitoring.coreos.com/v1/probes?allowWatchBookmarks=true&resourceVersion=13675&timeout=7m51s&timeoutSeconds=471&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 14x [6m16.147861571s] [200-13] /api/v1/namespaces/openshift-monitoring/secrets?allowWatchBookmarks=true&resourceVersion=13681&timeout=6m1s&timeoutSeconds=361&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 14x [5m30.806984428s] [200-13] /apis/monitoring.coreos.com/v1/podmonitors?allowWatchBookmarks=true&resourceVersion=13685&timeout=5m47s&timeoutSeconds=347&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 14x [5m12.400839428s] [200-13] /apis/monitoring.coreos.com/v1alpha1/alertmanagerconfigs?allowWatchBookmarks=true&resourceVersion=13688&timeout=8m2s&timeoutSeconds=482&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 14x [5m31.143045714s] [200-13] /apis/monitoring.coreos.com/v1/servicemonitors?allowWatchBookmarks=true&resourceVersion=13688&timeout=7m43s&timeoutSeconds=463&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] Top 10 429 (of 36 total hits): 8x [ 262µs] [429-7] /api/v1/namespaces/openshift-monitoring/configmaps?allowWatchBookmarks=true&labelSelector=thanos-ruler-name&resourceVersion=28337&timeout=9m34s&timeoutSeconds=574&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 8x [ 282µs] [429-7] /apis/apps/v1/namespaces/openshift-monitoring/statefulsets?allowWatchBookmarks=true&resourceVersion=28402&timeout=7m2s&timeoutSeconds=422&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 8x [ 4.47925ms] [429-7] /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/thanosrulers?allowWatchBookmarks=true&resourceVersion=28522&timeout=8m1s&timeoutSeconds=481&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 6x [ 283µs] [429-5] /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/alertmanagers?allowWatchBookmarks=true&resourceVersion=28553&timeout=5m26s&timeoutSeconds=326&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 6x [ 287.333µs] [429-5] /api/v1/namespaces/openshift-monitoring/secrets?allowWatchBookmarks=true&resourceVersion=28294&timeout=8m0s&timeoutSeconds=480&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] [1] https://github.com/openshift/cluster-debug-tools/ [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1168/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic/1399263464723058688/artifacts/e2e-agnostic/gather-audit-logs/artifacts/audit-logs.tar
based on Comment 10 and the attached logs file, there are 35 "the server has received too many requests and has asked us to try again later" errors in prometheus-operator container now, whereas in bug 1945808, the total number is 518, the number is reduced, set it to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438