Bug 1957190
Summary: | CI jobs failing due too many watch requests (prometheus-operator) | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Andrea Fasano <afasano> | ||||
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> | ||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.8 | CC: | alegrand, anpicker, aos-bugs, deads, erooth, hongyli, juzhao, kakkoyun, lcosic, pkrupa, spasquie | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-07-27 23:06:36 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Andrea Fasano
2021-05-05 10:40:54 UTC
There were multiple fixes regarding watcher logic in recent releases of upstream prometheus-operator and we just bumped downstream prometheus-operator fork to the latest version in https://github.com/openshift/prometheus-operator/pull/120. This issue needs another run of investigation to check if 0.47.1 releases fixed it or not. It might also be connected to bug 1948311. same bug as bug 1945808 *** Bug 1945808 has been marked as a duplicate of this bug. *** fix is in 4.8.0-0.nightly-2021-06-01-190440, but still can see "the server has received too many requests and has asked us to try again later" error # oc -n openshift-monitoring get prometheus -oyaml | grep -E "podMonitorNamespaceSelector|probeNamespaceSelector|ruleNamespaceSelector|serviceMonitorNamespaceSelector" -A2 podMonitorNamespaceSelector: matchLabels: openshift.io/cluster-monitoring: "true" -- probeNamespaceSelector: matchLabels: openshift.io/cluster-monitoring: "true" -- ruleNamespaceSelector: matchLabels: openshift.io/cluster-monitoring: "true" -- serviceMonitorNamespaceSelector: matchLabels: openshift.io/cluster-monitoring: "true" # oc -n openshift-monitoring get deploy prometheus-operator -oyaml | grep "\- --namespaces\=" no result # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "the server has received too many requests and has asked us to try again later" level=error ts=2021-06-02T00:49:56.108703953Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.ConfigMap: the server has received too many requests and has asked us to try again later (get configmaps)" level=error ts=2021-06-02T00:49:56.108729228Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.Secret: the server has received too many requests and has asked us to try again later (get secrets)" level=error ts=2021-06-02T00:49:56.108730691Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.StatefulSet: the server has received too many requests and has asked us to try again later (get statefulsets.apps)" this bug should be fixed in apisever side, since the error could be found in many project, see bug 1947829 # for project in $(oc get project | grep -v NAME | awk '{print $1}'); do echo "namespace is: " $project; for pod_name in $(oc get pod -n $project| grep -v NAME | awk '{print $1}'); do echo "pod name is: " $pod_name; for container in $(oc -n $project get pods $pod_name -o jsonpath="{.spec.containers[*].name}"); do echo container: $container;oc -n $project logs -c $container $pod_name | grep "the server has received too many requests and has asked us to try again later" | head -n 2; done; done; done Created attachment 1788584 [details]
prometheus-operator container logs
> this bug should be fixed in apisever side, since the error could be found in many project, see bug 1947829 As noted in https://bugzilla.redhat.com/show_bug.cgi?id=1947829#c1, it is a protection from the API server to throttle clients sending too many requests. Regarding prometheus operator, you can use kubectl-dev_tool [1] to check how many watch requests are issued by the prometheus operator. When I run the following command on the audit logs from the ci/prow/e2e-agnostic job of the pull request [2], I see that the number of requests have decreased a lot compared to what has been reported initially (from 8990 requests in comment 0 to 288). $ kubectl-dev_tool audit -f ./registry-build01-ci-openshift-org-ci-op-tnpzbdlc-stable-sha256-b26af26229ab63635a788809f1997f11bbe81d2a5a815203f7002c67711f68fb/audit_logs/ --user system:serviceaccount:openshift-monitoring:prometheus-operator --verb=watch -o top --by httpstatus had 20045 line read failures count: 288, first: 2021-05-31T09:52:09+02:00, last: 2021-05-31T10:39:42+02:00, duration: 47m32.750796s Top 10 200 (of 252 total hits): 34x [7m28.296162352s] [200-33] /api/v1/namespaces?allowWatchBookmarks=true&resourceVersion=20715&timeoutSeconds=586&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 34x [7m33.706732294s] [200-33] /apis/apps/v1/namespaces/openshift-monitoring/statefulsets?allowWatchBookmarks=true&resourceVersion=20715&timeout=9m53s&timeoutSeconds=593&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 30x [5m33.9332754s] [200-29] /apis/monitoring.coreos.com/v1/prometheusrules?allowWatchBookmarks=true&resourceVersion=13684&timeout=8m29s&timeoutSeconds=509&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 18x [5m17.430707666s] [200-17] /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/prometheuses?allowWatchBookmarks=true&resourceVersion=13700&timeout=5m23s&timeoutSeconds=323&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 16x [5m18.936009s] [200-15] /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/alertmanagers?allowWatchBookmarks=true&resourceVersion=13683&timeout=8m27s&timeoutSeconds=507&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 16x [5m41.387553125s] [200-15] /apis/monitoring.coreos.com/v1/probes?allowWatchBookmarks=true&resourceVersion=13675&timeout=7m51s&timeoutSeconds=471&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 14x [6m16.147861571s] [200-13] /api/v1/namespaces/openshift-monitoring/secrets?allowWatchBookmarks=true&resourceVersion=13681&timeout=6m1s&timeoutSeconds=361&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 14x [5m30.806984428s] [200-13] /apis/monitoring.coreos.com/v1/podmonitors?allowWatchBookmarks=true&resourceVersion=13685&timeout=5m47s&timeoutSeconds=347&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 14x [5m12.400839428s] [200-13] /apis/monitoring.coreos.com/v1alpha1/alertmanagerconfigs?allowWatchBookmarks=true&resourceVersion=13688&timeout=8m2s&timeoutSeconds=482&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 14x [5m31.143045714s] [200-13] /apis/monitoring.coreos.com/v1/servicemonitors?allowWatchBookmarks=true&resourceVersion=13688&timeout=7m43s&timeoutSeconds=463&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] Top 10 429 (of 36 total hits): 8x [ 262µs] [429-7] /api/v1/namespaces/openshift-monitoring/configmaps?allowWatchBookmarks=true&labelSelector=thanos-ruler-name&resourceVersion=28337&timeout=9m34s&timeoutSeconds=574&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 8x [ 282µs] [429-7] /apis/apps/v1/namespaces/openshift-monitoring/statefulsets?allowWatchBookmarks=true&resourceVersion=28402&timeout=7m2s&timeoutSeconds=422&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 8x [ 4.47925ms] [429-7] /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/thanosrulers?allowWatchBookmarks=true&resourceVersion=28522&timeout=8m1s&timeoutSeconds=481&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 6x [ 283µs] [429-5] /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/alertmanagers?allowWatchBookmarks=true&resourceVersion=28553&timeout=5m26s&timeoutSeconds=326&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] 6x [ 287.333µs] [429-5] /api/v1/namespaces/openshift-monitoring/secrets?allowWatchBookmarks=true&resourceVersion=28294&timeout=8m0s&timeoutSeconds=480&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator] [1] https://github.com/openshift/cluster-debug-tools/ [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1168/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic/1399263464723058688/artifacts/e2e-agnostic/gather-audit-logs/artifacts/audit-logs.tar based on Comment 10 and the attached logs file, there are 35 "the server has received too many requests and has asked us to try again later" errors in prometheus-operator container now, whereas in bug 1945808, the total number is 518, the number is reduced, set it to VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |