1957190 – CI jobs failing due too many watch requests (prometheus-operator)

Bug 1957190 - CI jobs failing due too many watch requests (prometheus-operator)

Summary: CI jobs failing due too many watch requests (prometheus-operator)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1945808 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-05 10:40 UTC by Andrea Fasano
Modified:	2021-07-27 23:06 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:06:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
prometheus-operator container logs (30.89 KB, text/plain) 2021-06-02 02:10 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1168	0	None	open	[WIP] Bug 1957190: use NamespaceSelector instead of explicit allow/deny list	2021-05-18 13:50:35 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:06:57 UTC

Internal Links: 1953264

Description Andrea Fasano 2021-05-05 10:40:54 UTC

Description of problem:

In the past weeks we observed a number of CI jobs failures related to tests getting the error message "Too many WATCH requests, please try again later.". For example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4900/pull-ci-openshift-installer-master-e2e-metal-ipi/1389762405155737600.
The problem seems to appear more frequently on metal-ipi cluster profiles, but it is observed as well as in other profiles, such as gcp and azure.
In the most recent failures it seems emerging a pattern where the prometheus-operator figures out as the top watches producer, for example:

count: 42506, first: 2021-05-05T04:59:22+02:00, last: 2021-05-05T06:07:18+02:00, duration: 1h7m56.219595s
8990x                system:serviceaccount:openshift-monitoring:prometheus-operator
3347x                system:serviceaccount:openshift-monitoring:prometheus-k8s
2674x                system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount
2371x                system:node:worker-0
2026x                system:admin
1888x                system:node:master-2
1886x                system:kube-controller-manager
1720x                system:node:worker-1
1626x                system:node:master-0
1572x                system:node:master-1

In addition, prometheus-operator remains a top watches producer also by narrowing down the the audit logs in a zone surrounding the test failure (+/- a couple of seconds):

count: 234, first: 2021-05-05T05:41:47+02:00, last: 2021-05-05T05:41:51+02:00, duration: 4.788038s
102x                 system:serviceaccount:openshift-monitoring:prometheus-operator
32x                  system:node:master-1
31x                  system:serviceaccount:openshift-monitoring:prometheus-k8s
30x                  system:node:worker-0
9x                   system:serviceaccount:openshift-controller-manager-operator:openshift-controller-manager-operator
5x                   system:serviceaccount:openshift-kube-apiserver-operator:kube-apiserver-operator
4x                   system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount
4x                   e2e-test-image-change-build-trigger-tscqg-user
3x                   system:serviceaccount:openshift-apiserver:openshift-apiserver-sa
3x                   system:serviceaccount:openshift-kube-storage-version-migrator-operator:kube-storage-version-migrator-operator

Version-Release number of selected component (if applicable):
4.8

How reproducible:
It's pretty frequent

Steps to Reproduce:
1. Currently it happens on CI jobs running e2e tests

Actual results:
Some e2e tests are failing with the message "Too many WATCH requests, please try again later."

Expected results:
No e2e test failure due the number of watch requests

Additional info:

Comment 1 Pawel Krupa 2021-05-05 11:09:31 UTC

There were multiple fixes regarding watcher logic in recent releases of upstream prometheus-operator and we just bumped downstream prometheus-operator fork to the latest version in https://github.com/openshift/prometheus-operator/pull/120. This issue needs another run of investigation to check if 0.47.1 releases fixed it or not.

Comment 2 Simon Pasquier 2021-05-06 11:57:27 UTC

It might also be connected to bug 1948311.

Comment 4 Junqi Zhao 2021-05-12 13:09:20 UTC

same bug as bug 1945808

Comment 5 Simon Pasquier 2021-05-21 16:05:45 UTC

*** Bug 1945808 has been marked as a duplicate of this bug. ***

Comment 8 Junqi Zhao 2021-06-02 02:08:51 UTC

fix is in 4.8.0-0.nightly-2021-06-01-190440, but still can see "the server has received too many requests and has asked us to try again later" error
# oc -n openshift-monitoring get prometheus -oyaml | grep -E "podMonitorNamespaceSelector|probeNamespaceSelector|ruleNamespaceSelector|serviceMonitorNamespaceSelector" -A2
    podMonitorNamespaceSelector:
      matchLabels:
        openshift.io/cluster-monitoring: "true"
--
    probeNamespaceSelector:
      matchLabels:
        openshift.io/cluster-monitoring: "true"
--
    ruleNamespaceSelector:
      matchLabels:
        openshift.io/cluster-monitoring: "true"
--
    serviceMonitorNamespaceSelector:
      matchLabels:
        openshift.io/cluster-monitoring: "true"


# oc -n openshift-monitoring get deploy  prometheus-operator -oyaml | grep "\- --namespaces\="
no result

# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "the server has received too many requests and has asked us to try again later"
level=error ts=2021-06-02T00:49:56.108703953Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.ConfigMap: the server has received too many requests and has asked us to try again later (get configmaps)"
level=error ts=2021-06-02T00:49:56.108729228Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.Secret: the server has received too many requests and has asked us to try again later (get secrets)"
level=error ts=2021-06-02T00:49:56.108730691Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.StatefulSet: the server has received too many requests and has asked us to try again later (get statefulsets.apps)"

this bug should be fixed in apisever side, since the error could be found in many project, see bug 1947829
# for project in $(oc get project | grep -v NAME | awk '{print $1}'); do echo "namespace is: " $project; for pod_name in $(oc get pod -n $project| grep -v NAME | awk '{print $1}'); do echo "pod name is: "  $pod_name; for container in $(oc -n $project get pods $pod_name -o jsonpath="{.spec.containers[*].name}"); do echo container: $container;oc -n $project logs -c $container $pod_name | grep "the server has received too many requests and has asked us to try again later" | head -n 2; done; done; done

Comment 9 Junqi Zhao 2021-06-02 02:10:50 UTC

Created attachment 1788584 [details]
prometheus-operator container logs

Comment 10 Simon Pasquier 2021-06-02 07:42:11 UTC

> this bug should be fixed in apisever side, since the error could be found in many project, see bug 1947829

As noted in https://bugzilla.redhat.com/show_bug.cgi?id=1947829#c1, it is a protection from the API server to throttle clients sending too many requests.

Regarding prometheus operator, you can use kubectl-dev_tool [1] to check how many watch requests are issued by the prometheus operator. When I run the following command on the audit logs from the ci/prow/e2e-agnostic job of the pull request [2], I see that the number of requests have decreased a lot compared to what has been reported initially (from 8990 requests in comment 0 to 288).

$ kubectl-dev_tool audit -f ./registry-build01-ci-openshift-org-ci-op-tnpzbdlc-stable-sha256-b26af26229ab63635a788809f1997f11bbe81d2a5a815203f7002c67711f68fb/audit_logs/ --user system:serviceaccount:openshift-monitoring:prometheus-operator --verb=watch -o top --by httpstatus
had 20045 line read failures
count: 288, first: 2021-05-31T09:52:09+02:00, last: 2021-05-31T10:39:42+02:00, duration: 47m32.750796s

Top 10 200 (of 252 total hits):
     34x [7m28.296162352s] [200-33] /api/v1/namespaces?allowWatchBookmarks=true&resourceVersion=20715&timeoutSeconds=586&watch=true                                                                         [system:serviceaccount:openshift-monitoring:prometheus-operator]
     34x [7m33.706732294s] [200-33] /apis/apps/v1/namespaces/openshift-monitoring/statefulsets?allowWatchBookmarks=true&resourceVersion=20715&timeout=9m53s&timeoutSeconds=593&watch=true                   [system:serviceaccount:openshift-monitoring:prometheus-operator]
     30x [5m33.9332754s] [200-29]   /apis/monitoring.coreos.com/v1/prometheusrules?allowWatchBookmarks=true&resourceVersion=13684&timeout=8m29s&timeoutSeconds=509&watch=true                               [system:serviceaccount:openshift-monitoring:prometheus-operator]
     18x [5m17.430707666s] [200-17] /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/prometheuses?allowWatchBookmarks=true&resourceVersion=13700&timeout=5m23s&timeoutSeconds=323&watch=true  [system:serviceaccount:openshift-monitoring:prometheus-operator]
     16x [5m18.936009s] [200-15]    /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/alertmanagers?allowWatchBookmarks=true&resourceVersion=13683&timeout=8m27s&timeoutSeconds=507&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator]
     16x [5m41.387553125s] [200-15] /apis/monitoring.coreos.com/v1/probes?allowWatchBookmarks=true&resourceVersion=13675&timeout=7m51s&timeoutSeconds=471&watch=true                                        [system:serviceaccount:openshift-monitoring:prometheus-operator]
     14x [6m16.147861571s] [200-13] /api/v1/namespaces/openshift-monitoring/secrets?allowWatchBookmarks=true&resourceVersion=13681&timeout=6m1s&timeoutSeconds=361&watch=true                               [system:serviceaccount:openshift-monitoring:prometheus-operator]
     14x [5m30.806984428s] [200-13] /apis/monitoring.coreos.com/v1/podmonitors?allowWatchBookmarks=true&resourceVersion=13685&timeout=5m47s&timeoutSeconds=347&watch=true                                   [system:serviceaccount:openshift-monitoring:prometheus-operator]
     14x [5m12.400839428s] [200-13] /apis/monitoring.coreos.com/v1alpha1/alertmanagerconfigs?allowWatchBookmarks=true&resourceVersion=13688&timeout=8m2s&timeoutSeconds=482&watch=true                      [system:serviceaccount:openshift-monitoring:prometheus-operator]
     14x [5m31.143045714s] [200-13] /apis/monitoring.coreos.com/v1/servicemonitors?allowWatchBookmarks=true&resourceVersion=13688&timeout=7m43s&timeoutSeconds=463&watch=true                               [system:serviceaccount:openshift-monitoring:prometheus-operator]

Top 10 429 (of 36 total hits):
      8x [       262µs] [429-7] /api/v1/namespaces/openshift-monitoring/configmaps?allowWatchBookmarks=true&labelSelector=thanos-ruler-name&resourceVersion=28337&timeout=9m34s&timeoutSeconds=574&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator]
      8x [       282µs] [429-7] /apis/apps/v1/namespaces/openshift-monitoring/statefulsets?allowWatchBookmarks=true&resourceVersion=28402&timeout=7m2s&timeoutSeconds=422&watch=true                          [system:serviceaccount:openshift-monitoring:prometheus-operator]
      8x [   4.47925ms] [429-7] /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/thanosrulers?allowWatchBookmarks=true&resourceVersion=28522&timeout=8m1s&timeoutSeconds=481&watch=true         [system:serviceaccount:openshift-monitoring:prometheus-operator]
      6x [       283µs] [429-5] /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/alertmanagers?allowWatchBookmarks=true&resourceVersion=28553&timeout=5m26s&timeoutSeconds=326&watch=true       [system:serviceaccount:openshift-monitoring:prometheus-operator]
      6x [   287.333µs] [429-5] /api/v1/namespaces/openshift-monitoring/secrets?allowWatchBookmarks=true&resourceVersion=28294&timeout=8m0s&timeoutSeconds=480&watch=true                                     [system:serviceaccount:openshift-monitoring:prometheus-operator]


[1] https://github.com/openshift/cluster-debug-tools/
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1168/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic/1399263464723058688/artifacts/e2e-agnostic/gather-audit-logs/artifacts/audit-logs.tar

Comment 11 Junqi Zhao 2021-06-02 08:04:27 UTC

based on Comment 10 and the attached logs file, there are 35 "the server has received too many requests and has asked us to try again later" errors in prometheus-operator container now, whereas in bug 1945808, the total number is 518, the number is reduced, set it to VERIFIED

Comment 14 errata-xmlrpc 2021-07-27 23:06:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.