Bug 1957190

Summary: CI jobs failing due too many watch requests (prometheus-operator)
Product: OpenShift Container Platform Reporter: Andrea Fasano <afasano>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.8CC: alegrand, anpicker, aos-bugs, deads, erooth, hongyli, juzhao, kakkoyun, lcosic, pkrupa, spasquie
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:06:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
prometheus-operator container logs none

Description Andrea Fasano 2021-05-05 10:40:54 UTC
Description of problem:

In the past weeks we observed a number of CI jobs failures related to tests getting the error message "Too many WATCH requests, please try again later.". For example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4900/pull-ci-openshift-installer-master-e2e-metal-ipi/1389762405155737600.
The problem seems to appear more frequently on metal-ipi cluster profiles, but it is observed as well as in other profiles, such as gcp and azure.
In the most recent failures it seems emerging a pattern where the prometheus-operator figures out as the top watches producer, for example:

count: 42506, first: 2021-05-05T04:59:22+02:00, last: 2021-05-05T06:07:18+02:00, duration: 1h7m56.219595s
8990x                system:serviceaccount:openshift-monitoring:prometheus-operator
3347x                system:serviceaccount:openshift-monitoring:prometheus-k8s
2674x                system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount
2371x                system:node:worker-0
2026x                system:admin
1888x                system:node:master-2
1886x                system:kube-controller-manager
1720x                system:node:worker-1
1626x                system:node:master-0
1572x                system:node:master-1

In addition, prometheus-operator remains a top watches producer also by narrowing down the the audit logs in a zone surrounding the test failure (+/- a couple of seconds):

count: 234, first: 2021-05-05T05:41:47+02:00, last: 2021-05-05T05:41:51+02:00, duration: 4.788038s
102x                 system:serviceaccount:openshift-monitoring:prometheus-operator
32x                  system:node:master-1
31x                  system:serviceaccount:openshift-monitoring:prometheus-k8s
30x                  system:node:worker-0
9x                   system:serviceaccount:openshift-controller-manager-operator:openshift-controller-manager-operator
5x                   system:serviceaccount:openshift-kube-apiserver-operator:kube-apiserver-operator
4x                   system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount
4x                   e2e-test-image-change-build-trigger-tscqg-user
3x                   system:serviceaccount:openshift-apiserver:openshift-apiserver-sa
3x                   system:serviceaccount:openshift-kube-storage-version-migrator-operator:kube-storage-version-migrator-operator

Version-Release number of selected component (if applicable):
4.8

How reproducible:
It's pretty frequent

Steps to Reproduce:
1. Currently it happens on CI jobs running e2e tests

Actual results:
Some e2e tests are failing with the message "Too many WATCH requests, please try again later."

Expected results:
No e2e test failure due the number of watch requests

Additional info:

Comment 1 Pawel Krupa 2021-05-05 11:09:31 UTC
There were multiple fixes regarding watcher logic in recent releases of upstream prometheus-operator and we just bumped downstream prometheus-operator fork to the latest version in https://github.com/openshift/prometheus-operator/pull/120. This issue needs another run of investigation to check if 0.47.1 releases fixed it or not.

Comment 2 Simon Pasquier 2021-05-06 11:57:27 UTC
It might also be connected to bug 1948311.

Comment 4 Junqi Zhao 2021-05-12 13:09:20 UTC
same bug as bug 1945808

Comment 5 Simon Pasquier 2021-05-21 16:05:45 UTC
*** Bug 1945808 has been marked as a duplicate of this bug. ***

Comment 8 Junqi Zhao 2021-06-02 02:08:51 UTC
fix is in 4.8.0-0.nightly-2021-06-01-190440, but still can see "the server has received too many requests and has asked us to try again later" error
# oc -n openshift-monitoring get prometheus -oyaml | grep -E "podMonitorNamespaceSelector|probeNamespaceSelector|ruleNamespaceSelector|serviceMonitorNamespaceSelector" -A2
    podMonitorNamespaceSelector:
      matchLabels:
        openshift.io/cluster-monitoring: "true"
--
    probeNamespaceSelector:
      matchLabels:
        openshift.io/cluster-monitoring: "true"
--
    ruleNamespaceSelector:
      matchLabels:
        openshift.io/cluster-monitoring: "true"
--
    serviceMonitorNamespaceSelector:
      matchLabels:
        openshift.io/cluster-monitoring: "true"


# oc -n openshift-monitoring get deploy  prometheus-operator -oyaml | grep "\- --namespaces\="
no result

# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "the server has received too many requests and has asked us to try again later"
level=error ts=2021-06-02T00:49:56.108703953Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.ConfigMap: the server has received too many requests and has asked us to try again later (get configmaps)"
level=error ts=2021-06-02T00:49:56.108729228Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.Secret: the server has received too many requests and has asked us to try again later (get secrets)"
level=error ts=2021-06-02T00:49:56.108730691Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.StatefulSet: the server has received too many requests and has asked us to try again later (get statefulsets.apps)"

this bug should be fixed in apisever side, since the error could be found in many project, see bug 1947829
# for project in $(oc get project | grep -v NAME | awk '{print $1}'); do echo "namespace is: " $project; for pod_name in $(oc get pod -n $project| grep -v NAME | awk '{print $1}'); do echo "pod name is: "  $pod_name; for container in $(oc -n $project get pods $pod_name -o jsonpath="{.spec.containers[*].name}"); do echo container: $container;oc -n $project logs -c $container $pod_name | grep "the server has received too many requests and has asked us to try again later" | head -n 2; done; done; done

Comment 9 Junqi Zhao 2021-06-02 02:10:50 UTC
Created attachment 1788584 [details]
prometheus-operator container logs

Comment 10 Simon Pasquier 2021-06-02 07:42:11 UTC
> this bug should be fixed in apisever side, since the error could be found in many project, see bug 1947829

As noted in https://bugzilla.redhat.com/show_bug.cgi?id=1947829#c1, it is a protection from the API server to throttle clients sending too many requests.

Regarding prometheus operator, you can use kubectl-dev_tool [1] to check how many watch requests are issued by the prometheus operator. When I run the following command on the audit logs from the ci/prow/e2e-agnostic job of the pull request [2], I see that the number of requests have decreased a lot compared to what has been reported initially (from 8990 requests in comment 0 to 288).

$ kubectl-dev_tool audit -f ./registry-build01-ci-openshift-org-ci-op-tnpzbdlc-stable-sha256-b26af26229ab63635a788809f1997f11bbe81d2a5a815203f7002c67711f68fb/audit_logs/ --user system:serviceaccount:openshift-monitoring:prometheus-operator --verb=watch -o top --by httpstatus
had 20045 line read failures
count: 288, first: 2021-05-31T09:52:09+02:00, last: 2021-05-31T10:39:42+02:00, duration: 47m32.750796s

Top 10 200 (of 252 total hits):
     34x [7m28.296162352s] [200-33] /api/v1/namespaces?allowWatchBookmarks=true&resourceVersion=20715&timeoutSeconds=586&watch=true                                                                         [system:serviceaccount:openshift-monitoring:prometheus-operator]
     34x [7m33.706732294s] [200-33] /apis/apps/v1/namespaces/openshift-monitoring/statefulsets?allowWatchBookmarks=true&resourceVersion=20715&timeout=9m53s&timeoutSeconds=593&watch=true                   [system:serviceaccount:openshift-monitoring:prometheus-operator]
     30x [5m33.9332754s] [200-29]   /apis/monitoring.coreos.com/v1/prometheusrules?allowWatchBookmarks=true&resourceVersion=13684&timeout=8m29s&timeoutSeconds=509&watch=true                               [system:serviceaccount:openshift-monitoring:prometheus-operator]
     18x [5m17.430707666s] [200-17] /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/prometheuses?allowWatchBookmarks=true&resourceVersion=13700&timeout=5m23s&timeoutSeconds=323&watch=true  [system:serviceaccount:openshift-monitoring:prometheus-operator]
     16x [5m18.936009s] [200-15]    /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/alertmanagers?allowWatchBookmarks=true&resourceVersion=13683&timeout=8m27s&timeoutSeconds=507&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator]
     16x [5m41.387553125s] [200-15] /apis/monitoring.coreos.com/v1/probes?allowWatchBookmarks=true&resourceVersion=13675&timeout=7m51s&timeoutSeconds=471&watch=true                                        [system:serviceaccount:openshift-monitoring:prometheus-operator]
     14x [6m16.147861571s] [200-13] /api/v1/namespaces/openshift-monitoring/secrets?allowWatchBookmarks=true&resourceVersion=13681&timeout=6m1s&timeoutSeconds=361&watch=true                               [system:serviceaccount:openshift-monitoring:prometheus-operator]
     14x [5m30.806984428s] [200-13] /apis/monitoring.coreos.com/v1/podmonitors?allowWatchBookmarks=true&resourceVersion=13685&timeout=5m47s&timeoutSeconds=347&watch=true                                   [system:serviceaccount:openshift-monitoring:prometheus-operator]
     14x [5m12.400839428s] [200-13] /apis/monitoring.coreos.com/v1alpha1/alertmanagerconfigs?allowWatchBookmarks=true&resourceVersion=13688&timeout=8m2s&timeoutSeconds=482&watch=true                      [system:serviceaccount:openshift-monitoring:prometheus-operator]
     14x [5m31.143045714s] [200-13] /apis/monitoring.coreos.com/v1/servicemonitors?allowWatchBookmarks=true&resourceVersion=13688&timeout=7m43s&timeoutSeconds=463&watch=true                               [system:serviceaccount:openshift-monitoring:prometheus-operator]

Top 10 429 (of 36 total hits):
      8x [       262µs] [429-7] /api/v1/namespaces/openshift-monitoring/configmaps?allowWatchBookmarks=true&labelSelector=thanos-ruler-name&resourceVersion=28337&timeout=9m34s&timeoutSeconds=574&watch=true [system:serviceaccount:openshift-monitoring:prometheus-operator]
      8x [       282µs] [429-7] /apis/apps/v1/namespaces/openshift-monitoring/statefulsets?allowWatchBookmarks=true&resourceVersion=28402&timeout=7m2s&timeoutSeconds=422&watch=true                          [system:serviceaccount:openshift-monitoring:prometheus-operator]
      8x [   4.47925ms] [429-7] /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/thanosrulers?allowWatchBookmarks=true&resourceVersion=28522&timeout=8m1s&timeoutSeconds=481&watch=true         [system:serviceaccount:openshift-monitoring:prometheus-operator]
      6x [       283µs] [429-5] /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/alertmanagers?allowWatchBookmarks=true&resourceVersion=28553&timeout=5m26s&timeoutSeconds=326&watch=true       [system:serviceaccount:openshift-monitoring:prometheus-operator]
      6x [   287.333µs] [429-5] /api/v1/namespaces/openshift-monitoring/secrets?allowWatchBookmarks=true&resourceVersion=28294&timeout=8m0s&timeoutSeconds=480&watch=true                                     [system:serviceaccount:openshift-monitoring:prometheus-operator]


[1] https://github.com/openshift/cluster-debug-tools/
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1168/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic/1399263464723058688/artifacts/e2e-agnostic/gather-audit-logs/artifacts/audit-logs.tar

Comment 11 Junqi Zhao 2021-06-02 08:04:27 UTC
based on Comment 10 and the attached logs file, there are 35 "the server has received too many requests and has asked us to try again later" errors in prometheus-operator container now, whereas in bug 1945808, the total number is 518, the number is reduced, set it to VERIFIED

Comment 14 errata-xmlrpc 2021-07-27 23:06:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438