Bug 2056450

Summary: cluster-monitoring-operator experiencing very short watch requests on single-node clusters
Product: OpenShift Container Platform Reporter: Omer Tuchfeld <otuchfel>
Component: kube-apiserverAssignee: Abu Kashem <akashem>
Status: CLOSED WONTFIX QA Contact: Ke Wang <kewang>
Severity: unspecified Docs Contact:
Priority: high    
Version: 4.11CC: aos-bugs, mfojtik, rfreiman, scuppett, xxia
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-16 11:53:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Omer Tuchfeld 2022-02-21 09:38:26 UTC
Description of problem:
cluster-monitoring-operator experiencing very short watch requests on single-node clusters. This leads to a higher watch request count which in turn leads to single-node CI test failure due to the watch count going over the allowed test threshold.

From audit logs:

22:04:21 [ WATCH][       737µs] [200] /api/v1/namespaces/openshift-config/configmaps?allowWatchBookmarks=true&resourceVersion=11283&timeoutSeconds=306&watch=true                               [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]
22:04:22 [ WATCH][     4.727ms] [200] /api/v1/namespaces/openshift-config-managed/configmaps?allowWatchBookmarks=true&resourceVersion=11283&timeoutSeconds=446&watch=true                       [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]
22:04:22 [ WATCH][     1.859ms] [200] /api/v1/namespaces/openshift-monitoring/persistentvolumeclaims?allowWatchBookmarks=true&resourceVersion=11260&timeoutSeconds=501&watch=true               [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]
22:04:22 [ WATCH][     4.286ms] [200] /apis/certificates.k8s.io/v1/certificatesigningrequests?allowWatchBookmarks=true&resourceVersion=11280&timeout=6m35s&timeoutSeconds=395&watch=true        [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]
22:04:24 [ WATCH][       770µs] [200] /api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&resourceVersion=11416&timeoutSeconds=487&watch=true                                    [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]
22:07:45 [ WATCH][   180.855ms] [200] /api/v1/namespaces/openshift-monitoring/persistentvolumeclaims?allowWatchBookmarks=true&resourceVersion=13067&timeoutSeconds=378&watch=true               [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]
22:07:45 [ WATCH][   160.594ms] [403] /api/v1/namespaces/openshift-config/configmaps?allowWatchBookmarks=true&resourceVersion=13073&timeoutSeconds=429&watch=true                               [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]
22:07:45 [ WATCH][   161.931ms] [200] /api/v1/namespaces/openshift-monitoring/configmaps?allowWatchBookmarks=true&resourceVersion=13263&timeoutSeconds=556&watch=true                           [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]
22:07:45 [ WATCH][   143.324ms] [403] /api/v1/namespaces/openshift-monitoring/secrets?allowWatchBookmarks=true&resourceVersion=13266&timeoutSeconds=571&watch=true                              [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]
22:07:45 [ WATCH][   165.666ms] [403] /apis/config.openshift.io/v1/apiservers?fieldSelector=metadata.name%3Dcluster&watch=true                                                                  [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]
22:07:45 [ WATCH][   160.978ms] [403] /api/v1/namespaces/openshift-monitoring/secrets?allowWatchBookmarks=true&resourceVersion=13266&timeout=8m17s&timeoutSeconds=497&watch=true                [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]
22:07:45 [ WATCH][   166.247ms] [403] /apis/certificates.k8s.io/v1/certificatesigningrequests?allowWatchBookmarks=true&resourceVersion=13054&timeout=7m31s&timeoutSeconds=451&watch=true        [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]
22:07:45 [ WATCH][   154.448ms] [403] /api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&resourceVersion=13075&timeoutSeconds=404&watch=true                                    [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]
22:07:45 [ WATCH][   210.732ms] [200] /api/v1/namespaces/openshift-user-workload-monitoring/persistentvolumeclaims?allowWatchBookmarks=true&resourceVersion=12994&timeoutSeconds=306&watch=true [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]
22:07:45 [ WATCH][   214.544ms] [200] /api/v1/namespaces/openshift-user-workload-monitoring/configmaps?allowWatchBookmarks=true&resourceVersion=13073&timeoutSeconds=551&watch=true             [system:serviceaccount:openshift-monitoring:cluster-monitoring-operator]


Version-Release number of selected component (if applicable):
4.11

How reproducible:
Happens frequently in single CI:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1490861976048373760
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1489537207965323264 
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1489476692840812544 
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1488157843289804800 
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1488097891372240896 

Steps to Reproduce:
1. Run conformance tests on single node cluster
2. Observe watch count going over the threshold (not always)

Actual results:
A lot of very short watch requests

Expected results:
Should not have short watch requests

Additional info:
PR to increase threshold until this is resolved: https://github.com/openshift/origin/pull/26685

JIRA ticket requesting review for said PR with some discussion about the issue:
https://issues.redhat.com/browse/MON-2239

This causes about 44% of CI runs to fail

Comment 1 Omer Tuchfeld 2022-02-21 12:59:15 UTC
Once this bz gets resolved, you may want to revisit the threshold increase and lower it back to more strict values - 
https://github.com/openshift/origin/pull/26685

Comment 4 Omer Tuchfeld 2022-03-02 13:34:53 UTC
Follow-up on the first comment [1], you may also want to fix the threshold in the backport [2], once the solution to this gets backported to 4.10

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2056450#c1

[2] https://github.com/openshift/origin/pull/26872

Comment 7 Michal Fojtik 2023-01-16 11:53:58 UTC
Dear reporter, we greatly appreciate the bug you have reported here. Unfortunately, due to migration to a new issue-tracking system (https://issues.redhat.com/), we cannot continue triaging bugs reported in Bugzilla. Since this bug has been stale for multiple days, we, therefore, decided to close this bug.
If you think this is a mistake or this bug has a higher priority or severity as set today, please feel free to reopen this bug and tell us why. We are going to move every re-opened bug to https://issues.redhat.com. 

Thank you for your patience and understanding.