Bug 2059845

Summary:	Test 'operators should not create watch channels very often' fails
Product:	OpenShift Container Platform	Reporter:	Itzik Brown <itbrown>
Component:	Monitoring	Assignee:	Jan Chaloupka <jchaloup>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.10	CC:	amuller, anpicker, aos-bugs, dgoodwin, erooth, jchaloup, spasquie
Target Milestone:	---
Target Release:	4.10.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-05-02 18:38:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2060406
Bug Blocks:

Description Itzik Brown 2022-03-02 07:52:50 UTC

Description of problem:
We have the following conformance test fails for us:
"[sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]"


The error:
"Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=93, upperbound=82, ratio=1.1341463414634145",


Log:
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1489
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1489
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/test.go:61
[BeforeEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:140
STEP: Creating a kubernetes client
[It] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/apiserver/api_requests.go:93
Feb 25 11:08:59.756: INFO: operator=ingress-operator, watchrequestcount=478, upperbound=626, ratio=0.7635782747603834
Feb 25 11:08:59.756: INFO: operator=authentication-operator, watchrequestcount=437, upperbound=618, ratio=0.7071197411003236
Feb 25 11:08:59.756: INFO: operator=kube-apiserver-operator, watchrequestcount=372, upperbound=456, ratio=0.8157894736842105
Feb 25 11:08:59.756: INFO: operator=openshift-apiserver-operator, watchrequestcount=331, upperbound=496, ratio=0.6673387096774194
Feb 25 11:08:59.756: INFO: operator=cluster-storage-operator, watchrequestcount=280, upperbound=378, ratio=0.7407407407407407
Feb 25 11:08:59.756: INFO: operator=openshift-controller-manager-operator, watchrequestcount=257, upperbound=340, ratio=0.7558823529411764
Feb 25 11:08:59.756: INFO: operator=kube-controller-manager-operator, watchrequestcount=233, upperbound=320, ratio=0.728125
Feb 25 11:08:59.756: INFO: operator=openshift-kube-scheduler-operator, watchrequestcount=220, upperbound=288, ratio=0.7638888888888888
Feb 25 11:08:59.756: INFO: operator=etcd-operator, watchrequestcount=203, upperbound=286, ratio=0.7097902097902098
Feb 25 11:08:59.756: INFO: operator=console-operator, watchrequestcount=185, upperbound=218, ratio=0.8486238532110092
Feb 25 11:08:59.756: INFO: Operator manila-csi-driver-operator not found in upper bounds for OpenStack
Feb 25 11:08:59.756: INFO: operator=system:serviceaccount:openshift-cluster-csi-drivers:manila-csi-driver-operator, watchrequestcount=180
Feb 25 11:08:59.756: INFO: operator=cluster-image-registry-operator, watchrequestcount=155, upperbound=224, ratio=0.6919642857142857
Feb 25 11:08:59.756: INFO: operator=prometheus-operator, watchrequestcount=152, upperbound=236, ratio=0.6440677966101694
Feb 25 11:08:59.756: INFO: operator=service-ca-operator, watchrequestcount=151, upperbound=200, ratio=0.755
Feb 25 11:08:59.756: INFO: Operator openstack-cinder-csi-driver-operator not found in upper bounds for OpenStack
Feb 25 11:08:59.756: INFO: operator=system:serviceaccount:openshift-cluster-csi-drivers:openstack-cinder-csi-driver-operator, watchrequestcount=149
Feb 25 11:08:59.756: INFO: operator=cluster-monitoring-operator, watchrequestcount=93, upperbound=82, ratio=1.1341463414634145
Feb 25 11:08:59.756: INFO: Operator "cluster-monitoring-operator" produces more watch requests than expected
Feb 25 11:08:59.756: INFO: operator=csi-snapshot-controller-operator, watchrequestcount=83, upperbound=120, ratio=0.6916666666666667
Feb 25 11:08:59.756: INFO: operator=openshift-config-operator, watchrequestcount=70, upperbound=96, ratio=0.7291666666666666
Feb 25 11:08:59.756: INFO: operator=machine-api-operator, watchrequestcount=64, upperbound=96, ratio=0.6666666666666666
Feb 25 11:08:59.756: INFO: operator=dns-operator, watchrequestcount=63, upperbound=98, ratio=0.6428571428571429
Feb 25 11:08:59.756: INFO: operator=cluster-autoscaler-operator, watchrequestcount=58, upperbound=106, ratio=0.5471698113207547
Feb 25 11:08:59.756: INFO: operator=cluster-node-tuning-operator, watchrequestcount=56, upperbound=88, ratio=0.6363636363636364
Feb 25 11:08:59.756: INFO: operator=cloud-credential-operator, watchrequestcount=55, upperbound=80, ratio=0.6875
Feb 25 11:08:59.756: INFO: operator=kube-storage-version-migrator-operator, watchrequestcount=53, upperbound=76, ratio=0.6973684210526315
Feb 25 11:08:59.756: INFO: operator=cluster-baremetal-operator, watchrequestcount=40, upperbound=84, ratio=0.47619047619047616
Feb 25 11:08:59.756: INFO: operator=cluster-samples-operator, watchrequestcount=25, upperbound=52, ratio=0.4807692307692308
Feb 25 11:08:59.756: INFO: operator=marketplace-operator, watchrequestcount=24, upperbound=38, ratio=0.631578947368421
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:138
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:139
fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=93, upperbound=82, ratio=1.1341463414634145",
    ]
to be empty

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-02-24-141514
OpenshiftSDN

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Jan Chaloupka 2022-03-02 10:04:24 UTC

I wonder if the increase in the WATCH requests is limited to DFG-osasinfra-shiftstack_ci-ocp_testing or if it can be observed in other jobs/topologies.

Simon, has there been any recent changes in the monitoring operator which might introduce this increase?

Comment 3 Simon Pasquier 2022-03-03 11:06:32 UTC

@Jan I don't recall anything on top of my head. 
Now looking at the upper bound value for OpenStack, it is significantly different from the other platforms:
* OpenStack is 41.0 [1]
* AWS is 124.0 [2]
* Azure is 99.0 [3]

Overall, I feel that the test is very brittle and it isn't clear to me who owns the test (in particular, when is it ok to bump the limits and who's responsible for it).

[1] https://github.com/openshift/origin/blob/625733dd1ce7ebf40c3dd0abd693f7bb54f2d580/test/extended/apiserver/api_requests.go#L265
[2] https://github.com/openshift/origin/blob/625733dd1ce7ebf40c3dd0abd693f7bb54f2d580/test/extended/apiserver/api_requests.go#L119
[3] https://github.com/openshift/origin/blob/625733dd1ce7ebf40c3dd0abd693f7bb54f2d580/test/extended/apiserver/api_requests.go#L149

Comment 4 Jan Chaloupka 2022-03-03 12:17:23 UTC

Everybody and nobody. The test is there to protect from exponential growths. Usually the rule of thumb was to bump it in case the increase is not huge or a new code adding more WATCH requests got merged.

Given the reported job runs on OpenStack, we might as well just bump the upper bound.

Comment 6 Sunil Thaha 2022-03-21 05:20:41 UTC

@jchaloup , shall I assign this bug as well to you since you are the assignee of https://bugzilla.redhat.com/show_bug.cgi?id=2060406 ?

Comment 15 Junqi Zhao 2022-04-24 03:43:14 UTC

fix for openstack is fine, we also need to change for GCP/AWS single node, exmaple

1. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1517872400643919872
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
Run #0: Failed expand_less 	4s
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=97, upperbound=96, ratio=1.0104166666666667",
    ]

2. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1517739083189719040
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
Run #0: Failed expand_less 	4s
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=101, upperbound=96, ratio=1.0520833333333333",
    ]


code: https://github.com/openshift/origin/blob/release-4.10/test/extended/apiserver/api_requests.go#L178

need to change for AWS single node cluster
3. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3109/pull-ci-openshift-machine-config-operator-release-4.10-e2e-aws-upgrade-single-node/1517598134392328192
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
 	2s
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:2, cap:2>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=127, upperbound=96, ratio=1.3229166666666667",
        "Operator \"marketplace-operator\" produces more watch requests than expected: watchrequestcount=29, upperbound=28, ratio=1.0357142857142858",
    ]
to be empty}

code: https://github.com/openshift/origin/blob/release-4.10/test/extended/apiserver/api_requests.go#L299

I think we should do the same change to 4.10 as 4.11, see the PR in bug 2060406, no need to change for VSphere, no issue for 4.10 VSphere cluster now.

Comment 16 Jan Chaloupka 2022-04-26 08:50:04 UTC

Checking https://search.ci.openshift.org/?search=operators+should+not+create+watch+channels+very+often&maxAge=336h&context=1&type=junit&name=4.10.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job:

AWS single node cluster: there are only 3 occurrences for the last 14 days (https://prow.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-release-4.10-e2e-aws-upgrade-single-node):
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=103, upperbound=96, ratio=1.0729166666666667 (1518692139616178176, Apr 26)
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=127, upperbound=96, ratio=1.3229166666666667 (1517598134392328192, Apr 23)
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=104, upperbound=96, ratio=1.0833333333333333 (1516326082691731456, Apr 20)

GCP: only 4 occurrences for the last 14 days (https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade):
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=109, upperbound=96, ratio=1.1354166666666667 (1518660295168364544, Apr 25)
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=97, upperbound=96, ratio=1.0104166666666667  (1517872400643919872, Apr 24)
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=101, upperbound=96, ratio=1.0520833333333333 (1517739083189719040, Apr 23)
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=100, upperbound=96, ratio=1.0416666666666667 (1514310871843606528, Apr 14)

OpenStack (https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-ccm-install):
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=105, upperbound=96, ratio=1.09375 (Apr 25)
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=87, upperbound=82, ratio=1.0609756097560976 (Apr 19)
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=90, upperbound=82, ratio=1.0975609756097562 (Apr 15)

https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node-serial:
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=114, upperbound=96, ratio=1.1875
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=102, upperbound=96, ratio=1.0625

https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-okd-4.10-e2e-vsphere:
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=89, upperbound=84, ratio=1.0595238095238095
- Operator \"dns-operator\" produces more watch requests than expected: watchrequestcount=105, upperbound=98, ratio=1.0714285714285714

https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-upgrade-from-stable-4.9-e2e-openstack-upgrade:
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=100, upperbound=96, ratio=1.0416666666666667

===========================
AWS single node -> 104
GCP -> 110

The rest is too rare to fix.

Comment 21 Junqi Zhao 2022-04-28 02:25:56 UTC

based on Comment 15 and 16, since the case would pass for most of platforms in 4.10, set to VERIFIED

Comment 23 errata-xmlrpc 2022-05-02 18:38:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.12 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1601

Comment 24 Devan Goodwin 2022-06-27 17:20:39 UTC

We've having to repeat this process again for cluster-monitoring-operator. See: https://github.com/openshift/origin/pull/27281 if you are interested.