Description of problem: We have the following conformance test fails for us: "[sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]" The error: "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=93, upperbound=82, ratio=1.1341463414634145", Log: [BeforeEach] [Top Level] github.com/openshift/origin/test/extended/util/framework.go:1489 [BeforeEach] [Top Level] github.com/openshift/origin/test/extended/util/framework.go:1489 [BeforeEach] [Top Level] github.com/openshift/origin/test/extended/util/test.go:61 [BeforeEach] [sig-arch][Late] github.com/openshift/origin/test/extended/util/client.go:140 STEP: Creating a kubernetes client [It] operators should not create watch channels very often [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/apiserver/api_requests.go:93 Feb 25 11:08:59.756: INFO: operator=ingress-operator, watchrequestcount=478, upperbound=626, ratio=0.7635782747603834 Feb 25 11:08:59.756: INFO: operator=authentication-operator, watchrequestcount=437, upperbound=618, ratio=0.7071197411003236 Feb 25 11:08:59.756: INFO: operator=kube-apiserver-operator, watchrequestcount=372, upperbound=456, ratio=0.8157894736842105 Feb 25 11:08:59.756: INFO: operator=openshift-apiserver-operator, watchrequestcount=331, upperbound=496, ratio=0.6673387096774194 Feb 25 11:08:59.756: INFO: operator=cluster-storage-operator, watchrequestcount=280, upperbound=378, ratio=0.7407407407407407 Feb 25 11:08:59.756: INFO: operator=openshift-controller-manager-operator, watchrequestcount=257, upperbound=340, ratio=0.7558823529411764 Feb 25 11:08:59.756: INFO: operator=kube-controller-manager-operator, watchrequestcount=233, upperbound=320, ratio=0.728125 Feb 25 11:08:59.756: INFO: operator=openshift-kube-scheduler-operator, watchrequestcount=220, upperbound=288, ratio=0.7638888888888888 Feb 25 11:08:59.756: INFO: operator=etcd-operator, watchrequestcount=203, upperbound=286, ratio=0.7097902097902098 Feb 25 11:08:59.756: INFO: operator=console-operator, watchrequestcount=185, upperbound=218, ratio=0.8486238532110092 Feb 25 11:08:59.756: INFO: Operator manila-csi-driver-operator not found in upper bounds for OpenStack Feb 25 11:08:59.756: INFO: operator=system:serviceaccount:openshift-cluster-csi-drivers:manila-csi-driver-operator, watchrequestcount=180 Feb 25 11:08:59.756: INFO: operator=cluster-image-registry-operator, watchrequestcount=155, upperbound=224, ratio=0.6919642857142857 Feb 25 11:08:59.756: INFO: operator=prometheus-operator, watchrequestcount=152, upperbound=236, ratio=0.6440677966101694 Feb 25 11:08:59.756: INFO: operator=service-ca-operator, watchrequestcount=151, upperbound=200, ratio=0.755 Feb 25 11:08:59.756: INFO: Operator openstack-cinder-csi-driver-operator not found in upper bounds for OpenStack Feb 25 11:08:59.756: INFO: operator=system:serviceaccount:openshift-cluster-csi-drivers:openstack-cinder-csi-driver-operator, watchrequestcount=149 Feb 25 11:08:59.756: INFO: operator=cluster-monitoring-operator, watchrequestcount=93, upperbound=82, ratio=1.1341463414634145 Feb 25 11:08:59.756: INFO: Operator "cluster-monitoring-operator" produces more watch requests than expected Feb 25 11:08:59.756: INFO: operator=csi-snapshot-controller-operator, watchrequestcount=83, upperbound=120, ratio=0.6916666666666667 Feb 25 11:08:59.756: INFO: operator=openshift-config-operator, watchrequestcount=70, upperbound=96, ratio=0.7291666666666666 Feb 25 11:08:59.756: INFO: operator=machine-api-operator, watchrequestcount=64, upperbound=96, ratio=0.6666666666666666 Feb 25 11:08:59.756: INFO: operator=dns-operator, watchrequestcount=63, upperbound=98, ratio=0.6428571428571429 Feb 25 11:08:59.756: INFO: operator=cluster-autoscaler-operator, watchrequestcount=58, upperbound=106, ratio=0.5471698113207547 Feb 25 11:08:59.756: INFO: operator=cluster-node-tuning-operator, watchrequestcount=56, upperbound=88, ratio=0.6363636363636364 Feb 25 11:08:59.756: INFO: operator=cloud-credential-operator, watchrequestcount=55, upperbound=80, ratio=0.6875 Feb 25 11:08:59.756: INFO: operator=kube-storage-version-migrator-operator, watchrequestcount=53, upperbound=76, ratio=0.6973684210526315 Feb 25 11:08:59.756: INFO: operator=cluster-baremetal-operator, watchrequestcount=40, upperbound=84, ratio=0.47619047619047616 Feb 25 11:08:59.756: INFO: operator=cluster-samples-operator, watchrequestcount=25, upperbound=52, ratio=0.4807692307692308 Feb 25 11:08:59.756: INFO: operator=marketplace-operator, watchrequestcount=24, upperbound=38, ratio=0.631578947368421 [AfterEach] [sig-arch][Late] github.com/openshift/origin/test/extended/util/client.go:138 [AfterEach] [sig-arch][Late] github.com/openshift/origin/test/extended/util/client.go:139 fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected <[]string | len:1, cap:1>: [ "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=93, upperbound=82, ratio=1.1341463414634145", ] to be empty Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2022-02-24-141514 OpenshiftSDN How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I wonder if the increase in the WATCH requests is limited to DFG-osasinfra-shiftstack_ci-ocp_testing or if it can be observed in other jobs/topologies. Simon, has there been any recent changes in the monitoring operator which might introduce this increase?
@Jan I don't recall anything on top of my head. Now looking at the upper bound value for OpenStack, it is significantly different from the other platforms: * OpenStack is 41.0 [1] * AWS is 124.0 [2] * Azure is 99.0 [3] Overall, I feel that the test is very brittle and it isn't clear to me who owns the test (in particular, when is it ok to bump the limits and who's responsible for it). [1] https://github.com/openshift/origin/blob/625733dd1ce7ebf40c3dd0abd693f7bb54f2d580/test/extended/apiserver/api_requests.go#L265 [2] https://github.com/openshift/origin/blob/625733dd1ce7ebf40c3dd0abd693f7bb54f2d580/test/extended/apiserver/api_requests.go#L119 [3] https://github.com/openshift/origin/blob/625733dd1ce7ebf40c3dd0abd693f7bb54f2d580/test/extended/apiserver/api_requests.go#L149
Everybody and nobody. The test is there to protect from exponential growths. Usually the rule of thumb was to bump it in case the increase is not huge or a new code adding more WATCH requests got merged. Given the reported job runs on OpenStack, we might as well just bump the upper bound.
@jchaloup , shall I assign this bug as well to you since you are the assignee of https://bugzilla.redhat.com/show_bug.cgi?id=2060406 ?
fix for openstack is fine, we also need to change for GCP/AWS single node, exmaple 1. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1517872400643919872 : [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel] Run #0: Failed expand_less 4s { fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected <[]string | len:1, cap:1>: [ "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=97, upperbound=96, ratio=1.0104166666666667", ] 2. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1517739083189719040 : [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel] Run #0: Failed expand_less 4s { fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected <[]string | len:1, cap:1>: [ "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=101, upperbound=96, ratio=1.0520833333333333", ] code: https://github.com/openshift/origin/blob/release-4.10/test/extended/apiserver/api_requests.go#L178 need to change for AWS single node cluster 3. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3109/pull-ci-openshift-machine-config-operator-release-4.10-e2e-aws-upgrade-single-node/1517598134392328192 : [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel] 2s { fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected <[]string | len:2, cap:2>: [ "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=127, upperbound=96, ratio=1.3229166666666667", "Operator \"marketplace-operator\" produces more watch requests than expected: watchrequestcount=29, upperbound=28, ratio=1.0357142857142858", ] to be empty} code: https://github.com/openshift/origin/blob/release-4.10/test/extended/apiserver/api_requests.go#L299 I think we should do the same change to 4.10 as 4.11, see the PR in bug 2060406, no need to change for VSphere, no issue for 4.10 VSphere cluster now.
Checking https://search.ci.openshift.org/?search=operators+should+not+create+watch+channels+very+often&maxAge=336h&context=1&type=junit&name=4.10.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job: AWS single node cluster: there are only 3 occurrences for the last 14 days (https://prow.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-release-4.10-e2e-aws-upgrade-single-node): - Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=103, upperbound=96, ratio=1.0729166666666667 (1518692139616178176, Apr 26) - Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=127, upperbound=96, ratio=1.3229166666666667 (1517598134392328192, Apr 23) - Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=104, upperbound=96, ratio=1.0833333333333333 (1516326082691731456, Apr 20) GCP: only 4 occurrences for the last 14 days (https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade): - Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=109, upperbound=96, ratio=1.1354166666666667 (1518660295168364544, Apr 25) - Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=97, upperbound=96, ratio=1.0104166666666667 (1517872400643919872, Apr 24) - Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=101, upperbound=96, ratio=1.0520833333333333 (1517739083189719040, Apr 23) - Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=100, upperbound=96, ratio=1.0416666666666667 (1514310871843606528, Apr 14) OpenStack (https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-ccm-install): - Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=105, upperbound=96, ratio=1.09375 (Apr 25) - Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=87, upperbound=82, ratio=1.0609756097560976 (Apr 19) - Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=90, upperbound=82, ratio=1.0975609756097562 (Apr 15) https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node-serial: - Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=114, upperbound=96, ratio=1.1875 - Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=102, upperbound=96, ratio=1.0625 https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-okd-4.10-e2e-vsphere: - Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=89, upperbound=84, ratio=1.0595238095238095 - Operator \"dns-operator\" produces more watch requests than expected: watchrequestcount=105, upperbound=98, ratio=1.0714285714285714 https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-upgrade-from-stable-4.9-e2e-openstack-upgrade: - Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=100, upperbound=96, ratio=1.0416666666666667 =========================== AWS single node -> 104 GCP -> 110 The rest is too rare to fix.
based on Comment 15 and 16, since the case would pass for most of platforms in 4.10, set to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.12 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1601
We've having to repeat this process again for cluster-monitoring-operator. See: https://github.com/openshift/origin/pull/27281 if you are interested.