Bug 2059845 - Test 'operators should not create watch channels very often' fails
Summary: Test 'operators should not create watch channels very often' fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.10
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.10.z
Assignee: Jan Chaloupka
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 2060406
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-02 07:52 UTC by Itzik Brown
Modified: 2022-06-27 17:20 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-02 18:38:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 26879 0 None open bug 2059845: operators should not create watch channels very often: bump OpenStack monitoring operator 2022-04-20 06:25:02 UTC
Github openshift origin pull 27055 0 None open bug 2059845: [test]: bump upper bounds for AWS single node and GCP 2022-04-26 08:53:25 UTC
Red Hat Product Errata RHBA-2022:1601 0 None None None 2022-05-02 18:39:10 UTC

Description Itzik Brown 2022-03-02 07:52:50 UTC
Description of problem:
We have the following conformance test fails for us:
"[sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]"


The error:
"Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=93, upperbound=82, ratio=1.1341463414634145",


Log:
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1489
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1489
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/test.go:61
[BeforeEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:140
STEP: Creating a kubernetes client
[It] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/apiserver/api_requests.go:93
Feb 25 11:08:59.756: INFO: operator=ingress-operator, watchrequestcount=478, upperbound=626, ratio=0.7635782747603834
Feb 25 11:08:59.756: INFO: operator=authentication-operator, watchrequestcount=437, upperbound=618, ratio=0.7071197411003236
Feb 25 11:08:59.756: INFO: operator=kube-apiserver-operator, watchrequestcount=372, upperbound=456, ratio=0.8157894736842105
Feb 25 11:08:59.756: INFO: operator=openshift-apiserver-operator, watchrequestcount=331, upperbound=496, ratio=0.6673387096774194
Feb 25 11:08:59.756: INFO: operator=cluster-storage-operator, watchrequestcount=280, upperbound=378, ratio=0.7407407407407407
Feb 25 11:08:59.756: INFO: operator=openshift-controller-manager-operator, watchrequestcount=257, upperbound=340, ratio=0.7558823529411764
Feb 25 11:08:59.756: INFO: operator=kube-controller-manager-operator, watchrequestcount=233, upperbound=320, ratio=0.728125
Feb 25 11:08:59.756: INFO: operator=openshift-kube-scheduler-operator, watchrequestcount=220, upperbound=288, ratio=0.7638888888888888
Feb 25 11:08:59.756: INFO: operator=etcd-operator, watchrequestcount=203, upperbound=286, ratio=0.7097902097902098
Feb 25 11:08:59.756: INFO: operator=console-operator, watchrequestcount=185, upperbound=218, ratio=0.8486238532110092
Feb 25 11:08:59.756: INFO: Operator manila-csi-driver-operator not found in upper bounds for OpenStack
Feb 25 11:08:59.756: INFO: operator=system:serviceaccount:openshift-cluster-csi-drivers:manila-csi-driver-operator, watchrequestcount=180
Feb 25 11:08:59.756: INFO: operator=cluster-image-registry-operator, watchrequestcount=155, upperbound=224, ratio=0.6919642857142857
Feb 25 11:08:59.756: INFO: operator=prometheus-operator, watchrequestcount=152, upperbound=236, ratio=0.6440677966101694
Feb 25 11:08:59.756: INFO: operator=service-ca-operator, watchrequestcount=151, upperbound=200, ratio=0.755
Feb 25 11:08:59.756: INFO: Operator openstack-cinder-csi-driver-operator not found in upper bounds for OpenStack
Feb 25 11:08:59.756: INFO: operator=system:serviceaccount:openshift-cluster-csi-drivers:openstack-cinder-csi-driver-operator, watchrequestcount=149
Feb 25 11:08:59.756: INFO: operator=cluster-monitoring-operator, watchrequestcount=93, upperbound=82, ratio=1.1341463414634145
Feb 25 11:08:59.756: INFO: Operator "cluster-monitoring-operator" produces more watch requests than expected
Feb 25 11:08:59.756: INFO: operator=csi-snapshot-controller-operator, watchrequestcount=83, upperbound=120, ratio=0.6916666666666667
Feb 25 11:08:59.756: INFO: operator=openshift-config-operator, watchrequestcount=70, upperbound=96, ratio=0.7291666666666666
Feb 25 11:08:59.756: INFO: operator=machine-api-operator, watchrequestcount=64, upperbound=96, ratio=0.6666666666666666
Feb 25 11:08:59.756: INFO: operator=dns-operator, watchrequestcount=63, upperbound=98, ratio=0.6428571428571429
Feb 25 11:08:59.756: INFO: operator=cluster-autoscaler-operator, watchrequestcount=58, upperbound=106, ratio=0.5471698113207547
Feb 25 11:08:59.756: INFO: operator=cluster-node-tuning-operator, watchrequestcount=56, upperbound=88, ratio=0.6363636363636364
Feb 25 11:08:59.756: INFO: operator=cloud-credential-operator, watchrequestcount=55, upperbound=80, ratio=0.6875
Feb 25 11:08:59.756: INFO: operator=kube-storage-version-migrator-operator, watchrequestcount=53, upperbound=76, ratio=0.6973684210526315
Feb 25 11:08:59.756: INFO: operator=cluster-baremetal-operator, watchrequestcount=40, upperbound=84, ratio=0.47619047619047616
Feb 25 11:08:59.756: INFO: operator=cluster-samples-operator, watchrequestcount=25, upperbound=52, ratio=0.4807692307692308
Feb 25 11:08:59.756: INFO: operator=marketplace-operator, watchrequestcount=24, upperbound=38, ratio=0.631578947368421
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:138
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:139
fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=93, upperbound=82, ratio=1.1341463414634145",
    ]
to be empty

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-02-24-141514
OpenshiftSDN

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Jan Chaloupka 2022-03-02 10:04:24 UTC
I wonder if the increase in the WATCH requests is limited to DFG-osasinfra-shiftstack_ci-ocp_testing or if it can be observed in other jobs/topologies.

Simon, has there been any recent changes in the monitoring operator which might introduce this increase?

Comment 3 Simon Pasquier 2022-03-03 11:06:32 UTC
@Jan I don't recall anything on top of my head. 
Now looking at the upper bound value for OpenStack, it is significantly different from the other platforms:
* OpenStack is 41.0 [1]
* AWS is 124.0 [2]
* Azure is 99.0 [3]

Overall, I feel that the test is very brittle and it isn't clear to me who owns the test (in particular, when is it ok to bump the limits and who's responsible for it).

[1] https://github.com/openshift/origin/blob/625733dd1ce7ebf40c3dd0abd693f7bb54f2d580/test/extended/apiserver/api_requests.go#L265
[2] https://github.com/openshift/origin/blob/625733dd1ce7ebf40c3dd0abd693f7bb54f2d580/test/extended/apiserver/api_requests.go#L119
[3] https://github.com/openshift/origin/blob/625733dd1ce7ebf40c3dd0abd693f7bb54f2d580/test/extended/apiserver/api_requests.go#L149

Comment 4 Jan Chaloupka 2022-03-03 12:17:23 UTC
Everybody and nobody. The test is there to protect from exponential growths. Usually the rule of thumb was to bump it in case the increase is not huge or a new code adding more WATCH requests got merged.

Given the reported job runs on OpenStack, we might as well just bump the upper bound.

Comment 6 Sunil Thaha 2022-03-21 05:20:41 UTC
@jchaloup , shall I assign this bug as well to you since you are the assignee of https://bugzilla.redhat.com/show_bug.cgi?id=2060406 ?

Comment 15 Junqi Zhao 2022-04-24 03:43:14 UTC
fix for openstack is fine, we also need to change for GCP/AWS single node, exmaple

1. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1517872400643919872
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
Run #0: Failed expand_less 	4s
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=97, upperbound=96, ratio=1.0104166666666667",
    ]

2. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1517739083189719040
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
Run #0: Failed expand_less 	4s
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=101, upperbound=96, ratio=1.0520833333333333",
    ]


code: https://github.com/openshift/origin/blob/release-4.10/test/extended/apiserver/api_requests.go#L178

need to change for AWS single node cluster
3. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3109/pull-ci-openshift-machine-config-operator-release-4.10-e2e-aws-upgrade-single-node/1517598134392328192
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
 	2s
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:2, cap:2>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=127, upperbound=96, ratio=1.3229166666666667",
        "Operator \"marketplace-operator\" produces more watch requests than expected: watchrequestcount=29, upperbound=28, ratio=1.0357142857142858",
    ]
to be empty}

code: https://github.com/openshift/origin/blob/release-4.10/test/extended/apiserver/api_requests.go#L299

I think we should do the same change to 4.10 as 4.11, see the PR in bug 2060406, no need to change for VSphere, no issue for 4.10 VSphere cluster now.

Comment 16 Jan Chaloupka 2022-04-26 08:50:04 UTC
Checking https://search.ci.openshift.org/?search=operators+should+not+create+watch+channels+very+often&maxAge=336h&context=1&type=junit&name=4.10.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job:

AWS single node cluster: there are only 3 occurrences for the last 14 days (https://prow.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-release-4.10-e2e-aws-upgrade-single-node):
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=103, upperbound=96, ratio=1.0729166666666667 (1518692139616178176, Apr 26)
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=127, upperbound=96, ratio=1.3229166666666667 (1517598134392328192, Apr 23)
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=104, upperbound=96, ratio=1.0833333333333333 (1516326082691731456, Apr 20)

GCP: only 4 occurrences for the last 14 days (https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade):
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=109, upperbound=96, ratio=1.1354166666666667 (1518660295168364544, Apr 25)
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=97, upperbound=96, ratio=1.0104166666666667  (1517872400643919872, Apr 24)
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=101, upperbound=96, ratio=1.0520833333333333 (1517739083189719040, Apr 23)
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=100, upperbound=96, ratio=1.0416666666666667 (1514310871843606528, Apr 14)

OpenStack (https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-ccm-install):
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=105, upperbound=96, ratio=1.09375 (Apr 25)
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=87, upperbound=82, ratio=1.0609756097560976 (Apr 19)
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=90, upperbound=82, ratio=1.0975609756097562 (Apr 15)

https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node-serial:
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=114, upperbound=96, ratio=1.1875
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=102, upperbound=96, ratio=1.0625

https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-okd-4.10-e2e-vsphere:
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=89, upperbound=84, ratio=1.0595238095238095
- Operator \"dns-operator\" produces more watch requests than expected: watchrequestcount=105, upperbound=98, ratio=1.0714285714285714

https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-upgrade-from-stable-4.9-e2e-openstack-upgrade:
- Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=100, upperbound=96, ratio=1.0416666666666667

===========================
AWS single node -> 104
GCP -> 110

The rest is too rare to fix.

Comment 21 Junqi Zhao 2022-04-28 02:25:56 UTC
based on Comment 15 and 16, since the case would pass for most of platforms in 4.10, set to VERIFIED

Comment 23 errata-xmlrpc 2022-05-02 18:38:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.12 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1601

Comment 24 Devan Goodwin 2022-06-27 17:20:39 UTC
We've having to repeat this process again for cluster-monitoring-operator. See: https://github.com/openshift/origin/pull/27281 if you are interested.


Note You need to log in before you can comment on or make changes to this bug.