Bug 2060406

Summary: Test 'operators should not create watch channels very often' fails
Product: OpenShift Container Platform Reporter: Jan Chaloupka <jchaloup>
Component: MonitoringAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: low    
Version: 4.11CC: amuller, anpicker, aos-bugs, juzhao, spasquie, sthaha
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:52:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2059845    
Attachments:
Description Flags
periodic-ci-openshift-release-master-nightly-4.11-e2e-vsphere-techpreview-serial CMO
none
periodic-ci-openshift-release-master-nightly-4.11-openshift-e2e-aws-single-node-workers-upgrade-conformance CMO
none
GCP CMO api WATCH request counts none

Description Jan Chaloupka 2022-03-03 13:19:21 UTC
This bug was initially created as a copy of Bug #2059845

I am copying this bug because: 



Description of problem:
We have the following conformance test fails for us:
"[sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]"


The error:
"Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=93, upperbound=82, ratio=1.1341463414634145",


Log:
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1489
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1489
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/test.go:61
[BeforeEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:140
STEP: Creating a kubernetes client
[It] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/apiserver/api_requests.go:93
Feb 25 11:08:59.756: INFO: operator=ingress-operator, watchrequestcount=478, upperbound=626, ratio=0.7635782747603834
Feb 25 11:08:59.756: INFO: operator=authentication-operator, watchrequestcount=437, upperbound=618, ratio=0.7071197411003236
Feb 25 11:08:59.756: INFO: operator=kube-apiserver-operator, watchrequestcount=372, upperbound=456, ratio=0.8157894736842105
Feb 25 11:08:59.756: INFO: operator=openshift-apiserver-operator, watchrequestcount=331, upperbound=496, ratio=0.6673387096774194
Feb 25 11:08:59.756: INFO: operator=cluster-storage-operator, watchrequestcount=280, upperbound=378, ratio=0.7407407407407407
Feb 25 11:08:59.756: INFO: operator=openshift-controller-manager-operator, watchrequestcount=257, upperbound=340, ratio=0.7558823529411764
Feb 25 11:08:59.756: INFO: operator=kube-controller-manager-operator, watchrequestcount=233, upperbound=320, ratio=0.728125
Feb 25 11:08:59.756: INFO: operator=openshift-kube-scheduler-operator, watchrequestcount=220, upperbound=288, ratio=0.7638888888888888
Feb 25 11:08:59.756: INFO: operator=etcd-operator, watchrequestcount=203, upperbound=286, ratio=0.7097902097902098
Feb 25 11:08:59.756: INFO: operator=console-operator, watchrequestcount=185, upperbound=218, ratio=0.8486238532110092
Feb 25 11:08:59.756: INFO: Operator manila-csi-driver-operator not found in upper bounds for OpenStack
Feb 25 11:08:59.756: INFO: operator=system:serviceaccount:openshift-cluster-csi-drivers:manila-csi-driver-operator, watchrequestcount=180
Feb 25 11:08:59.756: INFO: operator=cluster-image-registry-operator, watchrequestcount=155, upperbound=224, ratio=0.6919642857142857
Feb 25 11:08:59.756: INFO: operator=prometheus-operator, watchrequestcount=152, upperbound=236, ratio=0.6440677966101694
Feb 25 11:08:59.756: INFO: operator=service-ca-operator, watchrequestcount=151, upperbound=200, ratio=0.755
Feb 25 11:08:59.756: INFO: Operator openstack-cinder-csi-driver-operator not found in upper bounds for OpenStack
Feb 25 11:08:59.756: INFO: operator=system:serviceaccount:openshift-cluster-csi-drivers:openstack-cinder-csi-driver-operator, watchrequestcount=149
Feb 25 11:08:59.756: INFO: operator=cluster-monitoring-operator, watchrequestcount=93, upperbound=82, ratio=1.1341463414634145
Feb 25 11:08:59.756: INFO: Operator "cluster-monitoring-operator" produces more watch requests than expected
Feb 25 11:08:59.756: INFO: operator=csi-snapshot-controller-operator, watchrequestcount=83, upperbound=120, ratio=0.6916666666666667
Feb 25 11:08:59.756: INFO: operator=openshift-config-operator, watchrequestcount=70, upperbound=96, ratio=0.7291666666666666
Feb 25 11:08:59.756: INFO: operator=machine-api-operator, watchrequestcount=64, upperbound=96, ratio=0.6666666666666666
Feb 25 11:08:59.756: INFO: operator=dns-operator, watchrequestcount=63, upperbound=98, ratio=0.6428571428571429
Feb 25 11:08:59.756: INFO: operator=cluster-autoscaler-operator, watchrequestcount=58, upperbound=106, ratio=0.5471698113207547
Feb 25 11:08:59.756: INFO: operator=cluster-node-tuning-operator, watchrequestcount=56, upperbound=88, ratio=0.6363636363636364
Feb 25 11:08:59.756: INFO: operator=cloud-credential-operator, watchrequestcount=55, upperbound=80, ratio=0.6875
Feb 25 11:08:59.756: INFO: operator=kube-storage-version-migrator-operator, watchrequestcount=53, upperbound=76, ratio=0.6973684210526315
Feb 25 11:08:59.756: INFO: operator=cluster-baremetal-operator, watchrequestcount=40, upperbound=84, ratio=0.47619047619047616
Feb 25 11:08:59.756: INFO: operator=cluster-samples-operator, watchrequestcount=25, upperbound=52, ratio=0.4807692307692308
Feb 25 11:08:59.756: INFO: operator=marketplace-operator, watchrequestcount=24, upperbound=38, ratio=0.631578947368421
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:138
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:139
fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=93, upperbound=82, ratio=1.1341463414634145",
    ]
to be empty

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-02-24-141514
OpenshiftSDN

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Junqi Zhao 2022-03-09 04:44:16 UTC
searched with
https://search.ci.openshift.org/?search=operators+should+not+create+watch+channels+very+often&maxAge=48h&context=1&type=bug%2Bjunit&name=4.11.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

still have issue with 
1. VSphere Platform
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-vsphere-techpreview-serial/1501277588121718784
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel] expand_less 	1s
fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=87, upperbound=84, ratio=1.0357142857142858",

code: https://github.com/openshift/origin/blob/master/test/extended/apiserver/api_requests.go#L235

2. single node AWS cluster
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-openshift-e2e-aws-single-node-workers-upgrade-conformance/1501289618778099712
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel] expand_less
Run #0: Failed expand_less 	2s
fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=115, upperbound=96, ratio=1.1979166666666667",
    ]

code: https://github.com/openshift/origin/blob/master/test/extended/apiserver/api_requests.go#L299

Comment 4 Jan Chaloupka 2022-04-01 10:51:58 UTC
Created attachment 1869931 [details]
periodic-ci-openshift-release-master-nightly-4.11-e2e-vsphere-techpreview-serial CMO

Comment 5 Jan Chaloupka 2022-04-01 10:53:41 UTC
Created attachment 1869932 [details]
periodic-ci-openshift-release-master-nightly-4.11-openshift-e2e-aws-single-node-workers-upgrade-conformance CMO

Comment 6 Jan Chaloupka 2022-04-01 11:56:46 UTC
# periodic-ci-openshift-release-master-nightly-4.11-openshift-e2e-aws-single-node-workers-upgrade-conformance sequence

69 68 63 61 62 64 78 60 69 53 66 52 73 64 66 57 61 59 64 68 65 67 52 58 60 68 57 64 57 57 71 65 86

Giving 64 as the median

# periodic-ci-openshift-release-master-nightly-4.11-e2e-vsphere-techpreview-serial sequence

101 111 94 86 101 90 109 77 115 138 105 91 102 104 93 86 90 78 65 92 88 110 110 84 100 113 89 100 99 95 99 87 107 86 89 71 108 90 124 95 62 93 96 90 94 90 87 88 80 96 85 88 105 109 107 88 85 91 130 98 102 81 93 94

There are some spikes but mostly holding around 93/94 based on the median.

Comment 10 Junqi Zhao 2022-04-08 08:33:53 UTC
searched with
https://search.ci.openshift.org/?search=operators+should+not+create+watch+channels+very+often&maxAge=48h&context=1&type=bug%2Bjunit&name=4.11.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

we also need to update the upper bound for GCP platform
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-upgrade/1512265217537478656
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:449]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=97, upperbound=96, ratio=1.0104166666666667",
    ]

code: https://github.com/openshift/origin/blob/master/test/extended/apiserver/api_requests.go#L179

Comment 11 Jan Chaloupka 2022-04-11 16:00:30 UTC
Created attachment 1871841 [details]
GCP CMO api WATCH request counts

Comment 14 Junqi Zhao 2022-04-14 03:18:25 UTC
we should also bump the apirequests upper bounds for aws single node, most of the time 64 is fine, sometimes would be failed, example:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-openshift-e2e-aws-single-node-workers-upgrade-conformance/1513613641847410688
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
Run #0: Failed expand_less 	2s
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:449]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=131, upperbound=128, ratio=1.0234375",
    ]
code: https://github.com/openshift/origin/blob/master/test/extended/apiserver/api_requests.go#L300

same for openstack
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-ccm-install/1505593005082415104
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=107, upperbound=96, ratio=1.1145833333333333",
    ]
code: https://github.com/openshift/origin/blob/master/test/extended/apiserver/api_requests.go#L266

GCP/VSphere are fine, no issues now

Comment 16 Jan Chaloupka 2022-04-14 09:00:22 UTC
This could go on forever. Let's keep it as it is change the upper bounds in another iteration.

Comment 17 Jan Chaloupka 2022-04-14 09:02:40 UTC
Having flakes from time to time is acceptable. In some case the sudden increases are expected when many tests fail. Retrying and waiting can lead to higher number of watch requests due to tests/operators running longer. As long as the flakes does not appear often/regularly, it is acceptable to ignore occasional blips.

Comment 18 Junqi Zhao 2022-04-14 10:48:40 UTC
thanks, based on Comment 14 - 17, since most of time there is not issue with the upper bound, and checked for aws SNO run:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-nightly-4.11-openshift-e2e-aws-single-node-workers-upgrade-conformance

for case "openshift-tests.[sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]", 
1 in 23 case failures is related to cluster-monitoring-operator

for openstack ccm, the failure ratio is 6/9, other configurations on openstack does not have such high failed ratio.
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-ccm-install
6 out of 9 failures are related to cluster-monitoring-operator
1. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-ccm-install/1504505512811565056
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=104, upperbound=96, ratio=1.0833333333333333",
    ]

2. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-ccm-install/1505230613387415552
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=98, upperbound=96, ratio=1.0208333333333333",
    ]


3. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-ccm-install/1505593005082415104
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=107, upperbound=96, ratio=1.1145833333333333",
    ]

4. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-ccm-install/1508130335605395456
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=102, upperbound=96, ratio=1.0625",
    ]

5. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-ccm-install/1510305447666520064
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=101, upperbound=96, ratio=1.0520833333333333",
    ]


6. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-ccm-install/1511392966680252416
: [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:448]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=98, upperbound=96, ratio=1.0208333333333333",
    ]

Comment 23 errata-xmlrpc 2022-08-10 10:52:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069