Bug 2057079

Summary:	[cluster-csi-snapshot-controller-operator] CI failure: events should not repeat pathologically
Product:	OpenShift Container Platform	Reporter:	Fabio Bertinatto <fbertina>
Component:	Storage	Assignee:	Fabio Bertinatto <fbertina>
Storage sub component:	Operators	QA Contact:	Wei Duan <wduan>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, dgoodwin, wking
Version:	4.11
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2061343 (view as bug list)		Environment:
Last Closed:	2022-08-10 10:50:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2061343, 2062197

Description Fabio Bertinatto 2022-02-22 17:26:46 UTC

In https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-ovn-upgrade/1494174529876922368

[sig-arch] events should not repeat pathologically expand_less 	0s
2 events happened too frequently

event happened 30 times, something is wrong: ns/openshift-cluster-storage-operator deployment/csi-snapshot-controller-operator - reason/OperatorStatusChanged Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from False to True ("CSISnapshotWebhookControllerProgressing: 1 out of 2 pods running")
event happened 31 times, something is wrong: ns/openshift-cluster-storage-operator deployment/csi-snapshot-controller-operator - reason/OperatorStatusChanged Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False ("All is well")

Comment 2 Wei Duan 2022-03-04 03:11:54 UTC

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1499508868990898176
This job happens after the PR merged which still contains the error, will monitor longer.

Comment 4 Fabio Bertinatto 2022-03-07 13:01:59 UTC

I've seen two failues in release-openshift-okd-installer-e2e-aws-upgrade, but they seem to be caused by something else (the cluster seems unhealthy):
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-okd-installer-e2e-aws-upgrade/1500773368465461248
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-okd-installer-e2e-aws-upgrade/1500672696625664000

Comment 5 Wei Duan 2022-03-09 09:48:03 UTC

Did not say such failure in https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade after that, update status to "VERIFIED".

Comment 6 Devan Goodwin 2022-06-08 17:48:36 UTC

If it's ok to re-use this bug, this problem appears to still exist, albeit quite rare.

https://search.ci.openshift.org/?search=something+is+wrong.*CSISnapshotWebhookControllerProgressing%3A+1+out+of+2+pods+running&maxAge=168h&context=1&type=bug%2Bjunit&name=4.11&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Presently shows a few hits in the last week for 4.11 jobs, 

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1534420535566405632
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1533773073327591424
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1533210671909441536

Each hit is 30-40 times, which is a little odd and likely indicates a problem in the operator allowing it to fire the same event so many times. Should we re-open this? New bug?

Comment 8 errata-xmlrpc 2022-08-10 10:50:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069