Bug 1948090

Summary:	Storage should not set Available=False APIServices_Error AWSEBSCSIDriverOperatorCRAvailable on update
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Storage	Assignee:	Fabio Bertinatto <fbertina>
Storage sub component:	Operators	QA Contact:	Wei Duan <wduan>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, jsafrane, wduan
Version:	4.8	Keywords:	Upgrades
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-18 17:29:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2021-04-10 01:07:19 UTC

From CI runs like [1]:

  : [bz-Storage] clusteroperator/storage should not change condition/Available
    Run #0: Failed	0s
    1 unexpected clusteroperator state transitions during e2e test run 

    Apr 09 13:21:35.308 - 41s   E clusteroperator/storage condition/Available status/False reason/AWSEBSCSIDriverOperatorCRAvailable:
    AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment to deploy the CSI Controller Service

Very popular:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/storage+should+not+change+condi
tion/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 17 runs, 100% failed, 88% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 95% of failures match = 95% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 94% of failures match = 94% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 21 runs, 100% failed, 76% of failures match = 76% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 50% of failures match = 40% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152

Comment 1 Fabio Bertinatto 2021-05-07 08:29:01 UTC

Apparently clusteroperator/storage changes the condition at 13:21:

Apr 09 13:21:35.308 - 41s   E clusteroperator/storage condition/Available status/False reason/AWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment to deploy the CSI Controller Service

And the new operator starts at 13:22:

I0409 13:22:00.161393       1 builder.go:240] aws-ebs-csi-driver-operator version v0.0.0-unknown-695b8fc

This means that the *previous* storage operator is going Available=False.

Comment 3 Fabio Bertinatto 2021-06-04 16:16:33 UTC

What's missing:

1. Review and merge PR https://github.com/openshift/cluster-storage-operator/pull/173
2. Backport the following PR to other CSI operators: https://github.com/openshift/aws-ebs-csi-driver-operator/pull/122/files

Comment 4 Scott Dodson 2021-07-14 18:04:29 UTC

This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded during normal operations.

Comment 5 Fabio Bertinatto 2021-08-20 14:46:39 UTC

Moving manually to MODIFIED. oVirt is the only patch not merged yet, and it might be covered in other BZ.

Comment 7 Wei Duan 2021-09-03 01:08:36 UTC

Verified pass in recent ci in 4.9.

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/storage+should+not+change+condition/Available' | grep 'failures match' | sort | grep 4.9 | wc -l
0

Comment 10 errata-xmlrpc 2021-10-18 17:29:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759