Bug 1948090

Summary: Storage should not set Available=False APIServices_Error AWSEBSCSIDriverOperatorCRAvailable on update
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: StorageAssignee: Fabio Bertinatto <fbertina>
Storage sub component: Operators QA Contact: Wei Duan <wduan>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, jsafrane, wduan
Version: 4.8Keywords: Upgrades
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:29:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2021-04-10 01:07:19 UTC
From CI runs like [1]:

  : [bz-Storage] clusteroperator/storage should not change condition/Available
    Run #0: Failed	0s
    1 unexpected clusteroperator state transitions during e2e test run 

    Apr 09 13:21:35.308 - 41s   E clusteroperator/storage condition/Available status/False reason/AWSEBSCSIDriverOperatorCRAvailable:
    AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment to deploy the CSI Controller Service

Very popular:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/storage+should+not+change+condi
tion/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 17 runs, 100% failed, 88% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 95% of failures match = 95% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 94% of failures match = 94% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 21 runs, 100% failed, 76% of failures match = 76% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 50% of failures match = 40% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152

Comment 1 Fabio Bertinatto 2021-05-07 08:29:01 UTC
Apparently clusteroperator/storage changes the condition at 13:21:

Apr 09 13:21:35.308 - 41s   E clusteroperator/storage condition/Available status/False reason/AWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment to deploy the CSI Controller Service

And the new operator starts at 13:22:

I0409 13:22:00.161393       1 builder.go:240] aws-ebs-csi-driver-operator version v0.0.0-unknown-695b8fc

This means that the *previous* storage operator is going Available=False.

Comment 3 Fabio Bertinatto 2021-06-04 16:16:33 UTC
What's missing:

1. Review and merge PR https://github.com/openshift/cluster-storage-operator/pull/173
2. Backport the following PR to other CSI operators: https://github.com/openshift/aws-ebs-csi-driver-operator/pull/122/files

Comment 4 Scott Dodson 2021-07-14 18:04:29 UTC
This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded during normal operations.

Comment 5 Fabio Bertinatto 2021-08-20 14:46:39 UTC
Moving manually to MODIFIED. oVirt is the only patch not merged yet, and it might be covered in other BZ.

Comment 7 Wei Duan 2021-09-03 01:08:36 UTC
Verified pass in recent ci in 4.9.

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/storage+should+not+change+condition/Available' | grep 'failures match' | sort | grep 4.9 | wc -l
0

Comment 10 errata-xmlrpc 2021-10-18 17:29:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759