Bug 1940940

Summary: csi-snapshot-controller goes unavailable when machines are added removed to cluster
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: StorageAssignee: Fabio Bertinatto <fbertina>
Storage sub component: Kubernetes External Components QA Contact: Wei Duan <wduan>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, fbertina, jsafrane, wking
Version: 4.8Keywords: Upgrades
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1973686 (view as bug list) Environment:
Last Closed: 2021-07-27 22:54:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1973686    

Description Clayton Coleman 2021-03-19 15:22:40 UTC
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-compact-serial/1372797703225872384

serial runs create and remove nodes gracefully.  Operators may not go unavailable when that happens:

clusteroperator/csi-snapshot-controller should not change condition/Available 

csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:11:43.004517398 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:11:43.016807852 +0000 UTC -- All is well
csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:11:43.132808339 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:12:00.399030955 +0000 UTC -- All is well
csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:12:00.509765359 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:12:00.541133107 +0000 UTC -- All is well
csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:12:00.575805876 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:12:00.607094 +0000 UTC -- All is well
csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:12:00.641166131 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:12:00.668652789 +0000 UTC -- All is well

Adding and removing nodes does not make the operator unavailable, unless the operator has not properly configured itself so that graceful movement of the webhook pod is zero-disruption (which is a bug).  Operator should ensure that during graceful machine shutdown (drain etc) that all its components remain available by choosing the appropriate configuration for dependencies.

High because this is a normal behavior of the platform and the operator violates the constraints.

Comment 1 W. Trevor King 2021-04-10 00:49:37 UTC
Pretty much all the update jobs too:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/csi-snapshot-controller+should+
not+change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 16 runs, 100% failed, 69% of failures match = 69% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 89% of failures match = 89% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 88% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 21 runs, 100% failed, 71% of failures match = 71% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 50% of failures match = 40% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 10 runs, 50% failed, 60% of failures match = 30% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 70% of failures match = 70% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact


The test-case is new in 4.8, which is at least part of why earlier versions don't show up in that query.

Comment 3 Fabio Bertinatto 2021-06-04 14:30:33 UTC
Moving back to assigned because I still see some failures.

Comment 4 Fabio Bertinatto 2021-06-04 15:35:40 UTC
I just verified that last failing job run: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728

The job is not using a bundle with the updated operator. Perhaps it'll take a while until the changes are propagated.

Moving again to QA.

Comment 10 Wei Duan 2021-06-18 02:22:18 UTC
Did not see the failure on 4.8 non-single-node ci. 
Marked as verified according to the discussion, and agree with @wking that maybe need take some action for the single-node case.

Comment 11 Fabio Bertinatto 2021-06-18 13:32:45 UTC
Created bug 1973686(In reply to Wei Duan from comment #10)
> Did not see the failure on 4.8 non-single-node ci. 
> Marked as verified according to the discussion, and agree with @wking that
> maybe need take some action for the single-node case.

Created bug 1973686 to address that.

Comment 14 errata-xmlrpc 2021-07-27 22:54:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438