Bug 1940940 - csi-snapshot-controller goes unavailable when machines are added removed to cluster
Summary: csi-snapshot-controller goes unavailable when machines are added removed to c...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Fabio Bertinatto
QA Contact: Wei Duan
URL:
Whiteboard:
Depends On:
Blocks: 1973686
TreeView+ depends on / blocked
 
Reported: 2021-03-19 15:22 UTC by Clayton Coleman
Modified: 2021-07-27 22:55 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1973686 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:54:33 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-csi-snapshot-controller-operator pull 88 0 None open Bug 1940940: Deploy multiple operand replicas 2021-06-02 13:30:02 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:55:01 UTC

Description Clayton Coleman 2021-03-19 15:22:40 UTC
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-compact-serial/1372797703225872384

serial runs create and remove nodes gracefully.  Operators may not go unavailable when that happens:

clusteroperator/csi-snapshot-controller should not change condition/Available 

csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:11:43.004517398 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:11:43.016807852 +0000 UTC -- All is well
csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:11:43.132808339 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:12:00.399030955 +0000 UTC -- All is well
csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:12:00.509765359 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:12:00.541133107 +0000 UTC -- All is well
csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:12:00.575805876 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:12:00.607094 +0000 UTC -- All is well
csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:12:00.641166131 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:12:00.668652789 +0000 UTC -- All is well

Adding and removing nodes does not make the operator unavailable, unless the operator has not properly configured itself so that graceful movement of the webhook pod is zero-disruption (which is a bug).  Operator should ensure that during graceful machine shutdown (drain etc) that all its components remain available by choosing the appropriate configuration for dependencies.

High because this is a normal behavior of the platform and the operator violates the constraints.

Comment 1 W. Trevor King 2021-04-10 00:49:37 UTC
Pretty much all the update jobs too:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/csi-snapshot-controller+should+
not+change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 16 runs, 100% failed, 69% of failures match = 69% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 89% of failures match = 89% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 88% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 21 runs, 100% failed, 71% of failures match = 71% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 50% of failures match = 40% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 10 runs, 50% failed, 60% of failures match = 30% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 70% of failures match = 70% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact


The test-case is new in 4.8, which is at least part of why earlier versions don't show up in that query.

Comment 3 Fabio Bertinatto 2021-06-04 14:30:33 UTC
Moving back to assigned because I still see some failures.

Comment 4 Fabio Bertinatto 2021-06-04 15:35:40 UTC
I just verified that last failing job run: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728

The job is not using a bundle with the updated operator. Perhaps it'll take a while until the changes are propagated.

Moving again to QA.

Comment 10 Wei Duan 2021-06-18 02:22:18 UTC
Did not see the failure on 4.8 non-single-node ci. 
Marked as verified according to the discussion, and agree with @wking that maybe need take some action for the single-node case.

Comment 11 Fabio Bertinatto 2021-06-18 13:32:45 UTC
Created bug 1973686(In reply to Wei Duan from comment #10)
> Did not see the failure on 4.8 non-single-node ci. 
> Marked as verified according to the discussion, and agree with @wking that
> maybe need take some action for the single-node case.

Created bug 1973686 to address that.

Comment 14 errata-xmlrpc 2021-07-27 22:54:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.