1940940 – csi-snapshot-controller goes unavailable when machines are added removed to cluster

Bug 1940940 - csi-snapshot-controller goes unavailable when machines are added removed to cluster

Summary: csi-snapshot-controller goes unavailable when machines are added removed to c...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Fabio Bertinatto
QA Contact:	Wei Duan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1973686
TreeView+	depends on / blocked

Reported:	2021-03-19 15:22 UTC by Clayton Coleman
Modified:	2021-07-27 22:55 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1973686 (view as bug list)
Environment:
Last Closed:	2021-07-27 22:54:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-csi-snapshot-controller-operator pull 88	0	None	open	Bug 1940940: Deploy multiple operand replicas	2021-06-02 13:30:02 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:55:01 UTC

Description Clayton Coleman 2021-03-19 15:22:40 UTC

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-compact-serial/1372797703225872384

serial runs create and remove nodes gracefully.  Operators may not go unavailable when that happens:

clusteroperator/csi-snapshot-controller should not change condition/Available 

csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:11:43.004517398 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:11:43.016807852 +0000 UTC -- All is well
csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:11:43.132808339 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:12:00.399030955 +0000 UTC -- All is well
csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:12:00.509765359 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:12:00.541133107 +0000 UTC -- All is well
csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:12:00.575805876 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:12:00.607094 +0000 UTC -- All is well
csi-snapshot-controller was Available=true, but became Available=false at 2021-03-19 07:12:00.641166131 +0000 UTC -- CSISnapshotWebhookControllerAvailable: Waiting for a validating webhook Deployment pod to start
csi-snapshot-controller was Available=false, but became Available=true at 2021-03-19 07:12:00.668652789 +0000 UTC -- All is well

Adding and removing nodes does not make the operator unavailable, unless the operator has not properly configured itself so that graceful movement of the webhook pod is zero-disruption (which is a bug).  Operator should ensure that during graceful machine shutdown (drain etc) that all its components remain available by choosing the appropriate configuration for dependencies.

High because this is a normal behavior of the platform and the operator violates the constraints.

Comment 1 W. Trevor King 2021-04-10 00:49:37 UTC

Pretty much all the update jobs too:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/csi-snapshot-controller+should+
not+change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 16 runs, 100% failed, 69% of failures match = 69% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 89% of failures match = 89% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 88% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 21 runs, 100% failed, 71% of failures match = 71% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 50% of failures match = 40% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 10 runs, 50% failed, 60% of failures match = 30% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 70% of failures match = 70% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact


The test-case is new in 4.8, which is at least part of why earlier versions don't show up in that query.

Comment 3 Fabio Bertinatto 2021-06-04 14:30:33 UTC

Moving back to assigned because I still see some failures.

Comment 4 Fabio Bertinatto 2021-06-04 15:35:40 UTC

I just verified that last failing job run: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728

The job is not using a bundle with the updated operator. Perhaps it'll take a while until the changes are propagated.

Moving again to QA.

Comment 10 Wei Duan 2021-06-18 02:22:18 UTC

Did not see the failure on 4.8 non-single-node ci. 
Marked as verified according to the discussion, and agree with @wking that maybe need take some action for the single-node case.

Comment 11 Fabio Bertinatto 2021-06-18 13:32:45 UTC

Created bug 1973686(In reply to Wei Duan from comment #10)
> Did not see the failure on 4.8 non-single-node ci. 
> Marked as verified according to the discussion, and agree with @wking that
> maybe need take some action for the single-node case.

Created bug 1973686 to address that.

Comment 14 errata-xmlrpc 2021-07-27 22:54:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.