1973986 – 4.7 nightly upgrade to 4.8 and then downgrade back to 4.7.14 doesn't work - CSISnapshotStaticResourceControllerDegraded

Bug 1973986 - 4.7 nightly upgrade to 4.8 and then downgrade back to 4.7.14 doesn't work - CSISnapshotStaticResourceControllerDegraded

Summary: 4.7 nightly upgrade to 4.8 and then downgrade back to 4.7.14 doesn't work - C...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Latha S
QA Contact:	Wei Duan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-19 16:51 UTC by To Hung Sze
Modified:	2022-08-19 21:15 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-19 21:15:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description To Hung Sze 2021-06-19 16:51:29 UTC

Description of problem:
Upgrade from 4.7 nightly -> 4.8.0-fc.8 works.
Then downgrading back to 4.7 nightly fails.

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-06-17-173140

How reproducible: always (2 out of 2 tries)

Steps to Reproduce:
1. Install IPI on GCP
2. Upgrade to 4.8.0-fc.8 works
3. Downgrade back to 4.7 nightly fails

OpenShift release version:
4.7.0-0.nightly-2021-06-17-173140

Cluster Platform:
GCP


Actual results:
$ ./oc adm upgrade
info: An upgrade is in progress. Unable to apply 4.7.0-0.nightly-2021-06-17-173140: an unknown error has occurred: MultipleErrors

./oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-2021-06-17-173140   True        False         False      155m
baremetal                                  4.7.0-0.nightly-2021-06-17-173140   True        False         False      27h
cloud-credential                           4.7.0-0.nightly-2021-06-17-173140   True        False         False      27h
cluster-autoscaler                         4.7.0-0.nightly-2021-06-17-173140   True        False         False      27h
config-operator                            4.7.0-0.nightly-2021-06-17-173140   True        False         False      27h
console                                    4.7.0-0.nightly-2021-06-17-173140   True        False         False      23h
csi-snapshot-controller                    4.7.0-0.nightly-2021-06-17-173140   True        False         True       27h
dns                                        4.8.0-0.nightly-2021-06-18-055840   True        False         False      25h
etcd                                       4.7.0-0.nightly-2021-06-17-173140   True        False         False      27h
image-registry                             4.7.0-0.nightly-2021-06-17-173140   True        False         False      27h
ingress                                    4.7.0-0.nightly-2021-06-17-173140   True        False         True       23h
insights                                   4.7.0-0.nightly-2021-06-17-173140   True        False         False      27h
kube-apiserver                             4.7.0-0.nightly-2021-06-17-173140   True        False         False      27h
kube-controller-manager                    4.7.0-0.nightly-2021-06-17-173140   True        False         False      27h
kube-scheduler                             4.7.0-0.nightly-2021-06-17-173140   True        False         False      27h
kube-storage-version-migrator              4.7.0-0.nightly-2021-06-17-173140   True        False         False      24h
machine-api                                4.7.0-0.nightly-2021-06-17-173140   True        False         False      27h
machine-approver                           4.7.0-0.nightly-2021-06-17-173140   True        False         False      27h
machine-config                             4.8.0-0.nightly-2021-06-18-055840   True        False         False      27h
marketplace                                4.7.0-0.nightly-2021-06-17-173140   True        False         False      23h
monitoring                                 4.7.0-0.nightly-2021-06-17-173140   True        False         False      23h
network                                    4.8.0-0.nightly-2021-06-18-055840   True        False         False      27h
node-tuning                                4.8.0-0.nightly-2021-06-18-055840   True        False         False      25h
openshift-apiserver                        4.7.0-0.nightly-2021-06-17-173140   True        False         False      24h
openshift-controller-manager               4.7.0-0.nightly-2021-06-17-173140   True        False         False      27h
openshift-samples                          4.7.0-0.nightly-2021-06-17-173140   True        False         False      23h
operator-lifecycle-manager                 4.8.0-0.nightly-2021-06-18-055840   True        False         False      27h
operator-lifecycle-manager-catalog         4.8.0-0.nightly-2021-06-18-055840   True        False         False      27h
operator-lifecycle-manager-packageserver   4.8.0-0.nightly-2021-06-18-055840   True        False         False      27h
service-ca                                 4.8.0-0.nightly-2021-06-18-055840   True        False         False      27h
storage                                    4.7.0-0.nightly-2021-06-17-173140   True        False         False      24h



Expected results:
Downgrade succeeds - while downgrade may not be officially supported, it has been working last few releases.

Impact of the problem:
Downgrade fails

Additional info:
must-gather shows

When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information.
ClusterID: 9d668a63-310a-45b1-b5f6-0af9fe23caab
ClusterVersion: Updating to "4.7.0-0.nightly-2021-06-17-173140" from "4.8.0-0.nightly-2021-06-18-055840" for 4 hours: Unable to apply 4.7.0-0.nightly-2021-06-17-173140: an unknown error has occurred: MultipleErrors
ClusterOperators:
	clusteroperator/csi-snapshot-controller is degraded because CSISnapshotStaticResourceControllerDegraded: "csi_controller_deployment_pdb.yaml" (string): the server could not find the requested resource
CSISnapshotStaticResourceControllerDegraded: "webhook_deployment_pdb.yaml" (string): the server could not find the requested resource
CSISnapshotStaticResourceControllerDegraded: 
	clusteroperator/ingress is degraded because Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)



** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.


Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Comment 1 To Hung Sze 2021-06-19 16:51:52 UTC

Previous downgrade bug
https://bugzilla.redhat.com/show_bug.cgi?id=1971087

Comment 2 To Hung Sze 2021-06-19 16:52:05 UTC

Must gather is bigger than allowed - available to share. Please let me know with whom I should share it with.

Comment 3 To Hung Sze 2021-06-19 16:52:43 UTC

My apology if this isn't assigned correctly - please reassign if needed.

Comment 4 Jan Safranek 2021-06-21 11:51:50 UTC

You reached the right team. I am sorry, but we need must-gather from the failed cluster in this case. Various teams have their own favorite way how to provide huge logs, usually some local NFS + HTTP server (like "scratch" on http://wiki.brq.redhat.com/BrnoMountPoints, but that's half a globe away from you). Ask around your office / team. In the worst case, Google Drive works for big files too.

Comment 5 To Hung Sze 2021-06-21 13:08:21 UTC

Shared must-gather with you @jsafrane

Comment 6 To Hung Sze 2021-06-22 13:23:30 UTC

@jsafrane Just confirming that you can access the must-gather for this BZ. Thanks.

Comment 7 Jan Safranek 2021-06-22 13:40:58 UTC

I uploaded must-gather to https://download.eng.brq.redhat.com/scratch/jsafrane/BZ1973983.zip
(I may delete it in the future without announce)

Comment 8 Fabio Bertinatto 2021-09-06 13:39:23 UTC

To give some background: the condition CSISnapshotStaticResourceControllerDegraded is generated by a controller that was introduced in OCP 4.8. The controller that produces this condition simply creates a PodDisruptionBudget with policy/v1 version.

So what happened was: this condition was set to true by the 4.8 controller. Then the cluster was downgraded to 4.7, but this condition was not cleaned up because the controller that would override it doesn't exist in 4.7.

A possible solution for this would be to introduce some code in 4.7 to clean up that condition. However, this isn't a reasonable approach because we'd have to do that for every new condition we introduce in every OCP release.

Since this is a downgrade, which is not officially supported, I'd recommend deleting the csi-snapshot-controller ClusterOperator CR and let CVO recreate it. The new CR won't have that condition set.

We could document this workaround for users who want to downgrade from 4.8 to 4.7. Moving to docs team.

Comment 9 To Hung Sze 2021-09-07 12:05:55 UTC

@yanyang fyi.

Comment 10 To Hung Sze 2021-09-07 12:10:07 UTC

@jhou This looks like a storage related BZ.
Could you please help reassign QE assignment instead of Xiaoli?
If you are not the right person to assign, please redirect to the right person. Thanks.

Note You need to log in before you can comment on or make changes to this bug.