Description of problem: Upgrade from 4.7 nightly -> 4.8.0-fc.8 works. Then downgrading back to 4.7 nightly fails. Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-06-17-173140 How reproducible: always (2 out of 2 tries) Steps to Reproduce: 1. Install IPI on GCP 2. Upgrade to 4.8.0-fc.8 works 3. Downgrade back to 4.7 nightly fails OpenShift release version: 4.7.0-0.nightly-2021-06-17-173140 Cluster Platform: GCP Actual results: $ ./oc adm upgrade info: An upgrade is in progress. Unable to apply 4.7.0-0.nightly-2021-06-17-173140: an unknown error has occurred: MultipleErrors ./oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.0-0.nightly-2021-06-17-173140 True False False 155m baremetal 4.7.0-0.nightly-2021-06-17-173140 True False False 27h cloud-credential 4.7.0-0.nightly-2021-06-17-173140 True False False 27h cluster-autoscaler 4.7.0-0.nightly-2021-06-17-173140 True False False 27h config-operator 4.7.0-0.nightly-2021-06-17-173140 True False False 27h console 4.7.0-0.nightly-2021-06-17-173140 True False False 23h csi-snapshot-controller 4.7.0-0.nightly-2021-06-17-173140 True False True 27h dns 4.8.0-0.nightly-2021-06-18-055840 True False False 25h etcd 4.7.0-0.nightly-2021-06-17-173140 True False False 27h image-registry 4.7.0-0.nightly-2021-06-17-173140 True False False 27h ingress 4.7.0-0.nightly-2021-06-17-173140 True False True 23h insights 4.7.0-0.nightly-2021-06-17-173140 True False False 27h kube-apiserver 4.7.0-0.nightly-2021-06-17-173140 True False False 27h kube-controller-manager 4.7.0-0.nightly-2021-06-17-173140 True False False 27h kube-scheduler 4.7.0-0.nightly-2021-06-17-173140 True False False 27h kube-storage-version-migrator 4.7.0-0.nightly-2021-06-17-173140 True False False 24h machine-api 4.7.0-0.nightly-2021-06-17-173140 True False False 27h machine-approver 4.7.0-0.nightly-2021-06-17-173140 True False False 27h machine-config 4.8.0-0.nightly-2021-06-18-055840 True False False 27h marketplace 4.7.0-0.nightly-2021-06-17-173140 True False False 23h monitoring 4.7.0-0.nightly-2021-06-17-173140 True False False 23h network 4.8.0-0.nightly-2021-06-18-055840 True False False 27h node-tuning 4.8.0-0.nightly-2021-06-18-055840 True False False 25h openshift-apiserver 4.7.0-0.nightly-2021-06-17-173140 True False False 24h openshift-controller-manager 4.7.0-0.nightly-2021-06-17-173140 True False False 27h openshift-samples 4.7.0-0.nightly-2021-06-17-173140 True False False 23h operator-lifecycle-manager 4.8.0-0.nightly-2021-06-18-055840 True False False 27h operator-lifecycle-manager-catalog 4.8.0-0.nightly-2021-06-18-055840 True False False 27h operator-lifecycle-manager-packageserver 4.8.0-0.nightly-2021-06-18-055840 True False False 27h service-ca 4.8.0-0.nightly-2021-06-18-055840 True False False 27h storage 4.7.0-0.nightly-2021-06-17-173140 True False False 24h Expected results: Downgrade succeeds - while downgrade may not be officially supported, it has been working last few releases. Impact of the problem: Downgrade fails Additional info: must-gather shows When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information. ClusterID: 9d668a63-310a-45b1-b5f6-0af9fe23caab ClusterVersion: Updating to "4.7.0-0.nightly-2021-06-17-173140" from "4.8.0-0.nightly-2021-06-18-055840" for 4 hours: Unable to apply 4.7.0-0.nightly-2021-06-17-173140: an unknown error has occurred: MultipleErrors ClusterOperators: clusteroperator/csi-snapshot-controller is degraded because CSISnapshotStaticResourceControllerDegraded: "csi_controller_deployment_pdb.yaml" (string): the server could not find the requested resource CSISnapshotStaticResourceControllerDegraded: "webhook_deployment_pdb.yaml" (string): the server could not find the requested resource CSISnapshotStaticResourceControllerDegraded: clusteroperator/ingress is degraded because Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing) ** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report. You may also mark the bug private if you wish. Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC):
Previous downgrade bug https://bugzilla.redhat.com/show_bug.cgi?id=1971087
Must gather is bigger than allowed - available to share. Please let me know with whom I should share it with.
My apology if this isn't assigned correctly - please reassign if needed.
You reached the right team. I am sorry, but we need must-gather from the failed cluster in this case. Various teams have their own favorite way how to provide huge logs, usually some local NFS + HTTP server (like "scratch" on http://wiki.brq.redhat.com/BrnoMountPoints, but that's half a globe away from you). Ask around your office / team. In the worst case, Google Drive works for big files too.
Shared must-gather with you @jsafrane
@jsafrane Just confirming that you can access the must-gather for this BZ. Thanks.
I uploaded must-gather to https://download.eng.brq.redhat.com/scratch/jsafrane/BZ1973983.zip (I may delete it in the future without announce)
To give some background: the condition CSISnapshotStaticResourceControllerDegraded is generated by a controller that was introduced in OCP 4.8. The controller that produces this condition simply creates a PodDisruptionBudget with policy/v1 version. So what happened was: this condition was set to true by the 4.8 controller. Then the cluster was downgraded to 4.7, but this condition was not cleaned up because the controller that would override it doesn't exist in 4.7. A possible solution for this would be to introduce some code in 4.7 to clean up that condition. However, this isn't a reasonable approach because we'd have to do that for every new condition we introduce in every OCP release. Since this is a downgrade, which is not officially supported, I'd recommend deleting the csi-snapshot-controller ClusterOperator CR and let CVO recreate it. The new CR won't have that condition set. We could document this workaround for users who want to downgrade from 4.8 to 4.7. Moving to docs team.
@yanyang fyi.
@jhou This looks like a storage related BZ. Could you please help reassign QE assignment instead of Xiaoli? If you are not the right person to assign, please redirect to the right person. Thanks.