Side effect of bug 1835869, which was backported to 4.3.z as bug 1860100, landing in 4.3.22: because of this line [1], when the storage operator fails CheckAlphaSnapshot, it sets Upgradeable=False (appropriate) but also bails out of of the Reconcile method, abandoning all subsequent reconciliation steps. This can cause 4.3.(<32) -> 4.3.32 or 4.3.(<32) -> 4.3.33 updates to hang waiting on the storage operator. The status ClusterOperator conditions look like: $ tar -xOz config/clusteroperator/storage <insights.tar.gz | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")' 2020-08-27T12:58:28Z Available True AsExpected - 2020-06-29T11:42:14Z Degraded False AsExpected - 2020-08-24T12:39:22Z Upgradeable False AsExpected Unable to update cluster as v1alpha1 version of volumesnapshots.snapshot.storage.k8s.io, volumesnapshotclasses.snapshot.storage.k8s.io, volumesnapshotcontents.snapshot.storage.k8s.iois detected. Remove these CRDs to allow the upgrade to proceed. 2020-08-27T12:58:28Z Progressing False AsExpected - (side note that AsExpected is not a particularly useful reason for Upgradeable=False. Possibly pivot to 'V1Alpha1CRDs' or some such). ClusterVersion looks like: $ tar -xOz config/version <insights.tar.gz | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")' 2020-06-29T11:50:54Z Available True - Done applying 4.3.31 2020-08-24T12:54:50Z Failing True ClusterOperatorNotAvailable Cluster operator storage is still updating 2020-08-24T12:26:51Z Progressing True ClusterOperatorNotAvailable Unable to apply 4.3.33: the cluster operator storage has not yet successfully rolled out 2020-08-02T06:08:46Z RetrievedUpdates True - - 2020-08-24T12:41:34Z Upgradeable False AsExpected Cluster operator storage cannot be upgraded: Unable to update cluster as v1alpha1 version of volumesnapshots.snapshot.storage.k8s.io, volumesnapshotclasses.snapshot.storage.k8s.io, volumesnapshotcontents.snapshot.storage.k8s.iois detected. Remove these CRDs to allow the upgrade to proceed. The cluster-version operator is blocking on the storage operator, because the storage operator is not bumping it's versions[name=operator]: $ tar -xOz config/clusteroperator/storage <insights.tar.gz | jq -r '.status.versions[] | .version + " " + .name' 4.3.31 operator To fix, set Upgradeable=False when appropriate to warn the cluster-version operator off of future minor-version bumps, but continue to perform the rest of your reconciliation tasks because the currently-reconciling version still needs to be applied. Setting 'high' severity, because sticking updates are bad, but checking Insights/Telemetry doesn't turn up too many clusters stuck like this. If you have a way we can detect (in Telemetry/Insights) clusters on earlier 4.3 which have not yet updated to 4.3.32 or later and which may be vulnerable, that would help us decide whether it was appropriate to block edges into the affected 4.3 releases. [1]: https://github.com/openshift/cluster-storage-operator/pull/63/files#diff-97e82b0d41902abcb7a788ea37573993R177
We disabled the snapshot co, then uninstalled the v1beta1 VolumeSnapshot* CRDs and installed the v1alpha1 VolumeSnapshot* CRDs. But the upgrade did not start. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-08-25-204643 True False 27h Cluster version is 4.6.0-0.nightly-2020-08-25-204643 For clusterversion version: "spec": { "channel": "stable-4.6", "clusterID": "35c2b12d-7440-4c99-ac50-60f3aff0a059", "desiredUpdate": { "force": false, "image": "registry.svc.ci.openshift.org/ocp/release@sha256:c4059816df4d67ff5dc2356dd4d278833d1e57282e16b9fef59554936e5562ed", "version": "4.6.0-0.nightly-2020-08-25-234625" }, "upstream": "https://openshift-release.svc.ci.openshift.org/graph" }, "status": { "availableUpdates": [ { "image": "registry.svc.ci.openshift.org/ocp/release@sha256:c4059816df4d67ff5dc2356dd4d278833d1e57282e16b9fef59554936e5562ed", "version": "4.6.0-0.nightly-2020-08-25-234625" } ], "conditions": [ { "lastTransitionTime": "2020-08-28T03:33:44Z", "message": "Done applying 4.6.0-0.nightly-2020-08-25-204643", "status": "True", "type": "Available" }, { "lastTransitionTime": "2020-08-28T05:57:35Z", "status": "False", "type": "Failing" }, { "lastTransitionTime": "2020-08-28T05:59:20Z", "message": "Cluster version is 4.6.0-0.nightly-2020-08-25-204643", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2020-08-28T14:06:11Z", "status": "True", "type": "RetrievedUpdates" }, { "lastTransitionTime": "2020-08-28T06:11:26Z", "message": "Cluster operator storage cannot be upgraded between minor versions: SnapshotCRDControllerUpgradeable: Unable to update cluster as v1alpha1 version of VolumeSnapshot, VolumeSnapshotContent is detected. Remove these CRDs to allow the upgrade to proceed.", "reason": "SnapshotCRDController_AlphaDetected", "status": "False", "type": "Upgradeable" } ], From the status, we only see the SnapshotCRDController_AlphaDetected made the Upgradeable False. I tried to remove the v1alpha1 VolumeSnapshot* CRDs but only VolumeSnapshotclass CRD removed, VolumeSnapshot and VolumeSnapshotContent CRD cannot be deleted. Still working on it. But anyway, from my opinion, we also hit this issue on 4.6.
Looks like my previous conclusion is not correct. After managed to delete all the v1alpha1 VolumeSnapshot* CRDs, upgrade still did not start. I pastested the clusterversion here, maybe need check with upgrade team. $ oc get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-08-25-204643 True False 28h Cluster version is 4.6.0-0.nightly-2020-08-25-204643 [wduan@MINT kubernetes-1.16]$ oc get clusterversion version -o yaml apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2020-08-28T03:00:49Z" generation: 18 managedFields: - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: .: {} f:channel: {} f:clusterID: {} manager: cluster-bootstrap operation: Update time: "2020-08-28T03:00:49Z" - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: f:desiredUpdate: .: {} f:force: {} f:image: {} f:version: {} manager: oc operation: Update time: "2020-08-29T05:04:46Z" - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: f:upstream: {} manager: kubectl-patch operation: Update time: "2020-08-29T08:43:35Z" - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:status: .: {} f:availableUpdates: {} f:conditions: {} f:desired: .: {} f:image: {} f:version: {} f:history: {} f:observedGeneration: {} f:versionHash: {} manager: cluster-version-operator operation: Update time: "2020-08-29T09:45:06Z" name: version resourceVersion: "1325193" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: 80d09500-7c10-46a0-aa74-8cbd2edec283 spec: channel: stable-4.6 clusterID: 35c2b12d-7440-4c99-ac50-60f3aff0a059 desiredUpdate: force: false image: registry.svc.ci.openshift.org/ocp/release@sha256:c4059816df4d67ff5dc2356dd4d278833d1e57282e16b9fef59554936e5562ed version: 4.6.0-0.nightly-2020-08-25-234625 upstream: https://openshift-release.svc.ci.openshift.org/graph status: availableUpdates: - image: registry.svc.ci.openshift.org/ocp/release@sha256:c4059816df4d67ff5dc2356dd4d278833d1e57282e16b9fef59554936e5562ed version: 4.6.0-0.nightly-2020-08-25-234625 conditions: - lastTransitionTime: "2020-08-28T03:33:44Z" message: Done applying 4.6.0-0.nightly-2020-08-25-204643 status: "True" type: Available - lastTransitionTime: "2020-08-28T05:57:35Z" status: "False" type: Failing - lastTransitionTime: "2020-08-28T05:59:20Z" message: Cluster version is 4.6.0-0.nightly-2020-08-25-204643 status: "False" type: Progressing - lastTransitionTime: "2020-08-28T14:06:11Z" status: "True" type: RetrievedUpdates desired: image: registry.svc.ci.openshift.org/ocp/release@sha256:56945dc7218d758e25ffe990374668890e8c77d72132c98e0cc8f6272c063cc7 version: 4.6.0-0.nightly-2020-08-25-204643 history: - completionTime: "2020-08-28T03:33:44Z" image: registry.svc.ci.openshift.org/ocp/release@sha256:56945dc7218d758e25ffe990374668890e8c77d72132c98e0cc8f6272c063cc7 startedTime: "2020-08-28T03:00:49Z" state: Completed verified: false version: 4.6.0-0.nightly-2020-08-25-204643 observedGeneration: 7 versionHash: VdMtCylIGgw=
I uploaded the must-gather on http://virt-openshift-05.lab.eng.nay.redhat.com/wduan/logs/must-gather.local.8306168597559322770_0829.tar.gz
> ...upgrade still did not start I've spun this off into bug 1873900. It seems orthogonal to this bug's storage issue.
I tried another upgrade, this time upgrade was triggered but blocked with csi-snapshot-controller co. Which means this will still block the upgrade. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-08-26-202109 True True 47m Unable to apply 4.6.0-0.nightly-2020-08-27-005538: the cluster operator csi-snapshot-controller has not yet successfully rolled out $ oc get co csi-snapshot-controller -ojson | jq .status.conditions [ { "lastTransitionTime": "2020-08-31T08:14:41Z", "message": "Degraded: failed to sync CRDs: cluster-csi-snapshot-controller-operator does not support v1alpha1 version of snapshot CRDs volumesnapshots.snapshot.storage.k8s.io, volumesnapshotcontents.snapshot.storage.k8s.io, volumesnapshotclasses.snapshot.storage.k8s.io installed by user or 3rd party controller", "reason": "_AlphaCRDsExist", "status": "True", "type": "Degraded" }, { "lastTransitionTime": "2020-08-31T00:50:50Z", "reason": "AsExpected", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2020-08-31T00:50:50Z", "reason": "AsExpected", "status": "True", "type": "Available" }, { "lastTransitionTime": "2020-08-31T00:44:49Z", "reason": "AsExpected", "status": "True", "type": "Upgradeable" } ]
> In both cases, the 'storage' co was successfully upgraded; however, the 'csi-snapshot-controller‘ was not blocked due to the presence of the v1alpha1 CRDs. I think that means "this storage-operator bug can be VERIFIED on 4.6, and we may need a new bug for the snapshot controller".
First, correct my typo and fortunately it did not mislead you. > In both cases, the 'storage' co was successfully upgraded; however, the 'csi-snapshot-controller‘ was not blocked due to the presence of the v1alpha1 CRDs. Should be In both cases, the 'storage' co was successfully upgraded; however, the 'csi-snapshot-controller‘ was blocked due to the presence of the v1alpha1 CRDs. @Huffman, as I asked in comment10, I'd like to confirm if 'storage-operator' upgrade could enough for "VERIFIED", also we tried several scenarios for 'csi-snapshot-controller‘ to see what happened. Actually in case B, which crd 'csisnapshotcontrollers' and 'csi-snapshot-controller' co was deleted, is this case more similar with the upgrade from 4.3 to 4.4? > In other words, I think we need to ensure that this doesn't block upgrades in the storage operator in 4.3, as the CSI Snapshot Controller Operator shouldn't ever encounter this state (being present alongside the v1alpha1 CRDs). I agree with your concerned, it's ok for me to "VERIFIED" this BZ and switch to test/verified on upgrade from 4.3 -> 4.3 and 4.3 -> 4.4.
Thanks a lot for the explanation, Mark it as VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196