1873299 – Storage operator stops reconciling when going Upgradeable=False on v1alpha1 CRDs

Bug 1873299 - Storage operator stops reconciling when going Upgradeable=False on v1alpha1 CRDs

Summary: Storage operator stops reconciling when going Upgradeable=False on v1alpha1 CRDs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Christian Huffman
QA Contact:	Wei Duan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1874873
TreeView+	depends on / blocked

Reported:	2020-08-27 19:11 UTC by W. Trevor King
Modified:	2020-10-27 16:35 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: If v1alpha1 VolumeSnapshot CRDs were detected, no further reconcile actions were taken. Consequence: The Cluster Storage Operator could not perform z-stream upgrades if these CRDs were ever detected on the cluster. Fix: Moved the v1alpha1 CRD check to later in the Reconcile loop. Result: Z-stream upgrades now complete successfully, and v1alpha1 CRDs are detected without issue.
Clone Of:
Clones:	1874873 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:35:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:35:46 UTC

Description W. Trevor King 2020-08-27 19:11:33 UTC

Side effect of bug 1835869, which was backported to 4.3.z as bug 1860100, landing in 4.3.22: because of this line [1], when the storage operator fails CheckAlphaSnapshot, it sets Upgradeable=False (appropriate) but also bails out of of the Reconcile method, abandoning all subsequent reconciliation steps.  This can cause 4.3.(<32) -> 4.3.32 or 4.3.(<32) -> 4.3.33 updates to hang waiting on the storage operator.  The status ClusterOperator conditions look like:

$ tar -xOz config/clusteroperator/storage <insights.tar.gz | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")'
2020-08-27T12:58:28Z Available True AsExpected -
2020-06-29T11:42:14Z Degraded False AsExpected -
2020-08-24T12:39:22Z Upgradeable False AsExpected Unable to update cluster as v1alpha1 version of volumesnapshots.snapshot.storage.k8s.io, volumesnapshotclasses.snapshot.storage.k8s.io, volumesnapshotcontents.snapshot.storage.k8s.iois detected. Remove these CRDs to allow the upgrade to proceed.
2020-08-27T12:58:28Z Progressing False AsExpected -

(side note that AsExpected is not a particularly useful reason for Upgradeable=False.  Possibly pivot to 'V1Alpha1CRDs' or some such).  ClusterVersion looks like:

$ tar -xOz config/version <insights.tar.gz | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")'
2020-06-29T11:50:54Z Available True - Done applying 4.3.31
2020-08-24T12:54:50Z Failing True ClusterOperatorNotAvailable Cluster operator storage is still updating
2020-08-24T12:26:51Z Progressing True ClusterOperatorNotAvailable Unable to apply 4.3.33: the cluster operator storage has not yet successfully rolled out
2020-08-02T06:08:46Z RetrievedUpdates True - -
2020-08-24T12:41:34Z Upgradeable False AsExpected Cluster operator storage cannot be upgraded: Unable to update cluster as v1alpha1 version of volumesnapshots.snapshot.storage.k8s.io, volumesnapshotclasses.snapshot.storage.k8s.io, volumesnapshotcontents.snapshot.storage.k8s.iois detected. Remove these CRDs to allow the upgrade to proceed.

The cluster-version operator is blocking on the storage operator, because the storage operator is not bumping it's versions[name=operator]:

$ tar -xOz config/clusteroperator/storage <insights.tar.gz | jq -r '.status.versions[] | .version + " " + .name'
4.3.31 operator

To fix, set Upgradeable=False when appropriate to warn the cluster-version operator off of future minor-version bumps, but continue to perform the rest of your reconciliation tasks because the currently-reconciling version still needs to be applied.

Setting 'high' severity, because sticking updates are bad, but checking Insights/Telemetry doesn't turn up too many clusters stuck like this.  If you have a way we can detect (in Telemetry/Insights) clusters on earlier 4.3 which have not yet updated to 4.3.32 or later and which may be vulnerable, that would help us decide whether it was appropriate to block edges into the affected 4.3 releases.

[1]: https://github.com/openshift/cluster-storage-operator/pull/63/files#diff-97e82b0d41902abcb7a788ea37573993R177

Comment 5 Wei Duan 2020-08-29 09:09:28 UTC

We disabled the snapshot co, then uninstalled the v1beta1 VolumeSnapshot* CRDs and installed the v1alpha1 VolumeSnapshot* CRDs. But the upgrade did not start.


$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-25-204643   True        False         27h     Cluster version is 4.6.0-0.nightly-2020-08-25-204643


For clusterversion version:
    "spec": {
        "channel": "stable-4.6",
        "clusterID": "35c2b12d-7440-4c99-ac50-60f3aff0a059",
        "desiredUpdate": {
            "force": false,
            "image": "registry.svc.ci.openshift.org/ocp/release@sha256:c4059816df4d67ff5dc2356dd4d278833d1e57282e16b9fef59554936e5562ed",
            "version": "4.6.0-0.nightly-2020-08-25-234625"
        },
        "upstream": "https://openshift-release.svc.ci.openshift.org/graph"
    },
    "status": {
        "availableUpdates": [
            {
                "image": "registry.svc.ci.openshift.org/ocp/release@sha256:c4059816df4d67ff5dc2356dd4d278833d1e57282e16b9fef59554936e5562ed",
                "version": "4.6.0-0.nightly-2020-08-25-234625"
            }
        ],

        "conditions": [
            {
                "lastTransitionTime": "2020-08-28T03:33:44Z",
                "message": "Done applying 4.6.0-0.nightly-2020-08-25-204643",
                "status": "True",
                "type": "Available"
            },
            {
                "lastTransitionTime": "2020-08-28T05:57:35Z",
                "status": "False",
                "type": "Failing"
            },
            {
                "lastTransitionTime": "2020-08-28T05:59:20Z",
                "message": "Cluster version is 4.6.0-0.nightly-2020-08-25-204643",
                "status": "False",
                "type": "Progressing"
            },
            {
                "lastTransitionTime": "2020-08-28T14:06:11Z",
                "status": "True",
                "type": "RetrievedUpdates"
            },
            {
                "lastTransitionTime": "2020-08-28T06:11:26Z",
                "message": "Cluster operator storage cannot be upgraded between minor versions: SnapshotCRDControllerUpgradeable: Unable to update cluster as v1alpha1 version of VolumeSnapshot, VolumeSnapshotContent is detected. Remove these CRDs to allow the upgrade to proceed.",
                "reason": "SnapshotCRDController_AlphaDetected",
                "status": "False",
                "type": "Upgradeable"
            }
        ],


From the status, we only see the SnapshotCRDController_AlphaDetected made the Upgradeable False.
I tried to remove the v1alpha1 VolumeSnapshot* CRDs but only VolumeSnapshotclass CRD removed, VolumeSnapshot and VolumeSnapshotContent CRD cannot be deleted. Still working on it.
But anyway, from my opinion, we also hit this issue on 4.6.

Comment 6 Wei Duan 2020-08-29 10:24:37 UTC

Looks like my previous conclusion is not correct.
After managed to delete all the v1alpha1 VolumeSnapshot* CRDs, upgrade still  did not start. I pastested the clusterversion here, maybe need check with upgrade team.

$ oc get clusterversion version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-25-204643   True        False         28h     Cluster version is 4.6.0-0.nightly-2020-08-25-204643

[wduan@MINT kubernetes-1.16]$ oc get clusterversion version -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
  creationTimestamp: "2020-08-28T03:00:49Z"
  generation: 18
  managedFields:
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        .: {}
        f:channel: {}
        f:clusterID: {}
    manager: cluster-bootstrap
    operation: Update
    time: "2020-08-28T03:00:49Z"
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:desiredUpdate:
          .: {}
          f:force: {}
          f:image: {}
          f:version: {}
    manager: oc
    operation: Update
    time: "2020-08-29T05:04:46Z"
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:upstream: {}
    manager: kubectl-patch
    operation: Update
    time: "2020-08-29T08:43:35Z"
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        .: {}
        f:availableUpdates: {}
        f:conditions: {}
        f:desired:
          .: {}
          f:image: {}
          f:version: {}
        f:history: {}
        f:observedGeneration: {}
        f:versionHash: {}
    manager: cluster-version-operator
    operation: Update
    time: "2020-08-29T09:45:06Z"
  name: version
  resourceVersion: "1325193"
  selfLink: /apis/config.openshift.io/v1/clusterversions/version
  uid: 80d09500-7c10-46a0-aa74-8cbd2edec283
spec:
  channel: stable-4.6
  clusterID: 35c2b12d-7440-4c99-ac50-60f3aff0a059
  desiredUpdate:
    force: false
    image: registry.svc.ci.openshift.org/ocp/release@sha256:c4059816df4d67ff5dc2356dd4d278833d1e57282e16b9fef59554936e5562ed
    version: 4.6.0-0.nightly-2020-08-25-234625
  upstream: https://openshift-release.svc.ci.openshift.org/graph
status:
  availableUpdates:
  - image: registry.svc.ci.openshift.org/ocp/release@sha256:c4059816df4d67ff5dc2356dd4d278833d1e57282e16b9fef59554936e5562ed
    version: 4.6.0-0.nightly-2020-08-25-234625
  conditions:
  - lastTransitionTime: "2020-08-28T03:33:44Z"
    message: Done applying 4.6.0-0.nightly-2020-08-25-204643
    status: "True"
    type: Available
  - lastTransitionTime: "2020-08-28T05:57:35Z"
    status: "False"
    type: Failing
  - lastTransitionTime: "2020-08-28T05:59:20Z"
    message: Cluster version is 4.6.0-0.nightly-2020-08-25-204643
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-08-28T14:06:11Z"
    status: "True"
    type: RetrievedUpdates
  desired:
    image: registry.svc.ci.openshift.org/ocp/release@sha256:56945dc7218d758e25ffe990374668890e8c77d72132c98e0cc8f6272c063cc7
    version: 4.6.0-0.nightly-2020-08-25-204643
  history:
  - completionTime: "2020-08-28T03:33:44Z"
    image: registry.svc.ci.openshift.org/ocp/release@sha256:56945dc7218d758e25ffe990374668890e8c77d72132c98e0cc8f6272c063cc7
    startedTime: "2020-08-28T03:00:49Z"
    state: Completed
    verified: false
    version: 4.6.0-0.nightly-2020-08-25-204643
  observedGeneration: 7
  versionHash: VdMtCylIGgw=

Comment 7 Wei Duan 2020-08-30 07:14:44 UTC

I uploaded the must-gather on http://virt-openshift-05.lab.eng.nay.redhat.com/wduan/logs/must-gather.local.8306168597559322770_0829.tar.gz

Comment 8 W. Trevor King 2020-08-31 04:50:54 UTC

> ...upgrade still  did not start

I've spun this off into bug 1873900.  It seems orthogonal to this bug's storage issue.

Comment 9 Wei Duan 2020-08-31 09:19:23 UTC

I tried another upgrade, this time upgrade was triggered but blocked with csi-snapshot-controller co.
Which means this will still block the upgrade.

$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-26-202109   True        True          47m     Unable to apply 4.6.0-0.nightly-2020-08-27-005538: the cluster operator csi-snapshot-controller has not yet successfully rolled out

$ oc get co csi-snapshot-controller -ojson | jq .status.conditions
[
  {
    "lastTransitionTime": "2020-08-31T08:14:41Z",
    "message": "Degraded: failed to sync CRDs: cluster-csi-snapshot-controller-operator does not support v1alpha1 version of snapshot CRDs volumesnapshots.snapshot.storage.k8s.io, volumesnapshotcontents.snapshot.storage.k8s.io, volumesnapshotclasses.snapshot.storage.k8s.io installed by user or 3rd party controller",
    "reason": "_AlphaCRDsExist",
    "status": "True",
    "type": "Degraded"
  },
  {
    "lastTransitionTime": "2020-08-31T00:50:50Z",
    "reason": "AsExpected",
    "status": "False",
    "type": "Progressing"
  },
  {
    "lastTransitionTime": "2020-08-31T00:50:50Z",
    "reason": "AsExpected",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2020-08-31T00:44:49Z",
    "reason": "AsExpected",
    "status": "True",
    "type": "Upgradeable"
  }
]

Comment 11 W. Trevor King 2020-08-31 19:10:01 UTC

> In both cases, the 'storage' co was successfully upgraded; however, the 'csi-snapshot-controller‘ was not blocked due to the presence of the v1alpha1 CRDs.

I think that means "this storage-operator bug can be VERIFIED on 4.6, and we may need a new bug for the snapshot controller".

Comment 14 Wei Duan 2020-09-01 00:50:43 UTC

First, correct my typo and fortunately it did not mislead you.
> In both cases, the 'storage' co was successfully upgraded; however, the 'csi-snapshot-controller‘ was not blocked due to the presence of the v1alpha1 CRDs.
Should be 
  In both cases, the 'storage' co was successfully upgraded; however, the 'csi-snapshot-controller‘ was blocked due to the presence of the v1alpha1 CRDs.


@Huffman, as I asked in comment10, I'd like to confirm if 'storage-operator' upgrade could enough for "VERIFIED", also we tried several scenarios for 'csi-snapshot-controller‘ to see what happened. Actually in case B, which crd 'csisnapshotcontrollers' and 'csi-snapshot-controller' co was deleted, is this case more similar with the upgrade from 4.3 to 4.4?    


> In other words, I think we need to ensure that this doesn't block upgrades in the storage operator in 4.3, as the CSI Snapshot Controller Operator shouldn't ever encounter this state (being present alongside the v1alpha1 CRDs).
I agree with your concerned, it's ok for me to "VERIFIED" this BZ and switch to test/verified on upgrade from 4.3 -> 4.3 and 4.3 -> 4.4.

Comment 16 Wei Duan 2020-09-02 00:09:45 UTC

Thanks a lot for the explanation, Mark it as VERIFIED.

Comment 19 errata-xmlrpc 2020-10-27 16:35:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.