Bug 1873299
| Summary: | Storage operator stops reconciling when going Upgradeable=False on v1alpha1 CRDs | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
| Component: | Storage | Assignee: | Christian Huffman <chuffman> | |
| Storage sub component: | Operators | QA Contact: | Wei Duan <wduan> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | unspecified | CC: | aos-bugs, chuffman, lmohanty, sdodson | |
| Version: | 4.3.z | Keywords: | Upgrades | |
| Target Milestone: | --- | |||
| Target Release: | 4.6.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: If v1alpha1 VolumeSnapshot CRDs were detected, no further reconcile actions were taken.
Consequence: The Cluster Storage Operator could not perform z-stream upgrades if these CRDs were ever detected on the cluster.
Fix: Moved the v1alpha1 CRD check to later in the Reconcile loop.
Result: Z-stream upgrades now complete successfully, and v1alpha1 CRDs are detected without issue.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1874873 (view as bug list) | Environment: | ||
| Last Closed: | 2020-10-27 16:35:33 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1874873 | |||
|
Description
W. Trevor King
2020-08-27 19:11:33 UTC
We disabled the snapshot co, then uninstalled the v1beta1 VolumeSnapshot* CRDs and installed the v1alpha1 VolumeSnapshot* CRDs. But the upgrade did not start.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.0-0.nightly-2020-08-25-204643 True False 27h Cluster version is 4.6.0-0.nightly-2020-08-25-204643
For clusterversion version:
"spec": {
"channel": "stable-4.6",
"clusterID": "35c2b12d-7440-4c99-ac50-60f3aff0a059",
"desiredUpdate": {
"force": false,
"image": "registry.svc.ci.openshift.org/ocp/release@sha256:c4059816df4d67ff5dc2356dd4d278833d1e57282e16b9fef59554936e5562ed",
"version": "4.6.0-0.nightly-2020-08-25-234625"
},
"upstream": "https://openshift-release.svc.ci.openshift.org/graph"
},
"status": {
"availableUpdates": [
{
"image": "registry.svc.ci.openshift.org/ocp/release@sha256:c4059816df4d67ff5dc2356dd4d278833d1e57282e16b9fef59554936e5562ed",
"version": "4.6.0-0.nightly-2020-08-25-234625"
}
],
"conditions": [
{
"lastTransitionTime": "2020-08-28T03:33:44Z",
"message": "Done applying 4.6.0-0.nightly-2020-08-25-204643",
"status": "True",
"type": "Available"
},
{
"lastTransitionTime": "2020-08-28T05:57:35Z",
"status": "False",
"type": "Failing"
},
{
"lastTransitionTime": "2020-08-28T05:59:20Z",
"message": "Cluster version is 4.6.0-0.nightly-2020-08-25-204643",
"status": "False",
"type": "Progressing"
},
{
"lastTransitionTime": "2020-08-28T14:06:11Z",
"status": "True",
"type": "RetrievedUpdates"
},
{
"lastTransitionTime": "2020-08-28T06:11:26Z",
"message": "Cluster operator storage cannot be upgraded between minor versions: SnapshotCRDControllerUpgradeable: Unable to update cluster as v1alpha1 version of VolumeSnapshot, VolumeSnapshotContent is detected. Remove these CRDs to allow the upgrade to proceed.",
"reason": "SnapshotCRDController_AlphaDetected",
"status": "False",
"type": "Upgradeable"
}
],
From the status, we only see the SnapshotCRDController_AlphaDetected made the Upgradeable False.
I tried to remove the v1alpha1 VolumeSnapshot* CRDs but only VolumeSnapshotclass CRD removed, VolumeSnapshot and VolumeSnapshotContent CRD cannot be deleted. Still working on it.
But anyway, from my opinion, we also hit this issue on 4.6.
Looks like my previous conclusion is not correct.
After managed to delete all the v1alpha1 VolumeSnapshot* CRDs, upgrade still did not start. I pastested the clusterversion here, maybe need check with upgrade team.
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.0-0.nightly-2020-08-25-204643 True False 28h Cluster version is 4.6.0-0.nightly-2020-08-25-204643
[wduan@MINT kubernetes-1.16]$ oc get clusterversion version -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
creationTimestamp: "2020-08-28T03:00:49Z"
generation: 18
managedFields:
- apiVersion: config.openshift.io/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
.: {}
f:channel: {}
f:clusterID: {}
manager: cluster-bootstrap
operation: Update
time: "2020-08-28T03:00:49Z"
- apiVersion: config.openshift.io/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
f:desiredUpdate:
.: {}
f:force: {}
f:image: {}
f:version: {}
manager: oc
operation: Update
time: "2020-08-29T05:04:46Z"
- apiVersion: config.openshift.io/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
f:upstream: {}
manager: kubectl-patch
operation: Update
time: "2020-08-29T08:43:35Z"
- apiVersion: config.openshift.io/v1
fieldsType: FieldsV1
fieldsV1:
f:status:
.: {}
f:availableUpdates: {}
f:conditions: {}
f:desired:
.: {}
f:image: {}
f:version: {}
f:history: {}
f:observedGeneration: {}
f:versionHash: {}
manager: cluster-version-operator
operation: Update
time: "2020-08-29T09:45:06Z"
name: version
resourceVersion: "1325193"
selfLink: /apis/config.openshift.io/v1/clusterversions/version
uid: 80d09500-7c10-46a0-aa74-8cbd2edec283
spec:
channel: stable-4.6
clusterID: 35c2b12d-7440-4c99-ac50-60f3aff0a059
desiredUpdate:
force: false
image: registry.svc.ci.openshift.org/ocp/release@sha256:c4059816df4d67ff5dc2356dd4d278833d1e57282e16b9fef59554936e5562ed
version: 4.6.0-0.nightly-2020-08-25-234625
upstream: https://openshift-release.svc.ci.openshift.org/graph
status:
availableUpdates:
- image: registry.svc.ci.openshift.org/ocp/release@sha256:c4059816df4d67ff5dc2356dd4d278833d1e57282e16b9fef59554936e5562ed
version: 4.6.0-0.nightly-2020-08-25-234625
conditions:
- lastTransitionTime: "2020-08-28T03:33:44Z"
message: Done applying 4.6.0-0.nightly-2020-08-25-204643
status: "True"
type: Available
- lastTransitionTime: "2020-08-28T05:57:35Z"
status: "False"
type: Failing
- lastTransitionTime: "2020-08-28T05:59:20Z"
message: Cluster version is 4.6.0-0.nightly-2020-08-25-204643
status: "False"
type: Progressing
- lastTransitionTime: "2020-08-28T14:06:11Z"
status: "True"
type: RetrievedUpdates
desired:
image: registry.svc.ci.openshift.org/ocp/release@sha256:56945dc7218d758e25ffe990374668890e8c77d72132c98e0cc8f6272c063cc7
version: 4.6.0-0.nightly-2020-08-25-204643
history:
- completionTime: "2020-08-28T03:33:44Z"
image: registry.svc.ci.openshift.org/ocp/release@sha256:56945dc7218d758e25ffe990374668890e8c77d72132c98e0cc8f6272c063cc7
startedTime: "2020-08-28T03:00:49Z"
state: Completed
verified: false
version: 4.6.0-0.nightly-2020-08-25-204643
observedGeneration: 7
versionHash: VdMtCylIGgw=
I uploaded the must-gather on http://virt-openshift-05.lab.eng.nay.redhat.com/wduan/logs/must-gather.local.8306168597559322770_0829.tar.gz > ...upgrade still did not start I've spun this off into bug 1873900. It seems orthogonal to this bug's storage issue. I tried another upgrade, this time upgrade was triggered but blocked with csi-snapshot-controller co.
Which means this will still block the upgrade.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.0-0.nightly-2020-08-26-202109 True True 47m Unable to apply 4.6.0-0.nightly-2020-08-27-005538: the cluster operator csi-snapshot-controller has not yet successfully rolled out
$ oc get co csi-snapshot-controller -ojson | jq .status.conditions
[
{
"lastTransitionTime": "2020-08-31T08:14:41Z",
"message": "Degraded: failed to sync CRDs: cluster-csi-snapshot-controller-operator does not support v1alpha1 version of snapshot CRDs volumesnapshots.snapshot.storage.k8s.io, volumesnapshotcontents.snapshot.storage.k8s.io, volumesnapshotclasses.snapshot.storage.k8s.io installed by user or 3rd party controller",
"reason": "_AlphaCRDsExist",
"status": "True",
"type": "Degraded"
},
{
"lastTransitionTime": "2020-08-31T00:50:50Z",
"reason": "AsExpected",
"status": "False",
"type": "Progressing"
},
{
"lastTransitionTime": "2020-08-31T00:50:50Z",
"reason": "AsExpected",
"status": "True",
"type": "Available"
},
{
"lastTransitionTime": "2020-08-31T00:44:49Z",
"reason": "AsExpected",
"status": "True",
"type": "Upgradeable"
}
]
> In both cases, the 'storage' co was successfully upgraded; however, the 'csi-snapshot-controller‘ was not blocked due to the presence of the v1alpha1 CRDs.
I think that means "this storage-operator bug can be VERIFIED on 4.6, and we may need a new bug for the snapshot controller".
First, correct my typo and fortunately it did not mislead you. > In both cases, the 'storage' co was successfully upgraded; however, the 'csi-snapshot-controller‘ was not blocked due to the presence of the v1alpha1 CRDs. Should be In both cases, the 'storage' co was successfully upgraded; however, the 'csi-snapshot-controller‘ was blocked due to the presence of the v1alpha1 CRDs. @Huffman, as I asked in comment10, I'd like to confirm if 'storage-operator' upgrade could enough for "VERIFIED", also we tried several scenarios for 'csi-snapshot-controller‘ to see what happened. Actually in case B, which crd 'csisnapshotcontrollers' and 'csi-snapshot-controller' co was deleted, is this case more similar with the upgrade from 4.3 to 4.4? > In other words, I think we need to ensure that this doesn't block upgrades in the storage operator in 4.3, as the CSI Snapshot Controller Operator shouldn't ever encounter this state (being present alongside the v1alpha1 CRDs). I agree with your concerned, it's ok for me to "VERIFIED" this BZ and switch to test/verified on upgrade from 4.3 -> 4.3 and 4.3 -> 4.4. Thanks a lot for the explanation, Mark it as VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |