Description of problem: OpenShift Virtualization 2.5 introduced a new version of its hyperconvergeds.hco.kubevirt.io API named v1beta1 deprecating v1alpha1. OpenShift Virtualization 4.8 completely drops v1alpha1. The issue is that: - CRD's status.storedVersions lists every version that has been a stored version (so in our case also v1alpha1 if CNV got initially installed at 2.4 time) - once that migration is complete, it's a manual process to remove unused storage versions from a CRD's .status.storedVersion. HCO is currently correctly upgrading its CR to v1beta1 but, after that, it's not removing v1alpha1 from .status.storedVersion on the CRD. This prevents CNV to be upgraded to 4.8.0 that drops v1alpha1. See: https://evancordell.com/posts/kube-apis-crds/ for more details Version-Release number of selected component (if applicable): OpenShift Virtualization 2.6.5 How reproducible: 100% Steps to Reproduce: 1. Deploy OpenShift Virtualization starting from 2.4.0 and try to upgrade up to 4.8.0 2. Alternative way to quickly get in this status: run: a. $ oc proxy b. $ $ curl --header "Content-Type: application/json-patch+json" --request PATCH http://localhost:8001/apis/apiextensions.k8s.io/v1/customresourcedefinitions/hyperconvergeds.hco.kubevirt.io/status --data '[{"op": "replace", "path": "/status/storedVersions", "value":["v1beta1", "v1alpha1"]}]' to forcefully inject "v1alpha1" under /status/storedVersions on hyperconvergeds.hco.kubevirt.io CRD. Actual results: the OLM refuses to upgrade CNV with: risk of data loss updating hyperconvergeds.hco.kubevirt.io: new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD Expected results: OpenShift Virtualization can be correctly upgraded from 2.6.z to 4.8.0. Additional info: This is not going to affect who deployed OpenShift Virtualization starting after 2.5.0
Sorry, wrong copy and past, the right sequence for the workaround is: Workaround: # 1 run $ oc proxy & # remove v1alpha1 from .status.storedVersions on HCO CRD $ curl --header "Content-Type: application/json-patch+json" --request PATCH http://localhost:8001/apis/apiextensions.k8s.io/v1/customresourcedefinitions/hyperconvergeds.hco.kubevirt.io/status --data '[{"op": "replace", "path": "/status/storedVersions", "value":["v1beta1"]}]' # move 4.8.0 install plan from Failed to Installing to restart the upgrade process that got stuck in the middle $ curl --header "Content-Type: application/json-patch+json" --request PATCH http://localhost:8001/apis/operators.coreos.com/v1alpha1/namespaces/openshift-cnv/installplans/$(oc get installplan -n openshift-cnv | grep kubevirt-hyperconverged-operator.v4.8.0 | cut -d' ' -f1)/status --data '[{"op": "remove", "path": "/status/conditions"},{"op": "remove", "path": "/status/message"},{"op": "replace", "path": "/status/phase", "value": "Installing"}]' # now oc proxy can be killed, and the upgrade should continue as expected
QE team can easily reproduce the issue and test the workaround proposed at https://bugzilla.redhat.com/show_bug.cgi?id=1986989#c4 how to: 1. Deploy CNV 2.6.5 on OCP 4.8 creating a subscription with something like: apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: kubevirt-hyperconverged namespace: openshift-cnv spec: channel: stable installPlanApproval: Manual name: kubevirt-hyperconverged source: redhat-operators sourceNamespace: openshift-marketplace startingCSV: kubevirt-hyperconverged-operator.v2.6.5 2. Approve the install plan for v2.6.5 3. Create a CR for HCO 4. Wait for a successfully deployed environment 5. Run: $ oc proxy & $ curl --header "Content-Type: application/json-patch+json" --request PATCH http://localhost:8001/apis/apiextensions.k8s.io/v1/customresourcedefinitions/hyperconvergeds.hco.kubevirt.io/status --data '[{"op": "replace", "path": "/status/storedVersions", "value":["v1beta1", "v1alpha1"]}]' to replicate the condition that causes the bug. 6. Approve the installPlan for 4.8.0 7. Watch the installPlan for 4.8.0 until you will see something like: status: bundleLookups: - catalogSourceRef: name: redhat-operators namespace: openshift-marketplace identifier: kubevirt-hyperconverged-operator.v4.8.0 ... replaces: kubevirt-hyperconverged-operator.v2.6.5 catalogSources: [] conditions: - lastTransitionTime: "2021-07-29T08:10:25Z" lastUpdateTime: "2021-07-29T08:10:25Z" message: 'risk of data loss updating hyperconvergeds.hco.kubevirt.io: new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD' reason: InstallComponentFailed status: "False" type: Installed message: 'risk of data loss updating hyperconvergeds.hco.kubevirt.io: new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD' phase: Failed 9. now you fully reproduced the issue and you should try applying the workaround at https://bugzilla.redhat.com/show_bug.cgi?id=1986989#c4
1) With steps documented here: https://bugzilla.redhat.com/show_bug.cgi?id=1986989#c5, I was able to reproduce the issue on bm01-cnvqe2-rdu2. Note: Install plan showed the following as mentioned in the steps to reproduce: ======= conditions: - lastTransitionTime: "2021-07-30T18:39:32Z" lastUpdateTime: "2021-07-30T18:39:32Z" message: 'risk of data loss updating hyperconvergeds.hco.kubevirt.io: new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD' reason: InstallComponentFailed status: "False" type: Installed message: 'risk of data loss updating hyperconvergeds.hco.kubevirt.io: new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD' phase: Failed plan: ========= 2) With steps documented here: https://bugzilla.redhat.com/show_bug.cgi?id=1986989#c4, I saw install plan was moved from Failed to Installing. 3) Subsequently csv went to installing state: ========== cnv-qe-jenkins@cnv-qe-infra-01:~$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.6.5 OpenShift Virtualization 2.6.5 kubevirt-hyperconverged-operator.v2.6.4 Replacing kubevirt-hyperconverged-operator.v4.8.0 OpenShift Virtualization 4.8.0 kubevirt-hyperconverged-operator.v2.6.5 Installing cnv-qe-jenkins@cnv-qe-infra-01:~$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v4.8.0 OpenShift Virtualization 4.8.0 kubevirt-hyperconverged-operator.v2.6.5 Installing cnv-qe-jenkins@cnv-qe-infra-01:~$ ========== 4) HCO CR version went to `version: v4.8.0` ========= lastHeartbeatTime: "2021-07-30T18:59:35Z" lastTransitionTime: "2021-07-30T18:53:44Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: Available ========= 5) All cnv pods are in running state ========= cnv-qe-jenkins@cnv-qe-infra-01:~$ kubectl get pods -n openshift-cnv | grep -vi running I0730 19:16:51.690607 1999146 request.go:668] Waited for 1.002933685s due to client-side throttling, not priority and fairness, request: GET:https://api.bm01-cnvqe2-rdu2.cnvqe2.lab.eng.rdu2.redhat.com:6443/apis/autoscaling.openshift.io/v1beta1?timeout=32s NAME READY STATUS RESTARTS AGE cnv-qe-jenkins@cnv-qe-infra-01:~$ ========== 6) Validated deployments.
Debarati, can you please approve my docs PR to add this workaround to the 4.8 Known Issues? PR: https://github.com/openshift/openshift-docs/pull/35054 Preview build: https://deploy-preview-35054--osdocs.netlify.app/openshift-enterprise/latest/virt/virt-4-8-release-notes.html#virt-4-8-known-issues Thank you!
Moving to 2.6.6 because it entered that release due to a respin caused by another issue.
Followed the verification instruction for the fix as follows: 1. Deploy 2.6.5 (startingCSV: kubevirt-hyperconverged-operator.v2.6.5, installPlanApproval: Manual) [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ kubectl get ip -n openshift-cnv I0802 15:55:44.393999 82211 request.go:655] Throttling request took 1.081175725s, request: GET:https://api.iuo-dbn-265.cnv-qe.rhcloud.com:6443/apis/autoscaling.openshift.io/v1beta1?timeout=32s NAME CSV APPROVAL APPROVED install-xmz2x kubevirt-hyperconverged-operator.v2.6.5 Manual true 2. Inject the issue: oc proxy & sleep 3 curl --header "Content-Type: application/json-patch+json" --request PATCH http://localhost:8001/apis/apiextensions.k8s.io/v1/customresourcedefinitions/hyperconvergeds.hco.kubevirt.io/status --data '[ {"op": "replace", "path": "/status/storedVersions", "value":["v1beta1", "v1alpha1"]} ]' kill $(ps -C "oc proxy" -o pid=) 3. Dump HCO CRD ensuring that v1alpha1 is there [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ oc get crd hyperconvergeds.hco.kubevirt.io -o yaml | grep v1alpha1 name: v1alpha1 - v1alpha1 [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ 4. Upgrade 2.6.6 [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ kubectl get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.6.5 OpenShift Virtualization 2.6.5 kubevirt-hyperconverged-operator.v2.6.4 Replacing kubevirt-hyperconverged-operator.v2.6.6 OpenShift Virtualization 2.6.6 kubevirt-hyperconverged-operator.v2.6.5 InstallReady [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ kubectl get csv -n openshift-cnv I0802 16:26:43.865092 82583 request.go:655] Throttling request took 1.085912897s, request: GET:https://api.iuo-dbn-265.cnv-qe.rhcloud.com:6443/apis/console.openshift.io/v1?timeout=32s NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.6.6 OpenShift Virtualization 2.6.6 kubevirt-hyperconverged-operator.v2.6.5 Succeeded [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ 5. Dump HCO CRD ensuring that v1alpha1 got removed [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ oc get crd hyperconvergeds.hco.kubevirt.io -o yaml | grep v1alpha1 name: v1alpha1 'v1alpha1' was dropped from status.storedVersions reason: InitialNamesAccepted status: "True" type: Established storedVersions: - v1beta1 [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ 6. Upgrade to 4.8.0: it should be smooth without the need for any workaround. Now upgraded ocp to 4.8 updating cnv to 4.8: [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.6.6 OpenShift Virtualization 2.6.6 kubevirt-hyperconverged-operator.v2.6.5 Replacing kubevirt-hyperconverged-operator.v4.8.0 OpenShift Virtualization 4.8.0 kubevirt-hyperconverged-operator.v2.6.6 InstallReady [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v4.8.0 OpenShift Virtualization 4.8.0 kubevirt-hyperconverged-operator.v2.6.6 Succeeded [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ Checked hyperconverged CR got updated: versions: - name: operator version: v4.8.0 HCO CRD got v1alpha1 dropped: [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ oc get crd hyperconvergeds.hco.kubevirt.io -o yaml | grep v1alpha1 [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ All cnv pods were in running state: [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ kubectl get pods -n openshift-cnv | grep -vi Running NAME READY STATUS RESTARTS AGE [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ Validated deployments as well. Marking as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 2.6.6 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3119