1986989 – OpenShift Virtualization 2.6.z cannot be upgraded to 4.8.0 initially deployed starting with <= 4.8

Bug 1986989 - OpenShift Virtualization 2.6.z cannot be upgraded to 4.8.0 initially deployed starting with <= 4.8

Summary: OpenShift Virtualization 2.6.z cannot be upgraded to 4.8.0 initially deployed...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Installation
Sub Component:
Version:	2.6.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	2.6.6
Assignee:	Simone Tiraboschi
QA Contact:	Debarati Basu-Nag
Docs Contact:	Pan Ousley
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-28 16:27 UTC by Simone Tiraboschi
Modified:	2021-08-10 17:33 UTC (History)
CC List:	4 users (show)
Fixed In Version:	hco-bundle-registry-container-v2.6.6-67
Doc Type:	Known Issue
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-10 17:33:37 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt hyperconverged-cluster-operator pull 1457	None	open	Update .status.storedVersions on the CRD on upgrades	2021-08-02 12:17:20 UTC
Github	kubevirt hyperconverged-cluster-operator pull 1458	None	None	None	2021-08-02 12:17:21 UTC
Github	kubevirt hyperconverged-cluster-operator pull 1459	None	None	None	2021-08-02 12:17:22 UTC
Red Hat Product Errata	RHSA-2021:3119	None	None	None	2021-08-10 17:33:48 UTC

Description Simone Tiraboschi 2021-07-28 16:27:50 UTC

Description of problem:
OpenShift Virtualization 2.5 introduced a new version of its hyperconvergeds.hco.kubevirt.io API named v1beta1 deprecating v1alpha1.

OpenShift Virtualization 4.8 completely drops v1alpha1.

The issue is that:
- CRD's status.storedVersions lists every version that has been a stored version (so in our case also v1alpha1 if CNV got initially installed at 2.4 time)
- once that migration is complete, it's a manual process to remove unused storage versions from a CRD's .status.storedVersion.

HCO is currently correctly upgrading its CR to v1beta1 but, after that, it's not removing v1alpha1 from .status.storedVersion on the CRD.
This prevents CNV to be upgraded to 4.8.0 that drops v1alpha1.

See: https://evancordell.com/posts/kube-apis-crds/ for more details

Version-Release number of selected component (if applicable):
OpenShift Virtualization 2.6.5

How reproducible:
100%

Steps to Reproduce:
1. Deploy OpenShift Virtualization starting from 2.4.0 and try to upgrade up to 4.8.0
2. Alternative way to quickly get in this status:
run:
a. $ oc proxy
b. $ $ curl --header "Content-Type: application/json-patch+json" --request PATCH http://localhost:8001/apis/apiextensions.k8s.io/v1/customresourcedefinitions/hyperconvergeds.hco.kubevirt.io/status --data '[{"op": "replace", "path": "/status/storedVersions", "value":["v1beta1", "v1alpha1"]}]'

to forcefully inject "v1alpha1" under /status/storedVersions on hyperconvergeds.hco.kubevirt.io CRD.

Actual results:
the OLM refuses to upgrade CNV with:
risk of data loss updating hyperconvergeds.hco.kubevirt.io: new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD

Expected results:
OpenShift Virtualization can be correctly upgraded from 2.6.z to 4.8.0.

Additional info:
This is not going to affect who deployed OpenShift Virtualization starting after 2.5.0

Comment 4 Simone Tiraboschi 2021-07-29 08:47:11 UTC

Sorry, wrong copy and past,
the right sequence for the workaround is:

Workaround:

 # 1 run
 $ oc proxy &

 # remove v1alpha1 from .status.storedVersions on HCO CRD
 $ curl --header "Content-Type: application/json-patch+json" --request PATCH http://localhost:8001/apis/apiextensions.k8s.io/v1/customresourcedefinitions/hyperconvergeds.hco.kubevirt.io/status --data '[{"op": "replace", "path": "/status/storedVersions", "value":["v1beta1"]}]'

 # move 4.8.0 install plan from Failed to Installing to restart the upgrade process that got stuck in the middle
 $ curl --header "Content-Type: application/json-patch+json" --request PATCH http://localhost:8001/apis/operators.coreos.com/v1alpha1/namespaces/openshift-cnv/installplans/$(oc get installplan -n openshift-cnv | grep kubevirt-hyperconverged-operator.v4.8.0 | cut -d' ' -f1)/status --data '[{"op": "remove", "path": "/status/conditions"},{"op": "remove", "path": "/status/message"},{"op": "replace", "path": "/status/phase", "value": "Installing"}]'

 # now oc proxy can be killed, and the upgrade should continue as expected

Comment 5 Simone Tiraboschi 2021-07-29 08:56:31 UTC

QE team can easily reproduce the issue and test the workaround proposed at https://bugzilla.redhat.com/show_bug.cgi?id=1986989#c4

how to:

1. Deploy CNV 2.6.5 on OCP 4.8 creating a subscription with something like:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: kubevirt-hyperconverged
  namespace: openshift-cnv
spec:
  channel: stable
  installPlanApproval: Manual
  name: kubevirt-hyperconverged
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  startingCSV: kubevirt-hyperconverged-operator.v2.6.5


2. Approve the install plan for v2.6.5

3. Create a CR for HCO

4. Wait for a successfully deployed environment

5. Run:
  $ oc proxy &
  $ curl --header "Content-Type: application/json-patch+json" --request PATCH http://localhost:8001/apis/apiextensions.k8s.io/v1/customresourcedefinitions/hyperconvergeds.hco.kubevirt.io/status --data '[{"op": "replace", "path": "/status/storedVersions", "value":["v1beta1", "v1alpha1"]}]'

to replicate the condition that causes the bug.

6. Approve the installPlan for 4.8.0

7. Watch the installPlan for 4.8.0 until you will see something like:
status:
  bundleLookups:
  - catalogSourceRef:
      name: redhat-operators
      namespace: openshift-marketplace
    identifier: kubevirt-hyperconverged-operator.v4.8.0
    ...
    replaces: kubevirt-hyperconverged-operator.v2.6.5
  catalogSources: []
  conditions:
  - lastTransitionTime: "2021-07-29T08:10:25Z"
    lastUpdateTime: "2021-07-29T08:10:25Z"
    message: 'risk of data loss updating hyperconvergeds.hco.kubevirt.io: new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD'
    reason: InstallComponentFailed
    status: "False"
    type: Installed
  message: 'risk of data loss updating hyperconvergeds.hco.kubevirt.io: new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD'
  phase: Failed

9. now you fully reproduced the issue and you should try applying the workaround at https://bugzilla.redhat.com/show_bug.cgi?id=1986989#c4

Comment 6 Debarati Basu-Nag 2021-07-30 19:17:59 UTC

1) With steps documented here: https://bugzilla.redhat.com/show_bug.cgi?id=1986989#c5, I was able to reproduce the issue on bm01-cnvqe2-rdu2. 
Note: Install plan showed the following as mentioned in the steps to reproduce:
=======
 conditions:
  - lastTransitionTime: "2021-07-30T18:39:32Z"
    lastUpdateTime: "2021-07-30T18:39:32Z"
    message: 'risk of data loss updating hyperconvergeds.hco.kubevirt.io: new CRD
      removes version v1alpha1 that is listed as a stored version on the existing
      CRD'
    reason: InstallComponentFailed
    status: "False"
    type: Installed
  message: 'risk of data loss updating hyperconvergeds.hco.kubevirt.io: new CRD removes
    version v1alpha1 that is listed as a stored version on the existing CRD'
  phase: Failed
  plan:
=========
2) With steps documented here: https://bugzilla.redhat.com/show_bug.cgi?id=1986989#c4, I saw install plan was moved from Failed to Installing.
3) Subsequently csv went to installing state:
==========
cnv-qe-jenkins@cnv-qe-infra-01:~$ oc get csv -n openshift-cnv
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v2.6.5   OpenShift Virtualization   2.6.5     kubevirt-hyperconverged-operator.v2.6.4   Replacing
kubevirt-hyperconverged-operator.v4.8.0   OpenShift Virtualization   4.8.0     kubevirt-hyperconverged-operator.v2.6.5   Installing
cnv-qe-jenkins@cnv-qe-infra-01:~$ oc get csv -n openshift-cnv
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v4.8.0   OpenShift Virtualization   4.8.0     kubevirt-hyperconverged-operator.v2.6.5   Installing
cnv-qe-jenkins@cnv-qe-infra-01:~$ 
==========
4) HCO CR version went to `version: v4.8.0`
=========
lastHeartbeatTime: "2021-07-30T18:59:35Z"
      lastTransitionTime: "2021-07-30T18:53:44Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: Available
=========
5) All cnv pods are in running state
=========
cnv-qe-jenkins@cnv-qe-infra-01:~$ kubectl get pods -n openshift-cnv | grep -vi running
I0730 19:16:51.690607 1999146 request.go:668] Waited for 1.002933685s due to client-side throttling, not priority and fairness, request: GET:https://api.bm01-cnvqe2-rdu2.cnvqe2.lab.eng.rdu2.redhat.com:6443/apis/autoscaling.openshift.io/v1beta1?timeout=32s
NAME                                                  READY   STATUS    RESTARTS   AGE
cnv-qe-jenkins@cnv-qe-infra-01:~$
==========
6) Validated deployments.

Comment 7 Pan Ousley 2021-08-02 13:11:12 UTC

Debarati, can you please approve my docs PR to add this workaround to the 4.8 Known Issues? 

PR: https://github.com/openshift/openshift-docs/pull/35054
Preview build: https://deploy-preview-35054--osdocs.netlify.app/openshift-enterprise/latest/virt/virt-4-8-release-notes.html#virt-4-8-known-issues

Thank you!

Comment 8 Simone Tiraboschi 2021-08-02 13:41:57 UTC

Moving to 2.6.6 because it entered that release due to a respin caused by another issue.

Comment 9 Debarati Basu-Nag 2021-08-03 13:09:50 UTC

Followed the verification instruction for the fix as follows:
1. Deploy 2.6.5 (startingCSV: kubevirt-hyperconverged-operator.v2.6.5, installPlanApproval: Manual)

[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ kubectl get ip -n openshift-cnv
I0802 15:55:44.393999   82211 request.go:655] Throttling request took 1.081175725s, request: GET:https://api.iuo-dbn-265.cnv-qe.rhcloud.com:6443/apis/autoscaling.openshift.io/v1beta1?timeout=32s
NAME            CSV                                       APPROVAL   APPROVED
install-xmz2x   kubevirt-hyperconverged-operator.v2.6.5   Manual     true
2. Inject the issue:
oc proxy &
sleep 3
curl --header "Content-Type: application/json-patch+json" --request PATCH http://localhost:8001/apis/apiextensions.k8s.io/v1/customresourcedefinitions/hyperconvergeds.hco.kubevirt.io/status --data '[

{"op": "replace", "path": "/status/storedVersions", "value":["v1beta1", "v1alpha1"]}
]'
kill $(ps -C "oc proxy" -o pid=)

3. Dump HCO CRD ensuring that v1alpha1 is there

 [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ oc get crd hyperconvergeds.hco.kubevirt.io -o yaml | grep v1alpha1
    name: v1alpha1
  - v1alpha1
[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ 
4. Upgrade 2.6.6

[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ kubectl get csv -n openshift-cnv
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v2.6.5   OpenShift Virtualization   2.6.5     kubevirt-hyperconverged-operator.v2.6.4   Replacing
kubevirt-hyperconverged-operator.v2.6.6   OpenShift Virtualization   2.6.6     kubevirt-hyperconverged-operator.v2.6.5   InstallReady
[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ 


[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ kubectl get csv -n openshift-cnv
I0802 16:26:43.865092   82583 request.go:655] Throttling request took 1.085912897s, request: GET:https://api.iuo-dbn-265.cnv-qe.rhcloud.com:6443/apis/console.openshift.io/v1?timeout=32s
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v2.6.6   OpenShift Virtualization   2.6.6     kubevirt-hyperconverged-operator.v2.6.5   Succeeded
[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ 
5. Dump HCO CRD ensuring that v1alpha1 got removed

 [cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ oc get crd hyperconvergeds.hco.kubevirt.io -o yaml | grep v1alpha1
    name: v1alpha1
'v1alpha1' was dropped from status.storedVersions

    reason: InitialNamesAccepted
    status: "True"
    type: Established
  storedVersions:
  - v1beta1
[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ 
6. Upgrade to 4.8.0: it should be smooth without the need for any workaround.
Now upgraded ocp to 4.8
updating cnv to 4.8:

[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ oc get csv -n openshift-cnv
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v2.6.6   OpenShift Virtualization   2.6.6     kubevirt-hyperconverged-operator.v2.6.5   Replacing
kubevirt-hyperconverged-operator.v4.8.0   OpenShift Virtualization   4.8.0     kubevirt-hyperconverged-operator.v2.6.6   InstallReady
[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ 
[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ oc get csv -n openshift-cnv
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v4.8.0   OpenShift Virtualization   4.8.0     kubevirt-hyperconverged-operator.v2.6.6   Succeeded
[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ 
Checked hyperconverged CR got updated:

  versions:
    - name: operator
      version: v4.8.0
HCO CRD got v1alpha1 dropped:

[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ oc get crd hyperconvergeds.hco.kubevirt.io -o yaml | grep v1alpha1
[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ 
All cnv pods were in running state:

[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ kubectl get pods -n openshift-cnv | grep -vi Running
NAME                                                  READY   STATUS    RESTARTS   AGE
[cnv-qe-jenkins@iuo-dbn-265-8gqzr-executor ~]$ 
Validated deployments as well.

Marking as verified.

Comment 14 errata-xmlrpc 2021-08-10 17:33:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.6 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3119

Note You need to log in before you can comment on or make changes to this bug.