Bug 2037307

Summary: [CNV-4.8.4] VMI is in LiveMigrate loop when Upgrading Cluster from 2.6.7/4.7.32 to OCP 4.8.13
Product: Container Native Virtualization (CNV) Reporter: Kedar Bidarkar <kbidarka>
Component: VirtualizationAssignee: Jed Lejosne <jlejosne>
Status: CLOSED ERRATA QA Contact: Kedar Bidarkar <kbidarka>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.8.4CC: aamirian, cnv-qe-bugs, sgott, ycui
Target Milestone: ---   
Target Release: 4.8.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: virt-operator-container-v4.8.4-12 hco-bundle-registry-container-v4.8.4-44 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-20 17:21:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2014509    

Description Kedar Bidarkar 2022-01-05 12:19:56 UTC
This bug was initially created as a copy of Bug #2008511

I am copying this bug because: 



Description of problem:

Have been upgrading a Cluster from OCP/CNV-4.6/2.5.8 to OCP/CNV-4.7.32/2.6.7 to OCP-4.8.13  for another bug verification.

CNV_SOURCE=osbs   ;  2.5.8 to 2.6.7 upgrade was as follows: 2.5.8 to 2.6.4 to 2.6.5 to 2.6.6 to 2.6.7
Post 2.6.7, cluster status was GOOD.


When upgrading from  OCP/CNV 4.7.32/2.6.7   to OCP 4.8.13, that is CNV 2.6.7 running on OCP 4.8.13 , we see "VMI is in  LiveMigrate loop"

From the below pastebin, upgrade seems complete but,
Worker Nodes are not yet updated to 4.8 Images.


The Worker Nodes are stuck, because the vm2-ocs-rhel84 VM is in a constant LiveMigrate Loop.


I guess, If I stop/start or restart the VM, the worker Nodes might continue with the upgrade.
But I feel the VMI should have LiveMigrated successfully. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
Upgrade stuck, due to constant VMI Live-Migrate Loop

Expected results:
Upgrade is not stuck, due to VMI Live-Migrate Loop

Additional info:
Running Must-Gather shows,

ClusterOperators:
	clusteroperator/machine-config is not available (Cluster not available for 4.8.13) because Failed to resync 4.8.13 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)
	clusteroperator/operator-lifecycle-manager is not upgradeable because ClusterServiceVersions blocking cluster upgrade: openshift-local-storage/local-storage-operator.4.6.0-202109030220 is incompatible with OpenShift minor versions greater than 4.8,openshift-cnv/kubevirt-hyperconverged-operator.v2.6.7 is incompatible with OpenShift minor versions greater than 4.8


1) local-storage operator was not updated and it remains the same, which was installed with 4.6/2.5.8
2) OCS was not updated and it remains the same, which was installed with 4.6/2.5.8

Comment 3 Kedar Bidarkar 2022-01-07 16:43:13 UTC
Summary: LiveMigration was successful, directly from virt-launcher version v2.6.4 ( to be precise it's v2.6.3-2 ) to v4.8.4-12

VERIFIED with: virt-operator-container-v4.8.4-12 hco-bundle-registry-container-v4.8.4-44

[kbidarka@localhost migration]$ oc get pods
NAME                                 READY   STATUS      RESTARTS   AGE
virt-launcher-vm-rhel84-ocs-8nqps    0/1     Completed   0          90m
virt-launcher-vm-rhel84-ocs-pxp5b    1/1     Running     0          7m23s
virt-launcher-vm2-rhel84-ocs-9xldq   1/1     Running     0          9m41s
virt-launcher-vm2-rhel84-ocs-f7pd7   0/1     Error       0          90m
[kbidarka@localhost migration]$ oc get vmi 
NAME             AGE   PHASE     IP             NODENAME
vm-rhel84-ocs    21h   Running   10.xxx.yy.zzz   virt05m-kswp6-worker-0-k2xtk
vm2-rhel84-ocs   21h   Running   10.xxx.yy.zzz  virt05m-kswp6-worker-0-k2xtk
[kbidarka@localhost migration]$ virtctl console vm-rhel84-ocs
Successfully connected to vm-rhel84-ocs console. The escape sequence is ^]

[cloud-user@vm-rhel84-ocs ~]$ uptime
 11:22:46 up 21:15,  1 user,  load average: 0.00, 0.00, 0.00
[cloud-user@vm-rhel84-ocs ~]$ [kbidarka@localhost migration]$ 
[kbidarka@localhost migration]$ virtctl console vm2-rhel84-ocs
Successfully connected to vm2-rhel84-ocs console. The escape sequence is ^]

[cloud-user@vm2-rhel84-ocs ~]$ uptime
 11:22:57 up 21:15,  1 user,  load average: 0.00, 0.00, 0.00
[cloud-user@vm2-rhel84-ocs ~]$ [kbidarka@localhost migration]$ 
[kbidarka@localhost migration]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.25    True        False         19h     Cluster version is 4.8.25
[kbidarka@localhost migration]$ oc get csv -n openshift-cnv 
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v4.8.4   OpenShift Virtualization   4.8.4     kubevirt-hyperconverged-operator.v2.6.4   Succeeded

Direct CNV upgrade from v2.6.4  to v4.8.4
--------------------------------------------
[kbidarka@localhost ocp-cnv-scripts]$ oc get ip -A
NAMESPACE                 NAME            CSV                                         APPROVAL    APPROVED
openshift-cnv             install-fw9n5   kubevirt-hyperconverged-operator.v2.6.4     Manual      true
openshift-cnv             install-v42hc   kubevirt-hyperconverged-operator.v4.8.4     Manual      true


---

Source Virt-Launcher Pod:
]$ ./fetch_pod_version_info_icsp.sh virt-launcher-vm-rhel84-ocs-8nqps default
Output: container-native-virtualization/virt-launcher/images/v2.6.3-2

Target Virt-Launcher Pod:
]$ ./fetch_pod_version_info_icsp.sh virt-launcher-vm-rhel84-ocs-pxp5b default
Output: container-native-virtualization/virt-launcher/images/v4.8.4-12

---

[kbidarka@localhost migration]$ oc logs -f virt-launcher-vm2-rhel84-ocs-f7pd7 | grep '\"level\":\"error\"'
{"component":"virt-launcher","level":"error","msg":"internal error: Child process (dmidecode -q -t 0,1,2,3,4,11,17) unexpected exit status 1: /dev/mem: No such file or directory","pos":"virCommandWait:2745","subcomponent":"libvirt","thread":"42","timestamp":"2022-01-07T14:52:15.221000Z"}
{"component":"virt-launcher","level":"error","msg":"Error when closing file","pos":"os_helper.go:34","reason":"invalid argument","timestamp":"2022-01-07T14:52:16.550089Z"}
{"component":"virt-launcher","level":"error","msg":"At least one cgroup controller is required: No such device or address","pos":"virCgroupDetect:415","subcomponent":"libvirt","thread":"30","timestamp":"2022-01-07T14:52:17.836000Z"}
{"component":"virt-launcher","level":"error","msg":"failed to get domain metadata","pos":"libvirt_helper.go:137","reason":"virError(Code=80, Domain=20, Message='metadata not found: Requested metadata element is not present')","timestamp":"2022-01-07T14:52:17.939299Z"}
{"component":"virt-launcher","level":"error","msg":"Could not fetch the Domain specification.","pos":"client.go:231","reason":"virError(Code=80, Domain=20, Message='metadata not found: Requested metadata element is not present')","timestamp":"2022-01-07T14:52:17.939365Z"}
{"component":"virt-launcher","level":"error","msg":"failed to get domain metadata","pos":"libvirt_helper.go:137","reason":"virError(Code=80, Domain=20, Message='metadata not found: Requested metadata element is not present')","timestamp":"2022-01-07T14:52:18.478106Z"}
{"component":"virt-launcher","level":"error","msg":"Could not fetch the Domain specification.","pos":"client.go:231","reason":"virError(Code=80, Domain=20, Message='metadata not found: Requested metadata element is not present')","timestamp":"2022-01-07T14:52:18.478207Z"}
{"component":"virt-launcher","kind":"","level":"error","msg":"failed to get domain job info","name":"vm2-rhel84-ocs","namespace":"default","pos":"manager.go:742","reason":"virError(Code=55, Domain=20, Message='Requested operation is not valid: domain is not running')","timestamp":"2022-01-07T16:13:06.657070Z","uid":"b9fe7292-8841-4396-8e76-17f7636d9a0a"}
{"component":"virt-launcher","level":"error","msg":"timeout on stopping the cmd server, continuing anyway.","pos":"server.go:478","timestamp":"2022-01-07T16:13:07.906406Z"}
{"component":"virt-launcher","level":"error","msg":"dirty virt-launcher shutdown","pos":"virt-launcher.go:522","reason":"wait: no child processes","timestamp":"2022-01-07T16:13:07.912884Z"}



[kbidarka@localhost migration]$ oc logs -f virt-launcher-vm-rhel84-ocs-8nqps | grep '\"level\":\"error\"'
{"component":"virt-launcher","level":"error","msg":"internal error: Child process (dmidecode -q -t 0,1,2,3,4,11,17) unexpected exit status 1: /dev/mem: No such file or directory","pos":"virCommandWait:2745","subcomponent":"libvirt","thread":"42","timestamp":"2022-01-07T14:52:15.734000Z"}
{"component":"virt-launcher","level":"error","msg":"Error when closing file","pos":"os_helper.go:34","reason":"invalid argument","timestamp":"2022-01-07T14:52:17.242507Z"}
{"component":"virt-launcher","level":"error","msg":"At least one cgroup controller is required: No such device or address","pos":"virCgroupDetect:415","subcomponent":"libvirt","thread":"28","timestamp":"2022-01-07T14:52:18.312000Z"}
{"component":"virt-launcher","level":"error","msg":"failed to get domain metadata","pos":"libvirt_helper.go:137","reason":"virError(Code=80, Domain=20, Message='metadata not found: Requested metadata element is not present')","timestamp":"2022-01-07T14:52:18.410666Z"}
{"component":"virt-launcher","level":"error","msg":"Could not fetch the Domain specification.","pos":"client.go:231","reason":"virError(Code=80, Domain=20, Message='metadata not found: Requested metadata element is not present')","timestamp":"2022-01-07T14:52:18.410741Z"}
{"component":"virt-launcher","level":"error","msg":"failed to get domain metadata","pos":"libvirt_helper.go:137","reason":"virError(Code=80, Domain=20, Message='metadata not found: Requested metadata element is not present')","timestamp":"2022-01-07T14:52:18.981304Z"}
{"component":"virt-launcher","level":"error","msg":"Could not fetch the Domain specification.","pos":"client.go:231","reason":"virError(Code=80, Domain=20, Message='metadata not found: Requested metadata element is not present')","timestamp":"2022-01-07T14:52:18.981376Z"}
{"component":"virt-launcher","kind":"","level":"error","msg":"failed to get domain job info","name":"vm-rhel84-ocs","namespace":"default","pos":"manager.go:742","reason":"virError(Code=55, Domain=20, Message='Requested operation is not valid: domain is not running')","timestamp":"2022-01-07T16:15:18.317062Z","uid":"af7fc901-7852-490f-bd60-8532dc7068d1"}
{"component":"virt-launcher","level":"error","msg":"timeout on stopping the cmd server, continuing anyway.","pos":"server.go:478","timestamp":"2022-01-07T16:15:19.550850Z"}
{"component":"virt-launcher","level":"error","msg":"Failed to reap process -1","pos":"virt-launcher.go:487","reason":"no child processes","timestamp":"2022-01-07T16:15:19.555211Z"}

Comment 9 errata-xmlrpc 2022-01-20 17:21:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 4.8.4 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0213