Bug 2013494

Summary: [CNV-2.6.8] VMI is in LiveMigrate loop when Upgrading Cluster from 2.6.7/4.7.32 to OCP 4.8.13
Product: Container Native Virtualization (CNV) Reporter: Kedar Bidarkar <kbidarka>
Component: VirtualizationAssignee: Jed Lejosne <jlejosne>
Status: CLOSED ERRATA QA Contact: Israel Pinto <ipinto>
Severity: urgent Docs Contact:
Priority: high    
Version: 2.6.7CC: cnv-qe-bugs, dvossel, fdeutsch, ipinto, jlejosne, lpivarc, rmohr, sgott, stirabos, vromanso, zpeng
Target Milestone: ---   
Target Release: 2.6.8   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: virt-operator-container-v2.6.8-5 hco-bundle-registry-container-v2.6.8-22 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2008511 Environment:
Last Closed: 2021-11-17 18:40:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2008511    
Bug Blocks: 2010742    

Comment 1 Kedar Bidarkar 2021-10-13 04:28:18 UTC
During OCP Upgrade to 4.8.14 ( CNV 4.8.z, is not involved yet )  from 4.7.33
-----------------------------------------------------------------------------

Source VMI Pod Version: container-native-virtualization/virt-launcher/images/v2.5.8-3"  ( yes virt-launcher was still using 2.5.8-3, when on CNV 2.6.7/4.7.33 )
Target VMI Pod Version: container-native-virtualization/virt-operator/images/v2.6.7-8"

---

NOTE: Paid close attention to the VMI Pod Versions during this upgrade
The below issue is seen when VMI Pod LiveMigrates/Upgrades from 2.5.8-3 to 2.6.7-8 ( during OCP-4.8.14)


{"component":"virt-launcher","kind":"","level":"error","msg":"Live migration failed.","name":"vm3-ocs-rhel84","namespace":"default","pos":"manager.go:565","reason":"virError(Code=9, Domain=10, Message='operation failed: migration of disk vdb failed: Source and target image have different sizes')","timestamp":"2021-10-12T19:36:46.340948Z","uid":"7c294dfc-89d6-4b1a-a60d-70e390efa0da"}

Comment 4 Jed Lejosne 2021-10-28 20:45:42 UTC
This new failure is caused by virt-chroot not working properly in CNV 2.6.
More specifically, the --user option which fails on every user, even root, saying the user doesn't exist.
I have not figured out why that happens.

However, after talking to Roman, we figured out that virt-chroot was not needed in the codepath involved in the issue, and in fact just made the code unnecessarily complicated.
So I pushed a fix to KubeVirt main and backported it to release-0.36 (linked above), which should fix the issue (by not using virt-chroot anymore).

It is worth noting that another (unrelated) function uses `virt-chroot --user`, GetImageInfo(), and in that case the use of virt-chroot makes sense.
I assume that function does not work in CNV 2.6 either, but I'm not sure what the impact of it is.

Comment 5 Roman Mohr 2021-10-29 12:43:04 UTC
(In reply to Jed Lejosne from comment #4)
> This new failure is caused by virt-chroot not working properly in CNV 2.6.
> More specifically, the --user option which fails on every user, even root,
> saying the user doesn't exist.
> I have not figured out why that happens.
> 
> However, after talking to Roman, we figured out that virt-chroot was not
> needed in the codepath involved in the issue, and in fact just made the code
> unnecessarily complicated.
> So I pushed a fix to KubeVirt main and backported it to release-0.36 (linked
> above), which should fix the issue (by not using virt-chroot anymore).
> 
> It is worth noting that another (unrelated) function uses `virt-chroot
> --user`, GetImageInfo(), and in that case the use of virt-chroot makes sense.
> I assume that function does not work in CNV 2.6 either, but I'm not sure
> what the impact of it is.

This is not an issue. This is only called at startup where the new launcher image
is already in use (there is a small race windows where new handlers can get old launcher pods and that should normally compatible too but it is not worth fixing here and it would only be a transient error).

Comment 8 zhe peng 2021-11-03 09:50:44 UTC
verify with build: v2.6.8-22 

Summary:
Start with VM in 2.5.8 (CNV 2.5.8, OCP 4.6)
Do OCP upgrade to 4.7, Applied 2.6.8 ICSP immediately and started CNV upgrade with the following scenarios:

Scenario 1: 
Upgrade CNV From: 2.5.8 To: 2.6.4 
Migrate VM in CNV 2.6.4 LiveMigration - PASSED
	Virt-Launcher version - 2.6.4/2.6.3-2
Continue the upgrade to CNV 2.6.8 
Virt-Launcher version 2.6.4/2.6.3-2 to 2.6.8-5 - PASSED

Scenario 2: 
Upgrade CNV From: 2.5.8 To: 2.6.5 
Migrate VM in CNV 2.6.5 LiveMigration - PASSED
	Virt-Launcher version - 2.6.5-2
Continue the upgrade to CNV 2.6.8 
Virt-Launcher version 2.6.5-2 to 2.6.8-5 - PASSED

Scenario 3: 
Upgrade CNV From: 2.5.8 To: 2.6.6 
Migrate VM in CNV 2.6.6 LiveMigration - PASSED
Virt-Launcher version - 2.6.6-7
Continue the upgrade to CNV 2.6.8
Virt-Launcher version 2.6.6-7 to 2.6.8-5 - PASSED

Scenario 4: 
Upgrade CNV From: 2.5.8 To: 2.6.7 
Migrate VM in CNV 2.6.7 LiveMigration  - FAILED  (https://bugzilla.redhat.com/show_bug.cgi?id=2019705)
	Virt-Launcher version 2.5.8 to 2.6.7 
		Source Virt-Launcher Pod 2.5.8 continues to be in Running state.
Target Virt-launcher Pod 2.6.7 enters Completed state
VMIM Object shows Status: FAILED
Continue the upgrade to CNV 2.6.8
Virt-Launcher version 2.5.8 to 2.6.8-5  - PASSED


Scenario 5: 
Upgrade CNV From: 2.5.8 To: 2.6.8 
Was tested as part of Scenario 4 itself.
As we see above,
the virt-launcher upgrade from version 2.5.8 to 2.6.8-5 - PASSED

move this to verified.

Comment 9 Kedar Bidarkar 2021-11-08 10:38:04 UTC
LiveMigration of VMI with the following scenario:  PASSED

source virt-launcher Pod: v2.6.7
Target Virt-Launcher Pod: v2.6.8-5

Comment 15 errata-xmlrpc 2021-11-17 18:40:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.8 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4725