2063531 – Warm migrations from RHV may fail during cutover step on convert image to kubevirt

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

Bug 2063531 - Warm migrations from RHV may fail during cutover step on convert image to kubevirt

Summary: Warm migrations from RHV may fail during cutover step on convert image to kub...

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Migration Toolkit for Virtualization
Classification:	Red Hat
Component:	General
Sub Component:
Version:	2.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	Future
Assignee:	Arik
QA Contact:	Ilanit Stein
Docs Contact:	Richard Hoch
URL:
Whiteboard:
Depends On:
Blocks:	2069330
TreeView+	depends on / blocked

Reported:	2022-03-13 13:23 UTC by Tzahi Ashkenazi
Modified:	2023-07-11 08:39 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2069330 (view as bug list)
Environment:
Last Closed:	2023-07-11 08:39:27 UTC
Target Upstream Version:
Embargoed:
Flags:	istein: needinfo+

Attachments	(Terms of Use)
warm (121.71 KB, image/png) 2022-03-13 13:23 UTC, Tzahi Ashkenazi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	MTV-456	0	None	None	None	2023-07-11 08:39:26 UTC

Description Tzahi Ashkenazi 2022-03-13 13:23:09 UTC

Created attachment 1865747 [details]
warm

Created attachment 1865747 [details]
warm

Created attachment 1865747 [details]
warm

Created attachment 1865747 [details]
warm

Description of problem:
during some sanity tests of warm migration with 20VMS using 2 rhev hosts on a single plan 
3 VMs out of 20VMS on a single plan, exit on > "MountVolume.SetUp failed for volume "libvirt-domain-xml"

[root@f01-h14-000-r640 ~]# oc get pods
NAME                                                              READY   STATUS      RESTARTS      AGE
forklift-controller-854bbdd985-cfnkj                              2/2     Running     5 (83m ago)   3d1h
forklift-must-gather-api-7fd458f97f-zthsh                         1/1     Running     0             5d23h
forklift-operator-7648895549-5zhxc                                1/1     Running     0             5d23h
forklift-ui-6668df84db-xpbm9                                      1/1     Running     0             5d23h
forklift-validation-5d48f67559-tx8kp                              1/1     Running     0             5d23h
mtv-api-tests-22-13-03-11-26-28-7e9-plan-00e85d1e-7ac5-4fcc7pgf   0/1     Completed   0             58m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-03fa85fd-2bef-4c3p9zvk   0/1     Completed   0             58m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-3e16eacd-ab42-43c8xwgd   0/1     Completed   0             58m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-40fb9e17-e946-470n2kmq   0/1     Completed   0             57m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-430e7e19-296c-431vj9r9   0/1     Error       0             57m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-4407c876-ed4d-432wng5l   0/1     Completed   0             58m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-450aacdd-124e-45afxnmp   0/1     Completed   0             57m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-453dba6e-94ef-4985wxqp   0/1     Completed   0             58m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-6a7902bf-58a8-4catfvk2   0/1     Completed   0             58m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-7b932815-43a5-483s59jz   0/1     Completed   0             57m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-8a3c603b-667a-454ltrwl   0/1     Completed   0             57m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-8b4caa54-402f-467hbpsn   0/1     Error       0             58m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-95256ae1-d3fd-4d686mqv   0/1     Completed   0             58m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-9d86240b-d6ec-405tjkz6   0/1     Completed   0             57m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-a10fea1a-a1ea-45477jbm   0/1     Completed   0             58m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-ab611a65-3de3-48dhlvnq   0/1     Completed   0             58m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-be782dfd-b54e-4a14fxlj   0/1     Completed   0             58m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-c4fb3186-351b-4bdndgjn   0/1     Error       0             58m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-c9d90790-9be6-42apqg7g   0/1     Completed   0             58m
mtv-api-tests-22-13-03-11-26-28-7e9-plan-fbde3f50-1929-464bgs44   0/1     Completed   0             58m


Full VMs list for the plan  : 
  Vms:
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-1
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-10
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-100
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-11
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-12
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-13
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-14
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-15
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-16
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-17
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-18
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-19
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-2
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-20
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-21
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-22
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-23
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-24
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-25
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-26


VMs that failed : 
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-11
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-16
    Name:  auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-25


some errors from one of the failed pod :
oc logs mtv-api-tests-22-13-03-11-26-28-7e9-plan-c4fb3186-351b-4bdndgjn

libguestfs: trace: v2v: aug_get = "Syntax error"
libguestfs: trace: v2v: aug_match "/augeas/files/etc/profile/error//*"
guestfsd: <= aug_match (0x18) request length 80 bytes
guestfsd: => aug_match (0x18) took 0.00 secs

[    0.000000] Kernel command line: panic=1 console=ttyS0 edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check printk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=UUID=a0042771-11ee-429f-a0ea-7db305c5c6fb selinux=0 guestfs_verbose=1 guestfs_network=1 TERM=xterm guestfs_identifier=v2v
[    0.000000] Specific versions of hardware are certified with Red Hat Enterprise Linux 8. Please see the list of hardware certified with Red Hat Enterprise Linux 8 at ht



Version-Release number of selected component (if applicable):
MTV 2.3.0-37
CNV 4.10.0-648
OCP 4.10
RHEV 4.4.10-7-001
cloud10 

Additional info:
if the logs are missing info related to must gather option , one logs can be found here for example  :
http://pastebin.test.redhat.com/1036591

the full must-gather logs for the 3 VMs that failed and 2 VMs that succeeds + the plan log can be found here

https://drive.google.com/drive/folders/1aaM3eWrnQ1_CflPKfN9yjcTFnseNM7Wq?usp=sharing

Comment 1 Sam Lucidi 2022-03-14 14:34:23 UTC

Do you have the guest conversion pod logs from any of the other VMs? I see you included one in a pastebin, but it would be good to have the rest for comparison. Also, have you tried migrating those specific VMs individually? It would be good to help determine whether this is a consistent problem or a transient one.

Comment 2 Ilanit Stein 2022-03-14 15:15:44 UTC

Sam,

We'll add the conversion log for the "passing" VMs shortly.
A VM that fail when run as part of a 20 Vms migration plan, pass migration when run alone in a migration plan, as stated below in point 3.

Adding here further test results reported by Tzahi:

1. first cycle 20VMS 3 VMs failed - errors from the events on the failed pods ( pod describe ) > "MountVolume.SetUp failed for volume "libvirt-domain-xml"
VMs that failed have around 4-12 snapshots
2. second cycle : 20VMS 2VMS failed - no errors on the events on the pods - the same errors on the pod logs that failed
VMs that failed have 13 snapshots each
3. single VM warm migration ( one of the VMs that failed on the first cycle )
Completed successfully

4. controller pod have 10 restarts in total ! ( on the main container ) -bug 2063789
NAME READY STATUS RESTARTS AGE
forklift-controller-854bbdd985-cfnkj 2/2 Running 10 (16h ago) 3d22h

5. on the cycle of the 20 VMS that was running last night ( 19:00 pm )
the max connections didn't reached above 3 connections again ( need to check it again live ) - bug 2061345 seem repeating sometimes

6. I have run again the 20VMS to check the max connections again
and its ok first host have 10 , the second host have 10 , not sure what was the issue from last night cycle ( not related to the BZ of max connections of mtv-32 )

7. vm name > auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-111111 ( which failed on the first cycle )
have 20 snapshots one created manually name tzahi completed successfully ( to check no problem on rhev side )

Comment 5 Ilanit Stein 2022-03-14 18:13:29 UTC

Richard W.M. Jones:

"
>  9. [   19.097467] XFS (dm-1): Metadata corruption detected at
>     xfs_buf_ioend+0x189/0x630 [xfs], xfs_inode block 0x526c0
>     xfs_inode_buf_verify
> 10. [   19.098749] XFS (dm-1): Unmount and run xfs_repair

The filesystem could genuinely be corrupt, or possibly something went
wrong during the copying / convergence part of the warm conversion
which corrupted the filesystem.

You could see if it's the first one by running 'xfs_repair -m /dev/sda1'
inside the guest before conversion (note that the -m flag makes this
non-destructive, it'll just tell you if there are errors without
modifying anything).

On the more general point, I wasn't aware we were doing warm
conversions (yet) with Kubevirt.  Is this using Kubevirt &
https://github.com/konveyor/forklift-controller or is there another
code base involved here?  I'm trying to find and fix all uses of
virt-v2v at the moment ...
"

istein: As the VM failing to migrate in a group of 20 VMs, pass when migrated alone, it rules out that the _source_ disk filesystem is corrupted.

Comment 6 Jeff Ortel 2022-03-16 13:59:28 UTC

Matthew, What are the next steps?

Comment 7 Matthew Arnold 2022-03-16 15:55:43 UTC

I would like to look at an importer log and a disk image from a failure. I am trying to reproduce it now, but if anyone else can do it faster that would be helpful.

Comment 8 Tzahi Ashkenazi 2022-03-27 13:34:24 UTC

i finished today 20 cycles of warm migration using rhev as a provider using two rhev hosts in total 20  VMs ( 10 main cycles + 10 "restart plans "  for those failed VMS ) 

test summary :
* The  failed VMs are 20% from the total  
* the error from the UI is new ( not like the original that was open in this BZ )  > "Unable to connect to imageio data source: Fault reason is "Operation Failed". Fault detail is "[Cannot transfer Virtual Disk. Snapshot is currently being created for"
* the original error from the BZ is now present on the failed pod using the command " oc describe pod/$pod_name > MountVolume.SetUp failed for volume "libvirt-domain-xml" : object "openshift-mtv"/"mtv-api- tests-22-27-03-07-51-42-f4d-plan-cbe80cf0-75c7-4d7bdhzd" not registered "  on the events section 
* another error from the command above that may give more info for those errors  are  > Container image "registry.redhat.io/migration-toolkit-virtualization/mtv-virt-v2v-rhel8@sha256:46b940d6ac5d8bee9d729e288f6511ca91007a1935a0214c31427de96f6a605e" already present on machine
* most of the errors during warm migration seems to be occurred when the "cut over " is in progress ( on the stage > "Convert image to kubevirt" )
* the full cycles results can be found here : https://docs.google.com/spreadsheets/d/1WqGPFVURjOxAs8IdvOuRRYy0D7hnh0gPDgUhi9aQKCM/edit#gid=0
* logs samples from 3 failed plans can be found here : https://drive.google.com/drive/folders/1tKid9sXJOLnfAS4IASd2WgSNSgf_Iz1g

Comment 9 mlehrer 2022-03-28 13:26:43 UTC

Adjusting the title of the bug to reflect updates mentioned in Comment 8

Comment 11 Arik 2023-07-11 08:39:27 UTC

we no longer have the "Convert image to kubevirt" phase when importing from RHV
we've noticed such failures with MTV 2.4 as well that are supposed to be handled in https://issues.redhat.com/browse/MTV-456

Note You need to log in before you can comment on or make changes to this bug.