Bug 2063531
| Summary: | Warm migrations from RHV may fail during cutover step on convert image to kubevirt | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Migration Toolkit for Virtualization | Reporter: | Tzahi Ashkenazi <tashkena> | ||||
| Component: | General | Assignee: | Arik <ahadas> | ||||
| Status: | CLOSED MIGRATED | QA Contact: | Ilanit Stein <istein> | ||||
| Severity: | high | Docs Contact: | Richard Hoch <rhoch> | ||||
| Priority: | high | ||||||
| Version: | 2.3.0 | CC: | ahadas, istein, jortel, marnold, mlehrer, slucidi | ||||
| Target Milestone: | --- | Flags: | istein:
needinfo+
|
||||
| Target Release: | Future | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 2069330 (view as bug list) | Environment: | |||||
| Last Closed: | 2023-07-11 08:39:27 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 2069330 | ||||||
| Attachments: |
|
||||||
|
Description
Tzahi Ashkenazi
2022-03-13 13:23:09 UTC
Do you have the guest conversion pod logs from any of the other VMs? I see you included one in a pastebin, but it would be good to have the rest for comparison. Also, have you tried migrating those specific VMs individually? It would be good to help determine whether this is a consistent problem or a transient one. Sam,
We'll add the conversion log for the "passing" VMs shortly.
A VM that fail when run as part of a 20 Vms migration plan, pass migration when run alone in a migration plan, as stated below in point 3.
Adding here further test results reported by Tzahi:
1. first cycle 20VMS 3 VMs failed - errors from the events on the failed pods ( pod describe ) > "MountVolume.SetUp failed for volume "libvirt-domain-xml"
VMs that failed have around 4-12 snapshots
2. second cycle : 20VMS 2VMS failed - no errors on the events on the pods - the same errors on the pod logs that failed
VMs that failed have 13 snapshots each
3. single VM warm migration ( one of the VMs that failed on the first cycle )
Completed successfully
4. controller pod have 10 restarts in total ! ( on the main container ) -bug 2063789
NAME READY STATUS RESTARTS AGE
forklift-controller-854bbdd985-cfnkj 2/2 Running 10 (16h ago) 3d22h
5. on the cycle of the 20 VMS that was running last night ( 19:00 pm )
the max connections didn't reached above 3 connections again ( need to check it again live ) - bug 2061345 seem repeating sometimes
6. I have run again the 20VMS to check the max connections again
and its ok first host have 10 , the second host have 10 , not sure what was the issue from last night cycle ( not related to the BZ of max connections of mtv-32 )
7. vm name > auto-rhv-red-iscsi-warm-mig-50gb-70usage-vm-111111 ( which failed on the first cycle )
have 20 snapshots one created manually name tzahi completed successfully ( to check no problem on rhev side )
Richard W.M. Jones: " > 9. [ 19.097467] XFS (dm-1): Metadata corruption detected at > xfs_buf_ioend+0x189/0x630 [xfs], xfs_inode block 0x526c0 > xfs_inode_buf_verify > 10. [ 19.098749] XFS (dm-1): Unmount and run xfs_repair The filesystem could genuinely be corrupt, or possibly something went wrong during the copying / convergence part of the warm conversion which corrupted the filesystem. You could see if it's the first one by running 'xfs_repair -m /dev/sda1' inside the guest before conversion (note that the -m flag makes this non-destructive, it'll just tell you if there are errors without modifying anything). On the more general point, I wasn't aware we were doing warm conversions (yet) with Kubevirt. Is this using Kubevirt & https://github.com/konveyor/forklift-controller or is there another code base involved here? I'm trying to find and fix all uses of virt-v2v at the moment ... " istein: As the VM failing to migrate in a group of 20 VMs, pass when migrated alone, it rules out that the _source_ disk filesystem is corrupted. Matthew, What are the next steps? I would like to look at an importer log and a disk image from a failure. I am trying to reproduce it now, but if anyone else can do it faster that would be helpful. i finished today 20 cycles of warm migration using rhev as a provider using two rhev hosts in total 20 VMs ( 10 main cycles + 10 "restart plans " for those failed VMS ) test summary : * The failed VMs are 20% from the total * the error from the UI is new ( not like the original that was open in this BZ ) > "Unable to connect to imageio data source: Fault reason is "Operation Failed". Fault detail is "[Cannot transfer Virtual Disk. Snapshot is currently being created for" * the original error from the BZ is now present on the failed pod using the command " oc describe pod/$pod_name > MountVolume.SetUp failed for volume "libvirt-domain-xml" : object "openshift-mtv"/"mtv-api- tests-22-27-03-07-51-42-f4d-plan-cbe80cf0-75c7-4d7bdhzd" not registered " on the events section * another error from the command above that may give more info for those errors are > Container image "registry.redhat.io/migration-toolkit-virtualization/mtv-virt-v2v-rhel8@sha256:46b940d6ac5d8bee9d729e288f6511ca91007a1935a0214c31427de96f6a605e" already present on machine * most of the errors during warm migration seems to be occurred when the "cut over " is in progress ( on the stage > "Convert image to kubevirt" ) * the full cycles results can be found here : https://docs.google.com/spreadsheets/d/1WqGPFVURjOxAs8IdvOuRRYy0D7hnh0gPDgUhi9aQKCM/edit#gid=0 * logs samples from 3 failed plans can be found here : https://drive.google.com/drive/folders/1tKid9sXJOLnfAS4IASd2WgSNSgf_Iz1g Adjusting the title of the bug to reflect updates mentioned in Comment 8 we no longer have the "Convert image to kubevirt" phase when importing from RHV we've noticed such failures with MTV 2.4 as well that are supposed to be handled in https://issues.redhat.com/browse/MTV-456 |