Bug 1680361
Summary: | [v2v][RHV][Scale] v2v Migration to RHV failed on timed out waiting for transfer to finalize | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ilanit Stein <istein> | ||||||||||||||
Component: | libguestfs | Assignee: | Richard W.M. Jones <rjones> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Ilanit Stein <istein> | ||||||||||||||
Severity: | high | Docs Contact: | Red Hat CloudForms Documentation <cloudforms-docs> | ||||||||||||||
Priority: | high | ||||||||||||||||
Version: | 7.6 | CC: | dagur, derez, dmetzger, fdupont, jomurphy, jsuchane, juzhou, michal.skrivanek, mtessun, mxie, mzhan, nsoffer, ptoscano, pvauter, rjones, tnisan, tzheng, xiaodwan, ytale, zili | ||||||||||||||
Target Milestone: | rc | Keywords: | Regression, TestBlocker, ZStream | ||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||
OS: | Unspecified | ||||||||||||||||
Whiteboard: | v2v | ||||||||||||||||
Fixed In Version: | libguestfs-1.40.2-7.el7 | Doc Type: | If docs needed, set a value | ||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | |||||||||||||||||
: | 1750742 1753121 (view as bug list) | Environment: | |||||||||||||||
Last Closed: | 2020-03-31 19:55:04 UTC | Type: | Bug | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | RHEVM | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | |||||||||||||||||
Bug Blocks: | 1683443, 1750742, 1753121 | ||||||||||||||||
Attachments: |
|
Description
Ilanit Stein
2019-02-24 10:40:52 UTC
Created attachment 1538121 [details]
v2v-import.log
As a comparison, I tested CFME-5.10.0.33, with RHV-4.3.1.1-0.1.el7 - small Scale. This migration passed - so it seems the problem do not repeat on this RHV-4.3 - small scale. This might be related to: bug 1668720 - [RHV] CFME fail to refresh\discover RHV-4.3. But note that though bug 1668720 also exists on CFME-5.10.0.33, v2v migration from VMware to RHV was successful, for a RHV-4.3 small scale. I checked several RHV-4.3 (with & without OVN), that are not Hosted Engine - and these DO NOT have the refresh failure. Seems the problem is specific to RHV-4.3 Hosted-Engine configuration. I managed to reproduce this bug, using the same Scale RHV & destination host Created attachment 1542913 [details]
v2v-import-2nd-try.log
Created attachment 1542914 [details]
v2v-import-wrapper-2nd-try.log
Created attachment 1542915 [details]
vdsm-2nd-try.log
I managed to reproduce this bug on the same RHV, cluster, but another host. The timeout happens in this code: https://github.com/libguestfs/libguestfs/blob/89b5dabf8d1797e3875d949b6e2a903a5703be5c/v2v/rhv-upload-plugin.py#L518-L533 We wait up to 5 minutes for the "transfer" to "finalize". I'm not too familiar with the oVirt code so I don't know what it's actually doing here. Michal: Do you know who we could ask about this issue? Ilanit, I suppose we also need engine.log for starts Tal, can anyone look at the disk upload finalization? Created attachment 1543220 [details]
engine.log
(In reply to Richard W.M. Jones from comment #11) > The timeout happens in this code: > > https://github.com/libguestfs/libguestfs/blob/ > 89b5dabf8d1797e3875d949b6e2a903a5703be5c/v2v/rhv-upload-plugin.py#L518-L533 transfer_service.finalize() # Wait until the transfer disk job is completed since # only then we can be sure the disk is unlocked. As this # code is not very clear, what's happening is that we are # waiting for the transfer object to cease to exist, which # falls through to the exception case and then we can # continue. endt = time.time() + timeout try: while True: time.sleep(1) tmp = transfer_service.get() if time.time() > endt: raise RuntimeError("timed out waiting for transfer " "to finalize") except sdk.NotFoundError: pass This does not look right. I think you should check the transfer.phase. In the sdk examples we call this without waiting or checking the state. (This looks wrong) https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/upload_disk.py Finalizing an upload validates that the disk properties matches what RHV metadata, deactivate the volume, and remove the ticket from imageio server. ImageTransfer phases are documented here: http://ovirt.github.io/ovirt-engine-sdk/master/types.m.html#ovirtsdk4.types.ImageTransferPhase Unfortunately there is no documentation, only the phase names. Based on common sense, I think the expected phases are: FINALIZING_SUCCESS -> FINISHED_SUCCESS If finalizing failed (e.g. upload image format is invalid), I think we should get: FINALIZING_SUCCESS -> FINISHED_FAILURE Any other phase does not make sense and should probably be treated as error. Daniel, can you confirm that these are the expected phases, and give an estimate of the maximum time a client should wait for finalizing? Ilanit, the vdsm.log from attachment 1542915 [details] seems to be the wrong log. Maybe it is from the wrong host or maybe the right log was rotated. In v2v-import-20190311T113412-22520.log from attachment 1542915 [details] we can see the transfer id: transfer.id = '19a4662a-a635-4b4a-afd8-e103cdb6780c' grep 19a4662a-a635-4b4a-afd8-e103cdb6780c vdsm.log.1: (nothing) Please use xzgrep 19a4662a-a635-4b4a-afd8-e103cdb6780c /var/log/vdsm/vdsm.log* To find the write log. However for this case, we don't need vdsm log, because we see very clearly in engine log attachment 1543220 [details] that the transfer was successful. 1. Finalizing started: 2019-03-11 11:42:03,491Z INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (default task-65) [b27de17c-3a7c-4ac7-942e-de22900e0e33] Updating image transfer 19a4662a-a635-4b4a-afd8-e103cdb6780c (image a6bf5c5a-6c2b-4f48-9112-e02d28f125fb) phase to Finalizing Success 2. Engine completed verification of the uploaded image: 2019-03-11 11:42:18,475Z INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VerifyUntrustedVolumeVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-23) [c59ada05-1b39-40ca-848b-9857c096acfc] FINISH, VerifyUntrustedVolumeVDSCommand, return: StatusReturn:{status='Status [code=0, message=Done]'}, log id: 33d3c110 3. Engine marked the image as valid. 2019-03-11 11:42:19,266Z INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeLegalityVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-23) [c59ada05-1b39-40ca-848b-9857c096acfc] FINISH, SetVolumeLegalityVDSCommand, return: , log id: 5d22296b 4. Upload become successful: 2019-03-11 11:42:19,298Z INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedThreadFactory-engineScheduled-Thread-23) [c59ada05-1b39-40ca-848b-9857c096acfc] Updating image transfer 19a4662a-a635-4b4a-afd8-e103cdb6780c (image a6bf5c5a-6c2b-4f48-9112-e02d28f125fb) phase to Finished Success Not sure why we see this log again about 20 seconds later... 2019-03-11 11:42:32,580Z INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedThreadFactory-engineScheduled-Thread-78) [c59ada05-1b39-40ca-848b-9857c096acfc] Updating image transfer 19a4662a-a635-4b4a-afd8-e103cdb6780c (image a6bf5c5a-6c2b-4f48-9112-e02d28f125fb) phase to Finished Success And again... 2019-03-11 11:42:32,908Z INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-78) [c59ada05-1b39-40ca-848b-9857c096acfc] Successfully transferred disk '00000000-0000-0000-0000-000000000000' (command id '19a4662a-a635-4b4a-afd8-e103cdb6780c') This is more than 5 minutes after finalize was called - the status changed to Canceled. This looks like a bug. 2019-03-11 11:47:06,280Z INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (default task-65) [12088a1d-a4ad-4f30-8928-42e04611893f] Updating image transfer 19a4662a-a635-4b4a-afd8-e103cdb6780c (image a6bf5c5a-6c2b-4f48-9112-e02d28f125fb) phase to Cancelled But this show that if rhv-plugin was checking the transfer phase it should have succeeded. Looking in the API model: http://ovirt.github.io/ovirt-engine-api-model/master/#services/image_transfer We document the expected phase clearly: When finishing the transfer, the user should call finalize. This will make the final adjustments and verifications for finishing the transfer process. For example: transfer_service.finalize() In case of an error, the transfer’s phase will be changed to finished_failure, and the disk’s status will be changed to Illegal. Otherwise it will be changed to finished_success, and the disk will be ready to be used. In both cases, the transfer entity will be removed shortly after. But there is no example code. So I think we have several issues: - v2v: Fix waiting after finalize - ovirt-engine: ensure that phase does change after FINISHED_SUCCESS - docs: Add example code for waiting for finalize - sdk/examples: Wait for finalize in upload_*.py, download_*.py Richard, Daniel, what do you think? Example code is always good. Can you comment on what specifically we're doing wrong here? https://github.com/libguestfs/libguestfs/blob/89b5dabf8d1797e3875d949b6e2a903a5703be5c/v2v/rhv-upload-plugin.py#L518-L533 > > And again... > > 2019-03-11 11:42:32,908Z INFO > [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] > (EE-ManagedThreadFactory-engineScheduled-Thread-78) > [c59ada05-1b39-40ca-848b-9857c096acfc] Successfully transferred disk > '00000000-0000-0000-0000-000000000000' (command id > '19a4662a-a635-4b4a-afd8-e103cdb6780c') I think that the problem here is that the transfer wasn't cleared from db, which is an issue we've encountered in older versions ('Successfully transferred disk' message should contain disk's guid and not an empty one '00000000-0000-0000-0000-000000000000'). Is it reproducible in recent builds? Anyway, I think that fixing the waiting after finalize in v2v could indeed be solve the issue. I suggest to poll the disk's status instead, as we do in upload_disk.py (line 161) [*] On status 'OK', we can continue the operation. Since finalize executes teardown image in vdsm, I'm not sure what's the timeout estimation. But I guess a couple of minutes timeout should be good enough (or maybe a few minutes to keep it safe on scale env). [*] https://github.com/oVirt/ovirt-engine-sdk/blob/fd728f8286c57967c88e275edd68643a1f71c173/sdk/examples/upload_disk.py > > This is more than 5 minutes after finalize was called - the status changed > to Canceled. > This looks like a bug. > Created attachment 1543593 [details]
vdsm-2nd-try.log
(In reply to Daniel Erez from comment #17) > I suggest to poll the disk's status instead, as we do in upload_disk.py > (line 161) [*] > On status 'OK', we can continue the operation. But if we poll on the disk, how do we detect an error in finalize? I think it makes sense to poll the transfer.phase, waiting for FINISHED_SUCCESS and failing on FINISHED_FAILURE. I hope that when transfer.phase == FINISHED_SUCCESS we already unlocked the disk and we don't need to wait for the disk. OK, can someone suggest a change to the Python code to make this work? I've no idea about how any of this stuff works. (In reply to Richard W.M. Jones from comment #16) > Example code is always good. Can you comment on what specifically we're > doing wrong here? > https://github.com/libguestfs/libguestfs/blob/ > 89b5dabf8d1797e3875d949b6e2a903a5703be5c/v2v/rhv-upload-plugin.py#L518-L533 I think we need to do (untested): start = time.time() while True: time.sleep(1) transfer = transfer_service.get() if transfer.phase == types.ImageTransferPhase.FINISHED_SUCCESS: debug("finalized after %s seconds", time.time() - start) break if transfer.phase == types.ImageTransferPhase.FINALIZING_SUCCESS: if time.time() > start + timeout: raise RuntimeError("timed out waiting for transfer " "to finalize") continue raise RuntimeError("Unexpected transfer phase while finalizing " "upload %r" % transfer.phase) (In reply to Michal Skrivanek from comment #12) > Ilanit, I suppose we also need engine.log for starts > Tal, can anyone look at the disk upload finalization? Daniel, can you please have a look? (In reply to Nir Soffer from comment #21) > (In reply to Richard W.M. Jones from comment #16) > > Example code is always good. Can you comment on what specifically we're > > doing wrong here? > > https://github.com/libguestfs/libguestfs/blob/ > > 89b5dabf8d1797e3875d949b6e2a903a5703be5c/v2v/rhv-upload-plugin.py#L518-L533 > > I think we need to do (untested): > > start = time.time() > > while True: > time.sleep(1) > > transfer = transfer_service.get() 'transfer' could be None in this stage (in case the transfer has already been completed as we reach here). So just need to add something like this: if transfer == None: disk = disk_service.get() if disk.status == types.DiskStatus.OK: continue; Makes sense? > > if transfer.phase == types.ImageTransferPhase.FINISHED_SUCCESS: > debug("finalized after %s seconds", time.time() - start) > break > > if transfer.phase == types.ImageTransferPhase.FINALIZING_SUCCESS: > if time.time() > start + timeout: > raise RuntimeError("timed out waiting for transfer " > "to finalize") > continue > > raise RuntimeError("Unexpected transfer phase while finalizing " > "upload %r" % transfer.phase) Sent the suggested fix to v2v (not verified): https://www.redhat.com/archives/libguestfs/2019-March/msg00044.html Retested on these version: ovirt-engine-4.3.2-0.1.el7.noarch vdsm-4.30.11-1.el7ev.x86_64 ovirt-ansible-v2v-conversion-host-1.9.2-1.el7ev.noarch virt-v2v-1.38.2-12.29.lp.el7ev.x86_64 CFME-5.10.1.2 Got the same failure in v2v-import.log: nbdkit: python[1]: error: /var/tmp/v2v.3jibRB/rhv-upload-plugin.py: close: error: ['Traceback (most recent call last):\n', ' File "/var/tmp/v2v.3jibRB/rhv-upload-plugin.py", line 558, in close\n', 'RuntimeError: timed out waiting for transfer to finalize\n'] In RHV conversion host, run: $ yum downgrade libguestfs libguestfs-tools-c virt-v2v python-libguestfs, brings interim version: 1.40.2-5.el7.1.bz1680361.v3.1.x86_64 (based on Derez fix). Using this repo: http://brew-task-repos.usersys.redhat.com/repos/scratch/rjones/libguestfs/1.40.2/5.el7.1.bz1680361.v3.1/ Error: 'RuntimeError: timed out waiting for transfer to finalize\n' no longer appeared. I managed to run v2v migration, VMware->RHV ISCSI->FC, VDDK, for a single VM, with 20GB disk successfully. However, for single/20 VMs with 100GB, disk, I encountered these 2 new bugs: Bug 1740098 - [v2v][Scale][RHV] Single VM migration failed, but related virt-v2v error is not logged. Bug 1740021 - [v2v][Scale][RHV] 20 VMs migration fail on "timed out waiting for disk to become unlocked" Upstream in eeabb3fdc7756887b53106f455a7b54309130637, virt-v2v >= 1.41.5 and >= 1.40.3. Will be able to verify when RHEL-7.8 released to QE. Hi Ming, I verified the fix (added manually to my RHV host), on the exact same large scale RHV. After several runs, I can say that the issue do not appear. Based on that, I suggest to move the bug to VERIFIED. Correction to comment #50: As I tested the fix on RHEL-7.7, and not RHEL-7.8, The RHEL-7.7 bug 1753121 can be moved to VERIFIED, and not this one. Verify the bug with builds: virt-v2v-1.40.2-8.el7.x86_64 libguestfs-1.40.2-8.el7.x86_64 libvirt-4.5.0-26.el7.x86_64 qemu-kvm-rhev-2.12.0-35.el7.x86_64 vdsm-4.30.30-1.el7ev.x86_64 VMware-vix-disklib-6.7.3-14389676.x86_64.tar.gz RHV:4.3.6.5-0.1.el7 host kernel: 3.10.0-1062.1.1.el7.x86_64 nbdkit-1.8.0-1.el7.x86_64 Steps: Scenario1: 1.1 Convert a rhel7 guest which has disk size more than 100GB from VMware to rhv on rhv-node by virt-v2v # virt-v2v -ic esx://root.75.219/?no_verify=1 rhv-4.3-150G -it vddk -io vddk-libdir=/home/vmware-vix-disklib-distrib -io vddk-thumbprint=89:89:AE:6D:DE:B2:89:01:B0:CF:FC:D3:C8:A8:1A:5E:14:E4:B4:54 -o rhv-upload -oo rhv-cafile=/root/ca.pem -oo rhv-direct -oc https://hp-dl360eg8-03.lab.eng.pek2.redhat.com/ovirt-engine/api -op /tmp/rhvpasswd -os fc_data -b ovirtmgmt -oo rhv-cluster=FC --password-file /tmp/esxpw -of raw -oa preallocated Exception AttributeError: "'module' object has no attribute 'dump_plugin'" in <module 'threading' from '/usr/lib64/python2.7/threading.pyc'> ignored [ 1.2] Opening the source -i libvirt -ic esx://root.75.219/?no_verify=1 rhv-4.3-150G -it vddk -io vddk-libdir=/home/vmware-vix-disklib-distrib -io vddk-thumbprint=89:89:AE:6D:DE:B2:89:01:B0:CF:FC:D3:C8:A8:1A:5E:14:E4:B4:54 [ 2.7] Creating an overlay to protect the source from being modified nbdkit: error: VixDiskLib: VixDiskLibIsLegacyConnParams: the instance of VixDiskLibConnectParams is NOT allocated by VixDiskLib_AllocateConnectParams. The new features in 6.7 or later are not supported. [ 4.1] Opening the overlay libvirt needs authentication to connect to libvirt URI qemu:///system (see also: http://libvirt.org/auth.html http://libvirt.org/uri.html) Please enter your authentication name: test Please enter your password: nbdkit: error: VixDiskLib: VixDiskLibIsLegacyConnParams: the instance of VixDiskLibConnectParams is NOT allocated by VixDiskLib_AllocateConnectParams. The new features in 6.7 or later are not supported. [ 138.9] Inspecting the overlay [ 288.2] Checking for sufficient free disk space in the guest [ 288.2] Estimating space required on target for each disk [ 288.2] Converting Red Hat Enterprise Linux Server 7.6 (Maipo) to run on KVM virt-v2v: This guest has virtio drivers installed. [ 468.2] Mapping filesystem data to avoid copying unused and blank areas [ 469.1] Closing the overlay [ 470.6] Assigning disks to buses [ 470.6] Checking if the guest needs BIOS or UEFI to boot [ 470.6] Initializing the target -o rhv-upload -oa preallocated -oc https://hp-dl360eg8-03.lab.eng.pek2.redhat.com/ovirt-engine/api -op /tmp/rhvpasswd -os fc_data [ 472.2] Copying disk 1/1 to qemu URI json:{ "file.driver": "nbd", "file.path": "/var/tmp/rhvupload.uZk0MN/nbdkit0.sock", "file.export": "/" } (raw) nbdkit: error: VixDiskLib: VixDiskLibIsLegacyConnParams: the instance of VixDiskLibConnectParams is NOT allocated by VixDiskLib_AllocateConnectParams. The new features in 6.7 or later are not supported. (100.00/100%) [5289.3] Creating output metadata [5301.4] Finishing off 1.2 Power on guest and checkpoints of guest are passed Scenario2: 2.1 Import a guest which has disk size more than 100GB from VMware on RHV GUI Click import option on Virtual Machine interface-> input VMware host info to load the guest-> select the guest to import 2.2 The importing can finished without error,power on guest and checkpoints of guest are passed Sceanario3: 3.1 Convert a guest which has disk size more than 100GB from VMware to rhv on standalone v2v conversion server by virt-v2v # virt-v2v -ic esx://root.196.89/?no_verify=1 debian-120G -it vddk -io vddk-libdir=/home/vmware-vix-disklib-distrib -io vddk-thumbprint=E2:D1:AC:55:C8:9A:64:AA:5E:11:54:8C:D8:6E:60:54:D1:EF:18:B0 -o rhv-upload -oo rhv-cafile=/root/ca.pem -oo rhv-direct -oc https://hp-dl360eg8-03.lab.eng.pek2.redhat.com/ovirt-engine/api -op /tmp/rhvpasswd -os iscsi_data -b ovirtmgmt -oo rhv-cluster=ISCSI --password-file /tmp/esxpw -of raw -oa preallocated Exception AttributeError: "'module' object has no attribute 'dump_plugin'" in <module 'threading' from '/usr/lib64/python2.7/threading.pyc'> ignored [ 0.5] Opening the source -i libvirt -ic esx://root.196.89/?no_verify=1 debian-120G -it vddk -io vddk-libdir=/home/vmware-vix-disklib-distrib -io vddk-thumbprint=E2:D1:AC:55:C8:9A:64:AA:5E:11:54:8C:D8:6E:60:54:D1:EF:18:B0 [ 1.9] Creating an overlay to protect the source from being modified nbdkit: error: VixDiskLib: VixDiskLibIsLegacyConnParams: the instance of VixDiskLibConnectParams is NOT allocated by VixDiskLib_AllocateConnectParams. The new features in 6.7 or later are not supported. [ 5.3] Opening the overlay nbdkit: error: VixDiskLib: VixDiskLibIsLegacyConnParams: the instance of VixDiskLibConnectParams is NOT allocated by VixDiskLib_AllocateConnectParams. The new features in 6.7 or later are not supported. [ 30.2] Inspecting the overlay [ 33.8] Checking for sufficient free disk space in the guest [ 33.8] Estimating space required on target for each disk [ 33.8] Converting 9.4 to run on KVM virt-v2v: warning: could not determine a way to update the configuration of Grub2 virt-v2v: This guest has virtio drivers installed. [ 47.8] Mapping filesystem data to avoid copying unused and blank areas [ 49.3] Closing the overlay [ 49.5] Assigning disks to buses [ 49.5] Checking if the guest needs BIOS or UEFI to boot [ 49.5] Initializing the target -o rhv-upload -oa preallocated -oc https://hp-dl360eg8-03.lab.eng.pek2.redhat.com/ovirt-engine/api -op /tmp/rhvpasswd -os iscsi_data [ 51.0] Copying disk 1/1 to qemu URI json:{ "file.driver": "nbd", "file.path": "/var/tmp/rhvupload.4C6XqY/nbdkit0.sock", "file.export": "/" } (raw) nbdkit: error: VixDiskLib: VixDiskLibIsLegacyConnParams: the instance of VixDiskLibConnectParams is NOT allocated by VixDiskLib_AllocateConnectParams. The new features in 6.7 or later are not supported. (100.00/100%) [3599.9] Creating output metadata [3609.1] Finishing off 3.2 Power on guest and checkpoints of guest are passed Verify the same scenarios of comment53 with below builds again,the result is passed, so move the bug from ON_QA to VERIFIED virt-v2v-1.40.2-8.el7.x86_64 libguestfs-1.40.2-8.el7.x86_64 libvirt-4.5.0-27.el7.x86_64 qemu-kvm-rhev-2.12.0-37.el7.x86_64 nbdkit-1.8.0-2.el7.x86_64 VMware-vix-disklib-6.7.3-14389676.x86_64.tar.gz RHV:4.3.6.5-0.1.el7 host kernel: 3.10.0-1062.1.1.el7.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1082 |