Bug 1849861
Summary: | [RFE][v2v] [upload/download disk/CBT] Failed to attach disk to the VM - disk is OK but image transfer still holds a lock on the disk | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Xiaodai Wang <xiaodwan> | |
Component: | ovirt-engine | Assignee: | Pavel Bar <pbar> | |
Status: | CLOSED DEFERRED | QA Contact: | Xiaodai Wang <xiaodwan> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.4.0 | CC: | aefrat, bzlotnik, eshenitz, juzhou, lsvaty, mavital, mxie, mzhan, nsoffer, pbar, ptoscano, rjones, tnisan, tyan, tzheng, Yury.Panchenko | |
Target Milestone: | --- | Keywords: | Automation, FutureFeature, ZStream | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1976020 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-11 11:50:03 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1976020, 1976024 |
Description
Xiaodai Wang
2020-06-23 03:30:49 UTC
Here is the discussion: http://post-office.corp.redhat.com/archives/v2v-devel/2020-June/msg00035.html Just to make the command more clear: /bin/virt-v2v \ -i ova \ -ic qemu:///system \ -o rhv-upload \ -os nfs_data \ -of raw \ -b ovirtmgmt \ -n ovirtmgmt /var/lib/libvirt/images/v2v/ova-images/case84967/invalid_line_in_manifest.ova \ -on ova_vm_toT \ -oc https://hp-dl360eg8-03.lab.eng.pek2.redhat.com/ovirt-engine/api \ -op /tmp/rhv_upload_passwd_file \ -oo rhv-cafile=/tmp/rhv_upload_ca.pem \ -oo rhv-cluster=NFS \ -oo rhv-direct \ -v \ -x The command looks broken, see the -n argument. Why do you use ova? Are you testing ova support? Can you reproduce using local disk? virt-v2v \ -i disk disk.img \ -o rhv-upload \ -oc https://my.engine/ovirt-engine/api \ -op password-file \ -on v2v \ -os my-domain \ -of raw \ -oa sparse \ -oo rhv-cafile=/etc/pki/vdsm/certs/cacert.pem \ -oo rhv-cluster=my-cluster \ -oo rhv-direct=true The issue described sounds like an engine bug, so it should be reproducible using local disk. (In reply to Nir Soffer from comment #3) > Just to make the command more clear: > > /bin/virt-v2v \ > -i ova \ > -ic qemu:///system \ > -o rhv-upload \ > -os nfs_data \ > -of raw \ > -b ovirtmgmt \ > -n ovirtmgmt > /var/lib/libvirt/images/v2v/ova-images/case84967/invalid_line_in_manifest. > ova \ > -on ova_vm_toT \ > -oc https://hp-dl360eg8-03.lab.eng.pek2.redhat.com/ovirt-engine/api \ > -op /tmp/rhv_upload_passwd_file \ > -oo rhv-cafile=/tmp/rhv_upload_ca.pem \ > -oo rhv-cluster=NFS \ > -oo rhv-direct \ > -v \ > -x > > The command looks broken, see the -n argument. It's not a broken argument. '-n' is duplicate with '-b'. > > Why do you use ova? Are you testing ova support? Yes, we also have ova testing cases for v2v. > > Can you reproduce using local disk? > > virt-v2v \ > -i disk disk.img \ > -o rhv-upload \ > -oc https://my.engine/ovirt-engine/api \ > -op password-file \ > -on v2v \ > -os my-domain \ > -of raw \ > -oa sparse \ > -oo rhv-cafile=/etc/pki/vdsm/certs/cacert.pem \ > -oo rhv-cluster=my-cluster \ > -oo rhv-direct=true > > The issue described sounds like an engine bug, so it should be reproducible > using local disk. Yes, I reproduced it with a local disk. Steps: 1) Convert the OVA file to a qcow2 file. qemu-img 'create' '-q' '-f' 'qcow2' '-b' 'json:{ "file": { "driver": "raw", "offset": 6656, "size": 4063232, "file": { "driver": "file", "filename": "/var/lib/libvirt/images/v2v/ova-images/case84967/invalid_line_in_manifest.ova" } } }' '-o' 'compat=1.1,backing_fmt=vmdk' '/var/tmp/v2v.qcow2' 2) export LIBGUESTFS_BACKEND=direct 3) Run v2v command(I copied '/var/tmp/v2v.qcow2' to '/root/' in case it's deleted by mistake.). # virt-v2v -i disk -ic qemu:///system -o rhv-upload -os nfs_data -of raw -b ovirtmgmt -n ovirtmgmt /root/v2v.qcow2 -on testvm -oc https://hp-dl360eg8-03.lab.eng.pek2.redhat.com/ovirt-engine/api -op /root/rhv_upload_passwd_file -oo rhv-cafile=/root/rhv_upload_ca.pem -oo rhv-cluster=NFS -oo rhv-direct -v -x I'm going to upload the v2v and ovirt-engine logs and also the local qcow2 disk if you'd like to reproduce it on your environment. (In reply to Tal Nisan from comment #6) > Richard, any insights about what is the issue here? From the virt-v2v point the conversion is completely successful in both cases. So the possibilities can only be: * It's a RHV engine problem of some kind. * virt-v2v's -o rhv-upload mode should wait a bit longer or check the disk status in a different way from how we're doing it at the moment to give time for the disk to "become available" (whatever that means). For reference here's what we're doing now to wait for the disk to appear: https://github.com/libguestfs/virt-v2v/blob/c1caf7132000a4560c3e20c2753978e8dd10036a/v2v/rhv-upload-plugin.py#L593 and the general VM creation code here which runs after the above: https://github.com/libguestfs/virt-v2v/blob/master/v2v/rhv-upload-createvm.py To add that in the both cases in this bug we print this message: 2020-06-21 23:35:41,381 process L0458 DEBUG| [stderr] finalized after 1.02907133102417 seconds finalized after 1.0110740661621094 seconds which given the code above seems to mean that the disk status did change to types.DiskStatus.OK. (In reply to Richard W.M. Jones from comment #7) > (In reply to Tal Nisan from comment #6) > > Richard, any insights about what is the issue here? > > From the virt-v2v point the conversion is completely successful in > both cases. > > So the possibilities can only be: > > * It's a RHV engine problem of some kind. Very likely. I have seen in the past case when finalizing the transfer just hangs forever. This is not reproducible and the next transfer succeeds. > * virt-v2v's -o rhv-upload mode should wait a bit longer or > check the disk status in a different way from how we're doing > it at the moment to give time for the disk to "become available" > (whatever that means). For reference here's what we're doing now > to wait for the disk to appear: > > https://github.com/libguestfs/virt-v2v/blob/ > c1caf7132000a4560c3e20c2753978e8dd10036a/v2v/rhv-upload-plugin.py#L593 This should be the correct way to wait for the disk, but I don't know engine internals. > and the general VM creation code here which runs after the above: > > https://github.com/libguestfs/virt-v2v/blob/master/v2v/rhv-upload-createvm.py This does not attach the disk but create a vm with ovf, maybe there is engine bug handling this ovf, or some issue with the ovf. (In reply to Richard W.M. Jones from comment #8) > To add that in the both cases in this bug we print this message: > > 2020-06-21 23:35:41,381 process L0458 DEBUG| [stderr] finalized > after 1.02907133102417 seconds > > finalized after 1.0110740661621094 seconds > > which given the code above seems to mean that the disk status did > change to types.DiskStatus.OK. This is not the same message as in: https://github.com/libguestfs/virt-v2v/blob/c1caf7132000a4560c3e20c2753978e8dd10036a/v2v/rhv-upload-plugin.py#L593 The relevant logs from finalize_transfer are: transfer %s finalized in %.3f seconds On success, and in case of timeout (60 seconds?), we raise: RuntimeError("timed out waiting for transfer %s to finalize) Can you point me to the virt-v2v code used here? Oh you're right (and I can't read ...). I note also this is the old version of virt-v2v (1.40.2, new version is 1.42.0). This message changed in the following upstream commit: https://github.com/libguestfs/virt-v2v/commit/75fcf46379ec8cad31684f3ca3dd24caa7aaaa46 which means (as expected) we're not carrying all of the patches needed in this branch. Is there a chance to move to RHEL AV 8.3.0? (In reply to Richard W.M. Jones from comment #11) > Oh you're right (and I can't read ...). I note also this is the old version > of virt-v2v (1.40.2, new version is 1.42.0). > > This message changed in the following upstream commit: > > https://github.com/libguestfs/virt-v2v/commit/ > 75fcf46379ec8cad31684f3ca3dd24caa7aaaa46 > > which means (as expected) we're not carrying all of the patches > needed in this branch. I think the issue with the old code was not handling the ILLEGAL state which could end in an endless loop instead of fast failure, but otherwise we check for LOCKED and OK stages, so based on the logs, we detected that the disk was OK. Looking in engine log: 1. Transfer take exclusive lock on the disk 2020-06-23 18:18:27,005+08 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-1340) [539a982e-8973-4783-a56e-298622c811ec] Lock Acqu ired to object 'EngineLock:{exclusiveLocks='[700b798a-4434-44c7-b500-a03ff9363faf=DISK]', sharedLocks='[]'}' 2. Transfer was finalized 2020-06-23 18:18:29,638+08 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (default task-1340) [1ff2ecec-eb40-46f3-9868-bf903f8f9db8] Updating image transfer 9996fe27-2881-41a0-99c1-80d0005ed3ea (image 700b798a-4434-44c7-b500-a03ff9363faf) phase to Finalizing Success 3. Starting to import vm from configuration 2020-06-23 18:18:33,098+08 INFO [org.ovirt.engine.core.bll.exportimport.ImportVmFromConfigurationCommand] (default task-1340) [b313fbb8-c1c7-45c1-b919-2da431bb94ca] Lock Ac quired to object 'EngineLock:{exclusiveLocks='[a65fddb4-48e1-46e1-a080-074d156cab8b=VM, testvm=VM_NAME]', sharedLocks='[a65fddb4-48e1-46e1-a080-074d156cab8b=REMOTE_VM]'}' 4. Import vm command fail to acquire lock on the disk 2020-06-23 18:18:33,349+08 INFO [org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand] (default task-1340) [57cc33d7] Failed to Acquire Lock to object 'EngineLock:{exclusiveLocks='[a65fddb4-48e1-46e1-a080-074d156cab8b=VM_DISK_BOOT, 700b798a-4434-44c7-b500-a03ff9363faf=DISK]', sharedLocks=''}' 2020-06-23 18:18:33,349+08 WARN [org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand] (default task-1340) [57cc33d7] Validation of action 'AttachDiskToVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__ATTACH_ACTION_TO,VAR__TYPE__DISK,ACTION_TYPE_FAILED_DISK_IS_LOCKED 5. Transfer marked as finished 2020-06-23 18:18:35,322+08 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedThreadFactory-engineScheduled-Thread-12) [539a982e-8973-4783-a56e-298622c811ec] Updating image transfer 9996fe27-2881-41a0-99c1-80d0005ed3ea (image 700b798a-4434-44c7-b500-a03ff9363faf) phase to Finished Success 6. Transfer release exclusive lock on the disk 2020-06-23 18:18:35,398+08 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-12) [539a982e-8973-4783-a56e-298622c811ec] Lock freed to object 'EngineLock:{exclusiveLocks='[700b798a-4434-44c7-b500-a03ff9363faf=DISK]', sharedLocks='[]'}' We have 2 ways lock disks: - set imagestatus to LOCKED (2) in the database - take memory locks (e.g. EngineLock:{exclusiveLocks='[700b798a-4434-44c7-b500-a03ff9363faf=DISK]') When starting a transfer the transfer command takes exclusive lock on the disk and change the disk status to LOCKED. When ending the tranfer the transfer command release the exclusive lock and change the disk status to OK. Based on the logs, we change the disk to OK before releasing the exclusive lock, so when v2v continue the import the disk is still locked. This is not the first time we had such issues. Moving to RHV and changing severity to high since fails basic functionallity. (In reply to Nir Soffer from comment #13) > Moving to RHV and changing severity to high since fails basic > functionallity. Tal this is a severe bug that will affect customers as well please retarget to 4.4.1(if possible) or at least 4.4.2. Encountered similar issue when downloading disk with help of python3-ovirt-engine-sdk4: When using SDK for downloading qcow2 disk, the transfer times out to finalize transfer with the following ERROR from the SDK [1], with the following error from engine [2], With the following error from vdsm host [3] [1]: 2020-07-05 10:37:18,495 - MainThread - RemoteExecutor - DEBUG - [root.16.37/qum5net] Results of command: python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py download fac3911d-95b6-41be-a538-dad345691502 --password-file /root/pass.txt --engine-url https://hosted-engine-06.lab.eng.tlv2.redhat.com --username admin@internal --cafile /root/ca.crt --backup-dir /root/full_backup/ --backup-uuid 44ae937c-8c30-41a1-bb68-d4daf3177d7e 2020-07-05 10:37:18,496 - MainThread - RemoteExecutor - DEBUG - [root.16.37/qum5net] OUT: [ 0.0 ] Downloading VM fac3911d-95b6-41be-a538-dad345691502 disks [ 0.2 ] Creating image transfer for disk 2d075957-7481-4af3-9920-e63d7759af83 [ 2.9 ] Image transfer 391d9654-0e59-4e24-9524-e59a37cb8b53 is ready [ ------- ] 0 bytes, 0.00 seconds, 0 bytes/s Formatting '/root/full_backup/2d075957-7481-4af3-9920-e63d7759af83.202007051036.full.qcow2', fmt=qcow2 size=1073741824 cluster_size=65536 lazy_refcounts=off refcount_bits=16 [ 99.89% ] 1022.88 MiB, 0.22 seconds, 4.56 GiB/s [ 99.92% ] 1023.19 MiB, 0.33 seconds, 3.00 GiB/s [ 99.99% ] 1023.88 MiB, 0.44 seconds, 2.29 GiB/s [ 100.00% ] 1.00 GiB, 0.46 seconds, 2.17 GiB/s [ 3.3 ] Finalizing image transfer 2020-07-05 10:37:18,496 - MainThread - RemoteExecutor - DEBUG - [root.16.37/qum5net] ERR: Traceback (most recent call last): File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 476, in <module> main() File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 173, in main args.command(args) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 244, in cmd_download connection, args.backup_uuid, args, incremental=args.incremental) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 385, in download_backup download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 430, in download_disk imagetransfer.finalize_transfer(connection, transfer, disk) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 296, in finalize_transfer .format(transfer.id)) RuntimeError: Timed out waiting for transfer 391d9654-0e59-4e24-9524-e59a37cb8b53 to finalize --------------------------------------------------------------------------------------------- [2]: 2020-07-05 10:37:20,801+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] START, GetImageTicketVDSCommand(HostName = host_mixed_1, GetImageTicketVDSCommandParameters:{hostId='9fb37721-708b-4327-9148-d3f302e51436', ticketId='0287b970-6b73-4b9e-b683-47c1f44dbe86', timeout='null'}), log id: 382fd9d6 2020-07-05 10:37:20,807+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] FINISH, GetImageTicketVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.ImageTicketInformation@dc80e36, log id: 382fd9d6 2020-07-05 10:37:20,825+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] EVENT_ID: DOWNLOAD_IMAGE_CANCELED_TIMEOUT(1,072), Download was canceled by system. Reason: timeout due to transfer inactivity. 2020-07-05 10:37:20,826+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] Updating image transfer 391d9654-0e59-4e24-9524-e59a37cb8b53 (image 2d075957-7481-4af3-9920-e63d7759af83) phase to Cancelled --------------------------------------------------------------------------------------------- [3]: 2020-07-05 10:37:19,686+0300 INFO (jsonrpc/7) [api.virt] FINISH stop_backup return={'status': {'code': 0, 'message': 'Done'}} from=::ffff:10.46.16.252,60486, flow_id=backups_syncAction_9850ddba-f8b8-4b8, vmId=fac3911d-95b6-41be-a538-dad345691502 (api:54) 2020-07-05 10:37:19,686+0300 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call VM.stop_backup succeeded in 0.01 seconds (__init__:312) 2020-07-05 10:37:19,694+0300 ERROR (libvirt/events) [virt.vm] (vmId='fac3911d-95b6-41be-a538-dad345691502') Block job (untracked) type BACKUP for drive sda was canceled (vm:5306) 2020-07-05 10:37:19,710+0300 ERROR (libvirt/events) [virt.vm] (vmId='fac3911d-95b6-41be-a538-dad345691502') Block job (untracked) type BACKUP for drive vdb was canceled (vm:5306) 2020-07-05 10:37:19,732+0300 ERROR (libvirt/events) [virt.vm] (vmId='fac3911d-95b6-41be-a538-dad345691502') Block job (untracked) type BACKUP for drive sdb was canceled (vm:5306) 2020-07-05 10:37:19,783+0300 ERROR (libvirt/events) [virt.vm] (vmId='fac3911d-95b6-41be-a538-dad345691502') Block job (untracked) type BACKUP for drive vdc was canceled (vm:5306) *** Bug 1913577 has been marked as a duplicate of this bug. *** I recently open a new bug [1] which is apperantly a duplicate of this one. Just to mention that the issue Nir is explaining in his comments might affect not only v2v, but also any Automated tests which involve upload / download disks, and also CBT test cases (as they also involve downloading and uploading), and also other manual efforts which involve the usage of SDK. [1]: https://bugzilla.redhat.com/1923178 *** Bug 1923178 has been marked as a duplicate of this bug. *** (In reply to Ilan Zuckerman from comment #17) > I recently open a new bug [1] which is apperantly a duplicate of this one. > Just to mention that the issue Nir is explaining in his comments might > affect not only v2v, but also any Automated tests which involve upload / > download disks, and also CBT test cases (as they also involve downloading > and uploading), and also other manual efforts which involve the usage of SDK. > > [1]: > https://bugzilla.redhat.com/1923178 Just making this more clear: This issue (reported in 1923178) is first seen in rhv-4.4.5-4 , and was reproduced once more (by automated regression runs) in rhv-4.4.5-5 I noticed this only on NFS/gluster flavors Adding additional details just in case (might be helpful) - there was a PR that was closed: https://github.com/oVirt/ovirt-ansible-collection/pull/215 (In reply to Pavel Bar from comment #21) > Adding additional details just in case (might be helpful) - there was a PR > that was closed: > https://github.com/oVirt/ovirt-ansible-collection/pull/215 PR closed because it waits for the fix in the engine code instead of fixing it in the Ansible collection. How this is modified without updating image transfer examples scripts to use this change? ovirt-engine's side will be fixed and verified in bug 1923178 virt-v2v will be fixed and verified in bug 1976020 and bug 1976024 Removing stale needsinfos. Pavel, the doc text is not relevant to this bug because virt-v2v does not use the backup APIs. Since the bug was closed I think we can remove the doc text. |