Description of problem: In 20 VMs, with 107GB (70% usage) v2v VDDK migration, from VMware to RHV, all 20 VMs failed to migrate, on the same error, in v2v import log: nbdkit: error: /var/tmp/v2v.qgJ9ib/rhv-upload-plugin.py: open: error: ['Traceback (most recent call last):\n', ' File "/var/tmp/v2v.qgJ9ib/rhv-upload-plugin.py", line 192, in open\n raise RuntimeError("timed out waiting for disk to become unlocked")\n', 'RuntimeError: timed out waiting for disk to become unlocked\n'] Some of the VMs failed after 25 minutes, and some after 35 minutes. On CFME UI, it appeared a Cancelled migration. Environment details: 20 VMs, residing 10:10 on 2 VMware hosts. there were 2 conversion hosts. ISCSI to FC storage. (Host max concurrent migrations was set to 10, for both conversion hosts, but actually the VMs were distributed 12 & 8 between the 2 conversion hosts). Version-Release number of selected component (if applicable): CFME-5.11.0.18.20190806180636_1dd6378 RHV-4.3.5.3-0.1.el7 RHV-hosts (2, that serve as conversion hosts): * Special packages of: libguestfs libguestfs-tools-c virt-v2v python-libguestfs: 1.40.2-5.el7.1.bz1680361.v3.1.x86_64. * OS Version:RHEL - 7.7 - 9.el7 * OS Description: Red Hat Enterprise Linux Server 7.7 Beta (Maipo) * Kernel Version: 3.10.0 - 957.21.3.el7.x86_64 * KVM Version:2.12.0 - 33.el7 * LIBVIRT Version: libvirt-4.5.0-23.el7 * VDSM Version: vdsm-4.30.19-1.el7ev How reproducible: Not sure. So far happen once. Additional info: nsoffer: "This means engine could not prepare the disks within the timeout. Please file ovirt-engine bug to investigate this, and include full logs from engine, vdsm, and ovirt-imageio-daemon on both hosts."
All logs (evm.log, engine.log, vdsm.log, ovirt-imageio-daemon.log, v2v import logs) can be found here: https://drive.google.com/drive/u/0/folders/1O9R-83Ur042TCvbcbTnEXK_9qY8G33O6
Note for QE: 1.40.2-5.el7.1.bz1680361.v3.1.x86_64 is a special build that I did for Ilanit containing a possible fix for bz1680361. However it is otherwise identical to libguestfs-1.40.2-5.el7
This bug repeated on the same environment, mentioned in the bug description, for 20 VMs with __20__GB disk. All 20 VMs failed, with import v2v log last messages (for all VMS): nbdkit: debug: python: open readonly=0 disk.id = '7d28c8c2-d0d5-4c59-a556-9f62a806b2ac' nbdkit: error: /var/tmp/v2v.RgGTXL/rhv-upload-plugin.py: open: error: ['Traceback (most recent call last):\n', ' File "/var/tmp/v2v.RgGTXL/rhv-upload-plugin.py", line 192, in open\n raise RuntimeError("timed out waiting for disk to become unlocked")\n', 'RuntimeError: timed out waiting for disk to become unlocked\n'] qemu-img: Could not open 'json:{ "file.driver": "nbd", "file.path": "/var/tmp/rhvupload.3kXuXT/nbdkit0.sock", "file.export": "/" }': Failed to read initial magic: Unexpected end-of-file before all bytes were read nbdkit: vddk[3]: debug: client sent disconnect command, closing connection nbdkit: vddk[3]: debug: close nbdkit: vddk[3]: debug: VDDK call: VixDiskLib_Close (handle) nbdkit: vddk[3]: debug: VixDiskLib: VixDiskLib_Close: Close disk. nbdkit: vddk[3]: debug: VDDK call: VixDiskLib_Disconnect (connection) nbdkit: vddk[3]: debug: VixDiskLib: VixDiskLib_Disconnect: Disconnect. nbdkit: vddk[3]: debug: VixDiskLib: VixDiskLib_FreeConnectParams: Free connection parameters. virt-v2v: error: qemu-img command failed, see earlier errors rm -rf '/var/tmp/vddk.PnxvqF' rm -rf '/var/tmp/v2v.rNr9Xc' rm -rf '/var/tmp/v2v.RgGTXL' rm -rf '/var/tmp/v2v.9rvGXk' rm -rf '/var/tmp/rhvupload.3kXuXT' rm -rf '/var/tmp/null.DAckXs' libguestfs: trace: close libguestfs: closing guestfs handle 0x564e98b36700 (state 0) nbdkit: nbdkit: debug: debug: python: unloadvddk: unload nbdkit: debug: VDDK call: VixDiskLib_Exit () nbdkit: debug: VixDiskLib: VixDiskLib_Exit: Unmatched Init calls so far: 1. nbdkit: debug: VixDiskLibVim: VixDiskLibVim_Exit: Clean up.
Tal, Would you please consider Daniel's request in comment #13?
This bug do not reproduce on a small scale RHV. I'll try to reproduce it on the large scale RHV, where it was found, once it's free to use.
I tried to reproduce it on large scale RHV environment RHV-4.3.6, with CFME-5.11.0.24. however it did not reproduce so far. if/once I'll see it again, I'll update this bug.
This bug seem to reproduce on a small scale RHV. Logs and details on the migration time & spm host can be found here: https://drive.google.com/drive/u/0/folders/19n9BROT-yyFB-XGgUdlDjU8PIWTfcyl1 Note that this was reproduced while patch: https://github.com/libguestfs/virt-v2v/commit/2aa78ade2d48e926b7b04050338ebd8a0c5e3f05 was added manaually to the host, but it is not clear if this reproduce is related to this patch or not.
(In reply to Ilanit Stein from comment #17) > Logs and details on the migration time & spm host can be found here: > https://drive.google.com/drive/u/0/folders/19n9BROT-yyFB-XGgUdlDjU8PIWTfcyl1 Adding details hidden in the logs: - The migration on hosts started Nov 27 2019, @14:26, and ended @14:40. - Host f02-h23-000-r620.rdu2.scalelab.redhat.com was the SPM during the migration time. There are no errors in vdsm log on the SPM. We have 2 warnings that are not related about pointless extend request. This log issue is fixed in 4.4. # xzgrep WARN /var/log/vdsm/vdsm.log.1.xz 2019-11-27 15:16:44,592+0000 WARN (tasks/7) [storage.LVM] Command with specific filter failed, retrying with a wider filter, cmd=['/sbin/lvm', 'lvextend', '--config', 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/3600a098038304437415d4b6a59683167$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }', '--autobackup', 'n', '--size', '1m', '4d72b4af-1ff4-41ba-90ee-44a748cc853a/7ecc8e10-fbc0-4150-85d5-732521cf9189'] rc=5 err=[' New size (1 extents) matches existing size (1 extents).'] (lvm:339) 2019-11-27 15:16:46,582+0000 WARN (tasks/2) [storage.LVM] Command with specific filter failed, retrying with a wider filter, cmd=['/sbin/lvm', 'lvextend', '--config', 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/3600a098038304437415d4b6a59683167$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }', '--autobackup', 'n', '--size', '1m', '4d72b4af-1ff4-41ba-90ee-44a748cc853a/e38541d6-838c-4f03-803f-c31120aba24a'] rc=5 err=[' New size (1 extents) matches existing size (1 extents).'] (lvm:339) On the other hosts there are 2 errors before the migration time seems that they are not related: # xzgrep ERROR /var/log/vdsm/vdsm.log.1.xz 2019-11-27 14:17:15,456+0000 ERROR (vm/24727bf9) [virt.vm] (vmId='24727bf9-ae73-48b7-8779-09b0c48d25dd') The vm start process failed (vm:933) 2019-11-27 14:17:32,838+0000 ERROR (jsonrpc/1) [jsonrpc.JsonRpcServer] Internal server error (__init__:350) And there are lot of warning about guest agent from same time which are not related and probably just log noise. Without any issue on the hosts, this must be engine issue. Lets have storage engine developer look at the logs. Please attach the logs to the bug to increase the chance that someone will look at it quickly. > Note that this was reproduced while patch: > https://github.com/libguestfs/virt-v2v/commit/ > 2aa78ade2d48e926b7b04050338ebd8a0c5e3f05 > was added manaually to the host, but it is not clear if this reproduce is > related to this patch or not. The patch cannot be related to this issue, it affects only the qemu command line that did not run because the disk never became ready and the import was aborted. Do we have a stress test, creating 20, 50 or 100 disks at the same time? I don't think this issue is related to virt-v2v, and it should be easy to reproduce it using simple script using the python SDK.
Created attachment 1640236 [details] vdsm log from the SPM
Created attachment 1640237 [details] vdsm log from the other host
I attached the vdsm logs showing the timeframe of the import. We miss the engine log showing the same time, and the import logs mentioned in comment 17.
ALL the logs are here: https://drive.google.com/drive/u/0/folders/19n9BROT-yyFB-XGgUdlDjU8PIWTfcyl1 The migration on hosts started Nov 27 2019, @14:26, and ended @14:40. There is a time difference between the RHV engine and RHV hosts. Engine is 5 hours behind, so migration time in engine log was on: Nov 27 2019, @9:26, and ended @9:40. Host f02-h23-000-r620.rdu2.scalelab.redhat.com was the SPM during the migration time. Attaching here the engine.log, for convenience.
Created attachment 1640301 [details] engine.log
Daniel, Mordechai, By Nir's analysis above, this bug seem to be related to concurrent disks creation on RHV, and no necessarily related to virt-v2v. Can you please answer Nir's question in comment #18, regarding the RHV stress tests?
Nir we have such tests and high disk creations on hosts are common in population scenarios but the focus is on the throughput of vm creation not # of active vdsm storage tasks per host. We can adapt this to provide this type of test if desired. Regarding your question is the focus here seeing 2 hosts with pending disk creation tasks of 25, 50, 100 where current total storage tasks that can be processed is 20 and the rest in queue or is the assumption that in 20, 50, 100 disk creation tests you have multiple hosts (more than 2 hosts) available?
(In reply to mlehrer from comment #25) > Nir we have such tests and high disk creations on hosts are common in > population scenarios but the focus is on the throughput of vm creation not # > of active vdsm storage tasks per host. This is not the same use case, vms takes lot of time to start, so actual disk creation rate is very low. > Regarding your question is the focus here seeing 2 hosts with pending disk > creation tasks of 25, 50, 100 where current total storage tasks that can be > processed is 20 and the rest in queue or is the assumption that in 20, 50, > 100 disk creation tests you have multiple hosts (more than 2 hosts) > available? Creating disks is SPM operation, so it does not matter how many hosts you have. The purpose is to make sure engine can deal with large number of concurrent disk creation, which is the case with virt-v2v. When you import 20 vms we have 20 import processes creating disk in parallel. The test can use N threads, each connecting to engine and creating one disk. If we have a problem in engine such script will reproduce it easily. It can also be used by developer to reproduce locally.
Tal, please assign someone else to check the engine issue.
In case this bug, of concurrent RHV disk creation is related to the disk type that was used, adding disk details, for the 20 migrated VMs: * It was ISCSI(on VMware source side)->FC (on RHV destination side) <actual_size>107374182400</actual_size> <alias>v2v_migration_vm_1-000</alias> <backup>none</backup> <content_type>data</content_type> <format>raw</format> <image_id>79cc1cfd-2be4-4c11-a526-0d85f261e652</image_id> <propagate_errors>false</propagate_errors> <provisioned_size>107374182400</provisioned_size> <shareable>false</shareable> <sparse>false</sparse> <status>ok</status> <storage_type>image</storage_type> <total_size>107374182400</total_size> <wipe_after_delete>true</wipe_after_delete> <disk_profile href="/ovirt-engine/api/diskprofiles/1377568f-553d-4f34-8969-e08cf66ab7f9" id="1377568f-553d-4f34-8969-e08cf66ab7f9"/> <quota href="/ovirt-engine/api/datacenters/417c106c-e826-468c-80de-f76c20eca560/quotas/6d618418-363f-4bab-ac9f-cc006cd59268" id="6d618418-363f-4bab-ac9f-cc006cd59268"/> <storage_domains> <storage_domain href="/ovirt-engine/api/storagedomains/4d72b4af-1ff4-41ba-90ee-44a748cc853a" id="4d72b4af-1ff4-41ba-90ee-44a748cc853a"/> </storage_domains>
disk create "timed out" not reproduced. StorageDomain Type: iscsi Test Description: 40 concurrency Create 1G thin disk No delay between creating iterations Total running 48 iterations ==> 1920 disks Tested on: rhv-release-4.3.6-7-001.noarch edhat-release-server-7.7-10.el7.x86_64 libvirt-4.5.0-23.el7_7.1.x86_64 vdsm-4.30.34-1.el7ev.x86_64 qemu-kvm-rhev-2.12.0-33.el7_7.5.x86_64
Does not seem like regression to me, as this have low reproducibility removing Regression flag to unblock 4.4 Beta. If you believe this bug is indeed a regression on flow previously tested, please re-add it.
@Brett, we can close this bug as this is a common scale use case, and if this repeats in our coming CFME-5.11.7 case tests, we'll certainly report it.