1740021 – [v2v][Scale][RHV] 20 VMs migration fail on "timed out waiting for disk to become unlocked"

Bug 1740021 - [v2v][Scale][RHV] 20 VMs migration fail on "timed out waiting for disk to become unlocked"

Summary: [v2v][Scale][RHV] 20 VMs migration fail on "timed out waiting for disk to bec...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	ovirt-4.4.1
Target Release:	---
Assignee:	Tal Nisan
QA Contact:	Ilanit Stein
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-12 06:11 UTC by Ilanit Stein
Modified:	2020-07-03 10:20 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-17 17:31:55 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm log from the SPM (744.94 KB, application/x-xz) 2019-11-27 22:45 UTC, Nir Soffer	no flags	Details
vdsm log from the other host (643.68 KB, application/x-xz) 2019-11-27 22:46 UTC, Nir Soffer	no flags	Details
engine.log (1.75 MB, application/gzip) 2019-11-28 07:07 UTC, Ilanit Stein	no flags	Details
View All

Description Ilanit Stein 2019-08-12 06:11:52 UTC

Description of problem:
In 20 VMs, with 107GB (70% usage) v2v VDDK migration, from VMware to RHV,
all 20 VMs failed to migrate, on the same error, in v2v import log:

nbdkit: error: /var/tmp/v2v.qgJ9ib/rhv-upload-plugin.py: open: error: ['Traceback (most recent call last):\n', '  File "/var/tmp/v2v.qgJ9ib/rhv-upload-plugin.py", line 192, in open\n    raise RuntimeError("timed out waiting for disk to become unlocked")\n', 'RuntimeError: timed out waiting for disk to become unlocked\n']

Some of the VMs failed after 25 minutes, and some after 35 minutes.
On CFME UI, it appeared a Cancelled migration.

Environment details:
20 VMs, residing 10:10 on 2 VMware hosts. there were 2 conversion hosts. ISCSI to FC storage. 
(Host max concurrent migrations was set to 10, for both conversion hosts, but actually the VMs were distributed 12 & 8 between the 2 conversion hosts). 

Version-Release number of selected component (if applicable):
CFME-5.11.0.18.20190806180636_1dd6378
RHV-4.3.5.3-0.1.el7
RHV-hosts (2, that serve as conversion hosts):
* Special packages of: libguestfs libguestfs-tools-c virt-v2v python-libguestfs: 1.40.2-5.el7.1.bz1680361.v3.1.x86_64.
* OS Version:RHEL - 7.7 - 9.el7
* OS Description: Red Hat Enterprise Linux Server 7.7 Beta (Maipo)
* Kernel Version: 3.10.0 - 957.21.3.el7.x86_64
* KVM Version:2.12.0 - 33.el7
* LIBVIRT Version: libvirt-4.5.0-23.el7
* VDSM Version: vdsm-4.30.19-1.el7ev

How reproducible:
Not sure. So far happen once.

Additional info:
nsoffer:
"This means engine could not prepare the disks within the timeout. Please file ovirt-engine bug
to investigate this, and include full logs from engine, vdsm, and ovirt-imageio-daemon on both hosts."

Comment 3 Ilanit Stein 2019-08-12 07:16:11 UTC

All logs (evm.log, engine.log, vdsm.log, ovirt-imageio-daemon.log, v2v import logs)
can be found here:  
https://drive.google.com/drive/u/0/folders/1O9R-83Ur042TCvbcbTnEXK_9qY8G33O6

Comment 6 Richard W.M. Jones 2019-08-12 09:23:08 UTC

Note for QE: 1.40.2-5.el7.1.bz1680361.v3.1.x86_64 is a special build that
I did for Ilanit containing a possible fix for bz1680361.  However it is
otherwise identical to libguestfs-1.40.2-5.el7

Comment 9 Ilanit Stein 2019-08-12 12:57:19 UTC

This bug repeated on the same environment, mentioned in the bug description,
for 20 VMs with __20__GB disk.

All 20 VMs failed,
with import v2v log last messages (for all VMS):

nbdkit: debug: python: open readonly=0
disk.id = '7d28c8c2-d0d5-4c59-a556-9f62a806b2ac'
nbdkit: error: /var/tmp/v2v.RgGTXL/rhv-upload-plugin.py: open: error: ['Traceback (most recent call last):\n', '  File "/var/tmp/v2v.RgGTXL/rhv-upload-plugin.py", line 192, in open\n    raise RuntimeError("timed out waiting for disk to become unlocked")\n', 'RuntimeError: timed out waiting for disk to become unlocked\n']
qemu-img: Could not open 'json:{ "file.driver": "nbd", "file.path": "/var/tmp/rhvupload.3kXuXT/nbdkit0.sock", "file.export": "/" }': Failed to read initial magic: Unexpected end-of-file before all bytes were read
nbdkit: vddk[3]: debug: client sent disconnect command, closing connection
nbdkit: vddk[3]: debug: close
nbdkit: vddk[3]: debug: VDDK call: VixDiskLib_Close (handle)
nbdkit: vddk[3]: debug: VixDiskLib: VixDiskLib_Close: Close disk.

nbdkit: vddk[3]: debug: VDDK call: VixDiskLib_Disconnect (connection)
nbdkit: vddk[3]: debug: VixDiskLib: VixDiskLib_Disconnect: Disconnect.
nbdkit: vddk[3]: debug: VixDiskLib: VixDiskLib_FreeConnectParams: Free connection parameters.
virt-v2v: error: qemu-img command failed, see earlier errors
rm -rf '/var/tmp/vddk.PnxvqF'
rm -rf '/var/tmp/v2v.rNr9Xc'
rm -rf '/var/tmp/v2v.RgGTXL'
rm -rf '/var/tmp/v2v.9rvGXk'
rm -rf '/var/tmp/rhvupload.3kXuXT'
rm -rf '/var/tmp/null.DAckXs'
libguestfs: trace: close
libguestfs: closing guestfs handle 0x564e98b36700 (state 0)
nbdkit: nbdkit: debug: debug: python: unloadvddk: unload

nbdkit: debug: VDDK call: VixDiskLib_Exit ()
nbdkit: debug: VixDiskLib: VixDiskLib_Exit: Unmatched Init calls so far: 1.
nbdkit: debug: VixDiskLibVim: VixDiskLibVim_Exit: Clean up.

Comment 14 Ilanit Stein 2019-09-03 09:33:27 UTC

Tal,

Would you please consider Daniel's request in comment #13?

Comment 15 Ilanit Stein 2019-09-05 12:53:45 UTC

This bug do not reproduce on a small scale RHV.
I'll try to reproduce it on the large scale RHV, where it was found, once it's free to use.

Comment 16 Ilanit Stein 2019-09-18 13:57:34 UTC

I tried to reproduce it on large scale RHV environment RHV-4.3.6, with CFME-5.11.0.24. however it did not reproduce so far.
if/once I'll see it again, I'll update this bug.

Comment 17 Ilanit Stein 2019-11-27 18:32:10 UTC

This bug seem to reproduce on a small scale RHV.

Logs and details on the migration time & spm host can be found here:
https://drive.google.com/drive/u/0/folders/19n9BROT-yyFB-XGgUdlDjU8PIWTfcyl1

Note that this was reproduced while patch:
https://github.com/libguestfs/virt-v2v/commit/2aa78ade2d48e926b7b04050338ebd8a0c5e3f05
was added manaually to the host, but it is not clear if this reproduce is related to this patch or not.

Comment 18 Nir Soffer 2019-11-27 22:42:40 UTC

(In reply to Ilanit Stein from comment #17)
> Logs and details on the migration time & spm host can be found here:
> https://drive.google.com/drive/u/0/folders/19n9BROT-yyFB-XGgUdlDjU8PIWTfcyl1

Adding details hidden in the logs:

- The migration on hosts started Nov 27 2019, @14:26, and ended @14:40.
- Host f02-h23-000-r620.rdu2.scalelab.redhat.com was the SPM during the migration time.

There are no errors in vdsm log on the SPM.

We have 2 warnings that are not related about pointless extend request.
This log issue is fixed in 4.4.

# xzgrep WARN /var/log/vdsm/vdsm.log.1.xz 
2019-11-27 15:16:44,592+0000 WARN  (tasks/7) [storage.LVM] Command with specific filter failed, retrying with a wider filter, cmd=['/sbin/lvm', 'lvextend', '--config', 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/3600a098038304437415d4b6a59683167$|", "r|.*|"] } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 }', '--autobackup', 'n', '--size', '1m', '4d72b4af-1ff4-41ba-90ee-44a748cc853a/7ecc8e10-fbc0-4150-85d5-732521cf9189'] rc=5 err=['  New size (1 extents) matches existing size (1 extents).'] (lvm:339)
2019-11-27 15:16:46,582+0000 WARN  (tasks/2) [storage.LVM] Command with specific filter failed, retrying with a wider filter, cmd=['/sbin/lvm', 'lvextend', '--config', 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/3600a098038304437415d4b6a59683167$|", "r|.*|"] } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 }', '--autobackup', 'n', '--size', '1m', '4d72b4af-1ff4-41ba-90ee-44a748cc853a/e38541d6-838c-4f03-803f-c31120aba24a'] rc=5 err=['  New size (1 extents) matches existing size (1 extents).'] (lvm:339)

On the other hosts there are 2 errors before the migration time seems
that they are not related:

# xzgrep ERROR /var/log/vdsm/vdsm.log.1.xz 
2019-11-27 14:17:15,456+0000 ERROR (vm/24727bf9) [virt.vm] (vmId='24727bf9-ae73-48b7-8779-09b0c48d25dd') The vm start process failed (vm:933)
2019-11-27 14:17:32,838+0000 ERROR (jsonrpc/1) [jsonrpc.JsonRpcServer] Internal server error (__init__:350)

And there are lot of warning about guest agent from same time which are not related
and probably just log noise.

Without any issue on the hosts, this must be engine issue. Lets have
storage engine developer look at the logs.

Please attach the logs to the bug to increase the chance that someone will
look at it quickly.

> Note that this was reproduced while patch:
> https://github.com/libguestfs/virt-v2v/commit/
> 2aa78ade2d48e926b7b04050338ebd8a0c5e3f05
> was added manaually to the host, but it is not clear if this reproduce is
> related to this patch or not.

The patch cannot be related to this issue, it affects only the qemu command
line that did not run because the disk never became ready and the import
was aborted.

Do we have a stress test, creating 20, 50 or 100 disks at the  same time?
I don't think this issue is related to virt-v2v, and it should be easy to
reproduce it using simple script using the python SDK.

Comment 19 Nir Soffer 2019-11-27 22:45:47 UTC

Created attachment 1640236 [details]
vdsm log from the SPM

Comment 20 Nir Soffer 2019-11-27 22:46:47 UTC

Created attachment 1640237 [details]
vdsm log from the other host

Comment 21 Nir Soffer 2019-11-27 22:48:21 UTC

I attached the vdsm logs showing the timeframe of the import.
We miss the engine log showing the same time, and the import
logs mentioned in comment 17.

Comment 22 Ilanit Stein 2019-11-28 07:07:22 UTC

ALL the logs are here:
https://drive.google.com/drive/u/0/folders/19n9BROT-yyFB-XGgUdlDjU8PIWTfcyl1

The migration on hosts started Nov 27 2019, @14:26, and ended @14:40.
There is a time difference between the RHV engine and RHV hosts.
Engine is 5 hours behind, so migration time in engine log was on:
Nov 27 2019, @9:26, and ended @9:40.

Host f02-h23-000-r620.rdu2.scalelab.redhat.com was the SPM during the migration time.

Attaching here the engine.log, for convenience.

Comment 23 Ilanit Stein 2019-11-28 07:07:59 UTC

Created attachment 1640301 [details]
engine.log

Comment 24 Ilanit Stein 2019-11-28 07:16:30 UTC

Daniel, Mordechai,

By Nir's analysis above, this bug seem to be related to concurrent disks creation on RHV,
and no necessarily related to virt-v2v.

Can you please answer Nir's question in comment #18, regarding the RHV stress tests?

Comment 25 mlehrer 2019-11-28 12:40:31 UTC

Nir we have such tests and high disk creations on hosts are common in population scenarios but the focus is on the throughput of vm creation not # of active vdsm storage tasks per host.
We can adapt this to provide this type of test if desired.

Regarding your question is the focus here seeing 2 hosts with pending disk creation tasks of 25, 50, 100 where current total storage tasks that can be processed is 20 and the rest in queue or is the assumption that in 20, 50, 100 disk creation tests you have multiple hosts (more than 2 hosts) available?

Comment 26 Nir Soffer 2019-11-28 12:56:36 UTC

(In reply to mlehrer from comment #25)
> Nir we have such tests and high disk creations on hosts are common in
> population scenarios but the focus is on the throughput of vm creation not #
> of active vdsm storage tasks per host.

This is not the same use case, vms takes lot of time to start, so actual
disk creation rate is very low.

> Regarding your question is the focus here seeing 2 hosts with pending disk
> creation tasks of 25, 50, 100 where current total storage tasks that can be
> processed is 20 and the rest in queue or is the assumption that in 20, 50,
> 100 disk creation tests you have multiple hosts (more than 2 hosts)
> available?

Creating disks is SPM operation, so it does not matter how many hosts you
have.

The purpose is to make sure engine can deal with large number of concurrent
disk creation, which is the case with virt-v2v. When you import 20 vms we
have 20 import processes creating disk in parallel.

The test can use N threads, each connecting to engine and creating one disk.

If we have a problem in engine such script will reproduce it easily. It can
also be used by developer to reproduce locally.

Comment 27 Nir Soffer 2019-11-28 12:59:23 UTC

Tal, please assign someone else to check the engine issue.

Comment 30 Ilanit Stein 2019-12-01 14:35:23 UTC

In case this bug, of concurrent RHV disk creation is related to the disk type that was used,
adding disk details, for the 20 migrated VMs:

* It was ISCSI(on VMware source side)->FC (on RHV destination side)

<actual_size>107374182400</actual_size>
<alias>v2v_migration_vm_1-000</alias>
<backup>none</backup>
<content_type>data</content_type>
<format>raw</format>
<image_id>79cc1cfd-2be4-4c11-a526-0d85f261e652</image_id>
<propagate_errors>false</propagate_errors>
<provisioned_size>107374182400</provisioned_size>
<shareable>false</shareable>
<sparse>false</sparse>
<status>ok</status>
<storage_type>image</storage_type>
<total_size>107374182400</total_size>
<wipe_after_delete>true</wipe_after_delete>
<disk_profile href="/ovirt-engine/api/diskprofiles/1377568f-553d-4f34-8969-e08cf66ab7f9" id="1377568f-553d-4f34-8969-e08cf66ab7f9"/>
<quota href="/ovirt-engine/api/datacenters/417c106c-e826-468c-80de-f76c20eca560/quotas/6d618418-363f-4bab-ac9f-cc006cd59268" id="6d618418-363f-4bab-ac9f-cc006cd59268"/>
<storage_domains>
<storage_domain href="/ovirt-engine/api/storagedomains/4d72b4af-1ff4-41ba-90ee-44a748cc853a" id="4d72b4af-1ff4-41ba-90ee-44a748cc853a"/>
</storage_domains>

Comment 31 David Vaanunu 2020-02-20 08:25:58 UTC

disk create "timed out" not reproduced.

StorageDomain Type: iscsi

Test Description:
   40 concurrency 
   Create 1G thin disk
   No delay between creating iterations


Total running 48 iterations ==> 1920 disks 


Tested on: 

rhv-release-4.3.6-7-001.noarch
edhat-release-server-7.7-10.el7.x86_64
libvirt-4.5.0-23.el7_7.1.x86_64
vdsm-4.30.34-1.el7ev.x86_64
qemu-kvm-rhev-2.12.0-33.el7_7.5.x86_64

Comment 32 Lukas Svaty 2020-03-24 12:21:36 UTC

Does not seem like regression to me, as this have low reproducibility removing Regression flag to unblock 4.4 Beta. If you believe this bug is indeed a regression on flow previously tested, please re-add it.

Comment 34 Ilanit Stein 2020-06-17 16:09:59 UTC

@Brett, we can close this bug as this is a common scale use case, and if this repeats in our coming CFME-5.11.7 case tests, we'll certainly report it.

Note You need to log in before you can comment on or make changes to this bug.