Bug 1126450 - [Scale] - remove vms running too long due vdsm stuck on state finish
Summary: [Scale] - remove vms running too long due vdsm stuck on state finish
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.4.1-1
Hardware: x86_64
OS: Linux
Target Milestone: ovirt-3.6.3
: 3.6.0
Assignee: Nir Soffer
QA Contact: Aharon Canan
Whiteboard: storage
Depends On:
TreeView+ depends on / blocked
Reported: 2014-08-04 13:13 UTC by Eldad Marciano
Modified: 2016-03-10 06:23 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2015-04-19 15:44:15 UTC
oVirt Team: Storage
amureini: Triaged+

Attachments (Terms of Use)
vdsm.zip (450.85 KB, application/zip)
2014-08-04 13:13 UTC, Eldad Marciano
no flags Details

Description Eldad Marciano 2014-08-04 13:13:06 UTC
Created attachment 923871 [details]

Description of problem:
slowness was discovered around remove vms action, due to task stacked on finish state around ~70 sec [1].

I have tested few sections to avoid any slowness or latency around the NFS storage [2].

setup distribution:
-up to 6500 vms
-2 storage domain
-37 hosts
-NFS storage.

vms disk template:
-thin provision.
-postzero false.

-at the SPM side looks like the call for '11:05:05,267::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage' was come in after 2 min.
 by the logs looks looks like the removing taking ~70 sec, (the task stuck on 'finish' state for this period of time, which means the actual removing is very quick)
 Thread-138::INFO::2014-08-04 11:03:49,011::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='68957f61-33a4-47ea-9b7d-4e0a84639841', spUUID='5d43076e-f0b2-48ca-9984-c6788e9adb31', imgUUID='d5
 f9fa80-ec63-4c49-86cc-840d212a2ae5', postZero='false', force='false')
 Thread-138::INFO::2014-08-04 11:05:04,726::logUtils::47::dispatcher::(wrapper) Run and protect: deleteImage, Return response: None
 and after one more minuet the SPM abort the action due to un-exist files.
 Thread-138::INFO::2014-08-04 11:06:13,665::task::1168::TaskManager.Task::(prepare) Task=`b9e8b131-057a-405e-9247-606293cbd8bb`::aborting: Task is aborted: 'Image does not exist in domain' - code 268
 thats explain why the engine waiting for locks too long.

-by looking at vdsClient -s 0 getAllTasks the 'deleteImage' stuck on state finish for ~70 sec like described in the log.
-looks like the remove always doing by the same thread 'Thread-138' the removing action using multiple threads?

running rm -dfr from the SPM host for 'master/vms/<vm>/<vm>.ovf' and 'images/<image>/*' (disk in size of 20gb) was very quick less then ~2sec.
after all cleaned from the mount (which no latency found there).
I tired to remove the vm from engine, which this action should be very quick now since he have "no files to remove from the mount".

both of the machines running well no overload was found.

this issue probably reproduced for vm creation too (since this action also running slow).

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.up to ~6000 vms
2.remove vms

Actual results:
removing vms takes more then 2 min each, in parallel much more critical. 

Expected results:
in time removing files in size of 20gb from the NFS takes less than 2 sec, removing vms should be similar 2-3 sec.

Additional info:

logs attached.

Comment 1 Allon Mureinik 2014-08-06 10:28:10 UTC
First order of business - see what consumes the time there - whether it's in the storage subsystem or the tasks infra.

Comment 2 Allon Mureinik 2014-09-16 16:01:33 UTC
(In reply to Allon Mureinik from comment #1)
> First order of business - see what consumes the time there - whether it's in
> the storage subsystem or the tasks infra.
We should revisit in 3.6.0 after the "tasks" rehaul.

Comment 3 Allon Mureinik 2015-04-19 15:44:15 UTC
Closing old bugs, as per Itamar's guidlines.
If you think this bug is worth fixing, please feel free to reopen.

Note You need to log in before you can comment on or make changes to this bug.