Bug 1126450

Summary: [Scale] - remove vms running too long due vdsm stuck on state finish
Product: Red Hat Enterprise Virtualization Manager Reporter: Eldad Marciano <emarcian>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED WONTFIX QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4.1-1CC: amureini, bazulay, ecohen, gklein, iheim, lpeer, michal.skrivanek, oourfali, scohen, yeylon
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-04-19 15:44:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm.zip none

Description Eldad Marciano 2014-08-04 13:13:06 UTC
Created attachment 923871 [details]
vdsm.zip

Description of problem:
slowness was discovered around remove vms action, due to task stacked on finish state around ~70 sec [1].

I have tested few sections to avoid any slowness or latency around the NFS storage [2].

setup distribution:
-up to 6500 vms
-2 storage domain
-37 hosts
-NFS storage.

vms disk template:
-thin provision.
-postzero false.
-20GB

-at the SPM side looks like the call for '11:05:05,267::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage' was come in after 2 min.
 by the logs looks looks like the removing taking ~70 sec, (the task stuck on 'finish' state for this period of time, which means the actual removing is very quick)
 Thread-138::INFO::2014-08-04 11:03:49,011::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='68957f61-33a4-47ea-9b7d-4e0a84639841', spUUID='5d43076e-f0b2-48ca-9984-c6788e9adb31', imgUUID='d5
 f9fa80-ec63-4c49-86cc-840d212a2ae5', postZero='false', force='false')
 Thread-138::INFO::2014-08-04 11:05:04,726::logUtils::47::dispatcher::(wrapper) Run and protect: deleteImage, Return response: None
 and after one more minuet the SPM abort the action due to un-exist files.
 Thread-138::INFO::2014-08-04 11:06:13,665::task::1168::TaskManager.Task::(prepare) Task=`b9e8b131-057a-405e-9247-606293cbd8bb`::aborting: Task is aborted: 'Image does not exist in domain' - code 268
 thats explain why the engine waiting for locks too long.

-by looking at vdsClient -s 0 getAllTasks the 'deleteImage' stuck on state finish for ~70 sec like described in the log.
-looks like the remove always doing by the same thread 'Thread-138' the removing action using multiple threads?


[2].
running rm -dfr from the SPM host for 'master/vms/<vm>/<vm>.ovf' and 'images/<image>/*' (disk in size of 20gb) was very quick less then ~2sec.
after all cleaned from the mount (which no latency found there).
I tired to remove the vm from engine, which this action should be very quick now since he have "no files to remove from the mount".
 

both of the machines running well no overload was found.

this issue probably reproduced for vm creation too (since this action also running slow).

Version-Release number of selected component (if applicable):


How reproducible:
100% 

Steps to Reproduce:
1.up to ~6000 vms
2.remove vms

Actual results:
removing vms takes more then 2 min each, in parallel much more critical. 

Expected results:
in time removing files in size of 20gb from the NFS takes less than 2 sec, removing vms should be similar 2-3 sec.

Additional info:

logs attached.

Comment 1 Allon Mureinik 2014-08-06 10:28:10 UTC
First order of business - see what consumes the time there - whether it's in the storage subsystem or the tasks infra.

Comment 2 Allon Mureinik 2014-09-16 16:01:33 UTC
(In reply to Allon Mureinik from comment #1)
> First order of business - see what consumes the time there - whether it's in
> the storage subsystem or the tasks infra.
We should revisit in 3.6.0 after the "tasks" rehaul.

Comment 3 Allon Mureinik 2015-04-19 15:44:15 UTC
Closing old bugs, as per Itamar's guidlines.
If you think this bug is worth fixing, please feel free to reopen.