1126450 – [Scale] - remove vms running too long due vdsm stuck on state finish

Bug 1126450 - [Scale] - remove vms running too long due vdsm stuck on state finish

Summary: [Scale] - remove vms running too long due vdsm stuck on state finish

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.4.1-1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nir Soffer
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-08-04 13:13 UTC by Eldad Marciano
Modified:	2022-07-13 07:45 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-04-19 15:44:15 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm.zip (450.85 KB, application/zip) 2014-08-04 13:13 UTC, Eldad Marciano	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-47612	0	None	None	None	2022-07-13 07:45:03 UTC

Description Eldad Marciano 2014-08-04 13:13:06 UTC

Created attachment 923871 [details]
vdsm.zip

Description of problem:
slowness was discovered around remove vms action, due to task stacked on finish state around ~70 sec [1].

I have tested few sections to avoid any slowness or latency around the NFS storage [2].

setup distribution:
-up to 6500 vms
-2 storage domain
-37 hosts
-NFS storage.

vms disk template:
-thin provision.
-postzero false.
-20GB

-at the SPM side looks like the call for '11:05:05,267::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage' was come in after 2 min.
 by the logs looks looks like the removing taking ~70 sec, (the task stuck on 'finish' state for this period of time, which means the actual removing is very quick)
 Thread-138::INFO::2014-08-04 11:03:49,011::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='68957f61-33a4-47ea-9b7d-4e0a84639841', spUUID='5d43076e-f0b2-48ca-9984-c6788e9adb31', imgUUID='d5
 f9fa80-ec63-4c49-86cc-840d212a2ae5', postZero='false', force='false')
 Thread-138::INFO::2014-08-04 11:05:04,726::logUtils::47::dispatcher::(wrapper) Run and protect: deleteImage, Return response: None
 and after one more minuet the SPM abort the action due to un-exist files.
 Thread-138::INFO::2014-08-04 11:06:13,665::task::1168::TaskManager.Task::(prepare) Task=`b9e8b131-057a-405e-9247-606293cbd8bb`::aborting: Task is aborted: 'Image does not exist in domain' - code 268
 thats explain why the engine waiting for locks too long.

-by looking at vdsClient -s 0 getAllTasks the 'deleteImage' stuck on state finish for ~70 sec like described in the log.
-looks like the remove always doing by the same thread 'Thread-138' the removing action using multiple threads?


[2].
running rm -dfr from the SPM host for 'master/vms/<vm>/<vm>.ovf' and 'images/<image>/*' (disk in size of 20gb) was very quick less then ~2sec.
after all cleaned from the mount (which no latency found there).
I tired to remove the vm from engine, which this action should be very quick now since he have "no files to remove from the mount".
 

both of the machines running well no overload was found.

this issue probably reproduced for vm creation too (since this action also running slow).

Version-Release number of selected component (if applicable):


How reproducible:
100% 

Steps to Reproduce:
1.up to ~6000 vms
2.remove vms

Actual results:
removing vms takes more then 2 min each, in parallel much more critical. 

Expected results:
in time removing files in size of 20gb from the NFS takes less than 2 sec, removing vms should be similar 2-3 sec.

Additional info:

logs attached.

Comment 1 Allon Mureinik 2014-08-06 10:28:10 UTC

First order of business - see what consumes the time there - whether it's in the storage subsystem or the tasks infra.

Comment 2 Allon Mureinik 2014-09-16 16:01:33 UTC

(In reply to Allon Mureinik from comment #1)
> First order of business - see what consumes the time there - whether it's in
> the storage subsystem or the tasks infra.
We should revisit in 3.6.0 after the "tasks" rehaul.

Comment 3 Allon Mureinik 2015-04-19 15:44:15 UTC

Closing old bugs, as per Itamar's guidlines.
If you think this bug is worth fixing, please feel free to reopen.

Note You need to log in before you can comment on or make changes to this bug.