Bug 1303578 - [scale] GetDeviceListVDSCommand takes 6-10s on average for large storage domains
Summary: [scale] GetDeviceListVDSCommand takes 6-10s on average for large storage domains
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 3.6.2.5
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ovirt-4.0.5
: 4.0.5
Assignee: Fred Rolland
QA Contact: Eldad Marciano
URL:
Whiteboard:
: 1303583 (view as bug list)
Depends On:
Blocks: 1303583
TreeView+ depends on / blocked
 
Reported: 2016-02-01 11:06 UTC by mlehrer
Modified: 2016-11-24 09:40 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-24 09:40:42 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.0.z+
rule-engine: planning_ack+
amureini: devel_ack+
gklein: testing_ack+


Attachments (Terms of Use)
ovirt log showing GetDeviceList execution start/stop time (924.76 KB, text/x-vhdl)
2016-02-01 11:06 UTC, mlehrer
no flags Details
Profiling of GetDeviceList (54.67 KB, application/octet-stream)
2016-06-16 12:53 UTC, Fred Rolland
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 59865 0 master MERGED storage: Remove unneeded multipath call 2016-09-06 13:45:13 UTC
oVirt gerrit 63427 0 ovirt-4.0 MERGED storage: Remove unneeded multipath call 2016-09-12 06:42:31 UTC

Description mlehrer 2016-02-01 11:06:35 UTC
Created attachment 1120068 [details]
ovirt log showing GetDeviceList execution start/stop time

Description of problem:

GetDeviceListVDSCommand takes 6-10s per execution.

In certain admin flows, like editing an existing storage domain, this call maybe executed several times sequentially leading to long waits for the appearance of the dialog box in the ovirt-engine UI and a sluggish user experience.

Required for this enviroment:

100 luns exposed to each host.
Data Domains of multiple luns per individual ISCSI data storage domain.
eg (7 luns 1 SD, 100 total luns exposed to host)


Version-Release number of selected component (if applicable):


vdsm-hook-vmfex-dev-4.17.17-0.el7ev.noarch
vdsm-python-4.17.17-0.el7ev.noarch
vdsm-yajsonrpc-4.17.17-0.el7ev.noarch
vdsm-4.17.17-0.el7ev.noarch
vdsm-xmlrpc-4.17.17-0.el7ev.noarch
vdsm-jsonrpc-4.17.17-0.el7ev.noarch
vdsm-cli-4.17.17-0.el7ev.noarch
vdsm-infra-4.17.17-0.el7ev.noarch

ovirt-vmconsole-proxy-1.0.0-1.el6ev.noarch
libgovirt-0.3.2-1.el6.x86_64
rhevm-setup-plugin-ovirt-engine-common-3.6.2.5-0.1.el6.noarch
ovirt-engine-extension-aaa-jdbc-1.0.5-1.el6ev.noarch
ovirt-setup-lib-1.0.1-1.el6ev.noarch
rhevm-setup-plugin-ovirt-engine-3.6.2.5-0.1.el6.noarch
ovirt-host-deploy-java-1.4.1-1.el6ev.noarch
ovirt-host-deploy-1.4.1-1.el6ev.noarch
ovirt-vmconsole-1.0.0-1.el6ev.noarch


How reproducible:
Very

Steps to Reproduce:
1. Create ISCSI data domain of multi-luns, ensuring at least 100 luns are exposed to the host.
2. Create data domain
3. Click "edit domain"
4. See ovirt-engine.log for GetDeviceListVDSCommand execution time.

Actual results:

See attached ovirt.log

Expected results:

Faster response time, or UI specific message warning that number of exposed luns may require additional time to perform action.


Additional info:

Comment 1 Sandro Bonazzola 2016-05-02 10:04:39 UTC
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 2 Yaniv Lavi 2016-05-23 13:19:17 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 3 Yaniv Lavi 2016-05-23 13:23:09 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 5 Fred Rolland 2016-06-16 12:49:00 UTC
From my testing on 4.0, the command takes less than 4 sec with 100 LUNs exposed to host:

Engine :
2016-06-16 11:45:07,436 START, GetDeviceListVDSCommand
2016-06-16 11:45:10,931 FINISH, GetDeviceListVDSCommand
3495 ms

VDSM:
jsonrpc.Executor/4::INFO::2016-06-16 11:45:07,452::logUtils::49::dispatcher::(wrapper) Run and protect: getDeviceList
jsonrpc.Executor/4::INFO::2016-06-16 11:45:10,683::logUtils::52::dispatcher::(wrapper) Run and protect: getDeviceList, Return response: 
3231 ms

The verb has 3 main parts:
1 - Refresh Storage (50 % of time) ( rescan iscsi/fc, multipath, resize devices)
2 - Reload PVs (25 % of time)
3 - Getting dev and path info (25 % of time)

From the profiling, the only thing that maybe considered is to investigate if the multipath call is needed (900msec).

I am not sure that we should invest in improving this verb, as it is not called frequently.
Fixing the UI bug, will have much more impact than any other improvement in this flow (https://bugzilla.redhat.com/show_bug.cgi?id=1259353)

Detailed profiling :
time filename:lineno(function)
3.178 |hsm.py:1948(HSM._getDeviceList)
1.546 |--> sdc.py:80(StorageDomainCache.refreshStorage)
1.489     |--> multipath.py:65(rescan)
0.481         |--> iscsi.py:431(rescan)**** iscsi scan cmd ****
0.076         |--> storage/hba.py:53(rescan) **** fc scan cmd ****
0.912         |--> multipath.py:88(multipath)**** 'multipath' cmd ****
0.056     |--> multipath.py:92(resize_devices)
0.756 |--> lvm.py:843(getAllPVs)
0.756     |--> lvm.py:317(LVMCache._reloadpvs)
0.755         |--> lvm.py:290(LVMCache.cmd)   ****'pvs' cmd ****
0.871 |--> multipath.py:213(pathListIter)
0.029     |-->devicemapper.py:176(getPathsStatus) **** 'dmsetup status' cmd****

Comment 6 Fred Rolland 2016-06-16 12:53:52 UTC
Created attachment 1168725 [details]
Profiling of GetDeviceList

In order to see the profiling use :
contrib/profile-stats -r 500 -s cumu device.prof | less

In order to create the file, add the following annotation to the method :

@profile('/run/vdsm/device.prof', clock='wall', threads=False)
    def _getDeviceList(self, storageType=None, guids=(), checkStatus=True):

Comment 7 Fred Rolland 2016-06-28 10:58:56 UTC
*** Bug 1303583 has been marked as a duplicate of this bug. ***

Comment 8 Yaniv Kaul 2016-07-15 19:30:59 UTC
I wonder if the multipath issue will be worse if there are more paths per LUN (how many did you have)? For example, with 4 paths per LUN (a reasonable value for performance and high availability).

But agree the UI bug is more critical to look at.
Defer this to 4.1?

Comment 9 Fred Rolland 2016-07-17 05:43:43 UTC
I had only one path per LUN.

Comment 10 Nir Soffer 2016-07-17 17:23:22 UTC
The time is reasonable for the number of devices and the operations we are doing when scanning devices. This could take several minutes in the past, and after we
optimized it, it take now couple of seconds as we see in comment 5.

Comment 11 Allon Mureinik 2016-07-26 07:33:53 UTC
The same issue has existed for several versions. Pushing out to 4.0.4 to give time to properly verify this.

Comment 12 Allon Mureinik 2016-09-12 06:57:19 UTC
Fred, can you please explain exactly what was fixed here and how this fix can be verified?
I doubt this patch reduces the runtime below 6 seconds.

Comment 13 Nir Soffer 2016-09-12 07:38:14 UTC
(In reply to Allon Mureinik from comment #12)
> Fred, can you please explain exactly what was fixed here and how this fix
> can be verified?
> I doubt this patch reduces the runtime below 6 seconds.

This patch removes the multipath call consuming 29% of the time in vdsm, shortening
the delay in engine when opening the manage domain dialog.

The change will probably be noticeable only when using 100 or more devices on 
the host.

The remove call was also known to get stuck if the server is misbehaving, removing
it prevent unlimited blocking at this point.

Verification should focus on correct operation:
- detect new devices added before creating new domain or editing existing domain
- detect resize devices
- normal operation otherwise

Improved performance can be verified by using lot of devices (e.g. 500), 
comparing the time to open the manage domain dialog with 4.0.4 and with
the version including this fix.

Comment 14 Eldad Marciano 2016-11-03 15:02:45 UTC
verifying on top of 4.0.5 
it still takes ~5 sec.

lab profile:
1 SD with 100 luns.

[root@vega09 ~]# time vdsClient -s 0 getDeviceList > /dev/null

real	0m5.454s
user	0m0.430s
sys	0m0.012s

Comment 15 Nir Soffer 2016-11-03 15:07:40 UTC
(In reply to Eldad Marciano from comment #14)
> verifying on top of 4.0.5 
> it still takes ~5 sec.
> 
> lab profile:
> 1 SD with 100 luns.
> 
> [root@vega09 ~]# time vdsClient -s 0 getDeviceList > /dev/null
> 
> real	0m5.454s
> user	0m0.430s
> sys	0m0.012s

Please run the same tests with 4.0.5, and with the latest version that does not
include this fix.

You should expect about 30% improvement in 4.0.5. See comment 5 to understand why.

Comment 16 Fred Rolland 2016-11-03 15:14:52 UTC
(In reply to Eldad Marciano from comment #14)
> verifying on top of 4.0.5 
> it still takes ~5 sec.
> 
> lab profile:
> 1 SD with 100 luns.
> 
> [root@vega09 ~]# time vdsClient -s 0 getDeviceList > /dev/null
> 
> real	0m5.454s
> user	0m0.430s
> sys	0m0.012s

You should use :
 time vdsClient -s 0 getDeviceList ISCSI False > /dev/null

This is how the engine calls this verb. It will skip the pvcreate test for each lun.

Comment 17 Eldad Marciano 2016-11-06 09:36:57 UTC
(In reply to Nir Soffer from comment #15)
> (In reply to Eldad Marciano from comment #14)
> > verifying on top of 4.0.5 
> > it still takes ~5 sec.
> > 
> > lab profile:
> > 1 SD with 100 luns.
> > 
> > [root@vega09 ~]# time vdsClient -s 0 getDeviceList > /dev/null
> > 
> > real	0m5.454s
> > user	0m0.430s
> > sys	0m0.012s
> 
> Please run the same tests with 4.0.5, and with the latest version that does
> not
> include this fix.
> 
> You should expect about 30% improvement in 4.0.5. See comment 5 to
> understand why.

Mordechai mentioned 6-10 sec with the following build vdsm-4.17.17-0.el7ev.noarch
(see the bug description)

now we have an avg of 3-5 sec, by using what fred asked to run:
[root@vega09 ~]# time vdsClient -s 0 getDeviceList ISCSI False > /dev/null

real	0m4.013s
user	0m0.431s
sys	0m0.014s
[root@vega09 ~]# time vdsClient -s 0 getDeviceList ISCSI False > /dev/null

real	0m5.343s
user	0m0.446s
sys	0m0.032s
[root@vega09 ~]# time vdsClient -s 0 getDeviceList ISCSI False > /dev/null

real	0m3.949s
user	0m0.439s
sys	0m0.009s

Comment 18 Allon Mureinik 2016-11-24 09:40:42 UTC
4.0.5 has been released, closing.


Note You need to log in before you can comment on or make changes to this bug.