Bug 1303578

Summary:

[scale] GetDeviceListVDSCommand takes 6-10s on average for large storage domains

Product:

[oVirt] ovirt-engine

Reporter:

mlehrer

Component:

BLL.Storage

Assignee:

Fred Rolland <frolland>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Eldad Marciano <emarcian>

Severity:

high

Docs Contact:

Priority:

medium

Version:

3.6.2.5

CC:

amureini, bugs, frolland, gklein, nsoffer, tnisan

Target Milestone:

ovirt-4.0.5

Keywords:

Performance

Target Release:

4.0.5

Flags:

rule-engine: ovirt-4.0.z+
rule-engine: planning_ack+
amureini: devel_ack+
gklein: testing_ack+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-11-24 09:40:42 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1303583

Attachments:

Description	Flags
ovirt log showing GetDeviceList execution start/stop time	none
Profiling of GetDeviceList	none

Description mlehrer 2016-02-01 11:06:35 UTC

Created attachment 1120068 [details]
ovirt log showing GetDeviceList execution start/stop time

Description of problem:

GetDeviceListVDSCommand takes 6-10s per execution.

In certain admin flows, like editing an existing storage domain, this call maybe executed several times sequentially leading to long waits for the appearance of the dialog box in the ovirt-engine UI and a sluggish user experience.

Required for this enviroment:

100 luns exposed to each host.
Data Domains of multiple luns per individual ISCSI data storage domain.
eg (7 luns 1 SD, 100 total luns exposed to host)


Version-Release number of selected component (if applicable):


vdsm-hook-vmfex-dev-4.17.17-0.el7ev.noarch
vdsm-python-4.17.17-0.el7ev.noarch
vdsm-yajsonrpc-4.17.17-0.el7ev.noarch
vdsm-4.17.17-0.el7ev.noarch
vdsm-xmlrpc-4.17.17-0.el7ev.noarch
vdsm-jsonrpc-4.17.17-0.el7ev.noarch
vdsm-cli-4.17.17-0.el7ev.noarch
vdsm-infra-4.17.17-0.el7ev.noarch

ovirt-vmconsole-proxy-1.0.0-1.el6ev.noarch
libgovirt-0.3.2-1.el6.x86_64
rhevm-setup-plugin-ovirt-engine-common-3.6.2.5-0.1.el6.noarch
ovirt-engine-extension-aaa-jdbc-1.0.5-1.el6ev.noarch
ovirt-setup-lib-1.0.1-1.el6ev.noarch
rhevm-setup-plugin-ovirt-engine-3.6.2.5-0.1.el6.noarch
ovirt-host-deploy-java-1.4.1-1.el6ev.noarch
ovirt-host-deploy-1.4.1-1.el6ev.noarch
ovirt-vmconsole-1.0.0-1.el6ev.noarch


How reproducible:
Very

Steps to Reproduce:
1. Create ISCSI data domain of multi-luns, ensuring at least 100 luns are exposed to the host.
2. Create data domain
3. Click "edit domain"
4. See ovirt-engine.log for GetDeviceListVDSCommand execution time.

Actual results:

See attached ovirt.log

Expected results:

Faster response time, or UI specific message warning that number of exposed luns may require additional time to perform action.


Additional info:

Comment 1 Sandro Bonazzola 2016-05-02 10:04:39 UTC

Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 2 Yaniv Lavi 2016-05-23 13:19:17 UTC

oVirt 4.0 beta has been released, moving to RC milestone.

Comment 3 Yaniv Lavi 2016-05-23 13:23:09 UTC

oVirt 4.0 beta has been released, moving to RC milestone.

Comment 5 Fred Rolland 2016-06-16 12:49:00 UTC

From my testing on 4.0, the command takes less than 4 sec with 100 LUNs exposed to host:

Engine :
2016-06-16 11:45:07,436 START, GetDeviceListVDSCommand
2016-06-16 11:45:10,931 FINISH, GetDeviceListVDSCommand
3495 ms

VDSM:
jsonrpc.Executor/4::INFO::2016-06-16 11:45:07,452::logUtils::49::dispatcher::(wrapper) Run and protect: getDeviceList
jsonrpc.Executor/4::INFO::2016-06-16 11:45:10,683::logUtils::52::dispatcher::(wrapper) Run and protect: getDeviceList, Return response: 
3231 ms

The verb has 3 main parts:
1 - Refresh Storage (50 % of time) ( rescan iscsi/fc, multipath, resize devices)
2 - Reload PVs (25 % of time)
3 - Getting dev and path info (25 % of time)

From the profiling, the only thing that maybe considered is to investigate if the multipath call is needed (900msec).

I am not sure that we should invest in improving this verb, as it is not called frequently.
Fixing the UI bug, will have much more impact than any other improvement in this flow (https://bugzilla.redhat.com/show_bug.cgi?id=1259353)

Detailed profiling :
time filename:lineno(function)
3.178 |hsm.py:1948(HSM._getDeviceList)
1.546 |--> sdc.py:80(StorageDomainCache.refreshStorage)
1.489     |--> multipath.py:65(rescan)
0.481         |--> iscsi.py:431(rescan)**** iscsi scan cmd ****
0.076         |--> storage/hba.py:53(rescan) **** fc scan cmd ****
0.912         |--> multipath.py:88(multipath)**** 'multipath' cmd ****
0.056     |--> multipath.py:92(resize_devices)
0.756 |--> lvm.py:843(getAllPVs)
0.756     |--> lvm.py:317(LVMCache._reloadpvs)
0.755         |--> lvm.py:290(LVMCache.cmd)   ****'pvs' cmd ****
0.871 |--> multipath.py:213(pathListIter)
0.029     |-->devicemapper.py:176(getPathsStatus) **** 'dmsetup status' cmd****

Comment 6 Fred Rolland 2016-06-16 12:53:52 UTC

Created attachment 1168725 [details]
Profiling of GetDeviceList

In order to see the profiling use :
contrib/profile-stats -r 500 -s cumu device.prof | less

In order to create the file, add the following annotation to the method :

@profile('/run/vdsm/device.prof', clock='wall', threads=False)
    def _getDeviceList(self, storageType=None, guids=(), checkStatus=True):

Comment 7 Fred Rolland 2016-06-28 10:58:56 UTC

*** Bug 1303583 has been marked as a duplicate of this bug. ***

Comment 8 Yaniv Kaul 2016-07-15 19:30:59 UTC

I wonder if the multipath issue will be worse if there are more paths per LUN (how many did you have)? For example, with 4 paths per LUN (a reasonable value for performance and high availability).

But agree the UI bug is more critical to look at.
Defer this to 4.1?

Comment 9 Fred Rolland 2016-07-17 05:43:43 UTC

I had only one path per LUN.

Comment 10 Nir Soffer 2016-07-17 17:23:22 UTC

The time is reasonable for the number of devices and the operations we are doing when scanning devices. This could take several minutes in the past, and after we
optimized it, it take now couple of seconds as we see in comment 5.

Comment 11 Allon Mureinik 2016-07-26 07:33:53 UTC

The same issue has existed for several versions. Pushing out to 4.0.4 to give time to properly verify this.

Comment 12 Allon Mureinik 2016-09-12 06:57:19 UTC

Fred, can you please explain exactly what was fixed here and how this fix can be verified?
I doubt this patch reduces the runtime below 6 seconds.

Comment 13 Nir Soffer 2016-09-12 07:38:14 UTC

(In reply to Allon Mureinik from comment #12)
> Fred, can you please explain exactly what was fixed here and how this fix
> can be verified?
> I doubt this patch reduces the runtime below 6 seconds.

This patch removes the multipath call consuming 29% of the time in vdsm, shortening
the delay in engine when opening the manage domain dialog.

The change will probably be noticeable only when using 100 or more devices on 
the host.

The remove call was also known to get stuck if the server is misbehaving, removing
it prevent unlimited blocking at this point.

Verification should focus on correct operation:
- detect new devices added before creating new domain or editing existing domain
- detect resize devices
- normal operation otherwise

Improved performance can be verified by using lot of devices (e.g. 500), 
comparing the time to open the manage domain dialog with 4.0.4 and with
the version including this fix.

Comment 14 Eldad Marciano 2016-11-03 15:02:45 UTC

verifying on top of 4.0.5 
it still takes ~5 sec.

lab profile:
1 SD with 100 luns.

[root@vega09 ~]# time vdsClient -s 0 getDeviceList > /dev/null

real	0m5.454s
user	0m0.430s
sys	0m0.012s

Comment 15 Nir Soffer 2016-11-03 15:07:40 UTC

(In reply to Eldad Marciano from comment #14)
> verifying on top of 4.0.5 
> it still takes ~5 sec.
> 
> lab profile:
> 1 SD with 100 luns.
> 
> [root@vega09 ~]# time vdsClient -s 0 getDeviceList > /dev/null
> 
> real	0m5.454s
> user	0m0.430s
> sys	0m0.012s

Please run the same tests with 4.0.5, and with the latest version that does not
include this fix.

You should expect about 30% improvement in 4.0.5. See comment 5 to understand why.

Comment 16 Fred Rolland 2016-11-03 15:14:52 UTC

(In reply to Eldad Marciano from comment #14)
> verifying on top of 4.0.5 
> it still takes ~5 sec.
> 
> lab profile:
> 1 SD with 100 luns.
> 
> [root@vega09 ~]# time vdsClient -s 0 getDeviceList > /dev/null
> 
> real	0m5.454s
> user	0m0.430s
> sys	0m0.012s

You should use :
 time vdsClient -s 0 getDeviceList ISCSI False > /dev/null

This is how the engine calls this verb. It will skip the pvcreate test for each lun.

Comment 17 Eldad Marciano 2016-11-06 09:36:57 UTC

(In reply to Nir Soffer from comment #15)
> (In reply to Eldad Marciano from comment #14)
> > verifying on top of 4.0.5 
> > it still takes ~5 sec.
> > 
> > lab profile:
> > 1 SD with 100 luns.
> > 
> > [root@vega09 ~]# time vdsClient -s 0 getDeviceList > /dev/null
> > 
> > real	0m5.454s
> > user	0m0.430s
> > sys	0m0.012s
> 
> Please run the same tests with 4.0.5, and with the latest version that does
> not
> include this fix.
> 
> You should expect about 30% improvement in 4.0.5. See comment 5 to
> understand why.

Mordechai mentioned 6-10 sec with the following build vdsm-4.17.17-0.el7ev.noarch
(see the bug description)

now we have an avg of 3-5 sec, by using what fred asked to run:
[root@vega09 ~]# time vdsClient -s 0 getDeviceList ISCSI False > /dev/null

real	0m4.013s
user	0m0.431s
sys	0m0.014s
[root@vega09 ~]# time vdsClient -s 0 getDeviceList ISCSI False > /dev/null

real	0m5.343s
user	0m0.446s
sys	0m0.032s
[root@vega09 ~]# time vdsClient -s 0 getDeviceList ISCSI False > /dev/null

real	0m3.949s
user	0m0.439s
sys	0m0.009s

Comment 18 Allon Mureinik 2016-11-24 09:40:42 UTC

4.0.5 has been released, closing.