Bug 1303578
Summary: | [scale] GetDeviceListVDSCommand takes 6-10s on average for large storage domains | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | mlehrer | ||||||
Component: | BLL.Storage | Assignee: | Fred Rolland <frolland> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Eldad Marciano <emarcian> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.6.2.5 | CC: | amureini, bugs, frolland, gklein, nsoffer, tnisan | ||||||
Target Milestone: | ovirt-4.0.5 | Keywords: | Performance | ||||||
Target Release: | 4.0.5 | Flags: | rule-engine:
ovirt-4.0.z+
rule-engine: planning_ack+ amureini: devel_ack+ gklein: testing_ack+ |
||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2016-11-24 09:40:42 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1303583 | ||||||||
Attachments: |
|
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA. oVirt 4.0 beta has been released, moving to RC milestone. oVirt 4.0 beta has been released, moving to RC milestone. From my testing on 4.0, the command takes less than 4 sec with 100 LUNs exposed to host: Engine : 2016-06-16 11:45:07,436 START, GetDeviceListVDSCommand 2016-06-16 11:45:10,931 FINISH, GetDeviceListVDSCommand 3495 ms VDSM: jsonrpc.Executor/4::INFO::2016-06-16 11:45:07,452::logUtils::49::dispatcher::(wrapper) Run and protect: getDeviceList jsonrpc.Executor/4::INFO::2016-06-16 11:45:10,683::logUtils::52::dispatcher::(wrapper) Run and protect: getDeviceList, Return response: 3231 ms The verb has 3 main parts: 1 - Refresh Storage (50 % of time) ( rescan iscsi/fc, multipath, resize devices) 2 - Reload PVs (25 % of time) 3 - Getting dev and path info (25 % of time) From the profiling, the only thing that maybe considered is to investigate if the multipath call is needed (900msec). I am not sure that we should invest in improving this verb, as it is not called frequently. Fixing the UI bug, will have much more impact than any other improvement in this flow (https://bugzilla.redhat.com/show_bug.cgi?id=1259353) Detailed profiling : time filename:lineno(function) 3.178 |hsm.py:1948(HSM._getDeviceList) 1.546 |--> sdc.py:80(StorageDomainCache.refreshStorage) 1.489 |--> multipath.py:65(rescan) 0.481 |--> iscsi.py:431(rescan)**** iscsi scan cmd **** 0.076 |--> storage/hba.py:53(rescan) **** fc scan cmd **** 0.912 |--> multipath.py:88(multipath)**** 'multipath' cmd **** 0.056 |--> multipath.py:92(resize_devices) 0.756 |--> lvm.py:843(getAllPVs) 0.756 |--> lvm.py:317(LVMCache._reloadpvs) 0.755 |--> lvm.py:290(LVMCache.cmd) ****'pvs' cmd **** 0.871 |--> multipath.py:213(pathListIter) 0.029 |-->devicemapper.py:176(getPathsStatus) **** 'dmsetup status' cmd**** Created attachment 1168725 [details]
Profiling of GetDeviceList
In order to see the profiling use :
contrib/profile-stats -r 500 -s cumu device.prof | less
In order to create the file, add the following annotation to the method :
@profile('/run/vdsm/device.prof', clock='wall', threads=False)
def _getDeviceList(self, storageType=None, guids=(), checkStatus=True):
*** Bug 1303583 has been marked as a duplicate of this bug. *** I wonder if the multipath issue will be worse if there are more paths per LUN (how many did you have)? For example, with 4 paths per LUN (a reasonable value for performance and high availability). But agree the UI bug is more critical to look at. Defer this to 4.1? I had only one path per LUN. The time is reasonable for the number of devices and the operations we are doing when scanning devices. This could take several minutes in the past, and after we optimized it, it take now couple of seconds as we see in comment 5. The same issue has existed for several versions. Pushing out to 4.0.4 to give time to properly verify this. Fred, can you please explain exactly what was fixed here and how this fix can be verified? I doubt this patch reduces the runtime below 6 seconds. (In reply to Allon Mureinik from comment #12) > Fred, can you please explain exactly what was fixed here and how this fix > can be verified? > I doubt this patch reduces the runtime below 6 seconds. This patch removes the multipath call consuming 29% of the time in vdsm, shortening the delay in engine when opening the manage domain dialog. The change will probably be noticeable only when using 100 or more devices on the host. The remove call was also known to get stuck if the server is misbehaving, removing it prevent unlimited blocking at this point. Verification should focus on correct operation: - detect new devices added before creating new domain or editing existing domain - detect resize devices - normal operation otherwise Improved performance can be verified by using lot of devices (e.g. 500), comparing the time to open the manage domain dialog with 4.0.4 and with the version including this fix. verifying on top of 4.0.5 it still takes ~5 sec. lab profile: 1 SD with 100 luns. [root@vega09 ~]# time vdsClient -s 0 getDeviceList > /dev/null real 0m5.454s user 0m0.430s sys 0m0.012s (In reply to Eldad Marciano from comment #14) > verifying on top of 4.0.5 > it still takes ~5 sec. > > lab profile: > 1 SD with 100 luns. > > [root@vega09 ~]# time vdsClient -s 0 getDeviceList > /dev/null > > real 0m5.454s > user 0m0.430s > sys 0m0.012s Please run the same tests with 4.0.5, and with the latest version that does not include this fix. You should expect about 30% improvement in 4.0.5. See comment 5 to understand why. (In reply to Eldad Marciano from comment #14) > verifying on top of 4.0.5 > it still takes ~5 sec. > > lab profile: > 1 SD with 100 luns. > > [root@vega09 ~]# time vdsClient -s 0 getDeviceList > /dev/null > > real 0m5.454s > user 0m0.430s > sys 0m0.012s You should use : time vdsClient -s 0 getDeviceList ISCSI False > /dev/null This is how the engine calls this verb. It will skip the pvcreate test for each lun. (In reply to Nir Soffer from comment #15) > (In reply to Eldad Marciano from comment #14) > > verifying on top of 4.0.5 > > it still takes ~5 sec. > > > > lab profile: > > 1 SD with 100 luns. > > > > [root@vega09 ~]# time vdsClient -s 0 getDeviceList > /dev/null > > > > real 0m5.454s > > user 0m0.430s > > sys 0m0.012s > > Please run the same tests with 4.0.5, and with the latest version that does > not > include this fix. > > You should expect about 30% improvement in 4.0.5. See comment 5 to > understand why. Mordechai mentioned 6-10 sec with the following build vdsm-4.17.17-0.el7ev.noarch (see the bug description) now we have an avg of 3-5 sec, by using what fred asked to run: [root@vega09 ~]# time vdsClient -s 0 getDeviceList ISCSI False > /dev/null real 0m4.013s user 0m0.431s sys 0m0.014s [root@vega09 ~]# time vdsClient -s 0 getDeviceList ISCSI False > /dev/null real 0m5.343s user 0m0.446s sys 0m0.032s [root@vega09 ~]# time vdsClient -s 0 getDeviceList ISCSI False > /dev/null real 0m3.949s user 0m0.439s sys 0m0.009s 4.0.5 has been released, closing. |
Created attachment 1120068 [details] ovirt log showing GetDeviceList execution start/stop time Description of problem: GetDeviceListVDSCommand takes 6-10s per execution. In certain admin flows, like editing an existing storage domain, this call maybe executed several times sequentially leading to long waits for the appearance of the dialog box in the ovirt-engine UI and a sluggish user experience. Required for this enviroment: 100 luns exposed to each host. Data Domains of multiple luns per individual ISCSI data storage domain. eg (7 luns 1 SD, 100 total luns exposed to host) Version-Release number of selected component (if applicable): vdsm-hook-vmfex-dev-4.17.17-0.el7ev.noarch vdsm-python-4.17.17-0.el7ev.noarch vdsm-yajsonrpc-4.17.17-0.el7ev.noarch vdsm-4.17.17-0.el7ev.noarch vdsm-xmlrpc-4.17.17-0.el7ev.noarch vdsm-jsonrpc-4.17.17-0.el7ev.noarch vdsm-cli-4.17.17-0.el7ev.noarch vdsm-infra-4.17.17-0.el7ev.noarch ovirt-vmconsole-proxy-1.0.0-1.el6ev.noarch libgovirt-0.3.2-1.el6.x86_64 rhevm-setup-plugin-ovirt-engine-common-3.6.2.5-0.1.el6.noarch ovirt-engine-extension-aaa-jdbc-1.0.5-1.el6ev.noarch ovirt-setup-lib-1.0.1-1.el6ev.noarch rhevm-setup-plugin-ovirt-engine-3.6.2.5-0.1.el6.noarch ovirt-host-deploy-java-1.4.1-1.el6ev.noarch ovirt-host-deploy-1.4.1-1.el6ev.noarch ovirt-vmconsole-1.0.0-1.el6ev.noarch How reproducible: Very Steps to Reproduce: 1. Create ISCSI data domain of multi-luns, ensuring at least 100 luns are exposed to the host. 2. Create data domain 3. Click "edit domain" 4. See ovirt-engine.log for GetDeviceListVDSCommand execution time. Actual results: See attached ovirt.log Expected results: Faster response time, or UI specific message warning that number of exposed luns may require additional time to perform action. Additional info: