Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1533035

Summary: [Scale] getDeviceList slowdown when passing big list of guids
Product: [oVirt] vdsm Reporter: Idan Shaby <ishaby>
Component: CoreAssignee: Nir Soffer <nsoffer>
Status: CLOSED CURRENTRELEASE QA Contact: guy chen <guchen>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.20.15CC: amureini, bugs, ebenahar, ishaby, lveyde, nsoffer, tnisan
Target Milestone: ovirt-4.2.2Keywords: Performance
Target Release: 4.20.18Flags: rule-engine: ovirt-4.2+
rule-engine: exception+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: vdsm v4.20.18 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-18 12:24:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Idan Shaby 2018-01-10 10:19:52 UTC
Description of problem:
Passing a list of LUN guids to getDeviceList makes it execute slower than when running without it on all of the LUNs.

Version-Release number of selected component (if applicable):
122953f3f160b3c23510e4e151ea5d55c616186e

How reproducible:
100%

Steps to Reproduce:
1. Open the "New Domain" pop up, and navigate to a block storage type (iSCSI for example).
2. See in the vdsm and engine logs how much time it takes for getDeviceList to execute.
3. Now select a few LUNs and try to add the domain.

Actual results:
It takes getDeviceList a lot more time to run when receiving a guids list rather than not receiving it, i.e when returning the list of all the visible devices.

Expected results:
When running on a few devices, I would expect the performance of getDeviceList to be less or equal to when it runs on all of the devices.


Additional info:
Note that the two calls are not exactly equivalent, as in the first call we use checkStatus = false and in the second call we use checkStatus = true.
A better (though less convenient) way to see the real difference is by measuring the execution time of getDeviceList when called by vdsm-client:

- Without the guids list:
vdsm-client Host getDeviceList

- With the guids list:
cat << EOF | vdsm-client -f - Host getDeviceList
{"guids" : [<a list of 25 guids>]}
EOF

Comment 1 Nir Soffer 2018-01-10 16:26:07 UTC
Idan, can you add more info how did you test this issue? what was topology tested?

Comment 2 Nir Soffer 2018-01-10 16:26:52 UTC
I don't think this bug will be ready for 4.2.1. This requires testing in scale
environment to make sure we don't create regressions for the common case of
creating a domain with one or two devices.

Comment 3 Natalie Gavrielov 2018-02-28 14:10:11 UTC
I tried verifying this with a short script that issues vdsm-client Host getDeviceList 50 times and calculates the average time a call takes using the time that's reported in the vdsm.log
So:
1. Without a supplying a list of devices avg response time: 1.9788
2. With a full list (all devices) avg response time 1.9704
3. With a partial list of devices avg response time: 2.217

Used builds:
vdsm-4.20.19-1.el7ev.x86_64
rhvm-4.2.2.1-0.1.el7.noarch

I'm not so sure about the verification/fix.. any ideas?

Comment 4 Nir Soffer 2018-02-28 16:05:14 UTC
(In reply to Natalie Gavrielov from comment #3)
The difference between 1.97 and 2.2 is not interesting, and this is not what we
tried to fix. The bug title is misleading.

The fix was avoid slowdown when you pass big number of guids to getDeviceList.

We getDeviceList in two ways:

1. Getting all devices
Call without guids list, and with checkStatus=False

2. Check if some devices are used
Call with list of guids, and with checkStatus=True

This call may take more time than the first call since we run lvm pvcreate --test.

The way to verify this is:

1. Reproduce with older vdsm

- Setup system with many LUNs, at leat 100
- 50 block storage domains
- Call vdsm-tool with guids list using this json:

{
    "checkStatus": True,
    "guids": [
       "guid1",
       "guid2",
      ...
    ]
}

If you compare calling with one guid and 16, you will see big difference 
in the time to complete the request.

See https://github.com/oVirt/vdsm/commit/77e182420b613a3a535cf303e7777932dffde354
for more details.

2. Verify the fix

With new vdsm, calling with one or many should be mostly the same.

Comment 5 Nir Soffer 2018-02-28 16:12:10 UTC
(In reply to Idan Shaby from comment #0)
> Additional info:
> Note that the two calls are not exactly equivalent, as in the first call we
> use checkStatus = false and in the second call we use checkStatus = true.

The calls are not related and it does not make sense to compare them, have
different purpose and used in different flows.

> A better (though less convenient) way to see the real difference is by
> measuring the execution time of getDeviceList when called by vdsm-client:
> 
> - Without the guids list:
> vdsm-client Host getDeviceList

This example is wrong, we never call getDeviceList without parameters. When calling
without guids list, you must specify checkStatus: False in the json.

> - With the guids list:
> cat << EOF | vdsm-client -f - Host getDeviceList
> {"guids" : [<a list of 25 guids>]}
> EOF

The right way to compare is to compare calling with different number of guids.
We expect to have similar run time regardless of the number of LUNs.

Comment 6 guy chen 2018-04-09 10:34:37 UTC
Tested with new VDSM 4.20.23 :
with 1 GUID took 11 seconds
with 100 GUID took 15 seconds

Tested with old VDSM 4.19.50-1 :
with 1 GUID took 30 seconds
with 100 GUID took 2m43s

Thus moving bug to verified.

Comment 7 Sandro Bonazzola 2018-04-18 12:24:41 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.