Bug 1533035
| Summary: | [Scale] getDeviceList slowdown when passing big list of guids | ||
|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | Idan Shaby <ishaby> |
| Component: | Core | Assignee: | Nir Soffer <nsoffer> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | guy chen <guchen> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.20.15 | CC: | amureini, bugs, ebenahar, ishaby, lveyde, nsoffer, tnisan |
| Target Milestone: | ovirt-4.2.2 | Keywords: | Performance |
| Target Release: | 4.20.18 | Flags: | rule-engine:
ovirt-4.2+
rule-engine: exception+ |
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | vdsm v4.20.18 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-04-18 12:24:41 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Idan, can you add more info how did you test this issue? what was topology tested? I don't think this bug will be ready for 4.2.1. This requires testing in scale environment to make sure we don't create regressions for the common case of creating a domain with one or two devices. I tried verifying this with a short script that issues vdsm-client Host getDeviceList 50 times and calculates the average time a call takes using the time that's reported in the vdsm.log So: 1. Without a supplying a list of devices avg response time: 1.9788 2. With a full list (all devices) avg response time 1.9704 3. With a partial list of devices avg response time: 2.217 Used builds: vdsm-4.20.19-1.el7ev.x86_64 rhvm-4.2.2.1-0.1.el7.noarch I'm not so sure about the verification/fix.. any ideas? (In reply to Natalie Gavrielov from comment #3) The difference between 1.97 and 2.2 is not interesting, and this is not what we tried to fix. The bug title is misleading. The fix was avoid slowdown when you pass big number of guids to getDeviceList. We getDeviceList in two ways: 1. Getting all devices Call without guids list, and with checkStatus=False 2. Check if some devices are used Call with list of guids, and with checkStatus=True This call may take more time than the first call since we run lvm pvcreate --test. The way to verify this is: 1. Reproduce with older vdsm - Setup system with many LUNs, at leat 100 - 50 block storage domains - Call vdsm-tool with guids list using this json: { "checkStatus": True, "guids": [ "guid1", "guid2", ... ] } If you compare calling with one guid and 16, you will see big difference in the time to complete the request. See https://github.com/oVirt/vdsm/commit/77e182420b613a3a535cf303e7777932dffde354 for more details. 2. Verify the fix With new vdsm, calling with one or many should be mostly the same. (In reply to Idan Shaby from comment #0) > Additional info: > Note that the two calls are not exactly equivalent, as in the first call we > use checkStatus = false and in the second call we use checkStatus = true. The calls are not related and it does not make sense to compare them, have different purpose and used in different flows. > A better (though less convenient) way to see the real difference is by > measuring the execution time of getDeviceList when called by vdsm-client: > > - Without the guids list: > vdsm-client Host getDeviceList This example is wrong, we never call getDeviceList without parameters. When calling without guids list, you must specify checkStatus: False in the json. > - With the guids list: > cat << EOF | vdsm-client -f - Host getDeviceList > {"guids" : [<a list of 25 guids>]} > EOF The right way to compare is to compare calling with different number of guids. We expect to have similar run time regardless of the number of LUNs. Tested with new VDSM 4.20.23 : with 1 GUID took 11 seconds with 100 GUID took 15 seconds Tested with old VDSM 4.19.50-1 : with 1 GUID took 30 seconds with 100 GUID took 2m43s Thus moving bug to verified. This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |
Description of problem: Passing a list of LUN guids to getDeviceList makes it execute slower than when running without it on all of the LUNs. Version-Release number of selected component (if applicable): 122953f3f160b3c23510e4e151ea5d55c616186e How reproducible: 100% Steps to Reproduce: 1. Open the "New Domain" pop up, and navigate to a block storage type (iSCSI for example). 2. See in the vdsm and engine logs how much time it takes for getDeviceList to execute. 3. Now select a few LUNs and try to add the domain. Actual results: It takes getDeviceList a lot more time to run when receiving a guids list rather than not receiving it, i.e when returning the list of all the visible devices. Expected results: When running on a few devices, I would expect the performance of getDeviceList to be less or equal to when it runs on all of the devices. Additional info: Note that the two calls are not exactly equivalent, as in the first call we use checkStatus = false and in the second call we use checkStatus = true. A better (though less convenient) way to see the real difference is by measuring the execution time of getDeviceList when called by vdsm-client: - Without the guids list: vdsm-client Host getDeviceList - With the guids list: cat << EOF | vdsm-client -f - Host getDeviceList {"guids" : [<a list of 25 guids>]} EOF