Bug 1557147 - No efficient api method to collect metadata of all volumes from sd
Summary: No efficient api method to collect metadata of all volumes from sd
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.1.9
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.4.1
: 4.4.1
Assignee: Amit Bawer
QA Contact: mlehrer
URL:
Whiteboard:
Depends On:
Blocks: 1489145 1839444
TreeView+ depends on / blocked
 
Reported: 2018-03-16 05:46 UTC by Germano Veit Michel
Modified: 2021-03-02 14:27 UTC (History)
14 users (show)

Fixed In Version: vdsm-4.40.18
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-04 13:26:06 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
abawer: needinfo+


Attachments (Terms of Use)
wall-clock profile dump per comment #15, can be viewed using kcachegrind (to install: dnf install kcachegrind graphviz). (609.48 KB, text/plain)
2020-01-16 09:16 UTC, Amit Bawer
no flags Details
bulk API runs and profile on scale host 1030 vms (258.12 KB, application/gzip)
2020-05-21 18:26 UTC, Amit Bawer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5805271 0 None None None 2021-03-02 14:25:59 UTC
Red Hat Product Errata RHEA-2020:3246 0 None None None 2020-08-04 13:27:05 UTC
oVirt gerrit 97859 0 'None' ABANDONED sos: increase dump-volume-chains timeout 2021-02-15 17:58:34 UTC
oVirt gerrit 106393 0 master MERGED lvm: Mark stalelv = False when done with reloadlvs operation for all lvs 2021-02-15 17:58:21 UTC
oVirt gerrit 106406 0 master MERGED lvm_test: Amend test_reload_lvs_with_stale_lv for cached LVM 2021-02-15 17:58:22 UTC
oVirt gerrit 106542 0 master ABANDONED tool: Use concurrent calls for getVolumeInfo() in dump-volume-chains 2021-02-15 17:58:21 UTC
oVirt gerrit 106569 0 master ABANDONED API: Add getAllVolumesImages() API 2021-02-15 17:58:22 UTC
oVirt gerrit 106570 0 master ABANDONED tool: Use getVolumesImages() API in dump-volume-chains 2021-02-15 17:58:22 UTC
oVirt gerrit 106647 0 master ABANDONED API: Add dumpStorageDomain() API 2021-02-15 17:58:22 UTC
oVirt gerrit 106648 0 master ABANDONED tool: Add dump_volumes_metadata module 2021-02-15 17:58:23 UTC
oVirt gerrit 106828 0 master ABANDONED BlockSD: Add metadata parsing failure info to dump-volumes-metadata 2021-02-15 17:58:23 UTC
oVirt gerrit 106829 0 master ABANDONED FileSD: Add MD parsing failure info to getStorageDomainBulkMetadata() 2021-02-15 17:58:23 UTC
oVirt gerrit 106874 0 master ABANDONED blocksd_test: Add test_dump_sd_metadata 2021-02-15 17:58:23 UTC
oVirt gerrit 106887 0 master ABANDONED localfssd_test: Add test_dump_sd_metadata 2021-02-15 17:58:29 UTC
oVirt gerrit 106888 0 master ABANDONED nfssd_test: Add test_dump_sd_metadata 2021-02-15 17:58:29 UTC
oVirt gerrit 106957 0 master ABANDONED volumemetadata: Use todict() in storage_format() 2021-02-15 17:58:29 UTC
oVirt gerrit 106997 0 master MERGED tests: move chmod() contextmanager to storagetestlib 2021-02-15 17:58:29 UTC
oVirt gerrit 107033 0 master MERGED API: Add dumpStorageDomain() API 2021-02-15 17:58:29 UTC
oVirt gerrit 107052 0 master MERGED API: Remove 'storagedomainID' from StorageDomain.ctorArgs 2021-02-15 17:58:29 UTC
oVirt gerrit 107055 0 master MERGED volumemetadata: Add dump() method for dumping the MD dict per volume 2021-02-15 17:58:30 UTC
oVirt gerrit 107064 0 master ABANDONED volumemetadata: Add strict option to from_lines() method. 2021-02-15 17:58:30 UTC
oVirt gerrit 107084 0 master MERGED API, blockSD: Add 'volumes' section to dumpStorageDomain() 2021-02-15 17:58:30 UTC
oVirt gerrit 107125 0 master MERGED API, fileSD: Add 'volumes' section for dumpStorageDomain() 2021-02-15 17:58:30 UTC
oVirt gerrit 107313 0 master ABANDONED fileSD: concurrent the meta file reads for volumes dump API 2021-02-15 17:58:30 UTC
oVirt gerrit 107999 0 master MERGED blockSD: Don't break for bad md tag on occupied_metadata_slots() 2021-02-15 17:58:30 UTC
oVirt gerrit 108286 0 master MERGED fileSD: Add 'leases' section to dumpStorageDomain() 2021-02-15 17:58:30 UTC
oVirt gerrit 108414 0 master MERGED sanlock_direct: Add module for sanlock direct command 2021-02-15 17:58:31 UTC
oVirt gerrit 108415 0 master MERGED blockSD: Add 'leases' section to dumpStorageDomain() 2021-02-15 17:58:31 UTC
oVirt gerrit 108465 0 master MERGED blockSD: Make occupied_metadata_slots a module function 2021-02-15 17:58:31 UTC
oVirt gerrit 108469 0 master MERGED blockSD: Add 'xleases' section to dumpStorageDomain() 2021-02-15 17:58:31 UTC
oVirt gerrit 108516 0 master MERGED fileSD: Add 'xleases' section to dumpStorageDomain() 2021-02-15 17:58:32 UTC
oVirt gerrit 108537 0 master MERGED blockSD: Add 'lockspace' section to dumpStorageDomain() 2021-02-15 17:58:31 UTC
oVirt gerrit 108538 0 master MERGED fileSD: Add 'lockspace' section to dumpStorageDomain() 2021-02-15 17:58:31 UTC
oVirt gerrit 108561 0 master MERGED fakesanlock: Add dump_leases and dump_lockspace methods 2021-02-15 17:58:32 UTC
oVirt gerrit 108636 0 master MERGED sd: Move RESERVED_LEASES to sd constants 2021-02-15 17:58:32 UTC
oVirt gerrit 108637 0 master ABANDONED fileSD, blockSD: Amend relative offsets in leases dump 2021-02-15 17:58:32 UTC
oVirt gerrit 108867 0 master MERGED blockSD: Include removed image volumes in dumpStorageDomain() 2021-02-15 17:58:32 UTC
oVirt gerrit 108975 0 master MERGED fileSD: Include removed image volumes in dumpStorageDomain() 2021-02-15 17:58:32 UTC
oVirt gerrit 109075 0 master MERGED blockSD: Dump volume status as invalid if failed to get its size 2021-02-15 17:58:32 UTC
oVirt gerrit 109122 0 master MERGED API, fileSD, blockSD: Add full parameter to dumpStorageDomain() 2021-02-15 17:58:33 UTC
oVirt gerrit 109162 0 master MERGED xlease: Don't break on bad index record when dumping leases 2021-02-15 17:58:33 UTC
oVirt gerrit 109188 0 master MERGED xlease_test: Assert on set of leases keys 2021-02-15 17:58:33 UTC
oVirt gerrit 109344 0 master MERGED fileSD: Dump image UUID for volumes with invalid metadata 2021-02-15 17:58:33 UTC
oVirt gerrit 109358 0 master MERGED API: Add annotations to VolumeDump optional fields 2021-02-15 17:58:34 UTC

Description Germano Veit Michel 2018-03-16 05:46:07 UTC
Description of problem:

The vdsm tool dump-volume-chain uses a succession of calls to capture the metadata of volumes for a particular sd. This is inefficient and takes quite some some once the number of volumes increases. I'm afraid it does this because there is no other mechanism available.

With just 300 volumes in my lab
real    1m26.832s                              
user    0m1.483s                               
sys     0m0.219s

Basically:
getConnectedStoragePools                               
foreach image in getImages()
    foreach volume in getVolumes()
        getInfo()

We may want to have a verb to optimize this, collecting the metadata for a image, or the entire sd at once. Once the number of volumes increases (now limited to ~1500 IIRC, the time necessary to capture seems to grow linearly).

Version-Release number of selected component (if applicable):
vdsm-4.19.31-1.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Populate a SD with hundreds of images
2. Run vdsm-tool dump-volume-chains

Comment 4 Nir Soffer 2018-11-07 10:53:53 UTC
We have the same issue when importing storage domain during DR failover.

The main problem is we store metadata per volume, so we need to read read every
volume metadata separately, and we don't have API for getting image metadata, or
all volumes metadata in one call.

We can add API collecting metadata from all volumes on a storage domain.

On file storage, this API needs to read volume.meta file for every volumes, which
may be pretty slow.

On block storage all volume metadata is stored in the metadata LV, so we can read
the data from storage much faster.

This issue is also affecting changes to metadata. Because we store every volume
metadata in separate file or block on storage, there is no way to perform changes
atomically. This leads to failures during snapshots and live merge when the process
fail after we changed some volume metadata but not all.

Comment 5 Sandro Bonazzola 2019-01-28 09:44:35 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 7 Tal Nisan 2019-02-18 11:56:10 UTC
Michal, can the field engineering team look into changing the SOS report tool to avoid that timeout? We cannot supply a fix in VDSM for that issue

Comment 8 Nir Soffer 2019-02-18 12:53:28 UTC
(In reply to Tal Nisan from comment #7)
> Michal, can the field engineering team look into changing the SOS report
> tool to avoid that timeout? We cannot supply a fix in VDSM for that issue

vdsm sos plugin is part of vdsm, so maybe we can change the default timeout
for running this command.

Comment 9 Michal Skrivanek 2019-02-18 15:18:43 UTC
that looks quite simple. What would be a desired timeout then?

Comment 10 Germano Veit Michel 2019-02-19 01:13:24 UTC
(In reply to Nir Soffer from comment #8)
> vdsm sos plugin is part of vdsm, so maybe we can change the default timeout
> for running this command.

Yes, seems we just need to pass the timeout parameter when calling dump chains.

(In reply to Michal Skrivanek from comment #9)
> that looks quite simple. What would be a desired timeout then?

Taking into account one local RHV Lab we have with slow block storage and that we have and AlertOnNumberOfLVs=1300, a timeout of ~450-500s should do it for most cases. If it goes over AlertOnNumberOfLVs then I assume its already on non supported/tested territory.

Comment 12 Germano Veit Michel 2019-02-19 03:11:09 UTC
Back to NEW state, that change is just a workaround to make it stop timing out on larger/busier environments.
I would still like to see a more efficient mechanism implemented to get the metadata.

Comment 13 Nir Soffer 2019-08-18 10:42:30 UTC
I think the main issue here is using API for getting single volume info
for collecting info for all volumes in a storage domains.

180     images_uuids = cli.StorageDomain.getImages(storagedomainID=sd_uuid)
181 
182     volumes_info = {}
183 
184     for img_uuid in images_uuids:
185         img_volumes_info = {}
186 
187         volumes_uuids = cli.StorageDomain.getVolumes(
188             storagedomainID=sd_uuid, storagepoolID=sp_uuid,
189             imageID=img_uuid)
190 
191         for vol_uuid in volumes_uuids:
192             vol_info = cli.Volume.getInfo(
193                 volumeID=vol_uuid, storagepoolID=sp_uuid,
194                 storagedomainID=sd_uuid, imageID=img_uuid)
195 
196             img_volumes_info[vol_uuid] = vol_info
197 
198         volumes_info[img_uuid] = img_volumes_info

These APIs are good for one call, but do not scale for getting information
about entire domain.

Some APIs return also runtime status like sanlock lease status, which is not
required when dumping entire domain metadata.

We need a new API for DR/support use case, dumping entire storage domain
metadata in one call.

On file base domains:
- reading sd-id/dom_md/metadata
- reading sd-id/images/*.meta
- some processing to determine which volumes are templates
- getting the backing chain for every image using qemu-img info --backing-chain
  (requires activating all volumes in an image)

On block domains:
- dropping vg's lv/vg/pvs cache
- reading vg's lvs/vg/pvs info
- reading and parsing entire metadata volume
- getting the backing chain for every image using qemu-img info --backing-chain
  (requires activating all volumes in an image)

These APIs should be available both from vdsm API and from vdsm-tool so
we can collect the info even if vdsm cannot run.

One issue with bulk API is handing of corrupted metadata. Typically vdsm
will fail to handle volume with corrupted metadata, but in dump context,
it should try to get what it can from the metadata, and mark the volume
as corrupted.

Comment 14 Amit Bawer 2020-01-16 09:16:56 UTC
Created attachment 1652689 [details]
wall-clock profile dump per comment #15, can be viewed using kcachegrind (to install: dnf install kcachegrind graphviz).

Comment 15 Amit Bawer 2020-01-16 17:36:59 UTC
(In reply to Nir Soffer from comment #13)
> I think the main issue here is using API for getting single volume info
> for collecting info for all volumes in a storage domains.
> 
> 180     images_uuids = cli.StorageDomain.getImages(storagedomainID=sd_uuid)
> 181 
> 182     volumes_info = {}
> 183 
> 184     for img_uuid in images_uuids:
> 185         img_volumes_info = {}
> 186 
> 187         volumes_uuids = cli.StorageDomain.getVolumes(
> 188             storagedomainID=sd_uuid, storagepoolID=sp_uuid,
> 189             imageID=img_uuid)
> 190 
> 191         for vol_uuid in volumes_uuids:
> 192             vol_info = cli.Volume.getInfo(
> 193                 volumeID=vol_uuid, storagepoolID=sp_uuid,
> 194                 storagedomainID=sd_uuid, imageID=img_uuid)
> 195 
> 196             img_volumes_info[vol_uuid] = vol_info
> 197 
> 198         volumes_info[img_uuid] = img_volumes_info
> 
> These APIs are good for one call, but do not scale for getting information
> about entire domain.

Profiling on a SD with 460 disks (lvs), we wait 22 wall clock minutes for "time vdsm-tool dump-volume-chains d1c7782c-53a2-4a11-8b19-a83c39850fd0" to finish.

Since every disk in the SD has a base image, outer loop of line 184 is carried out 460 times:

cli.StorageDomain.getVolumes call at line 187 causes Vdsm to run LVMCache._reloadlvs() operation (called 472 times during profiling), strangely this happens on every execution of Vdsm's getVolumesList API, invoking the "lvs" command for the VG instead of using an existing lvm cache.

Running a single "lvs d1c7782c-53a2-4a11-8b19-a83c39850fd0" command on the tested host machine took ~2.8 seconds to finish,
so we already had 460*2.8s=21.5 minutes out of the total 22 minutes spent for getting the volume uids for a given image uid.


Repeating the test with following patch for LVMCache._reloadlvs()

diff --git a/lib/vdsm/storage/lvm.py b/lib/vdsm/storage/lvm.py
index 2a6c439a7..f973955a3 100644
--- a/lib/vdsm/storage/lvm.py
+++ b/lib/vdsm/storage/lvm.py
@@ -612,6 +612,9 @@ class LVMCache(object):
 
             log.debug("lvs reloaded")
 
+            if not lvNames:
+                self._stalelv = False
+
         return updatedLVs
 
     def _loadAllLvs(self):


we drop the total execution time into only 47 seconds for a SD with 460 disks:


time vdsm-tool dump-volume-chains d1c7782c-53a2-4a11-8b19-a83c39850fd0
...

   image:    f3398374-53af-4cca-8b4b-11ad48f6a8d4

             - fa572b7f-cf47-457b-ae61-12a29d65fc4a
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824


   image:    eaf1544d-0bc3-4b16-8ca9-8e5dfbc3d5b1

             - 88d0318b-b3a2-4754-ab8b-569dc8d511c3
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824


   image:    3f51ba58-5fc9-4ae6-a09f-d5125569a7a4

             - aa724c4b-2d62-4458-b642-6094d8b06d17
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824


   image:    bbe4e48d-e666-4cf6-a2ae-3bcd122ff416

             - f9a5112a-83a8-4a23-adf4-ee257245ee6e
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824


   image:    35ebf208-3694-4bc1-b368-5cb438947e5d

             - 89096229-988f-469e-a6b1-dcbfa927b458
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824


real	0m47.305s
user	0m1.472s
sys	0m0.371s

Comment 16 Nir Soffer 2020-01-19 13:43:28 UTC
(In reply to Amit Bawer from comment #15)

> Since every disk in the SD has a base image, outer loop of line 184 is
> carried out 460 times:

This is bad.

> we drop the total execution time into only 47 seconds for a SD with 460
> disks:

Nice, but we support up to 1500 lvs per domain.

Can you test with at leat 1000 lvs?

Also 2.8 seconds for one lvm command sounds slow. Maybe you are using
very old hardware?

Comment 17 Amit Bawer 2020-01-19 13:59:47 UTC
(In reply to Nir Soffer from comment #16)
> (In reply to Amit Bawer from comment #15)
> 
> > Since every disk in the SD has a base image, outer loop of line 184 is
> > carried out 460 times:
> 
> This is bad.
> 
> > we drop the total execution time into only 47 seconds for a SD with 460
> > disks:
> 
> Nice, but we support up to 1500 lvs per domain.
> 
> Can you test with at leat 1000 lvs?

Would try.

> 
> Also 2.8 seconds for one lvm command sounds slow. Maybe you are using
> very old hardware?

Could be, we have a prime stock of old wines here, but accessing a RAM dict (LVMCache) should be 100 times faster than using lvm commands which access files as well.

Comment 18 Amit Bawer 2020-01-21 15:35:23 UTC
Tested SD with 1000 disks containinig images (LVs), this time used a host with 8x2Ghz CPUs and 8GB RAM to gather comparative results faster.


1. Initial baseline run with master 4.40 Vdsm without patching LVMCache:

time vdsm-tool  dump-volume-chains 367a3231-8bb7-480a-9c0d-e69b0c6a9922

...


   image:    f8c0de22-ae8c-40d7-b850-ea439c957f0c

             - 6e782a5c-92ca-495a-a2a8-9dc9d4c2c51b
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824


   image:    c8d9e800-3640-4e7e-9fa5-699d39b41c66

             - 87f3d1f3-868d-45c2-8cea-6227181a5873
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824


   image:    fa727b76-91eb-487c-90a5-87e3518410c5

             - cb1a729e-feca-4700-92de-8b1c5283af6a
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824


   image:    e43c09b8-2f59-4f76-97f4-5a38a731f0e8

             - 885f44d5-da74-4ed4-9439-664443a9aa15
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824


real	18m10.264s
user	0m3.397s
sys	0m0.798s


2. For the patched Vdsm [1] execution time is 14 times faster:

time vdsm-tool  dump-volume-chains 367a3231-8bb7-480a-9c0d-e69b0c6a9922

...

   image:    61ecbdf5-737b-4078-8113-5026b98a5a22

             - f894163b-6ab1-4ef1-abe4-55677c150a3b
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824


   image:    ebf09f5c-5ef0-4fbb-a67f-78d9d6737d3f

             - ffca5f4b-0152-4034-9694-620c39bbf123
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824


   image:    577f3055-b8b9-46f4-9cc7-fed4b728d620

             - 56e36cdb-184f-45dd-b348-6f6a0252ddef
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824


   image:    18e8c4ec-56d6-4d22-a760-8675ace8e7e1

             - 4dab0000-5d55-406e-b70d-318ba28d6410
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824


real	1m3.531s
user	0m2.921s
sys	0m0.635s


On that host running "lvs VG" command took 0.9 seconds:

# time lvs 367a3231-8bb7-480a-9c0d-e69b0c6a9922 | wc -l
1011

real	0m0.929s
user	0m0.527s
sys	0m0.330s


[1] https://gerrit.ovirt.org/#/c/106393/15/

Comment 20 Amit Bawer 2020-05-21 18:26:19 UTC
Created attachment 1690768 [details]
bulk API runs and profile on scale host 1030 vms

Comment 21 Amit Bawer 2020-05-21 18:29:46 UTC
Attachment 1690768 [details] is for the results from testing the bulk API over 1030 VMs blockSD,
there were no HA vms there (no xleases to dump, but older tests on such scale shown this part to take 0.1 seconds for 1000 entries on a slow env)

Host spec:

256GB RAM
64 hex-core cpus

processor : 63
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
stepping : 4
microcode : 0x2000064
cpu MHz : 1331.933
cache size : 22528 KB
physical id : 1
siblings : 32
core id : 15
cpu cores : 16

Storage transfer rate:

# time dd if=leases  bs=1M count=1100 of=/dev/null iflag=direct status=progress
1045430272 bytes (1.0 GB, 997 MiB) copied, 5 s, 209 MB/s
1100+0 records in
1100+0 records out
1153433600 bytes (1.2 GB, 1.1 GiB) copied, 5.44578 s, 212 MB/s

real 0m5.447s
user 0m0.001s
sys 0m0.090s


Thin (SD+Volumes) dump for 1030 vms  blocksd took  1.2 seconds

[root@f17-h30-000-1029p mnt]# time vdsm-client StorageDomain dump sd_id=38ca2952-9ee4-4443-8f59-3884e294deb9 > /tmp/log

real 0m1.293s
user 0m0.174s
sys 0m0.026s


Full dump for 1030 vms  blocksd took 8.4 seconds

[root@f17-h30-000-1029p mnt]# time vdsm-client StorageDomain dump sd_id=38ca2952-9ee4-4443-8f59-3884e294deb9 full=1  > /tmp/log_full

real 0m8.430s
user 0m0.196s
sys 0m0.020s

Profiling full dump calls:

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    7.042    7.042 API.py:1063(StorageDomain.dump)
        1    0.000    0.000    7.042    7.042 <decorator-gen-173>:1(HSM.dumpStorageDomain)
        1    0.000    0.000    7.030    7.030 hsm.py:3321(HSM.dumpStorageDomain)
        1    0.000    0.000    7.029    7.029 blockSD.py:1730(BlockStorageDomain.dump)
        1    0.000    0.000    6.023    6.023 blockSD.py:1827(BlockStorageDomain._dump_leases)
     1043    0.001    0.000    5.929    0.006 sanlock_direct.py:123(_dump)
        2    0.000    0.000    5.924    2.962 sanlock_direct.py:86(run_dump)
        2    0.000    0.000    5.924    2.962 <string>:1(AutoProxy[instance].sanlock_direct_run_dump)
        1    0.003    0.003    0.688    0.688 blockSD.py:1751(BlockStorageDomain._dump_volumes)
        1    0.000    0.000    0.031    0.031 sd.py:1554(BlockStorageDomain.dump_lockspace)
        1    0.000    0.000    0.027    0.027 sd.py:1455(BlockStorageDomain.dump_external_leases)
        1    0.000    0.000    0.000    0.000 sanlock_direct.py:54(dump_leases)
        1    0.000    0.000    0.000    0.000 sanlock_direct.py:70(dump_lockspace)


Profiling sanlock direct dump leases call used on the API run:

# time strace -w -c /usr/sbin/sanlock direct dump leases:0:1191182336 -Z 512 -A 1M > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 96.92    5.409791        4762      1136           io_getevents
  1.35    0.075329          66      1136           io_submit
  1.06    0.058976          51      1139           munmap
  0.44    0.024315       24314         1           io_destroy
  0.18    0.009850           8      1169           mmap
  0.01    0.000752          57        13           poll
  0.01    0.000544          28        19           openat
  0.01    0.000425          84         5           socket
  0.00    0.000250          50         5         4 connect
  0.00    0.000245         244         1           execve
  0.00    0.000242           7        32           read
  0.00    0.000239           7        31           write
  0.00    0.000201          10        20           mprotect
  0.00    0.000196           8        22           close
  0.00    0.000190           7        27         1 fstat
  0.00    0.000080           6        12           lseek
  0.00    0.000060          59         1           stat
  0.00    0.000053           8         6           sendto
  0.00    0.000036          36         1           io_setup
  0.00    0.000035           6         5           fcntl
  0.00    0.000029           7         4           brk
  0.00    0.000021           7         3         1 ioctl
  0.00    0.000021           6         3           getpid
  0.00    0.000014           7         2           rt_sigaction
  0.00    0.000014           6         2         1 arch_prctl
  0.00    0.000010          10         1         1 access
  0.00    0.000007           7         1           fadvise64
  0.00    0.000007           7         1           prlimit64
  0.00    0.000007           7         1           set_robust_list
  0.00    0.000007           6         1           set_tid_address
  0.00    0.000007           6         1           rt_sigprocmask
------ ----------- ----------- --------- --------- ----------------
100.00    5.581954                  4801         8 total

real	0m6.163s
user	0m0.288s
sys	0m0.502s

Comment 22 Avihai 2020-07-01 06:36:29 UTC
This is clearly a scale performance bug.
Moving to scale/performance QE team leam Mordechai.

Mordechai, this is targeted to GA 4.4.1, be aware and ACK/NACK if you can make the testing till than,

Comment 23 Tzahi Ashkenazi 2020-07-01 14:08:51 UTC
tested on :
      version : 
              rhv-release-4.4.1-3-001.noarch
              vdsm-common-4.40.19-1.el8ev.noarch
       host :
              f17-h31-000-1029p.rdu2.scalelab.redhat.com

       hardware:

           supermicro  >  CPU Model Name:  Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
                          Physical Memory:257336 MB total,
                          CPU Threads per Core:2 
                          CPU Cores per Socket: 16 (32)

on SD with 747 lvs :

          [root@f17-h31-000-1029p ~]# vgs
          VG                                   #PV #LV #SN Attr   VSize    VFree
          38ca2952-9ee4-4443-8f59-3884e294deb9   1 747   0 wz--n-   <5.00t <3.99t


 time vdsm-tool dump-volume-chains 38ca2952-9ee4-4443-8f59-3884e294deb9
 real	3m51.936s
 user	0m1.778s
 sys	0m0.309s


will tested again on the same host with the correct vdsm version 4.40.21 for Comparison

Comment 24 Nir Soffer 2020-07-01 14:27:03 UTC
(In reply to Tzahi Ashkenazi from comment #23)
>  time vdsm-tool dump-volume-chains 38ca2952-9ee4-4443-8f59-3884e294deb9
>  real	3m51.936s
>  user	0m1.778s
>  sys	0m0.309s
> 
> 
> will tested again on the same host with the correct vdsm version 4.40.21 for
> Comparison

dump-volume-chain was not modified yet to use the StorageDomain.dump(), so we
cannot test it yet. We have a pending patch here:
https://gerrit.ovirt.org/c/109325/

What we can test now is StorageDomain.dump API, using vdsm-client, as you can
see in comment 21.

For testing it you need to use storage domain with lot of disks. 747 disks is
half of the maximum, so I would test with the maximum instead.

Since we did not finish the work and dump-volume-chain is not using the new
API yet, I don't think this bug is ready for testing, and it is indeed not
in ON_QA.

Comment 25 Nir Soffer 2020-07-01 15:16:24 UTC
I learned from Mordechi that we have bug 1839444 for using the new
StorageDomain.dump API in dump-volume-chains, so in this bug we can
test the new API, which is included in 4.4 since vdsm v4.40.18.

Using current dump-volume-chains as a baseline for comparing the 
new API sounds like a good idea.

So this bug can move to ON_QA, and it can be tested now with current
release of vdsm (vdsm-common-4.40.19-1), no need to wait for new
release.

Comment 26 Tzahi Ashkenazi 2020-07-02 09:35:29 UTC
tested on :
          f01-h23-000-r620.rdu2.scalelab.redhat.com

          rhv-release-4.4.1-3-001.noarch 

          vdsm-4.40.19-1.el8ev.x86_64 

          with 1356 disks on a single SD 

 
time vdsm-tool dump-volume-chains f32458fb-7cb5-431f-8767-5b3cf63fb85b 

real    18m44.384s 

user    0m4.729s 

sys    0m1.094s 

 

time vdsm-client StorageDomain dump sd_id=f32458fb-7cb5-431f-8767-5b3cf63fb85b 

real    0m3.818s 

user    0m0.475s 

sys    0m0.093s

Comment 29 Nir Soffer 2020-07-02 11:16:28 UTC
(In reply to Tzahi Ashkenazi from comment #26) 
> time vdsm-tool dump-volume-chains f32458fb-7cb5-431f-8767-5b3cf63fb85b 
> 
> real    18m44.384s 
... 
> time vdsm-client StorageDomain dump
> sd_id=f32458fb-7cb5-431f-8767-5b3cf63fb85b 
> 
> real    0m3.818s 

295 times faster, looks verified.

Comment 34 errata-xmlrpc 2020-08-04 13:26:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246


Note You need to log in before you can comment on or make changes to this bug.