Bug 1557147
Summary: | No efficient api method to collect metadata of all volumes from sd | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> | ||||||
Component: | vdsm | Assignee: | Amit Bawer <abawer> | ||||||
Status: | CLOSED ERRATA | QA Contact: | mlehrer | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.1.9 | CC: | abawer, aefrat, jortialc, lsurette, michal.skrivanek, mlehrer, nsoffer, pelauter, rdlugyhe, sfishbai, srevivo, tashkena, tnisan, ycui | ||||||
Target Milestone: | ovirt-4.4.1 | Keywords: | Performance | ||||||
Target Release: | 4.4.1 | Flags: | abawer:
needinfo+
|
||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | vdsm-4.40.18 | Doc Type: | No Doc Update | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2020-08-04 13:26:06 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1489145, 1839444 | ||||||||
Attachments: |
|
Description
Germano Veit Michel
2018-03-16 05:46:07 UTC
We have the same issue when importing storage domain during DR failover. The main problem is we store metadata per volume, so we need to read read every volume metadata separately, and we don't have API for getting image metadata, or all volumes metadata in one call. We can add API collecting metadata from all volumes on a storage domain. On file storage, this API needs to read volume.meta file for every volumes, which may be pretty slow. On block storage all volume metadata is stored in the metadata LV, so we can read the data from storage much faster. This issue is also affecting changes to metadata. Because we store every volume metadata in separate file or block on storage, there is no way to perform changes atomically. This leads to failures during snapshots and live merge when the process fail after we changed some volume metadata but not all. This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1. Michal, can the field engineering team look into changing the SOS report tool to avoid that timeout? We cannot supply a fix in VDSM for that issue (In reply to Tal Nisan from comment #7) > Michal, can the field engineering team look into changing the SOS report > tool to avoid that timeout? We cannot supply a fix in VDSM for that issue vdsm sos plugin is part of vdsm, so maybe we can change the default timeout for running this command. that looks quite simple. What would be a desired timeout then? (In reply to Nir Soffer from comment #8) > vdsm sos plugin is part of vdsm, so maybe we can change the default timeout > for running this command. Yes, seems we just need to pass the timeout parameter when calling dump chains. (In reply to Michal Skrivanek from comment #9) > that looks quite simple. What would be a desired timeout then? Taking into account one local RHV Lab we have with slow block storage and that we have and AlertOnNumberOfLVs=1300, a timeout of ~450-500s should do it for most cases. If it goes over AlertOnNumberOfLVs then I assume its already on non supported/tested territory. Back to NEW state, that change is just a workaround to make it stop timing out on larger/busier environments. I would still like to see a more efficient mechanism implemented to get the metadata. I think the main issue here is using API for getting single volume info for collecting info for all volumes in a storage domains. 180 images_uuids = cli.StorageDomain.getImages(storagedomainID=sd_uuid) 181 182 volumes_info = {} 183 184 for img_uuid in images_uuids: 185 img_volumes_info = {} 186 187 volumes_uuids = cli.StorageDomain.getVolumes( 188 storagedomainID=sd_uuid, storagepoolID=sp_uuid, 189 imageID=img_uuid) 190 191 for vol_uuid in volumes_uuids: 192 vol_info = cli.Volume.getInfo( 193 volumeID=vol_uuid, storagepoolID=sp_uuid, 194 storagedomainID=sd_uuid, imageID=img_uuid) 195 196 img_volumes_info[vol_uuid] = vol_info 197 198 volumes_info[img_uuid] = img_volumes_info These APIs are good for one call, but do not scale for getting information about entire domain. Some APIs return also runtime status like sanlock lease status, which is not required when dumping entire domain metadata. We need a new API for DR/support use case, dumping entire storage domain metadata in one call. On file base domains: - reading sd-id/dom_md/metadata - reading sd-id/images/*.meta - some processing to determine which volumes are templates - getting the backing chain for every image using qemu-img info --backing-chain (requires activating all volumes in an image) On block domains: - dropping vg's lv/vg/pvs cache - reading vg's lvs/vg/pvs info - reading and parsing entire metadata volume - getting the backing chain for every image using qemu-img info --backing-chain (requires activating all volumes in an image) These APIs should be available both from vdsm API and from vdsm-tool so we can collect the info even if vdsm cannot run. One issue with bulk API is handing of corrupted metadata. Typically vdsm will fail to handle volume with corrupted metadata, but in dump context, it should try to get what it can from the metadata, and mark the volume as corrupted. Created attachment 1652689 [details] wall-clock profile dump per comment #15, can be viewed using kcachegrind (to install: dnf install kcachegrind graphviz). (In reply to Nir Soffer from comment #13) > I think the main issue here is using API for getting single volume info > for collecting info for all volumes in a storage domains. > > 180 images_uuids = cli.StorageDomain.getImages(storagedomainID=sd_uuid) > 181 > 182 volumes_info = {} > 183 > 184 for img_uuid in images_uuids: > 185 img_volumes_info = {} > 186 > 187 volumes_uuids = cli.StorageDomain.getVolumes( > 188 storagedomainID=sd_uuid, storagepoolID=sp_uuid, > 189 imageID=img_uuid) > 190 > 191 for vol_uuid in volumes_uuids: > 192 vol_info = cli.Volume.getInfo( > 193 volumeID=vol_uuid, storagepoolID=sp_uuid, > 194 storagedomainID=sd_uuid, imageID=img_uuid) > 195 > 196 img_volumes_info[vol_uuid] = vol_info > 197 > 198 volumes_info[img_uuid] = img_volumes_info > > These APIs are good for one call, but do not scale for getting information > about entire domain. Profiling on a SD with 460 disks (lvs), we wait 22 wall clock minutes for "time vdsm-tool dump-volume-chains d1c7782c-53a2-4a11-8b19-a83c39850fd0" to finish. Since every disk in the SD has a base image, outer loop of line 184 is carried out 460 times: cli.StorageDomain.getVolumes call at line 187 causes Vdsm to run LVMCache._reloadlvs() operation (called 472 times during profiling), strangely this happens on every execution of Vdsm's getVolumesList API, invoking the "lvs" command for the VG instead of using an existing lvm cache. Running a single "lvs d1c7782c-53a2-4a11-8b19-a83c39850fd0" command on the tested host machine took ~2.8 seconds to finish, so we already had 460*2.8s=21.5 minutes out of the total 22 minutes spent for getting the volume uids for a given image uid. Repeating the test with following patch for LVMCache._reloadlvs() diff --git a/lib/vdsm/storage/lvm.py b/lib/vdsm/storage/lvm.py index 2a6c439a7..f973955a3 100644 --- a/lib/vdsm/storage/lvm.py +++ b/lib/vdsm/storage/lvm.py @@ -612,6 +612,9 @@ class LVMCache(object): log.debug("lvs reloaded") + if not lvNames: + self._stalelv = False + return updatedLVs def _loadAllLvs(self): we drop the total execution time into only 47 seconds for a SD with 460 disks: time vdsm-tool dump-volume-chains d1c7782c-53a2-4a11-8b19-a83c39850fd0 ... image: f3398374-53af-4cca-8b4b-11ad48f6a8d4 - fa572b7f-cf47-457b-ae61-12a29d65fc4a status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824 image: eaf1544d-0bc3-4b16-8ca9-8e5dfbc3d5b1 - 88d0318b-b3a2-4754-ab8b-569dc8d511c3 status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824 image: 3f51ba58-5fc9-4ae6-a09f-d5125569a7a4 - aa724c4b-2d62-4458-b642-6094d8b06d17 status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824 image: bbe4e48d-e666-4cf6-a2ae-3bcd122ff416 - f9a5112a-83a8-4a23-adf4-ee257245ee6e status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824 image: 35ebf208-3694-4bc1-b368-5cb438947e5d - 89096229-988f-469e-a6b1-dcbfa927b458 status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824 real 0m47.305s user 0m1.472s sys 0m0.371s (In reply to Amit Bawer from comment #15) > Since every disk in the SD has a base image, outer loop of line 184 is > carried out 460 times: This is bad. > we drop the total execution time into only 47 seconds for a SD with 460 > disks: Nice, but we support up to 1500 lvs per domain. Can you test with at leat 1000 lvs? Also 2.8 seconds for one lvm command sounds slow. Maybe you are using very old hardware? (In reply to Nir Soffer from comment #16) > (In reply to Amit Bawer from comment #15) > > > Since every disk in the SD has a base image, outer loop of line 184 is > > carried out 460 times: > > This is bad. > > > we drop the total execution time into only 47 seconds for a SD with 460 > > disks: > > Nice, but we support up to 1500 lvs per domain. > > Can you test with at leat 1000 lvs? Would try. > > Also 2.8 seconds for one lvm command sounds slow. Maybe you are using > very old hardware? Could be, we have a prime stock of old wines here, but accessing a RAM dict (LVMCache) should be 100 times faster than using lvm commands which access files as well. Tested SD with 1000 disks containinig images (LVs), this time used a host with 8x2Ghz CPUs and 8GB RAM to gather comparative results faster. 1. Initial baseline run with master 4.40 Vdsm without patching LVMCache: time vdsm-tool dump-volume-chains 367a3231-8bb7-480a-9c0d-e69b0c6a9922 ... image: f8c0de22-ae8c-40d7-b850-ea439c957f0c - 6e782a5c-92ca-495a-a2a8-9dc9d4c2c51b status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824 image: c8d9e800-3640-4e7e-9fa5-699d39b41c66 - 87f3d1f3-868d-45c2-8cea-6227181a5873 status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824 image: fa727b76-91eb-487c-90a5-87e3518410c5 - cb1a729e-feca-4700-92de-8b1c5283af6a status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824 image: e43c09b8-2f59-4f76-97f4-5a38a731f0e8 - 885f44d5-da74-4ed4-9439-664443a9aa15 status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824 real 18m10.264s user 0m3.397s sys 0m0.798s 2. For the patched Vdsm [1] execution time is 14 times faster: time vdsm-tool dump-volume-chains 367a3231-8bb7-480a-9c0d-e69b0c6a9922 ... image: 61ecbdf5-737b-4078-8113-5026b98a5a22 - f894163b-6ab1-4ef1-abe4-55677c150a3b status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824 image: ebf09f5c-5ef0-4fbb-a67f-78d9d6737d3f - ffca5f4b-0152-4034-9694-620c39bbf123 status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824 image: 577f3055-b8b9-46f4-9cc7-fed4b728d620 - 56e36cdb-184f-45dd-b348-6f6a0252ddef status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824 image: 18e8c4ec-56d6-4d22-a760-8675ace8e7e1 - 4dab0000-5d55-406e-b70d-318ba28d6410 status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 1073741824 real 1m3.531s user 0m2.921s sys 0m0.635s On that host running "lvs VG" command took 0.9 seconds: # time lvs 367a3231-8bb7-480a-9c0d-e69b0c6a9922 | wc -l 1011 real 0m0.929s user 0m0.527s sys 0m0.330s [1] https://gerrit.ovirt.org/#/c/106393/15/ Created attachment 1690768 [details]
bulk API runs and profile on scale host 1030 vms
Attachment 1690768 [details] is for the results from testing the bulk API over 1030 VMs blockSD,
there were no HA vms there (no xleases to dump, but older tests on such scale shown this part to take 0.1 seconds for 1000 entries on a slow env)
Host spec:
256GB RAM
64 hex-core cpus
processor : 63
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
stepping : 4
microcode : 0x2000064
cpu MHz : 1331.933
cache size : 22528 KB
physical id : 1
siblings : 32
core id : 15
cpu cores : 16
Storage transfer rate:
# time dd if=leases bs=1M count=1100 of=/dev/null iflag=direct status=progress
1045430272 bytes (1.0 GB, 997 MiB) copied, 5 s, 209 MB/s
1100+0 records in
1100+0 records out
1153433600 bytes (1.2 GB, 1.1 GiB) copied, 5.44578 s, 212 MB/s
real 0m5.447s
user 0m0.001s
sys 0m0.090s
Thin (SD+Volumes) dump for 1030 vms blocksd took 1.2 seconds
[root@f17-h30-000-1029p mnt]# time vdsm-client StorageDomain dump sd_id=38ca2952-9ee4-4443-8f59-3884e294deb9 > /tmp/log
real 0m1.293s
user 0m0.174s
sys 0m0.026s
Full dump for 1030 vms blocksd took 8.4 seconds
[root@f17-h30-000-1029p mnt]# time vdsm-client StorageDomain dump sd_id=38ca2952-9ee4-4443-8f59-3884e294deb9 full=1 > /tmp/log_full
real 0m8.430s
user 0m0.196s
sys 0m0.020s
Profiling full dump calls:
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 7.042 7.042 API.py:1063(StorageDomain.dump)
1 0.000 0.000 7.042 7.042 <decorator-gen-173>:1(HSM.dumpStorageDomain)
1 0.000 0.000 7.030 7.030 hsm.py:3321(HSM.dumpStorageDomain)
1 0.000 0.000 7.029 7.029 blockSD.py:1730(BlockStorageDomain.dump)
1 0.000 0.000 6.023 6.023 blockSD.py:1827(BlockStorageDomain._dump_leases)
1043 0.001 0.000 5.929 0.006 sanlock_direct.py:123(_dump)
2 0.000 0.000 5.924 2.962 sanlock_direct.py:86(run_dump)
2 0.000 0.000 5.924 2.962 <string>:1(AutoProxy[instance].sanlock_direct_run_dump)
1 0.003 0.003 0.688 0.688 blockSD.py:1751(BlockStorageDomain._dump_volumes)
1 0.000 0.000 0.031 0.031 sd.py:1554(BlockStorageDomain.dump_lockspace)
1 0.000 0.000 0.027 0.027 sd.py:1455(BlockStorageDomain.dump_external_leases)
1 0.000 0.000 0.000 0.000 sanlock_direct.py:54(dump_leases)
1 0.000 0.000 0.000 0.000 sanlock_direct.py:70(dump_lockspace)
Profiling sanlock direct dump leases call used on the API run:
# time strace -w -c /usr/sbin/sanlock direct dump leases:0:1191182336 -Z 512 -A 1M > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
96.92 5.409791 4762 1136 io_getevents
1.35 0.075329 66 1136 io_submit
1.06 0.058976 51 1139 munmap
0.44 0.024315 24314 1 io_destroy
0.18 0.009850 8 1169 mmap
0.01 0.000752 57 13 poll
0.01 0.000544 28 19 openat
0.01 0.000425 84 5 socket
0.00 0.000250 50 5 4 connect
0.00 0.000245 244 1 execve
0.00 0.000242 7 32 read
0.00 0.000239 7 31 write
0.00 0.000201 10 20 mprotect
0.00 0.000196 8 22 close
0.00 0.000190 7 27 1 fstat
0.00 0.000080 6 12 lseek
0.00 0.000060 59 1 stat
0.00 0.000053 8 6 sendto
0.00 0.000036 36 1 io_setup
0.00 0.000035 6 5 fcntl
0.00 0.000029 7 4 brk
0.00 0.000021 7 3 1 ioctl
0.00 0.000021 6 3 getpid
0.00 0.000014 7 2 rt_sigaction
0.00 0.000014 6 2 1 arch_prctl
0.00 0.000010 10 1 1 access
0.00 0.000007 7 1 fadvise64
0.00 0.000007 7 1 prlimit64
0.00 0.000007 7 1 set_robust_list
0.00 0.000007 6 1 set_tid_address
0.00 0.000007 6 1 rt_sigprocmask
------ ----------- ----------- --------- --------- ----------------
100.00 5.581954 4801 8 total
real 0m6.163s
user 0m0.288s
sys 0m0.502s
This is clearly a scale performance bug. Moving to scale/performance QE team leam Mordechai. Mordechai, this is targeted to GA 4.4.1, be aware and ACK/NACK if you can make the testing till than, tested on : version : rhv-release-4.4.1-3-001.noarch vdsm-common-4.40.19-1.el8ev.noarch host : f17-h31-000-1029p.rdu2.scalelab.redhat.com hardware: supermicro > CPU Model Name: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz Physical Memory:257336 MB total, CPU Threads per Core:2 CPU Cores per Socket: 16 (32) on SD with 747 lvs : [root@f17-h31-000-1029p ~]# vgs VG #PV #LV #SN Attr VSize VFree 38ca2952-9ee4-4443-8f59-3884e294deb9 1 747 0 wz--n- <5.00t <3.99t time vdsm-tool dump-volume-chains 38ca2952-9ee4-4443-8f59-3884e294deb9 real 3m51.936s user 0m1.778s sys 0m0.309s will tested again on the same host with the correct vdsm version 4.40.21 for Comparison (In reply to Tzahi Ashkenazi from comment #23) > time vdsm-tool dump-volume-chains 38ca2952-9ee4-4443-8f59-3884e294deb9 > real 3m51.936s > user 0m1.778s > sys 0m0.309s > > > will tested again on the same host with the correct vdsm version 4.40.21 for > Comparison dump-volume-chain was not modified yet to use the StorageDomain.dump(), so we cannot test it yet. We have a pending patch here: https://gerrit.ovirt.org/c/109325/ What we can test now is StorageDomain.dump API, using vdsm-client, as you can see in comment 21. For testing it you need to use storage domain with lot of disks. 747 disks is half of the maximum, so I would test with the maximum instead. Since we did not finish the work and dump-volume-chain is not using the new API yet, I don't think this bug is ready for testing, and it is indeed not in ON_QA. I learned from Mordechi that we have bug 1839444 for using the new StorageDomain.dump API in dump-volume-chains, so in this bug we can test the new API, which is included in 4.4 since vdsm v4.40.18. Using current dump-volume-chains as a baseline for comparing the new API sounds like a good idea. So this bug can move to ON_QA, and it can be tested now with current release of vdsm (vdsm-common-4.40.19-1), no need to wait for new release. tested on : f01-h23-000-r620.rdu2.scalelab.redhat.com rhv-release-4.4.1-3-001.noarch vdsm-4.40.19-1.el8ev.x86_64 with 1356 disks on a single SD time vdsm-tool dump-volume-chains f32458fb-7cb5-431f-8767-5b3cf63fb85b real 18m44.384s user 0m4.729s sys 0m1.094s time vdsm-client StorageDomain dump sd_id=f32458fb-7cb5-431f-8767-5b3cf63fb85b real 0m3.818s user 0m0.475s sys 0m0.093s (In reply to Tzahi Ashkenazi from comment #26) > time vdsm-tool dump-volume-chains f32458fb-7cb5-431f-8767-5b3cf63fb85b > > real 18m44.384s ... > time vdsm-client StorageDomain dump > sd_id=f32458fb-7cb5-431f-8767-5b3cf63fb85b > > real 0m3.818s 295 times faster, looks verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:3246 |