Description of problem: Current version of dump-storage-domains uses inefficient loops with several calls to gather data. With BZ1557147 fixed, a new API was introduced, which is much faster to gather the same data. So improve dump-storage-domains to use the new API.
Tal, once you set the flags feel free to assign this to me. Thanks.
Hi Germano, Please provide a clear and short verification scenario. I see this bug is related to a bug 1557147 which is a scale performance bug. So the scenario is to have a scale ENV to also verify this bug I'm guessing. Moving this bug as well to scale QE team. Mordechai, please check also this bug and see if you can test it by GA.
Tal please update the target milestone as the fix is not ready for 4.4.1
(In reply to Avihai from comment #2) > Hi Germano, > > Please provide a clear and short verification scenario. > I see this bug is related to a bug 1557147 which is a scale performance bug. > > So the scenario is to have a scale ENV to also verify this bug I'm guessing. > Moving this bug as well to scale QE team. No real need for scale here. Using a storage domain with a high number of volumes (i.e.: 300+ volumes, they can all be 1G size), it can be tested the following way: $ time vdsm-tool dump-volume-chains SD_UUID The newer version should complete the command much faster. (In reply to mlehrer from comment #4) > Tal please update the target milestone as the fix is not ready for 4.4.1 Yes, let's try 4.4.3 as I also need to check if any changes will be needed in the discrepancy tool.
is this still planned to be delivered in 4.4.3 to QE?
(In reply to mlehrer from comment #6) > is this still planned to be delivered in 4.4.3 to QE? I think so, the patch was merged 3 hours ago, so it should be on the next build? https://gerrit.ovirt.org/#/c/109325/ Verification Scenario: 1. Run the following on File and Block Storage Domains containing VMs with Snapshots $ vdsm-tool dump-volume-chains <su_uuid> 2. Optionally, do some random modifications on the storage volume metadata and confirm the tool runs cleanly and the output is sane. Some suggestions: - Change PUUID of a volume to a wrong UUID - Remove PUUID tag - Remove Image(Disk) tags - Remove some random metadata key (i.e. legality, capacity..)
hey Germano. some open questions regarding the Verification Scenario. we going to test this with Storage Domain that contains 750 VMs : 1. how many snapshot per single VM ? 2. how many VMs should have snapshot ? 3. does the snapshot should be include VM memory. thanks
(In reply to Tzahi Ashkenazi from comment #8) > hey Germano. > some open questions regarding the Verification Scenario. > > we going to test this with Storage Domain that contains 750 VMs : > 1. how many snapshot per single VM ? > 2. how many VMs should have snapshot ? Hi Tzahi, Since you already have 750VMs you already have plenty of volumes. You can create a few snapshots on a few VMs just to cover the case of having snapshots, but no need to have hundreds of snapshots in total. I'd say do 1 to 3 snapshots in 1 to 10 VMs and that's enough. > 3. does the snapshot should be include VM memory. No, but you can have a 1-2 snapshots with memory in total just to cover this scenario too. So: - 750VMs in total, out of which 10 have snapshots - The 10 have between 1 and 3 snapshots each - 1 or 2 of the 10 have 1 or 2 snapshots with memory This should be more than enough. Thanks!
Baseline results before the New API > vdsm-tool dump-volume-chains environment Red-01 : * 760 VMs * 10 VMs - with 10 snapshots each include memory = total 100 snapshots Version : 1.) vdsm-4.40.32-1.el8ev.x86_64 rhv-release-4.4.3-7-001.noarch L0_group_1_vdsm_4.40.32-1 command > time vdsm-tool dump-volume-chains bf38d2c2-99a4-421c-b847-58eabd782f52 real 11m13.253s user 0m3.363s sys 0m0.973s Version : 2.) rhv-release-4.4.2-3-001.noarch vdsm-4.40.25-1.el8ev.x86_64 L0_group_2_vdsm_4.40.25-1 command > time vdsm-tool dump-volume-chains b0a48885-b1f9-4a03-b46f-76f617aeb442 real 10m34.614s user 0m3.453s sys 0m0.821s no high cpu /memory consuming during both tests on the engine & the host that running the API call.
Tzahi, can you please move to VERIFIED then?
(In reply to Tal Nisan from comment #11) > Tzahi, can you please move to VERIFIED then? hi Tal the current results that i published are baseline which made on vdsm version vdsm-4.40.32 once we will upgrade red01 to vdsm vdsm-4.40.33 ( which contains the new API ) i will run the test again and publish the results thanks
Germano, The performance improvement seems about 6% faster, is this in line with what you expected? Earlier comments made it seem it would be better than that.
(In reply to Peter Lauterbach from comment #13) > Germano, > The performance improvement seems about 6% faster, is this in line with what > you expected? > Earlier comments made it seem it would be better than that. Hi Peter, Yes we are expecting a good improvement, but where are you seeing the new results? Both measurements from comment #10 are before the improvementm they are even from different Storage Domains UUIDs, so they cannot be compared. Did you see comment #12? Or do you have the results somewhere else?
the current results that i published are baseline which made on > vdsm version vdsm-4.40.32 once we will upgrade red01 to > vdsm vdsm-4.40.33 ( which contains the new API ) i will run the test again and publish the results
environment Red-01 : * 760 VMs * 10 VMs - with 10 snapshots each include memory = total 100 snapshots 1. Baseline > vdsm-4.40.25 : rhv-release-4.4.2-3-001.noarch vdsm-4.40.25-1.el8ev.x86_64 L0_group_2_vdsm_4.40.25-1 [root@f02-h31-000-r620 ~]# time vdsm-tool dump-volume-chains b0a48885-b1f9-4a03-b46f-76f617aeb442 > vdsm-4.40.25_b0a48885-b1f9-4a03-b46f-76f617aeb442.txt real 10m45.623s user 0m3.482s sys 0m0.816s the command completed successfully on vdsm 4.40.25 !!! 2. VDSM with the New API > vdsm-4.40.34-1.el8ev.x86_64 : The command FAILED on the same engine and on the same SD with the same amount of VMs and snapshot L0_Group_2!!!! example of the error from the output file : image: c3889918-7c2e-4aed-8c86-7fd5c6dec436 Error: no volume with a parent volume Id _BLANK_UUID found e.g: (a<-b), (b<-c) Unordered volumes and children: - 7c7696d4-e94c-410c-9e8c-a9f3030264d1 <- ff02ce5f-7dd1-48ef-b252-39253579bf13 status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 107374182400, truesize: 1073741824 image: adc7ea8c-1a62-4e1b-a73f-45739e8e228f Error: no volume with a parent volume Id _BLANK_UUID found e.g: (a<-b), (b<-c) Unordered volumes and children: - 7c7696d4-e94c-410c-9e8c-a9f3030264d1 <- ffc89e28-fbfd-41b5-9bac-7d5a2faa03dc status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 107374182400, truesize: 1073741824 the full output logs of each command can be found here : https://drive.google.com/drive/folders/1cFgTZSkceYrzzp8xueikYiPWE_SVyW8Q?usp=sharing
(In reply to Tzahi Ashkenazi from comment #17) > environment Red-01 : > > * 760 VMs > > * 10 VMs - with 10 snapshots each include memory = total 100 > snapshots > > > 1. Baseline > vdsm-4.40.25 : > > > rhv-release-4.4.2-3-001.noarch > vdsm-4.40.25-1.el8ev.x86_64 > > L0_group_2_vdsm_4.40.25-1 > > [root@f02-h31-000-r620 ~]# time vdsm-tool dump-volume-chains > b0a48885-b1f9-4a03-b46f-76f617aeb442 > > vdsm-4.40.25_b0a48885-b1f9-4a03-b46f-76f617aeb442.txt > > real 10m45.623s > user 0m3.482s > sys 0m0.816s > > the command completed successfully on vdsm 4.40.25 !!! > > > > 2. VDSM with the New API > vdsm-4.40.34-1.el8ev.x86_64 : > > > The command FAILED on the same engine and on the same SD with the same > amount of VMs and snapshot L0_Group_2!!!! > > example of the error from the output file : > > image: c3889918-7c2e-4aed-8c86-7fd5c6dec436 > > Error: no volume with a parent volume Id _BLANK_UUID found > e.g: (a<-b), (b<-c) > > Unordered volumes and children: > > - 7c7696d4-e94c-410c-9e8c-a9f3030264d1 <- > ff02ce5f-7dd1-48ef-b252-39253579bf13 > status: OK, voltype: LEAF, format: COW, legality: LEGAL, > type: SPARSE, capacity: 107374182400, truesize: 1073741824 > > > image: adc7ea8c-1a62-4e1b-a73f-45739e8e228f > > Error: no volume with a parent volume Id _BLANK_UUID > found e.g: (a<-b), (b<-c) > > Unordered volumes and children: > > - 7c7696d4-e94c-410c-9e8c-a9f3030264d1 <- > ffc89e28-fbfd-41b5-9bac-7d5a2faa03dc > status: OK, voltype: LEAF, format: COW, legality: LEGAL, > type: SPARSE, capacity: 107374182400, truesize: 1073741824 > > > > the full output logs of each command can be found here : > > https://drive.google.com/drive/folders/ > 1cFgTZSkceYrzzp8xueikYiPWE_SVyW8Q?usp=sharing Thanks Tzahi, uploaded fix patch for the missing base volume error in the chain dumps, currently on POST.
(In reply to Amit Bawer from comment #18) > Thanks Tzahi, uploaded fix patch for the missing base volume error in the > chain dumps, currently on POST. Clearly I didn't have VMs based on thin templates when testing. Thank you :)
Target release is planned to 4.4.4 is this still accurate?
(In reply to mlehrer from comment #24) > Target release is planned to 4.4.4 is this still accurate? Yes, all patches are on vdsm-4.40.36: $ git log --tags --oneline | egrep 'New release|dump-volume-chains' 41dc49a1a New release: 4.40.36 67f1bda25 tool: Handle template parent volumes in dump-volume-chains 20cc9164e tool: Normalize parent volume info in dump-volume-chains 5e4c5102e New release: 4.40.35.1 ab74826c9 New release: 4.40.35 578417cbb New release: 4.40.34 99468f66d New release: 4.40.33 1354aa468 dump-volume-chains: use storage.dump() api I'm not sure what happened in comments 21-23, but this should be ON_QA.
Tested and verified with the new vdsm version 4.40.37-1 on environment Red-01 : * 760 VMs * 10 VMs - with 10 snapshots each include memory = total 100 snapshots Version : rhv-release-4.4.4-2-001.noarch vdsm-4.40.37-1.el8ev.x86_64 L0_group_2 command > time vdsm-tool dump-volume-chains b0a48885-b1f9-4a03-b46f-76f617aeb442 real 0m3.262s user 0m0.653s sys 0m0.084s Amazing and huge improvement from 11m to 3sec !!!!!! no high cpu /memory consuming during both tests on the engine & the host that running the API call. the full test results for all vdsm version can be found here : https://drive.google.com/drive/folders/1cFgTZSkceYrzzp8xueikYiPWE_SVyW8Q?usp=sharing
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) 4.4.z [ovirt-4.4.4]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0382