Bug 1839444 - [RFE] Use more efficient dumpStorageDomain() in dump-volume-chains
Summary: [RFE] Use more efficient dumpStorageDomain() in dump-volume-chains
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.4.0
Hardware: x86_64
OS: Linux
low
low
Target Milestone: ovirt-4.4.4
: 4.4.4
Assignee: Germano Veit Michel
QA Contact: Tzahi Ashkenazi
URL:
Whiteboard:
Depends On: 1557147 1870435 1870887
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-24 04:47 UTC by Germano Veit Michel
Modified: 2021-02-02 13:59 UTC (History)
13 users (show)

Fixed In Version: vdsm-4.40.36
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-02 13:59:36 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:0382 0 None None None 2021-02-02 13:59:55 UTC
oVirt gerrit 109325 0 master MERGED dump-volume-chains: use storage.dump() api 2021-02-12 15:33:32 UTC
oVirt gerrit 111866 0 master MERGED tool: Handle template parent volumes in dump-volume-chains 2021-02-12 15:33:32 UTC
oVirt gerrit 111867 0 ovirt-4.4.3 ABANDONED tool: Handle template parent volumes in dump-volume-chains 2021-02-12 15:33:32 UTC
oVirt gerrit 112013 0 master MERGED tool: Normalize parent volume info in dump-volume-chains 2021-02-12 15:33:32 UTC
oVirt gerrit 112014 0 master ABANDONED tool: Handle template parent volumes in dump-volume-chains 2021-02-12 15:33:32 UTC

Description Germano Veit Michel 2020-05-24 04:47:17 UTC
Description of problem:

Current version of dump-storage-domains uses inefficient loops with several calls to gather data.

With BZ1557147 fixed, a new API was introduced, which is much faster to gather the same data. So improve dump-storage-domains to use the new API.

Comment 1 Germano Veit Michel 2020-05-24 04:48:20 UTC
Tal, once you set the flags feel free to assign this to me. Thanks.

Comment 2 Avihai 2020-07-01 06:39:41 UTC
Hi Germano, 

Please provide a clear and short verification scenario.
I see this bug is related to a  bug 1557147 which is a scale performance bug.

So the scenario is to have a scale ENV to also verify this bug I'm guessing.
Moving this bug as well to scale QE team.

Mordechai, please check also this bug and see if you can test it by GA.

Comment 4 mlehrer 2020-07-01 15:12:23 UTC
Tal please update the target milestone as the fix is not ready for 4.4.1

Comment 5 Germano Veit Michel 2020-07-29 23:01:15 UTC
(In reply to Avihai from comment #2)
> Hi Germano, 
> 
> Please provide a clear and short verification scenario.
> I see this bug is related to a  bug 1557147 which is a scale performance bug.
> 
> So the scenario is to have a scale ENV to also verify this bug I'm guessing.
> Moving this bug as well to scale QE team.

No real need for scale here.

Using a storage domain with a high number of volumes (i.e.: 300+ volumes, they can all be 1G size), it can be tested the following way:
$ time vdsm-tool dump-volume-chains SD_UUID

The newer version should complete the command much faster.

(In reply to mlehrer from comment #4)
> Tal please update the target milestone as the fix is not ready for 4.4.1

Yes, let's try 4.4.3 as I also need to check if any changes will be needed in the discrepancy tool.

Comment 6 mlehrer 2020-09-30 10:29:19 UTC
is this still planned to be delivered in 4.4.3 to QE?

Comment 7 Germano Veit Michel 2020-09-30 21:55:23 UTC
(In reply to mlehrer from comment #6)
> is this still planned to be delivered in 4.4.3 to QE?
I think so, the patch was merged 3 hours ago, so it should be on the next build?
https://gerrit.ovirt.org/#/c/109325/

Verification Scenario:
1. Run the following on File and Block Storage Domains containing VMs with Snapshots
$ vdsm-tool dump-volume-chains <su_uuid>
2. Optionally, do some random modifications on the storage volume metadata and confirm the
tool runs cleanly and the output is sane. Some suggestions:
- Change PUUID of a volume to a wrong UUID
- Remove PUUID tag
- Remove Image(Disk) tags
- Remove some random metadata key (i.e. legality, capacity..)

Comment 8 Tzahi Ashkenazi 2020-10-05 10:34:12 UTC
hey Germano.  
some open questions regarding the Verification Scenario.

we going to test this with Storage Domain that contains 750 VMs :
1. how many snapshot per single VM ?
2. how many VMs should have snapshot ?
3. does the snapshot should be include VM memory.

thanks

Comment 9 Germano Veit Michel 2020-10-05 22:32:06 UTC
(In reply to Tzahi Ashkenazi from comment #8)
> hey Germano.  
> some open questions regarding the Verification Scenario.
> 
> we going to test this with Storage Domain that contains 750 VMs :
> 1. how many snapshot per single VM ?
> 2. how many VMs should have snapshot ?

Hi Tzahi,

Since you already have 750VMs you already have plenty of volumes.
You can create a few snapshots on a few VMs just to cover the case
of having snapshots, but no need to have hundreds of snapshots in total.
I'd say do 1 to 3 snapshots in 1 to 10 VMs and that's enough.
> 3. does the snapshot should be include VM memory.
No, but you can have a 1-2 snapshots with memory in total
just to cover this scenario too.

So:
- 750VMs in total, out of which 10 have snapshots
- The 10 have between 1 and 3 snapshots each
- 1 or 2 of the 10 have 1 or 2 snapshots with memory

This should be more than enough.

Thanks!

Comment 10 Tzahi Ashkenazi 2020-10-06 14:18:08 UTC
Baseline results before  the New API >  vdsm-tool dump-volume-chains

environment Red-01 :

      * 760 VMs 
      * 10  VMs - with 10 snapshots each include memory  = total 100 snapshots  

Version :

  1.)   vdsm-4.40.32-1.el8ev.x86_64 
        rhv-release-4.4.3-7-001.noarch 
        L0_group_1_vdsm_4.40.32-1 

       command > time vdsm-tool dump-volume-chains bf38d2c2-99a4-421c-b847-58eabd782f52 
    
          real  11m13.253s 
          user  0m3.363s 
          sys   0m0.973s 

Version :

  2.)  rhv-release-4.4.2-3-001.noarch 
       vdsm-4.40.25-1.el8ev.x86_64 
       L0_group_2_vdsm_4.40.25-1 

       command > time vdsm-tool dump-volume-chains b0a48885-b1f9-4a03-b46f-76f617aeb442 

          real  10m34.614s 
          user  0m3.453s 
          sys   0m0.821s 

no high cpu /memory consuming during both tests on the engine & the host that running the API call.

Comment 11 Tal Nisan 2020-10-13 10:36:04 UTC
Tzahi, can you please move to VERIFIED then?

Comment 12 Tzahi Ashkenazi 2020-10-13 13:31:37 UTC
(In reply to Tal Nisan from comment #11)
> Tzahi, can you please move to VERIFIED then?


hi Tal 
the current results that i published are baseline which made on vdsm version vdsm-4.40.32 
once we will upgrade  red01 to vdsm vdsm-4.40.33 ( which contains the new API ) 
i will run the test again and publish the results 
thanks

Comment 13 Peter Lauterbach 2020-10-15 19:35:07 UTC
Germano,
The performance improvement seems about 6% faster, is this in line with what you expected?
Earlier comments made it seem it would be better than that.

Comment 14 Germano Veit Michel 2020-10-16 01:52:24 UTC
(In reply to Peter Lauterbach from comment #13)
> Germano,
> The performance improvement seems about 6% faster, is this in line with what
> you expected?
> Earlier comments made it seem it would be better than that.

Hi Peter, 

Yes we are expecting a good improvement, but where are you seeing the new results?

Both measurements from comment #10 are before the improvementm they are even from different Storage Domains UUIDs, so they cannot be compared.
Did you see comment #12? Or do you have the results somewhere else?

Comment 15 Tzahi Ashkenazi 2020-10-20 07:29:21 UTC
the current results that i published are baseline which made on > vdsm version vdsm-4.40.32 
once we will upgrade  red01 to > vdsm vdsm-4.40.33 ( which contains the new API ) 
i will run the test again and publish the results

Comment 17 Tzahi Ashkenazi 2020-10-25 09:02:44 UTC
environment Red-01 : 

      * 760 VMs  

      * 10  VMs - with 10 snapshots each include memory  = total 100 snapshots   


1. Baseline > vdsm-4.40.25 :  

  
       rhv-release-4.4.2-3-001.noarch  
       vdsm-4.40.25-1.el8ev.x86_64  

       L0_group_2_vdsm_4.40.25-1  

       [root@f02-h31-000-r620 ~]# time vdsm-tool dump-volume-chains b0a48885-b1f9-4a03-b46f-76f617aeb442  > vdsm-4.40.25_b0a48885-b1f9-4a03-b46f-76f617aeb442.txt  

       real 10m45.623s  
       user 0m3.482s  
       sys 0m0.816s  

the command completed successfully on vdsm 4.40.25 !!!   



2. VDSM with the New API  > vdsm-4.40.34-1.el8ev.x86_64  : 


The command FAILED on the same engine and on the same SD with the same amount of VMs and snapshot  L0_Group_2!!!! 

example of the error from the output file :

             image:    c3889918-7c2e-4aed-8c86-7fd5c6dec436
 
                  Error: no volume with a parent volume Id _BLANK_UUID found e.g: (a<-b), (b<-c)
 
                  Unordered volumes and children:

                 - 7c7696d4-e94c-410c-9e8c-a9f3030264d1 <- ff02ce5f-7dd1-48ef-b252-39253579bf13
                  status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 107374182400, truesize: 1073741824


             image:    adc7ea8c-1a62-4e1b-a73f-45739e8e228f

                    Error: no volume with a parent volume Id _BLANK_UUID found e.g: (a<-b), (b<-c)

                    Unordered volumes and children:

                   - 7c7696d4-e94c-410c-9e8c-a9f3030264d1 <- ffc89e28-fbfd-41b5-9bac-7d5a2faa03dc
                   status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 107374182400, truesize: 1073741824



the full output logs of each command can be found here : 

https://drive.google.com/drive/folders/1cFgTZSkceYrzzp8xueikYiPWE_SVyW8Q?usp=sharing

Comment 18 Amit Bawer 2020-10-25 09:05:08 UTC
(In reply to Tzahi Ashkenazi from comment #17)
> environment Red-01 : 
> 
>       * 760 VMs  
> 
>       * 10  VMs - with 10 snapshots each include memory  = total 100
> snapshots   
> 
> 
> 1. Baseline > vdsm-4.40.25 :  
> 
>   
>        rhv-release-4.4.2-3-001.noarch  
>        vdsm-4.40.25-1.el8ev.x86_64  
> 
>        L0_group_2_vdsm_4.40.25-1  
> 
>        [root@f02-h31-000-r620 ~]# time vdsm-tool dump-volume-chains
> b0a48885-b1f9-4a03-b46f-76f617aeb442  >
> vdsm-4.40.25_b0a48885-b1f9-4a03-b46f-76f617aeb442.txt  
> 
>        real 10m45.623s  
>        user 0m3.482s  
>        sys 0m0.816s  
> 
> the command completed successfully on vdsm 4.40.25 !!!   
> 
> 
> 
> 2. VDSM with the New API  > vdsm-4.40.34-1.el8ev.x86_64  : 
> 
> 
> The command FAILED on the same engine and on the same SD with the same
> amount of VMs and snapshot  L0_Group_2!!!! 
> 
> example of the error from the output file :
> 
>              image:    c3889918-7c2e-4aed-8c86-7fd5c6dec436
>  
>                   Error: no volume with a parent volume Id _BLANK_UUID found
> e.g: (a<-b), (b<-c)
>  
>                   Unordered volumes and children:
> 
>                  - 7c7696d4-e94c-410c-9e8c-a9f3030264d1 <-
> ff02ce5f-7dd1-48ef-b252-39253579bf13
>                   status: OK, voltype: LEAF, format: COW, legality: LEGAL,
> type: SPARSE, capacity: 107374182400, truesize: 1073741824
> 
> 
>              image:    adc7ea8c-1a62-4e1b-a73f-45739e8e228f
> 
>                     Error: no volume with a parent volume Id _BLANK_UUID
> found e.g: (a<-b), (b<-c)
> 
>                     Unordered volumes and children:
> 
>                    - 7c7696d4-e94c-410c-9e8c-a9f3030264d1 <-
> ffc89e28-fbfd-41b5-9bac-7d5a2faa03dc
>                    status: OK, voltype: LEAF, format: COW, legality: LEGAL,
> type: SPARSE, capacity: 107374182400, truesize: 1073741824
> 
> 
> 
> the full output logs of each command can be found here : 
> 
> https://drive.google.com/drive/folders/
> 1cFgTZSkceYrzzp8xueikYiPWE_SVyW8Q?usp=sharing

Thanks Tzahi, uploaded fix patch for the missing base volume error in the chain dumps, currently on POST.

Comment 19 Germano Veit Michel 2020-10-26 02:10:13 UTC
(In reply to Amit Bawer from comment #18)
> Thanks Tzahi, uploaded fix patch for the missing base volume error in the
> chain dumps, currently on POST.

Clearly I didn't have VMs based on thin templates when testing.

Thank you :)

Comment 24 mlehrer 2020-11-29 11:23:48 UTC
Target release is planned to 4.4.4 is this still accurate?

Comment 25 Germano Veit Michel 2020-11-29 21:13:13 UTC
(In reply to mlehrer from comment #24)
> Target release is planned to 4.4.4 is this still accurate?

Yes, all patches are on vdsm-4.40.36:

$ git log --tags --oneline | egrep 'New release|dump-volume-chains'
41dc49a1a New release: 4.40.36
67f1bda25 tool: Handle template parent volumes in dump-volume-chains
20cc9164e tool: Normalize parent volume info in dump-volume-chains
5e4c5102e New release: 4.40.35.1
ab74826c9 New release: 4.40.35
578417cbb New release: 4.40.34
99468f66d New release: 4.40.33
1354aa468 dump-volume-chains: use storage.dump() api

I'm not sure what happened in comments 21-23, but this should be ON_QA.

Comment 27 Tzahi Ashkenazi 2020-12-03 08:19:46 UTC
Tested and verified with the new vdsm version 4.40.37-1 
on  environment Red-01 :   
      * 760 VMs  
      * 10  VMs - with 10 snapshots each include memory  = total 100 snapshots   
Version : 

       rhv-release-4.4.4-2-001.noarch 
       vdsm-4.40.37-1.el8ev.x86_64 
       L0_group_2 

command > time vdsm-tool dump-volume-chains b0a48885-b1f9-4a03-b46f-76f617aeb442  

       real    0m3.262s 
       user    0m0.653s 
       sys     0m0.084s 


Amazing and huge improvement from 11m to 3sec !!!!!!

no high cpu /memory consuming during both tests on the engine & the host that running the API call. 

the full test results for all vdsm version can be found here  :  

                    https://drive.google.com/drive/folders/1cFgTZSkceYrzzp8xueikYiPWE_SVyW8Q?usp=sharing

Comment 33 errata-xmlrpc 2021-02-02 13:59:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4.z [ovirt-4.4.4]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0382


Note You need to log in before you can comment on or make changes to this bug.