Bug 1395941

Summary: [Scale] Average time for VM Snapshot to complete degrades once a VM contains multiple snapshots [Fixed for RHEL >= 7.5]
Product: [oVirt] ovirt-engine Reporter: mlehrer
Component: GeneralAssignee: Ala Hino <ahino>
Status: CLOSED CURRENTRELEASE QA Contact: guy chen <guchen>
Severity: medium Docs Contact:
Priority: high    
Version: 4.0.5.1CC: ahino, amureini, bugs, guchen, mlehrer, nsoffer, rgolan, tjelinek, tnisan, ylavi
Target Milestone: ovirt-4.2.2Keywords: Performance
Target Release: 4.2.2.2Flags: rule-engine: ovirt-4.2+
ylavi: exception+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-29 11:07:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1213786    
Bug Blocks: 1551684    

Description mlehrer 2016-11-17 00:50:13 UTC
Description of problem:

The time it takes for VM snapshot to complete with memory included or without against VM with multiple disks will degrade once the VM being snapshot-ed has multiple existing snapshots.

Dataset: 200 VM with 5 disk each 1000 Disks in total

Version-Release number of selected component (if applicable):
vdsm-4.18.15.2-1
RHEVM 4.0.5
	
How reproducible:
Very

Steps to Reproduce:
1. Load dataset of 200 Vms with 5 disks per VM
2. Attempt 1 VM snapshot on VM
3. Attempt additional VM snapshot after 2-3 snapshots loaded VM snapshot execution time degrades by 30s or more

Actual results:
VM snapshot execution time degrades once multiple VM snapshots exist

Expected results:
VM snapshot time to stay consistent irregardless of # of previous VM snapshots

Additional info:

Comment 1 Yaniv Kaul 2016-11-17 07:11:16 UTC
Does it have anything to do with the dataset in step 1? 
Can you attach relevant logs?

Comment 2 Tomas Jelinek 2016-11-23 09:14:24 UTC
ping, can you please attach the logs?

Comment 3 mlehrer 2016-11-23 10:57:07 UTC
(In reply to Yaniv Kaul from comment #1)
> Does it have anything to do with the dataset in step 1? 

It may contribute, we'll need to test a less populated enviroment to have more insights.

> Can you attach relevant logs?
Yes see link below [1]

(In reply to Tomas Jelinek from comment #2)
> ping, can you please attach the logs?
Yes see link below [1]

Following activity was done sequentially 4 times with no concurrency to the same VM.  The actions were: VM Snapshot (with memory) then VM Snapshot (no Memory)

In the shared logs link [1] you'll find a comparison table of the Snapshot samples you'll see the execution time increasing per iteration.

[1] https://drive.google.com/open?id=0B8V1DXeGhPPWempHNlJjNVNMU1U

Comment 4 mlehrer 2016-11-30 11:26:03 UTC
Based on conversations with Tomas, we have agreed to re-test this scenario manually to verify that automated scripts which ran immediately before this test scenario did not in anyway contribute to seeing this issue.  

I will update the BZ with the results of this manual test.

Comment 6 Roy Golan 2016-12-01 07:28:22 UTC
Tal can you or one of the team take that?

Comment 7 Allon Mureinik 2016-12-04 11:41:15 UTC
Let's retest this on 4.1 DC level please. The compat=1.1 parameter we use for qcows there should improve things, and may render this BZ obsolete (or not, of course ;-))

Comment 8 guy chen 2017-01-10 13:07:15 UTC
I have retest this on latest 4.1 build 4 - the bug reproduced, first snapshot took 46 seconds, each snapshot after it takes longer, after 10 snapshots creating a snapshot takes 1 minutes and 49 seconds.

Comment 9 Yaniv Lavi 2017-02-23 11:24:08 UTC
Moving out all non blocker\exceptions.

Comment 10 Ala Hino 2017-06-01 12:41:30 UTC
(In reply to guy chen from comment #8)
> I have retest this on latest 4.1 build 4 - the bug reproduced, first
> snapshot took 46 seconds, each snapshot after it takes longer, after 10
> snapshots creating a snapshot takes 1 minutes and 49 seconds.

Guy,

Can you provide info about the setup used to test?
Mainly, I'd like to know how many hosts there were in the deployment.
In addition, can you have the same test while the VM is running on HSM?

Comment 11 guy chen 2017-06-05 08:28:22 UTC
The system was with 1 host, 1 SD iscsi, 200 Vms with 5 thin provisioned disks per VM.
Currently we don't have a setup with more then 1 hosts but when we will have it I will run it on HSM.

Comment 12 Nir Soffer 2017-06-13 20:55:19 UTC
Ala, see https://bugzilla.redhat.com/show_bug.cgi?id=1213786#c5 - maybe we like to
depend on that bug?

Comment 13 Allon Mureinik 2017-06-14 07:40:25 UTC
(In reply to Nir Soffer from comment #12)
> Ala, see https://bugzilla.redhat.com/show_bug.cgi?id=1213786#c5 - maybe we
> like to
> depend on that bug?
Sounds legit.

Comment 14 Ala Hino 2017-06-14 07:45:30 UTC
(In reply to Nir Soffer from comment #12)
> Ala, see https://bugzilla.redhat.com/show_bug.cgi?id=1213786#c5 - maybe we
> like to
> depend on that bug?

Allon was quick and marked this bug to depend on BZ 1213786.

Comment 15 Allon Mureinik 2017-06-14 07:49:19 UTC
(In reply to Ala Hino from comment #14)
> (In reply to Nir Soffer from comment #12)
> > Ala, see https://bugzilla.redhat.com/show_bug.cgi?id=1213786#c5 - maybe we
> > like to
> > depend on that bug?
> 
> Allon was quick and marked this bug to depend on BZ 1213786.

Comment 16 Allon Mureinik 2017-06-18 10:57:32 UTC
bug 1213786 has be suggested for 7.4.z, let's see where it ends up.

Comment 17 Yaniv Kaul 2017-10-15 08:59:19 UTC
(In reply to Allon Mureinik from comment #16)
> bug 1213786 has be suggested for 7.4.z, let's see where it ends up.

It looks like it's only going to 7.5 - but is already in VERIFIED state. Anything we need to do here?

In any case, moving to ASSIGNED, as the patch above is no longer valid.

Comment 18 Ala Hino 2017-10-15 09:06:28 UTC
The patch was in POST state, not verified.
In any case, this will wait for 7.5.

Comment 19 Yaniv Kaul 2017-10-15 09:09:38 UTC
(In reply to Ala Hino from comment #18)
> The patch was in POST state, not verified.

The platform BZ is in VERIFIED state.

> In any case, this will wait for 7.5.

Which means you can take the package and test if it fixes the issue already - what do we need to wait for?

Do we have any work on our side to do here?

Comment 20 Ala Hino 2017-10-15 09:14:27 UTC
(In reply to Yaniv Kaul from comment #19)
> (In reply to Ala Hino from comment #18)
> > The patch was in POST state, not verified.
> 
> The platform BZ is in VERIFIED state.
> 
> > In any case, this will wait for 7.5.
> 
> Which means you can take the package and test if it fixes the issue already
> - what do we need to wait for?
> 
> Do we have any work on our side to do here?

Yes, we have to change our code to use the new option.

Comment 21 Allon Mureinik 2018-02-08 10:36:09 UTC
Ala, all the patches attached are merged.
Are we waiting for something else?

Comment 22 Allon Mureinik 2018-02-08 10:36:46 UTC
Sorry Mordehai, meant to direct this needinfo at Ala - please ignore.

Comment 23 Ala Hino 2018-02-08 11:42:55 UTC
(In reply to Allon Mureinik from comment #21)
> Ala, all the patches attached are merged.
> Are we waiting for something else?

This can be only verified on RHEL 7.5 where we have all qemu unsafe support.

Comment 24 Allon Mureinik 2018-02-08 12:19:14 UTC
Moving to MODIFIED then.
QA contact - note this requires qemu-*-[rh]ev-2.10 to verify.

Comment 25 guy chen 2018-03-07 12:32:44 UTC
Was retested with ovirt	4.2.2, vdsm 4.20.20, and RHEL 7.5.
Created 10 snapshots on a VMS with  2 HDD, was not reproduce, duration time stable thus bug is verified.

Comment 26 Sandro Bonazzola 2018-03-29 11:07:45 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.