Bug 1275652 - Removing snapshot operation takes around 2 hours using NFS domain
Removing snapshot operation takes around 2 hours using NFS domain
Status: CLOSED DUPLICATE of bug 1246114
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage (Show other bugs)
3.6.0.2
Unspecified Unspecified
unspecified Severity medium (vote)
: ovirt-3.6.2
: 3.6.2
Assigned To: Maor
Aharon Canan
storage
: Automation
Depends On:
Blocks: 1275655
  Show dependency treegraph
 
Reported: 2015-10-27 08:37 EDT by Raz Tamir
Modified: 2016-03-10 02:25 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-11-11 11:40:18 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
amureini: ovirt‑3.6.z?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
engine and vdsm logs (487.04 KB, application/x-bzip)
2015-10-27 08:37 EDT, Raz Tamir
no flags Details
3.5 logs (659.86 KB, application/x-bzip)
2015-10-27 11:22 EDT, Raz Tamir
no flags Details
3.6 logs (683.72 KB, application/x-bzip)
2015-10-27 11:23 EDT, Raz Tamir
no flags Details

  None (edit)
Description Raz Tamir 2015-10-27 08:37:08 EDT
Created attachment 1086837 [details]
engine and vdsm logs

Description of problem:
Setup: vm with ~7 disks: 1 bootable 10GB + 6 * 3GB
Removing a snapshot which is in middle of the snapshot chain (not the leaf volume) takes almost 2 hours on an NFS based storage domain.


Version-Release number of selected component (if applicable):
rhevm-backend-3.6.0.2-0.1.el6.noarch
rhevm-3.6.0.2-0.1.el6.noarch
vdsm-4.17.10-5.el7ev.noarch
qemu-img-rhev-2.3.0-31.el7.x86_64


How reproducible:
100%

Steps to Reproduce:
Setup: vm with ~7 disks: 1 bootable 10GB + 6 * 3GB
1. Create a partition on all disks
2. Create snapshot and write file on all disks except the bootable
3. Repeat step 2 two more times
4. The chain contains now 3 snapshots, delete the second snapshot

Actual results:
The operation will take nearly 2 hours on an NFS based storage

Expected results:


Additional info:
Comment 1 Allon Mureinik 2015-10-27 08:56:48 EDT
Please re-run on 3.5 vs the same storage and attach the logs, so we can see where the performance supposedly regressed.

Also, it's unclear from the description - is the VM up or down during this operation?
Comment 2 Raz Tamir 2015-10-27 11:22 EDT
Created attachment 1086886 [details]
3.5 logs

Hi Allon,
We are talking about regular snapshot deletion, not live merge.
The operation in 3.5 took 10 minutes.

Update to the steps to reproduce:
Even without the partitions and file creation it reproduced.

I'm attaching logs from 3.5 with the updated steps to reproduce and also new 3.6 log with the updated steps to reproduce for better comparison. 

BTW, in 3.5 with the new steps to reproduce the operation takes 1 hour 35 min
Comment 3 Raz Tamir 2015-10-27 11:23 EDT
Created attachment 1086889 [details]
3.6 logs
Comment 4 Raz Tamir 2015-10-27 11:24:35 EDT
(In reply to ratamir from comment #2)
> Created attachment 1086886 [details]
> 3.5 logs
> 
> Hi Allon,
> We are talking about regular snapshot deletion, not live merge.
> The operation in 3.5 took 10 minutes.
> 
> Update to the steps to reproduce:
> Even without the partitions and file creation it reproduced.
> 
> I'm attaching logs from 3.5 with the updated steps to reproduce and also new
> 3.6 log with the updated steps to reproduce for better comparison. 
> 
> BTW, in 3.5 with the new steps to reproduce the operation takes 1 hour 35 min

Sorry, in 3.6
Comment 5 Allon Mureinik 2015-10-28 09:55:49 EDT
Thanks - and Yikes!

Tal - can we please have someone take a look at this?
Comment 6 Tal Nisan 2015-10-28 11:04:03 EDT
Maor, please have a look
Comment 7 Maor 2015-10-29 05:37:42 EDT
It looks like the issue is with qemu-img rebase (see [1]).
Basically it is a heavy operation that might take a long time (not sure about 2 hours though).
most of the tasks were waiting for the command to finished.

all the 7 volumes were executed with "qemu-img rebase" around the same time.
only two were finished:
470a1f1c-20d6-4772-aa52-f7fd5e9f2743 : still running
82fe78ed-b315-4190-894b-fe0069fa4d09 : finished      end: 09:57
d7b4cc01-e637-4b9a-ad6d-3fdfce5160f3 : still running
50aa93e3-d061-4c7a-9d28-d417e0eb7ed8 : still running
bcd1832a-3448-4186-b45b-b2ed57c79116 : still running
fcd22d8f-7b18-4854-886d-17f11facdbe2 : still running
eb4ee80b-e846-4c73-80d1-1934b438f523 : finished      end: 10:04


I can notice that in the middle of the process the domain was refreshed, but I don't think it is relevant so much:
(_refreshDomain) Refreshing domain a15a0887-6069-41ba-b293-fbc4dfbea1ff

and the delay after the refresh was a bit long:
'delay': '1.54356'
10:06:51: 'delay': '2.97009'
10:23:11: 'delay': '5.63808'


Raz, "qemu-img rebase" should repors its progress using SIGUSR1. (see https://bugzilla.redhat.com/621482)
Please try to use SIGUSR1 once it happens again on the running "qemu-img rebase" processes and share the output.


[1]
the rebase command: 
/usr/bin/nice -n 19 /usr/bin/ionice -c 3 /usr/bin/qemu-img rebase -t none -T none -f qcow2 -F raw -b ../ab41c149-4a1a-4059-afca-e0393c1d8371/1a3a4c72-caca-4174-825f-fd5fb76ccc85 /rhev/data-center/mnt/10.35.160.108:_Storage_storage__ge3__nfs__1/a15a0887-6069-41ba-b293-fbc4dfbea1ff/images/ab41c149-4a1a-4059-afca-e0393c1d8371/a5a83bc3-6f7b-436f-a982-cb62aeeb0bfb (cwd /rhev/data-center/mnt/10.35.160.108:_Storage_storage__ge3__nfs__1/a15a0887-6069-41ba-b293-fbc4dfbea1ff/images/ab41c149-4a1a-4059-afca-e0393c1d8371
Comment 8 Yaniv Lavi 2015-10-29 08:54:38 EDT
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.
Comment 9 Raz Tamir 2015-11-10 07:33:31 EST
Hi Maor,
based on comment 26 here https://bugzilla.redhat.com/show_bug.cgi?id=621482
I ran:
1) pgrep qemu-img
>> 
8758
8769
8836

2) 
for ((;;)); do kill -SIGUSR1 8758 ; sleep 5; done
for ((;;)); do kill -SIGUSR1 8769 ; sleep 5; done
for ((;;)); do kill -SIGUSR1 8836 ; sleep 5; done

on 3 different sessions and all stuck with no output.
Comment 10 Maor 2015-11-11 11:40:18 EST
looks like BZ1246114 is already taken care of this issue. Closing it as duplicate

*** This bug has been marked as a duplicate of bug 1246114 ***

Note You need to log in before you can comment on or make changes to this bug.