Bug 1275652

Summary:

Removing snapshot operation takes around 2 hours using NFS domain

Product:

[oVirt] ovirt-engine

Reporter:

Raz Tamir <ratamir>

Component:

BLL.Storage

Assignee:

Maor <mlipchuk>

Status:

CLOSED DUPLICATE

QA Contact:

Aharon Canan <acanan>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.6.0.2

CC:

amureini, bugs, mlipchuk, ratamir, tnisan

Target Milestone:

ovirt-3.6.2

Keywords:

Automation

Target Release:

3.6.2

Flags:

amureini: ovirt-3.6.z?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

storage

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-11-11 16:40:18 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1275655

Attachments:

Description	Flags
engine and vdsm logs	none
3.5 logs	none
3.6 logs	none

Description Raz Tamir 2015-10-27 12:37:08 UTC

Created attachment 1086837 [details]
engine and vdsm logs

Description of problem:
Setup: vm with ~7 disks: 1 bootable 10GB + 6 * 3GB
Removing a snapshot which is in middle of the snapshot chain (not the leaf volume) takes almost 2 hours on an NFS based storage domain.


Version-Release number of selected component (if applicable):
rhevm-backend-3.6.0.2-0.1.el6.noarch
rhevm-3.6.0.2-0.1.el6.noarch
vdsm-4.17.10-5.el7ev.noarch
qemu-img-rhev-2.3.0-31.el7.x86_64


How reproducible:
100%

Steps to Reproduce:
Setup: vm with ~7 disks: 1 bootable 10GB + 6 * 3GB
1. Create a partition on all disks
2. Create snapshot and write file on all disks except the bootable
3. Repeat step 2 two more times
4. The chain contains now 3 snapshots, delete the second snapshot

Actual results:
The operation will take nearly 2 hours on an NFS based storage

Expected results:


Additional info:

Comment 1 Allon Mureinik 2015-10-27 12:56:48 UTC

Please re-run on 3.5 vs the same storage and attach the logs, so we can see where the performance supposedly regressed.

Also, it's unclear from the description - is the VM up or down during this operation?

Comment 2 Raz Tamir 2015-10-27 15:22:41 UTC

Created attachment 1086886 [details]
3.5 logs

Hi Allon,
We are talking about regular snapshot deletion, not live merge.
The operation in 3.5 took 10 minutes.

Update to the steps to reproduce:
Even without the partitions and file creation it reproduced.

I'm attaching logs from 3.5 with the updated steps to reproduce and also new 3.6 log with the updated steps to reproduce for better comparison. 

BTW, in 3.5 with the new steps to reproduce the operation takes 1 hour 35 min

Comment 3 Raz Tamir 2015-10-27 15:23:17 UTC

Created attachment 1086889 [details]
3.6 logs

Comment 4 Raz Tamir 2015-10-27 15:24:35 UTC

(In reply to ratamir from comment #2)
> Created attachment 1086886 [details]
> 3.5 logs
> 
> Hi Allon,
> We are talking about regular snapshot deletion, not live merge.
> The operation in 3.5 took 10 minutes.
> 
> Update to the steps to reproduce:
> Even without the partitions and file creation it reproduced.
> 
> I'm attaching logs from 3.5 with the updated steps to reproduce and also new
> 3.6 log with the updated steps to reproduce for better comparison. 
> 
> BTW, in 3.5 with the new steps to reproduce the operation takes 1 hour 35 min

Sorry, in 3.6

Comment 5 Allon Mureinik 2015-10-28 13:55:49 UTC

Thanks - and Yikes!

Tal - can we please have someone take a look at this?

Comment 6 Tal Nisan 2015-10-28 15:04:03 UTC

Maor, please have a look

Comment 7 Maor 2015-10-29 09:37:42 UTC

It looks like the issue is with qemu-img rebase (see [1]).
Basically it is a heavy operation that might take a long time (not sure about 2 hours though).
most of the tasks were waiting for the command to finished.

all the 7 volumes were executed with "qemu-img rebase" around the same time.
only two were finished:
470a1f1c-20d6-4772-aa52-f7fd5e9f2743 : still running
82fe78ed-b315-4190-894b-fe0069fa4d09 : finished      end: 09:57
d7b4cc01-e637-4b9a-ad6d-3fdfce5160f3 : still running
50aa93e3-d061-4c7a-9d28-d417e0eb7ed8 : still running
bcd1832a-3448-4186-b45b-b2ed57c79116 : still running
fcd22d8f-7b18-4854-886d-17f11facdbe2 : still running
eb4ee80b-e846-4c73-80d1-1934b438f523 : finished      end: 10:04


I can notice that in the middle of the process the domain was refreshed, but I don't think it is relevant so much:
(_refreshDomain) Refreshing domain a15a0887-6069-41ba-b293-fbc4dfbea1ff

and the delay after the refresh was a bit long:
'delay': '1.54356'
10:06:51: 'delay': '2.97009'
10:23:11: 'delay': '5.63808'


Raz, "qemu-img rebase" should repors its progress using SIGUSR1. (see https://bugzilla.redhat.com/621482)
Please try to use SIGUSR1 once it happens again on the running "qemu-img rebase" processes and share the output.


[1]
the rebase command: 
/usr/bin/nice -n 19 /usr/bin/ionice -c 3 /usr/bin/qemu-img rebase -t none -T none -f qcow2 -F raw -b ../ab41c149-4a1a-4059-afca-e0393c1d8371/1a3a4c72-caca-4174-825f-fd5fb76ccc85 /rhev/data-center/mnt/10.35.160.108:_Storage_storage__ge3__nfs__1/a15a0887-6069-41ba-b293-fbc4dfbea1ff/images/ab41c149-4a1a-4059-afca-e0393c1d8371/a5a83bc3-6f7b-436f-a982-cb62aeeb0bfb (cwd /rhev/data-center/mnt/10.35.160.108:_Storage_storage__ge3__nfs__1/a15a0887-6069-41ba-b293-fbc4dfbea1ff/images/ab41c149-4a1a-4059-afca-e0393c1d8371

Comment 8 Yaniv Lavi 2015-10-29 12:54:38 UTC

In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 9 Raz Tamir 2015-11-10 12:33:31 UTC

Hi Maor,
based on comment 26 here https://bugzilla.redhat.com/show_bug.cgi?id=621482
I ran:
1) pgrep qemu-img
>> 
8758
8769
8836

2) 
for ((;;)); do kill -SIGUSR1 8758 ; sleep 5; done
for ((;;)); do kill -SIGUSR1 8769 ; sleep 5; done
for ((;;)); do kill -SIGUSR1 8836 ; sleep 5; done

on 3 different sessions and all stuck with no output.

Comment 10 Maor 2015-11-11 16:40:18 UTC

looks like BZ1246114 is already taken care of this issue. Closing it as duplicate

*** This bug has been marked as a duplicate of bug 1246114 ***