Bug 1552059 - [Regression] Cannot delete VM's snapshot
Summary: [Regression] Cannot delete VM's snapshot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: pre-dev-freeze
: ---
Assignee: Fam Zheng
QA Contact: Tingting Mao
URL:
Whiteboard:
Depends On:
Blocks: 1554650 1554946
TreeView+ depends on / blocked
 
Reported: 2018-03-06 12:31 UTC by Jan Zmeskal
Modified: 2018-11-01 11:06 UTC (History)
24 users (show)

Fixed In Version: qemu-kvm-rhev-2.12.0-1.el7
Doc Type: Bug Fix
Doc Text:
Under certain circumstances, snapshots of guests created in Red Hat Virtualization (RHV) could not be deleted due to an error in the snapshot locking mechanism. This update fixes RHV snapshot locking, and the affected snapshots can now be removed as expected.
Clone Of:
: 1554650 1554946 (view as bug list)
Environment:
Last Closed: 2018-11-01 11:06:51 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Jan Zmeskal 2018-03-06 12:31:41 UTC
Created attachment 1404777 [details]
engine.log

Description of problem:
You can create snapshot of a VM but it cannot be deleted - no matter if the VM is running or not. Error can be found in engine.log.

Version-Release number of selected component (if applicable):
CAUTION! This bug was found ou:
ovirt-engine 4.1.10
However this version was not available in Version field when I was creating this bug.

How reproducible:
100 %

Steps to Reproduce:
1. Have some running VM (in my log it is "jzmeskal_3")
2. Create a snapshot (in my log it is "jzmeskal_3_snapshot_1"). I did not include memory with the snapshot.
3. Now try to remove the snapshot. It will be unsuccessful. Error message in Tasks: Merging snapshots (Active VM into jzmeskal_3_snapshot_1) of disk golden_mixed_virtio_template on host host_mixed_2
4. Shutdown the VM and try to remove the snapshot again. You will get the same result.

Actual results:
VM snapshot cannot be deleted. Several errors are found in engine.log.

Expected results:
You can delete VM's snapshot.

Additional info:
RHEL 7.5
engine.log in attachment

Comment 1 Allon Mureinik 2018-03-06 13:55:21 UTC
Jan, can you please include the vdsm log too?

Comment 2 Red Hat Bugzilla Rules Engine 2018-03-06 13:55:26 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 4 Ala Hino 2018-03-06 15:23:20 UTC
Jan,

Can you please share the version of vdsm?

In vdsm log that you uploaded I don't see any error, which is a bit odd.
How many hosts do you have in the env?
I need the SPM logs and the logs of the host running the VM, assuming there are multiple hosts.

Comment 5 Jan Zmeskal 2018-03-06 16:01:14 UTC
VDSM info:

Name        : vdsm
Arch        : x86_64
Version     : 4.19.46
Release     : 1.el7ev
Size        : 2.6 M
Repo        : installed
From repo   : rhv-4.1.10
Summary     : Virtual Desktop Server Manager
URL         : http://www.ovirt.org/develop/developer-guide/vdsm/vdsm/
License     : GPLv2+
Description : The VDSM service is required by a Virtualization Manager to manage the
            : Linux hosts. VDSM manages and monitors the host's storage, memory and
            : networks as well as virtual machine creation, other host administration
            : tasks, statistics gathering, and log collection.

The engine has only two hosts. The vdsm.log I provided is from SPM. I can reproduce once again and provive logs from all both of the hosts. Is vdsm.log enough or do you need some other logs as well?

Comment 6 Ala Hino 2018-03-06 19:15:34 UTC
(In reply to Jan Zmeskal from comment #5)
> VDSM info:
> 
> Name        : vdsm
> Arch        : x86_64
> Version     : 4.19.46
> Release     : 1.el7ev
> Size        : 2.6 M
> Repo        : installed
> From repo   : rhv-4.1.10
> Summary     : Virtual Desktop Server Manager
> URL         : http://www.ovirt.org/develop/developer-guide/vdsm/vdsm/
> License     : GPLv2+
> Description : The VDSM service is required by a Virtualization Manager to
> manage the
>             : Linux hosts. VDSM manages and monitors the host's storage,
> memory and
>             : networks as well as virtual machine creation, other host
> administration
>             : tasks, statistics gathering, and log collection.
> 
> The engine has only two hosts. The vdsm.log I provided is from SPM. I can
> reproduce once again and provive logs from all both of the hosts. Is
> vdsm.log enough or do you need some other logs as well?

If you can reproduce again and send the engine log and hosts (both) logs, that would be great.

Comment 7 Ala Hino 2018-03-06 19:19:10 UTC
Please explain also the storage that you are using - nfs, gluster, iscsi, etc.
If nfs or gluster, please provide the paths of the storage.

Basically, the scenario you are trying is really basic and should work without errors.
I just wonder whether there is something that configured differently.
Please provide any info you think might be helpful.

Thanks!

Comment 9 Ala Hino 2018-03-07 11:43:10 UTC
Thanks Jan.

What os are the hosts running - rhel 7.4 or 7.5?

Can you please share libvirt version (rpm -qa | grep libvirt)?

Comment 11 Ala Hino 2018-03-07 11:44:52 UTC
And qrmu version please: rpm -qa | grep qemu

Comment 13 Ala Hino 2018-03-07 13:43:07 UTC
Can you please try the following scenarios and share the results?

1. Create a VM that *not* from a template, add a disk, run the VM, create a snapshot and merge it (live merge)

2. Create a VM *from* a template, add a disk, create a snapshot and merge it (cold merge)

3. Create a VM *not* from a template, add a disk, create a snapshot and merge it (cold merge #2)

In all cases no need to install os.

Thanks.

Comment 15 Ala Hino 2018-03-07 15:03:05 UTC
Thanks Jan.

Fam,

As reported in comment #14, it seems that when the VM is created from a template, the merge (live and cold) fails.

Remembering you asking me about shared images, please note that when a VM is created from a template, there is a shared image in the chain.

Please advice how you would like to proceed.

Comment 16 Allon Mureinik 2018-03-07 15:10:44 UTC
The relevant error seems to be:

QImgError: cmd=['/usr/bin/taskset', '--cpu-list', '0-3', '/usr/bin/nice', '-n', '19', '/usr/bin/ionice', '-c', '3', '/usr/bin/qemu-img', 'commit', '-p', '-t', 'none', '-b', u'/rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:_nas01_ge-cfme-integration-2-nfs-0/f146e846-8dd3-4a0e-993f-cb256642c3cc/images/2ec1fe78-490a-49d3-abb5-308c45df7912/8eb638f6-a286-42a6-8371-4ce554f37622', '-f', 'qcow2', u'/rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:_nas01_ge-cfme-integration-2-nfs-0/f146e846-8dd3-4a0e-993f-cb256642c3cc/images/2ec1fe78-490a-49d3-abb5-308c45df7912/bab214b7-7838-43fe-899b-5bf53aebba0d'], ecode=1, stdout=, stderr=qemu-img: Failed to get "write" lock
Is another process using the image?
, message=None

Ala - shouldn't we be using the -U flag?

Comment 17 Ala Hino 2018-03-07 15:23:09 UTC
Per BZ 1535992, it is agreed that for RHV 4.1 on RHEL 7.5, -U flag should be provided by default. There supposed to be a downstream patch for this.

Please note that per comment #14, if the VM is not based on template, live and cold merge operations successfully work.

Comment 18 Yaniv Kaul 2018-03-08 11:11:57 UTC
(In reply to Ala Hino from comment #17)
> Per BZ 1535992, it is agreed that for RHV 4.1 on RHEL 7.5, -U flag should be
> provided by default. There supposed to be a downstream patch for this.
> 
> Please note that per comment #14, if the VM is not based on template, live
> and cold merge operations successfully work.

Ala, what's the latest here?

Comment 20 Ala Hino 2018-03-08 16:24:00 UTC
The root cause of this bug is having a shared base image that is of qcow2 format.

Please use the following steps to reproduce:
1. Create a qcow2 image - base.img
   $ qemu-img create -f qcow2 base.img 10M

2. Create a child of the base image:
   $ qemu-img create -f qcow2 -b base.img child1.img 10M

3. Run a qemu process that uses child1 image - at this point, the qemu process takes a write-lock on the entire chain, including the base image.

4. Create another child of the base image:
   $ qemu-img create -f qcow2 -b base.img child2.img 10M

5. Create a child of child2 image:
   $ qemu-img create -f qcow2 -b child2.img child2_2.img 10M

6. Commit child2_2 into child2
   $ qemu-img commit -p -t none -b child2.img child2_2.img

Step #6 fails because qemu-img commit tries to lock the entire chain, including the base image that is already locked by the qemu process created in step #3. This ends with the following error:

Error: Command ['/usr/bin/taskset', '--cpu-list', '0-3', '/usr/bin/nice', '-n', '19', '/usr/bin/ionice', '-c', '3', '/usr/bin/qemu-img', 'commit', '-p', '-t', 'none', '-b', u'/rhev/data-center/mnt/rich-nfs-server2.usersys.redhat.com:_home_storage_sd9/a033c56a-d8bd-4313-ab19-4eaabd366c69/images/211bdb3a-befe-4ace-a0bf-e09fa32abb31/f6df22ce-d5c2-4180-b847-afdc24a0f913', '-f', 'qcow2', u'/rhev/data-center/mnt/rich-nfs-server2.usersys.redhat.com:_home_storage_sd9/a033c56a-d8bd-4313-ab19-4eaabd366c69/images/211bdb3a-befe-4ace-a0bf-e09fa32abb31/38c100c2-6db8-44bd-aa00-6d508409b3bd'] failed with rc=1 out='' err=bytearray(b'qemu-img: Failed to get "write" lock\nIs another process using the image?\n')

Comment 26 Fam Zheng 2018-03-09 04:54:49 UTC
Thank you Ala. Reproduced the issue in comment 20. I believe both "block-commit" and "qemu-img commit" have this problem.

The root cause is the file lock we try to acquire on the base image during commit is more restrict than necessary.

While I could quickly come up with a small fix as below, I'm still testing it. It would be better if Kevin can help because of the subtlety of the perm system.

---

diff --git a/block.c b/block.c
index 4f76714f6b..a6b2bf89da 100644
--- a/block.c
+++ b/block.c
@@ -1003,7 +1003,7 @@ static void bdrv_backing_options(int *child_flags, QDict *child_options,

     /* backing files always opened read-only */
     qdict_set_default_str(child_options, BDRV_OPT_READ_ONLY, "on");
-    flags &= ~BDRV_O_COPY_ON_READ;
+    flags &= ~(BDRV_O_COPY_ON_READ | BDRV_O_RDWR);

     /* snapshot=on is handled on the top layer */
     flags &= ~(BDRV_O_SNAPSHOT | BDRV_O_TEMPORARY);

Comment 27 Fam Zheng 2018-03-09 08:57:04 UTC
Sent a more complete series to upstream:

https://lists.nongnu.org/archive/html/qemu-devel/2018-03/msg02606.html

Comment 29 Fam Zheng 2018-03-09 09:26:54 UTC
This will be a call for RHV.

Comment 34 Fam Zheng 2018-03-12 07:37:49 UTC
Kevin, could you help review the upstream patches mentioned in comment 27, please?

Comment 40 Raz Tamir 2018-03-15 08:31:12 UTC
Hi Jan,

Can you please re-test this flow with the proposed build in comment #36?
The qemu-kvm-rhev should be qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64

Thanks

Comment 44 Ala Hino 2018-04-17 09:02:49 UTC
Not sure what info is required from me, Kevin?

Comment 46 Raz Tamir 2018-04-25 10:08:12 UTC
It was a mistake.
We are setting to '+' only when it is covered by automation

Comment 52 Tingting Mao 2018-05-10 09:10:38 UTC
Verified the bug with external snapshot using the same steps of following bz, and it is passed:
https://bugzilla.redhat.com/show_bug.cgi?id=1554946#c8


However, for the regression test, there is ONE new bug produced:

Total amount of cases: 71
Results: 67 passed; 4 failed

Packages Tested:
kernel-3.10.0-880.el7.x86_64
qemu-kvm-rhev-2.12.0-1.el7

New Bugs(1):
Bug 1575578 - Failed to convert a source image to the qcow2 image encrypted by luks

Reproduced Bugs(4)
Bug 1523459 - RFE: support to dump metadata of the image encrypted by LUKS using default human format
Bug 1331279- test cases of qemu-iotests failed
Bug 1527085 - The copied flag should be updated during '-r leaks'
Bug 1191402 - The error info is not accurate when create image with specify wrong backing_fmt

Comment 53 errata-xmlrpc 2018-11-01 11:06:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443


Note You need to log in before you can comment on or make changes to this bug.