Bug 1746730 - [downstream clone - 4.3.6] Engine deletes the leaf volume when SnapshotVDSCommand timed out without checking if the volume is still used by the VM
Summary: [downstream clone - 4.3.6] Engine deletes the leaf volume when SnapshotVDSCom...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.3.4
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ovirt-4.3.6
: 4.3.6
Assignee: Benny Zlotnik
QA Contact: Evelina Shames
URL:
Whiteboard:
: 1899578 (view as bug list)
Depends On: 1737684
Blocks: 1687345
TreeView+ depends on / blocked
 
Reported: 2019-08-29 07:34 UTC by RHV bug bot
Modified: 2024-03-25 15:24 UTC (History)
8 users (show)

Fixed In Version: ovirt-engine-4.3.6.5
Doc Type: No Doc Update
Doc Text:
Clone Of: 1737684
Environment:
Last Closed: 2019-10-10 15:36:58 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5587341 0 None None None 2020-11-24 06:45:00 UTC
Red Hat Product Errata RHEA-2019:3010 0 None None None 2019-10-10 15:37:11 UTC
oVirt gerrit 102610 0 'None' MERGED core: ensure volume is not the chain before removing 2021-01-13 14:22:40 UTC
oVirt gerrit 102907 0 'None' MERGED core: ensure volume is not the chain before removing 2021-01-13 14:22:40 UTC

Description RHV bug bot 2019-08-29 07:34:01 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1737684 +++
======================================================================

Description of problem:

The SnapshotVDSCommand can get timed out in the engine however can be successful in the vdsm side. This can be because of a network drop between the manager or hypervisor when the command is issued in the engine side or can be because of bugs like Bug 1687345. However, if this command gets timeout in the engine side, the engine directly sends a destroy command to remove the newly created volume without checking if the volumes are still used by the VM. This will result in SPM deleting the new volume although the volume is still used by the VM. When VM is running in any host other than SPM, the dm device created for this volume still exists and VM will be writing to the disk even though the LV doesn't exist.

This can corrupt/cause outage of more than 1 VMs. In the LVM point of view, this LV is already deleted. So when a user creates a new snapshot or a disk, the SPM will happily allocate the same disk blocks to the recently deleted LV as these blocks are free in LVM point of view. However, from the other host, a VM is already using these blocks. Now both VM writes into the same blocks and consequently leads to i/o error and outage and corruption.        

Version-Release number of selected component (if applicable):

rhvm-4.3.4.3-0.1.el7.noarch
ovirt-engine-4.3.4.3-0.1.el7.noarch

How reproducible:

100%

Steps to Reproduce:

To reproduce we have to timeout the SnapshotVDSCommand in the engine. We can block the connectivity between the manager and hypervisor immediately after engine sends the SnapshotVDSCommand command.

Reproducer steps in the bug 1687345 should also work.

Actual results:

Engine deletes the leaf volume when SnapshotVDSCommand timed out without checking if the  volume is still used by the VM

Expected results:

Engine should check if the volume is in used by the VM by checking the XML before reverting the snapshot operation and deleting the volume.

Additional info:

(Originally by Nijin Ashok)

Comment 3 RHV bug bot 2019-08-29 07:34:07 UTC
Removing "orphan" volumes after snapshot failure seem to have been introduced by BZ1497355 (https://gerrit.ovirt.org/#/c/91658/), so IIUC 4.2.4 and higher are affected by this bug.
These exceptions need to be better handled.

(Originally by Germano Veit Michel)

Comment 5 RHV bug bot 2019-08-29 07:34:11 UTC
sync2jira

(Originally by Daniel Gur)

Comment 6 RHV bug bot 2019-08-29 07:34:12 UTC
sync2jira

(Originally by Daniel Gur)

Comment 7 Avihai 2019-09-01 15:12:07 UTC
This bug was not merged to rhv-4.3.6-6(ovirt-engine-4.3.6.4).

see:
https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/heads/ovirt-engine-4.3


As this is the last downstream build in 4.3.6 please retarget to 4.3.7.

Comment 8 Avihai 2019-09-01 15:14:33 UTC
Benny, 
See last comment as to why this bug will not be verified at 4.3.6.

Also, please provide a clear and simple as possible scenario so we can qa_ack it.

Comment 9 Benny Zlotnik 2019-09-01 15:23:06 UTC
Steps to reproduce are available in the description

Comment 10 RHV bug bot 2019-09-05 13:28:37 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops

Comment 11 RHV bug bot 2019-09-05 13:34:21 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops

Comment 13 RHV bug bot 2019-09-25 08:46:37 UTC
INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Tag 'ovirt-engine-4.3.5.6' doesn't contain patch 'https://gerrit.ovirt.org/102907']
gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.3.5.6

For more info please contact: rhv-devops

Comment 15 errata-xmlrpc 2019-10-10 15:36:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3010

Comment 18 Liran Rotenberg 2020-11-24 06:45:00 UTC
*** Bug 1899578 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.