1746730 – [downstream clone - 4.3.6] Engine deletes the leaf volume when SnapshotVDSCommand timed out without checking if the volume is still used by the VM

Bug 1746730 - [downstream clone - 4.3.6] Engine deletes the leaf volume when SnapshotVDSCommand timed out without checking if the volume is still used by the VM

Summary: [downstream clone - 4.3.6] Engine deletes the leaf volume when SnapshotVDSCom...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.3.4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.3.6
Target Release:	4.3.6
Assignee:	Benny Zlotnik
QA Contact:	Evelina Shames
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1899578 (view as bug list)
Depends On:	1737684
Blocks:	1687345
TreeView+	depends on / blocked

Reported:	2019-08-29 07:34 UTC by RHV bug bot
Modified:	2024-03-25 15:24 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ovirt-engine-4.3.6.5
Doc Type:	No Doc Update
Doc Text:
Clone Of:	1737684
Environment:
Last Closed:	2019-10-10 15:36:58 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5587341	None	None	None	2020-11-24 06:45:00 UTC
Red Hat Product Errata	RHEA-2019:3010	None	None	None	2019-10-10 15:37:11 UTC
oVirt gerrit	102610	'None'	MERGED	core: ensure volume is not the chain before removing	2021-01-13 14:22:40 UTC
oVirt gerrit	102907	'None'	MERGED	core: ensure volume is not the chain before removing	2021-01-13 14:22:40 UTC

Description RHV bug bot 2019-08-29 07:34:01 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1737684 +++
======================================================================

Description of problem:

The SnapshotVDSCommand can get timed out in the engine however can be successful in the vdsm side. This can be because of a network drop between the manager or hypervisor when the command is issued in the engine side or can be because of bugs like Bug 1687345. However, if this command gets timeout in the engine side, the engine directly sends a destroy command to remove the newly created volume without checking if the volumes are still used by the VM. This will result in SPM deleting the new volume although the volume is still used by the VM. When VM is running in any host other than SPM, the dm device created for this volume still exists and VM will be writing to the disk even though the LV doesn't exist.

This can corrupt/cause outage of more than 1 VMs. In the LVM point of view, this LV is already deleted. So when a user creates a new snapshot or a disk, the SPM will happily allocate the same disk blocks to the recently deleted LV as these blocks are free in LVM point of view. However, from the other host, a VM is already using these blocks. Now both VM writes into the same blocks and consequently leads to i/o error and outage and corruption.        

Version-Release number of selected component (if applicable):

rhvm-4.3.4.3-0.1.el7.noarch
ovirt-engine-4.3.4.3-0.1.el7.noarch

How reproducible:

100%

Steps to Reproduce:

To reproduce we have to timeout the SnapshotVDSCommand in the engine. We can block the connectivity between the manager and hypervisor immediately after engine sends the SnapshotVDSCommand command.

Reproducer steps in the bug 1687345 should also work.

Actual results:

Engine deletes the leaf volume when SnapshotVDSCommand timed out without checking if the  volume is still used by the VM

Expected results:

Engine should check if the volume is in used by the VM by checking the XML before reverting the snapshot operation and deleting the volume.

Additional info:

(Originally by Nijin Ashok)

Comment 3 RHV bug bot 2019-08-29 07:34:07 UTC

Removing "orphan" volumes after snapshot failure seem to have been introduced by BZ1497355 (https://gerrit.ovirt.org/#/c/91658/), so IIUC 4.2.4 and higher are affected by this bug.
These exceptions need to be better handled.

(Originally by Germano Veit Michel)

Comment 5 RHV bug bot 2019-08-29 07:34:11 UTC

sync2jira

(Originally by Daniel Gur)

Comment 6 RHV bug bot 2019-08-29 07:34:12 UTC

sync2jira

(Originally by Daniel Gur)

Comment 7 Avihai 2019-09-01 15:12:07 UTC

This bug was not merged to rhv-4.3.6-6(ovirt-engine-4.3.6.4).

see:
https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/heads/ovirt-engine-4.3


As this is the last downstream build in 4.3.6 please retarget to 4.3.7.

Comment 8 Avihai 2019-09-01 15:14:33 UTC

Benny, 
See last comment as to why this bug will not be verified at 4.3.6.

Also, please provide a clear and simple as possible scenario so we can qa_ack it.

Comment 9 Benny Zlotnik 2019-09-01 15:23:06 UTC

Steps to reproduce are available in the description

Comment 10 RHV bug bot 2019-09-05 13:28:37 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops

Comment 11 RHV bug bot 2019-09-05 13:34:21 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops

Comment 13 RHV bug bot 2019-09-25 08:46:37 UTC

INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Tag 'ovirt-engine-4.3.5.6' doesn't contain patch 'https://gerrit.ovirt.org/102907']
gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.3.5.6

For more info please contact: rhv-devops

Comment 15 errata-xmlrpc 2019-10-10 15:36:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3010

Comment 18 Liran Rotenberg 2020-11-24 06:45:00 UTC

*** Bug 1899578 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.