1737684 – Engine deletes the leaf volume when SnapshotVDSCommand timed out without checking if the volume is still used by the VM

Bug 1737684 - Engine deletes the leaf volume when SnapshotVDSCommand timed out without checking if the volume is still used by the VM

Summary: Engine deletes the leaf volume when SnapshotVDSCommand timed out without chec...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.3.4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.4.0
Target Release:	---
Assignee:	Benny Zlotnik
QA Contact:	Evelina Shames
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1746730
TreeView+	depends on / blocked

Reported:	2019-08-06 04:56 UTC by nijin ashok
Modified:	2023-03-24 15:10 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1746730 (view as bug list)
Environment:
Last Closed:	2020-08-04 13:20:00 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:3247	0	None	None	None	2020-08-04 13:20:34 UTC
oVirt gerrit	102610	0	'None'	MERGED	core: ensure volume is not the chain before removing	2021-01-19 13:41:36 UTC

Description nijin ashok 2019-08-06 04:56:14 UTC

Description of problem:

The SnapshotVDSCommand can get timed out in the engine however can be successful in the vdsm side. This can be because of a network drop between the manager or hypervisor when the command is issued in the engine side or can be because of bugs like Bug 1687345. However, if this command gets timeout in the engine side, the engine directly sends a destroy command to remove the newly created volume without checking if the volumes are still used by the VM. This will result in SPM deleting the new volume although the volume is still used by the VM. When VM is running in any host other than SPM, the dm device created for this volume still exists and VM will be writing to the disk even though the LV doesn't exist.

This can corrupt/cause outage of more than 1 VMs. In the LVM point of view, this LV is already deleted. So when a user creates a new snapshot or a disk, the SPM will happily allocate the same disk blocks to the recently deleted LV as these blocks are free in LVM point of view. However, from the other host, a VM is already using these blocks. Now both VM writes into the same blocks and consequently leads to i/o error and outage and corruption.

Version-Release number of selected component (if applicable):

rhvm-4.3.4.3-0.1.el7.noarch
ovirt-engine-4.3.4.3-0.1.el7.noarch

How reproducible:

100%

Steps to Reproduce:

To reproduce we have to timeout the SnapshotVDSCommand in the engine. We can block the connectivity between the manager and hypervisor immediately after engine sends the SnapshotVDSCommand command.

Reproducer steps in the bug 1687345 should also work.

Actual results:

Engine deletes the leaf volume when SnapshotVDSCommand timed out without checking if the volume is still used by the VM

Expected results:

Engine should check if the volume is in used by the VM by checking the XML before reverting the snapshot operation and deleting the volume.

Additional info:

Comment 3 Germano Veit Michel 2019-08-12 04:12:26 UTC

Removing "orphan" volumes after snapshot failure seem to have been introduced by BZ1497355 (https://gerrit.ovirt.org/#/c/91658/), so IIUC 4.2.4 and higher are affected by this bug.
These exceptions need to be better handled.

Comment 5 Daniel Gur 2019-08-28 13:15:30 UTC

sync2jira

Comment 6 Daniel Gur 2019-08-28 13:21:18 UTC

sync2jira

Comment 8 RHV bug bot 2019-12-13 13:15:54 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 9 RHV bug bot 2019-12-20 17:45:27 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 12 Benny Zlotnik 2020-01-02 10:05:02 UTC

(In reply to Evelina Shames from comment #11)
> The same bug was verified for ovirt-engine-4.3.6.5-0.1 and vdsm-4.30.30-1 -
> Bug 1746730, and I'm trying to verify on : 
> ovirt-engine-4.4.0-0.13.master.el7.noarch
> vdsm-4.40.0-164.git38a19bb.el8ev.x86_64                                     
> 
> 
> With the same steps:
> - On host, change in /usr/lib/python3.6/site-packages/vdsm/API.py:
>         s = vm.snapshot(snapDrives, memoryParams, frozen=frozen)
>         import time
>         time.sleep(190)
>         return s
> - Restart vdsm
> - Power on VM on this host
> - Try to create live snapshot
> - Operation fails -> But in engine log the following doesn't appear:
> 'appears to be in use by VM'
> 
> Benny, is it ok?
> 
> Engine log is attached
> (2020-01-02 11:33:10,858+02 INFO 
> [org.ovirt.engine.core.vdsbroker.irsbroker.CreateImageVDSCommand]
> (EE-ManagedThreadFactory-engine-Thread-187417)
> [e4a80c85-1e5e-45c6-9c71-87b12238e336] START, CreateImageVDSCommand(
> CreateImageVDSCommandParameters:{storagePoolId='b3e907a2-cea0-4eda-b1f9-
> 9d451aa5a571', ignoreFailoverLimit='false',
> storageDomainId='f69faf83-6f96-40f9-86e6-e865df601bbd',
> imageGroupId='5007efa1-9f41-479a-bb86-315e9eada5ae',
> imageSizeInBytes='1358954496', volumeFormat='RAW',
> newImageId='13aa556b-4ffe-4ee7-a7fd-28241448ca94', imageType='Sparse',
> newImageDescription='{"DiskAlias":
> "golden_env_mixed_virtio_1_0_snapshot_memory","DiskDescription":"Memory
> snapshot disk for snapshot 'snp1' of VM 'golden_env_mixed_virtio_1_0' (VM
> ID: '6d0b5d42-59f6-42d1-9088-180b63c410f6')"}',
> imageInitialSizeInBytes='0'}), log id: 299cd3d0
> )

I see it:
2020-01-02 11:36:27,179+02 WARN  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [e4a80c85-1e5e-45c6-9c71-87b12238e336] Image '12e06ed6-5045-47e7-89e7-0ce7efd7465e' appears to be in use by VM '6d0b5d42-59f6-42d1-9088-180b63c410f6', skipping deletion

Comment 13 Evelina Shames 2020-01-02 11:43:22 UTC

Ohh sorry, missed it.

Moving to verified.
ovirt-engine-4.4.0-0.13.master.el7.noarch
vdsm-4.40.0-164.git38a19bb.el8ev.x86_64

Comment 14 RHV bug bot 2020-01-08 14:49:28 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 15 RHV bug bot 2020-01-08 15:17:15 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 16 RHV bug bot 2020-01-24 19:51:16 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 21 errata-xmlrpc 2020-08-04 13:20:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247

Note You need to log in before you can comment on or make changes to this bug.