1842375 – Failed snapshot creation can cause data corruption of other VMs [RHV clone - 4.3.10]

Bug 1842375 - Failed snapshot creation can cause data corruption of other VMs [RHV clone - 4.3.10]

Summary: Failed snapshot creation can cause data corruption of other VMs [RHV clone - ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.3.10
Target Release:	---
Assignee:	Liran Rotenberg
QA Contact:	Shir Fishbain
Docs Contact:
URL:
Whiteboard:
Depends On:	1821164
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-01 06:55 UTC by RHV bug bot
Modified:	2023-12-15 18:02 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Unsuccessful freeze command from the VDSM reached the timeout to the snapshot command of 3 minutes. Consequence: Snapshot command don't start. The engine assumes that the volume chain did update when checking for volume usage. But in this case he has no reliable way for to tell if the volume is in use or not, making it possible for data corruption. Fix: A new value is set in engine-config: ' '. This value if set to true will perform the freeze command from the engine. This will prevent the situation above. Result: The freeze if 'LiveSnapshotPerformFreezeInEngine' is set to true will happen in the engine, before calling to snapshot command. In this case, no data corruption is possible.
Clone Of:	1821164
Environment:
Last Closed:	2020-06-09 10:20:28 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5219611	None	None	None	2020-07-20 22:42:09 UTC
oVirt gerrit	108539	master	MERGED	core: snapshot: allow force freeze in engine	2021-02-16 08:12:20 UTC
oVirt gerrit	108572	master	MERGED	core: snapshot: allow inconsistent snapshot	2021-02-16 08:12:20 UTC
oVirt gerrit	108666	ovirt-engine-4.3	MERGED	core: snapshot: allow inconsistent snapshot	2021-02-16 08:12:20 UTC
oVirt gerrit	108673	ovirt-engine-4.3	MERGED	core: snapshot: allow force freeze in engine	2021-02-16 08:12:21 UTC

Description RHV bug bot 2020-06-01 06:55:57 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1821164 +++
======================================================================

Description of problem:
When a snapshot creation fails on timeout, the engine will trigger rollback of the operation and remove the snapshot volumes even though the snapshot is finished on the hypervisor.

Version-Release number of selected component (if applicable):
4.3.7

How reproducible:
100%

Steps to Reproduce:
1. Trigger live snapshot of a VM which is not running on SPM
2. Make freeze fs to get stuck for 10 min (To breach all the timeouts)


Actual results:
The related volumes are removed on the SPM, but the VM finished the snapshot and using it.

Expected results:
The snapshot is either stopped completely or the volumes are not removed without the confirmation that they are not used by any VM.

Additional info:

This is very dangerous situation as other VMs can allocate the extends of the removed LVs. This will cause data corruption as two VMs may write to the same are.

(Originally by Roman Hodain)

Comment 1 RHV bug bot 2020-06-01 06:55:59 UTC

This issue may get fixed by 

    Bug 1749284 - Change the Snapshot operation to be asynchronous

But there may still be potential for this behaviour if we do not handle all the corner cases properly.

(Originally by Roman Hodain)

Comment 9 RHV bug bot 2020-06-01 06:56:14 UTC

Benny, you think it will be possible to check the Domain XML dump to figure out if the VM is currently using an image that we are going to rollback and in that case roll forward?

(Originally by Tal Nisan)

Comment 10 RHV bug bot 2020-06-01 06:56:16 UTC

(In reply to Tal Nisan from comment #9)
> Benny, you think it will be possible to check the Domain XML dump to figure
> out if the VM is currently using an image that we are going to rollback and
> in that case roll forward?

It's strange because we have this check[1], I checked the logs and it seems the xml dump didn't contain the new volumes so from engine POV they weren't used (they are part of the dump later on, when live merge runs), I didn't find logs from host when the VM runs so not entirely sure what happened



[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/CreateSnapshotCommand.java#L216

(Originally by Benny Zlotnik)

Comment 11 RHV bug bot 2020-06-01 06:56:18 UTC

well, there's timing to consider between getVolumeChain() and actual removal. Would be best to have such a check on vdsm side, perhaps? As a safeguard in case engine decides to delete an active volume....for whatever reason.

(Originally by michal.skrivanek)

Comment 12 RHV bug bot 2020-06-01 06:56:20 UTC

(In reply to Michal Skrivanek from comment #11)
> well, there's timing to consider between getVolumeChain() and actual
> removal. Would be best to have such a check on vdsm side, perhaps? As a
> safeguard in case engine decides to delete an active volume....for whatever
> reason.

yes, read the bug over, and if the snapshot creation didn't reach the call to libvirt we'll still see the original chain (in the previous bug freeze passed fine, the memory dump took too long)... so we can't really do this reliably in the engine

(Originally by Benny Zlotnik)

Comment 13 RHV bug bot 2020-06-01 06:56:21 UTC

Do we have a way to tell if a volume is used by a VM in vdsm though? Image removal is an SPM operation
Maybe we can acquire a volume lease and inquire when trying to delete

(Originally by Benny Zlotnik)

Comment 21 RHV bug bot 2020-06-01 06:56:36 UTC

Here is one observation.

The snapshot creation continued after we received:

    2020-03-30 20:45:13,918+0200 WARN  (jsonrpc/0) [virt.vm] (vmId='cdb7c691-41be-4f96-808c-4d4421462a36') Unable to freeze guest filesystems: internal error: unable to execute QEMU agent command 'guest-fsfreeze-freeze': timeout when try to receive Frozen event from VSS provider: Unspecified error (vm:4262)

This is generated by the qemu agent. The agent waits for the fsFreeze even for 10s, but this message was reported minutes after the fsFreeze was initiated. So the guest agent may get stuck even before triggering the freeze. Would it be better not to rely on the agent and simply fail the fsFreeze according to a timeout suitable for the vdsm workflow? We can see that this operation can be blocking.

(Originally by Roman Hodain)

Comment 23 Shir Fishbain 2020-06-01 13:47:54 UTC

The snapshot creation completed successfully and ready to be used
Verified with the following steps:
1. Adding sleep to the host at /usr/lib/python2.7/site-packages/vdsm/virt/vm.py
2. Restart vdsmd
3. On the engine engine-config -s LiveSnapshotPerformFreezeInEngine=true
4. Restart ovirt-engine service
5. Run the new VM on the host [1]
6. Create a snapshot without memory

**For this moment the LiveSnapshotPerformFreezeInEngine configured by default to true.

ovirt-engine-4.3.10.4-0.1.el7.noarch
vdsm-4.30.46-1.el7ev.x86_64
libvirt-4.5.0-33.el7_8.1.x86_64

Note You need to log in before you can comment on or make changes to this bug.