1899578 – Export to OVA operation failed, then removed active volumes that were in use, resulting in data loss when the VM was restarted

Bug 1899578 - Export to OVA operation failed, then removed active volumes that were in use, resulting in data loss when the VM was restarted

Summary: Export to OVA operation failed, then removed active volumes that were in use,...

Keywords:
Status:	CLOSED DUPLICATE of bug 1746730
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nobody
QA Contact:	Avihai
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-19 15:31 UTC by Gordon Watson
Modified:	2024-03-25 17:10 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-24 06:45:02 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5587341	0	None	None	None	2020-11-19 16:59:15 UTC

Description Gordon Watson 2020-11-19 15:31:09 UTC

Description of problem:

An Export to OVA operation was performed. A new volume per disk was created and the snapshot function was executed, albeit a day later, but then the operation failed on the engine and was rolled back. The rollback/reversion sequence removed the new volumes. However, these volumes were currently in use as the active volumes for the 'qemu-kvm' process. 

Thus when the VM was later restarted, it started up on the parent volumes as the active volumes, and all of the data that had been written to the volumes that had been removed was lost.


Version-Release number of selected component (if applicable):

RHV 4.3.4
RHVH 4.3-0.8;
  	libvirt-4.5.0-10.el7_6.10.x86_64           
	qemu-kvm-rhev-2.12.0-18.el7_6.5.x86_64            
	vdsm-4.30.17-1.el7ev.x86_6


How reproducible:

No, at least not yet.


Steps to Reproduce:
1.
2.
3.

Actual results:

Active volumes were removed while in use by the VM, instead of a Live Merge being issued, resulting in loss of data when VM was later restarted.


Expected results:

A Live Merge should should have been performed. Even no rollback of the operation would have been better than what happened.


Additional info:

Comment 11 Arik 2020-11-23 18:46:42 UTC

ExportVmToOva creates a snapshot.
It seems the new volumes were created successfully and thus SnapshotVDSCommand was called:

2020-09-22 08:20:00,088+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [2c2b4425-3c9f-44e7-9bab-d5c864f319b2] START, SnapshotVDSCommand
2020-09-22 08:20:02,354+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [2c2b4425-3c9f-44e7-9bab-d5c864f319b2] FINISH, SnapshotVDSCommand

It succeeded - an indication for that is the call to DumpXmls that returns a modified domain XML and therefore afterwards the VM started using the new volumes.

But create-snapshot detected as a failed task later on (probably expired - I see other commands that expired in about that time) so probably endVmCommand of CreateSnapshotForVm was called and decided to end the actions on the disks with failure -
and what it does is to remove the created volumes.

So in that very unlikely situation where CreateSnapshotForVm failed after calling SnapshotVDSCommand (note that it's called differently in 4.4), we should not remove the created volumes on failure (we can probably know that using the phase the command reached to).

Also note that the likelihood of this to happen probably reduced significantly in 4.4 where our virt ansible tasks no longer block CoCo from monitoring other commands.

And yes, that may happen also in clone VM I supposed since it also calls create-snapshot.

Comment 12 Liran Rotenberg 2020-11-24 06:45:02 UTC

Hi,
As Arik said, it can happen on every flow we make snapshot and on snapshot operation as a stand alone.

We had many bugs in that area and I think we solved all of them in 4.4. For this case, the CreateSnapshotCommand failure leads to unwanted cleanup of the volume.
It was fixed, and this fix had a backport to 4.3. The bug report is using RHV 4.3.4 while the fix is in RHV 4.3.6.
The engine checks if the volume is in useage by the VM and skip it's deletion if it does.
Therefore I'm closing as a duplicate.

*** This bug has been marked as a duplicate of bug 1746730 ***

Note You need to log in before you can comment on or make changes to this bug.