Bug 1899578

Summary:	Export to OVA operation failed, then removed active volumes that were in use, resulting in data loss when the VM was restarted
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Gordon Watson <gwatson>
Component:	ovirt-engine	Assignee:	Nobody <nobody>
Status:	CLOSED DUPLICATE	QA Contact:	Avihai <aefrat>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.3.4	CC:	ahadas, eshenitz, lrotenbe, mavital, tnisan
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-24 06:45:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Gordon Watson 2020-11-19 15:31:09 UTC

Description of problem:

An Export to OVA operation was performed. A new volume per disk was created and the snapshot function was executed, albeit a day later, but then the operation failed on the engine and was rolled back. The rollback/reversion sequence removed the new volumes. However, these volumes were currently in use as the active volumes for the 'qemu-kvm' process. 

Thus when the VM was later restarted, it started up on the parent volumes as the active volumes, and all of the data that had been written to the volumes that had been removed was lost.


Version-Release number of selected component (if applicable):

RHV 4.3.4
RHVH 4.3-0.8;
  	libvirt-4.5.0-10.el7_6.10.x86_64           
	qemu-kvm-rhev-2.12.0-18.el7_6.5.x86_64            
	vdsm-4.30.17-1.el7ev.x86_6


How reproducible:

No, at least not yet.


Steps to Reproduce:
1.
2.
3.

Actual results:

Active volumes were removed while in use by the VM, instead of a Live Merge being issued, resulting in loss of data when VM was later restarted.


Expected results:

A Live Merge should should have been performed. Even no rollback of the operation would have been better than what happened.


Additional info:

Comment 11 Arik 2020-11-23 18:46:42 UTC

ExportVmToOva creates a snapshot.
It seems the new volumes were created successfully and thus SnapshotVDSCommand was called:

2020-09-22 08:20:00,088+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [2c2b4425-3c9f-44e7-9bab-d5c864f319b2] START, SnapshotVDSCommand
2020-09-22 08:20:02,354+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [2c2b4425-3c9f-44e7-9bab-d5c864f319b2] FINISH, SnapshotVDSCommand

It succeeded - an indication for that is the call to DumpXmls that returns a modified domain XML and therefore afterwards the VM started using the new volumes.

But create-snapshot detected as a failed task later on (probably expired - I see other commands that expired in about that time) so probably endVmCommand of CreateSnapshotForVm was called and decided to end the actions on the disks with failure -
and what it does is to remove the created volumes.

So in that very unlikely situation where CreateSnapshotForVm failed after calling SnapshotVDSCommand (note that it's called differently in 4.4), we should not remove the created volumes on failure (we can probably know that using the phase the command reached to).

Also note that the likelihood of this to happen probably reduced significantly in 4.4 where our virt ansible tasks no longer block CoCo from monitoring other commands.

And yes, that may happen also in clone VM I supposed since it also calls create-snapshot.

Comment 12 Liran Rotenberg 2020-11-24 06:45:02 UTC

Hi,
As Arik said, it can happen on every flow we make snapshot and on snapshot operation as a stand alone.

We had many bugs in that area and I think we solved all of them in 4.4. For this case, the CreateSnapshotCommand failure leads to unwanted cleanup of the volume.
It was fixed, and this fix had a backport to 4.3. The bug report is using RHV 4.3.4 while the fix is in RHV 4.3.6.
The engine checks if the volume is in useage by the VM and skip it's deletion if it does.
Therefore I'm closing as a duplicate.

*** This bug has been marked as a duplicate of bug 1746730 ***