1658589 – [downstream clone - 4.2.8] Engine sent duplicate SnapshotVDSCommand, causing data corruption

Bug 1658589 - [downstream clone - 4.2.8] Engine sent duplicate SnapshotVDSCommand, causing data corruption

Summary: [downstream clone - 4.2.8] Engine sent duplicate SnapshotVDSCommand, causing ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.7
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.2.8
Target Release:	---
Assignee:	Eyal Shenitzky
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:	1654891
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-12 13:22 UTC by RHV bug bot
Modified:	2022-03-13 16:53 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1654891
Environment:
Last Closed:	2019-01-22 12:44:51 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1640225	unspecified	CLOSED	Failed to migrate VM disk: unsupported configuration: nothing selected for snapshot	2021-02-22 00:41:40 UTC
Red Hat Issue Tracker	RHV-45242	None	None	None	2022-03-13 16:53:41 UTC
Red Hat Knowledge Base (Solution)	3728131	None	None	None	2018-12-12 13:23:21 UTC
Red Hat Product Errata	RHBA-2019:0121	None	None	None	2019-01-22 12:44:57 UTC
oVirt gerrit	96139	master	MERGED	core: persist CreateSnapshotForVmCommand#performNextOperation	2018-12-12 13:23:21 UTC
oVirt gerrit	96165	ovirt-engine-4.2	MERGED	core: persist CreateSnapshotForVmCommand#performNextOperation	2018-12-13 08:29:33 UTC

Description RHV bug bot 2018-12-12 13:22:32 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1654891 +++
======================================================================

Description of problem:

The VM was apparently healthy. Then the user initiates a snapshot operation.

1. The engine creates the new volume (SPM). All good.
2. The engine sends SnapshotVDSCommand to the host running the VM. The VM starts using the new leaf volume. All good
3. The engine sends the exact same SnapshotVDSCommand again(!!!) It fails.
4. The engine tells the SPM to delete the LV that the VM just switched to in [2]. 

Now those extents are free but the VM is using them. A new volume create on the SD will overlap with the extents of the image used by the VM, causing data corruption for both.

Engine side:

1. Creates volume (IRS)

2018-11-26 03:35:15,453+13 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (default task-24) [4b9931a3-c789-4af4-be9c-9bec8b6b8ba2] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='58117215-0261-0167-0311-0000000002b6', ignoreFailoverLimit='false', storageDomainId='0b0107b2-9d2a-4871-b4d6-76bdffba84cd', imageGroupId='5cbd57df-bf57-4fd3-a209-039e9e8d9629', imageSizeInBytes='34359738368', volumeFormat='COW', newImageId='1da1d6d6-e9a1-43ac-9882-4e787961f4ea', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='452916ee-5837-455a-bfd0-f9a236ab37f6', sourceImageGroupId='5cbd57df-bf57-4fd3-a209-039e9e8d9629'}), log id: 2b40013a

2. 1st snapshot (succeeds)

2018-11-26 03:35:32,208+13 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [4b9931a3-c789-4af4-be9c-9bec8b6b8ba2] START, SnapshotVDSCommand(HostName = blade04, SnapshotVDSCommandParameters:{hostId='c4127bdd-d2b9-4eab-a551-9ad4936fbdd5', vmId='67711673-32a1-4803-873c-bfd029cfd1ca'}), log id: 32a6dad7

3. 2nd Snapshot (fails)
2018-11-26 03:35:38,564+13 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [4b9931a3-c789-4af4-be9c-9bec8b6b8ba2] START, SnapshotVDSCommand(HostName = blade04, SnapshotVDSCommandParameters:{hostId='c4127bdd-d2b9-4eab-a551-9ad4936fbdd5', vmId='67711673-32a1-4803-873c-bfd029cfd1ca'}), log id: 3b23ffdd

2018-11-26 03:35:38,607+13 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [4b9931a3-c789-4af4-be9c-9bec8b6b8ba2] Failed in 'SnapshotVDS' method

4. Destroy Volume (IRS)

2018-11-26 03:35:39,980+13 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-5) [4b9931a3-c789-4af4-be9c-9bec8b6b8ba2] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='58117215-0261-0167-0311-0000000002b6', ignoreFailoverLimit='false', storageDomainId='0b0107b2-9d2a-4871-b4d6-76bdffba84cd', imageGroupId='5cbd57df-bf57-4fd3-a209-039e9e8d9629', imageId='00000000-0000-0000-0000-000000000000', imageList='[1da1d6d6-e9a1-43ac-9882-4e787961f4ea]', postZero='false', force='false'}), log id: 194548ed

VDSM side:

2018-11-26 03:35:32,221+1300 INFO  (jsonrpc/6) [api.virt] START snapshot(snapDrives=[{u'baseVolumeID': u'452916ee-5837-455a-bfd0-f9a236ab37f6', u'domainID': u'0b0107b2-9d2a-4871-b4d6-76bdffba84cd', u'volumeID': u'1da1d6d6-e9a1-43ac-9882-4e787961f4ea', u'imageID': u'5cbd57df-bf57-4fd3-a209-039e9e8d9629'}], snapMemory=None, frozen=False) from=::ffff:10.1.224.10,52876, flow_id=4b9931a3-c789-4af4-be9c-9bec8b6b8ba2, vmId=67711673-32a1-4803-873c-bfd029cfd1ca (api:46)

2018-11-26 03:35:36,399+1300 INFO  (jsonrpc/6) [api.virt] FINISH snapshot return={'status': {'message': 'Done', 'code': 0}, 'quiesce': False} from=::ffff:10.1.224.10,52876, flow_id=4b9931a3-c789-4af4-be9c-9bec8b6b8ba2, vmId=67711673-32a1-4803-873c-bfd029cfd1ca (api:52)

Here comes the second snapshot:

2018-11-26 03:35:38,562+1300 INFO  (jsonrpc/4) [api.virt] START snapshot(snapDrives=[{u'baseVolumeID': u'452916ee-5837-455a-bfd0-f9a236ab37f6', u'domainID': u'0b0107b2-9d2a-4871-b4d6-76bdffba84cd', u'volumeID': u'1da1d6d6-e9a1-43ac-9882-4e787961f4ea', u'imageID': u'5cbd57df-bf57-4fd3-a209-039e9e8d9629'}], snapMemory=None, frozen=False) from=::ffff:10.1.224.10,52876, flow_id=4b9931a3-c789-4af4-be9c-9bec8b6b8ba2, vmId=67711673-32a1-4803-873c-bfd029cfd1ca (api:46)

2018-11-26 03:35:38,563+1300 INFO  (jsonrpc/4) [virt.vm] (vmId='67711673-32a1-4803-873c-bfd029cfd1ca') <?xml version='1.0' encoding='utf-8'?>
<domainsnapshot><disks /></domainsnapshot> (vm:4600)

And it fails:

2018-11-26 03:35:38,602+1300 INFO  (jsonrpc/4) [api.virt] FINISH snapshot return={'status': {'message': 'Snapshot failed', 'code': 48}} from=::ffff:10.1.224.10,52876, flow_id=4b9931a3-c789-4af4-be9c-9bec8b6b8ba2, vmId=67711673-32a1-4803-873c-bfd029cfd1ca (api:52)

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.7.5-0.1.el7ev.noarch
vdsm-4.20.43-1.el7ev.x86_64

How reproducible:
Unknown

Steps to Reproduce:
Unknown

Actual results:
VM using stale LV, data corruption

(Originally by Germano Veit Michel)

Comment 4 RHV bug bot 2018-12-12 13:22:44 UTC

There is a single POST to snapshots of this VM via API, just in case...

[26/Nov/2018:03:35:15 +1300] "POST /ovirt-engine/api/vms/67711673-32a1-4803-873c-bfd029cfd1ca/snapshots HTTP/1.1" 201 13478

(Originally by Germano Veit Michel)

Comment 5 RHV bug bot 2018-12-12 13:22:45 UTC

No progress, but just to document... 

1) I tried to send a POST to /snapshots with the same disk duplicated as below, the engine fails with Internal Sever Error. It does not trigger this bug.

<snapshot>
   <description>test2</description>
   <persist_memorystate>false</persist_memorystate>
   <disk_attachments>
      <disk_attachment>
         <disk id="0764dcb3-8b21-4cda-9899-5d18ecefc67f"/>
      </disk_attachment>
      <disk_attachment>
         <disk id="0764dcb3-8b21-4cda-9899-5d18ecefc67f"/>
      </disk_attachment>
  </disk_attachments>
</snapshot>

So I'm not sure this could have been triggered by an incorrect API request.

2) Also, the SnapshotVDS command seems to depend on what is returned here, it doesn't look duplicate:

engine=> select count(*) from all_disks_for_vms  where vm_id = '67711673-32a1-4803-873c-bfd029cfd1ca';
 count 
-------
     1

(Originally by Germano Veit Michel)

Comment 7 RHV bug bot 2018-12-12 13:22:54 UTC

Following an offline conversation, this issue was reported here - bug 1640225 (upstream).

(Originally by Elad Ben Aharon)

Comment 8 RHV bug bot 2018-12-12 13:22:56 UTC

Germano, did you see any disconnection from vdsm or any other error since the first
snapshot request was sent?

This may be also an infra issue, if infra try to resend the same command twice.

Martin, do we have protection from sending the same command twice?

On storage side we may also need to have protection from receiving response more
then once. If we sent a command and received a response, the command should change
the internal state so another response for the same command should be dropped.

(Originally by Nir Soffer)

Comment 9 RHV bug bot 2018-12-12 13:22:57 UTC

(In reply to Nir Soffer from comment #8)
> Germano, did you see any disconnection from vdsm or any other error since
> the first
> snapshot request was sent?

Hi Nir,

No, not on these logs and also not on BZ1640225 which is a very similar issue. The difference between this and BZ1640225 is that the latter is during a LSM, so the parent of CreateSnapshotForVm command is LiveMigrateDisk.

(Originally by Germano Veit Michel)

Comment 10 RHV bug bot 2018-12-12 13:22:58 UTC

I haven't dug too deep yet, but given 2 different threads executed each of the SnapshotVDS commands, the only explanation coming to mind is that both of them ran the CreateSnapshotForVmCommand#performNextOperation, which might be due to some race condition in the polling stage, causing two scheduling threads to see the command as "SUCCEEDED" at the same time, and execute its "next operation" at the same time.

Though it seems kind of strange given there was a 6 second delay between each snapshot attempt, but this could be due to the system being under load

(Originally by Benny Zlotnik)

Comment 17 Germano Veit Michel 2018-12-13 00:48:51 UTC

Eyal, thanks for the fix.

Is this in good shape so we can build a 4.2.7-5 with this patch on top of it as a hotfix for the customer? 
There are no additional patches expected right?

If we have green light regarding this patch, I will discuss with the customer regarding the hotfix option.

Thanks!

Comment 19 Eyal Edri 2018-12-16 17:00:02 UTC

Added to the Errata

Comment 21 Eyal Shenitzky 2018-12-18 05:02:57 UTC

(In reply to Germano Veit Michel from comment #17)
> Eyal, thanks for the fix.
> 
> Is this in good shape so we can build a 4.2.7-5 with this patch on top of it
> as a hotfix for the customer? 
> There are no additional patches expected right?
> 
> If we have green light regarding this patch, I will discuss with the
> customer regarding the hotfix option.
> 
> Thanks!

No, this is the only patch.
It is already merged and ready.

Sorry for the late response

Comment 22 Elad 2019-01-16 13:50:24 UTC

Moving to VERIFIED based on latest 4.2.8 regression cycles results.
Also executed few times the test case (TestCase21907) that this bug was originally discovered by.
The issue wasn't seen again.

Used:
vdsm-4.20.46-1.el7ev.x86_64
ovirt-engine-4.2.8.2-0.1.el7

Comment 24 errata-xmlrpc 2019-01-22 12:44:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0121

Note You need to log in before you can comment on or make changes to this bug.