1529073 – [downstream clone - 4.1.9] Previewing snapshot for VM A actually snapshots disks of VM B, both get broken.

Bug 1529073 - [downstream clone - 4.1.9] Previewing snapshot for VM A actually snapshots disks of VM B, both get broken.

Summary: [downstream clone - 4.1.9] Previewing snapshot for VM A actually snapshots di...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.6
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.1.9
Target Release:	---
Assignee:	Fred Rolland
QA Contact:	Natalie Gavrielov
Docs Contact:
URL:
Whiteboard:
Depends On:	1526260
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-26 09:27 UTC by rhev-integ
Modified:	2021-06-10 14:09 UTC (History)
CC List:	11 users (show)
Fixed In Version:	ovirt-engine-4.1.9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1526260
Environment:
Last Closed:	2018-01-24 14:43:33 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:0135	normal	SHIPPED_LIVE	org.ovirt.engine-root bug fix and enhancement update	2018-01-24 19:39:03 UTC
oVirt gerrit	85711	master	MERGED	engine: Validate disks belongs to VM in preview	2020-10-16 10:17:20 UTC
oVirt gerrit	85759	ovirt-engine-4.1	MERGED	engine: Validate disks belongs to VM in preview	2020-10-16 10:17:21 UTC

Description rhev-integ 2017-12-26 09:27:42 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1526260 +++
======================================================================

Description of problem:

This happened multiple times on customer environment, affecting several important VMs and causing an outage.

When previewing a snapshot for VM A, the CreateSnapshotVDSCommand command is sent for the disks of a completely unrelated VM B. Both VMs are affected and won't start without manual intervention.

The result is both VMs are down, the DB is all messed up:
  * VM B now has an extra active = 't' image. Shows duplicate disks in the Disks tab
  * VM B cannot start (two images of the same chain set with active = 't' in the DB
  * VM A cannot start, it's imagelocked (command failed with NPE at the end)
  * Preview failed, but leaf image was not removed.

1. VM 4035e61c-7ef5-49be-b462-c4f994548bb8 has the following disks:

            image_group_id            |
--------------------------------------|
 304ef195-99ba-4ec4-ab19-b4dc068e4551 |
 5cc0063b-02b6-4d25-9c57-05a0f0749607 | 

2. Preview snapshot for VM 4035e61c-7ef5-49be-b462-c4f994548bb8, engine does this:

START, SetVmStatusVDSCommand(
  vmId='4035e61c-7ef5-49be-b462-c4f994548bb8',
  status='ImageLocked', exitStatus='Normal'}
)

START, CreateSnapshotVDSCommand( CreateSnapshotVDSCommandParameters:{
  storageDomainId='0b0107b2-9d2a-4871-b4d6-76bdffba84cd', 
  imageGroupId='91ac4bc4-d843-4849-b7dc-e9959a26f83d',     <----- ??? this is from another VM
  newImageId='29db02bf-ebd8-4ac9-8b93-a337486fdc0a', 
  imageId='62011c11-c268-475a-a6d6-a86f30329927'
  sourceImageGroupId='91ac4bc4-d843-4849-b7dc-e9959a26f83d'}
)

ERROR [org.ovirt.engine.core.bll.snapshots.TryBackToSnapshotCommand] (default task-9) [80c29919-3826-4ed4-b8a0-d0863e61ca2c] Can't find image to update to active 'false', snapshot type 'PREVIEW', original image id '43b29055-fd4e-459a-8e01-1cc74ea70648' 

2017-12-12 22:39:46,612+13 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-6-thread-18) [] [within thread]: endAction for action type TryBackToAllSnapshotsOfVm threw an exception.: java.lang.NullPointerException                                                                                            
        at org.ovirt.engine.core.bll.snapshots.SnapshotsManager.synchronizeDisksFromSnapshot(SnapshotsManager.java:727) [bll.jar:]                                                           
        at org.ovirt.engine.core.bll.snapshots.SnapshotsManager.attempToRestoreVmConfigurationFromSnapshot(SnapshotsManager.java:520) [bll.jar:]                                             
        at org.ovirt.engine.core.bll.snapshots.TryBackToAllSnapshotsOfVmCommand.restoreVmConfigFromSnapshot(TryBackToAllSnapshotsOfVmCommand.java:157) [bll.jar:]                            
        at org.ovirt.engine.core.bll.snapshots.TryBackToAllSnapshotsOfVmCommand.endSuccessfully(TryBackToAllSnapshotsOfVmCommand.java:136) [bll.jar:]                                        

Version-Release number of selected component (if applicable):
rhevm-4.1.6.2-0.1.el7.noarch

How reproducible:
0%

Additional info:
Customer reports UI exceptions at the time this happened. I see some UI exceptions with snapshots commands in the traceback. 
Could this be triggered by a misbehaving UI sending incorrect IDs? Browser was Safari.

(Originally by Germano Veit Michel)

Comment 3 rhev-integ 2017-12-26 09:27:53 UTC

Forgot this:
rhevm-4.1.6.2-0.1.el7.noarch

(Originally by Germano Veit Michel)

Comment 6 rhev-integ 2017-12-26 09:28:07 UTC

Just finished the analysis for 2 VMs. Here are some more detailed logs. The ones in comment #0 are incomplete as it was 3 disks instead of 1.

Previewing snapshot of VM 4035e61c-7ef5-49be-b462-c4f994548bb8 ends up snapshotting all 3 disks of VM 793e01cb-077d-4768-8491-464d0b8d80b8:

2017-12-12 22:33:51,869+13 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (default task-9) [80c29919-3826-4ed4-b8a0-d0863e61ca2c] START, SetVmStatusVDSCommand( SetVmStatusVDSCommandParameters:{runAsync='true', vmId='4035e61c-7ef5-49be-b462-c4f994548bb8', status='ImageLocked', exitStatus='Normal'}), log id: 7dffada0  

engine.log-20171213:2017-12-12 22:33:52,100+13 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (default task-9) [80c29919-3826-4ed4-b8a0-d0863e61ca2c] START, CreateSnapshotVDSCommand( CreateSnapshotVDSCommandParameters:{runAsync='true', storagePoolId='58117215-0261-0167-0311-0000000002b6', ignoreFailoverLimit='false', storageDomainId='0b0107b2-9d2a-4871-b4d6-76bdffba84cd', imageGroupId='91ac4bc4-d843-4849-b7dc-e9959a26f83d', imageSizeInBytes='42949672960', volumeFormat='COW', newImageId='29db02bf-ebd8-4ac9-8b93-a337486fdc0a', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='62011c11-c268-475a-a6d6-a86f30329927', sourceImageGroupId='91ac4bc4-d843-4849-b7dc-e9959a26f83d'}), log id: 3ba2e87d

engine.log-20171213:2017-12-12 22:34:00,160+13 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (default task-9) [80c29919-3826-4ed4-b8a0-d0863e61ca2c] START, CreateSnapshotVDSCommand( CreateSnapshotVDSCommandParameters:{runAsync='true', storagePoolId='58117215-0261-0167-0311-0000000002b6', ignoreFailoverLimit='false', storageDomainId='b391474a-566e-47ca-b5b5-5bb902ae9fd7', imageGroupId='01e88f43-249c-4e07-a3e1-8986e8a178d1', imageSizeInBytes='1099511627776', volumeFormat='COW', newImageId='d149e890-d29f-4294-a0d6-84828655f2a5', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='0cf30697-ae86-419e-8f6c-f85c46546d46', sourceImageGroupId='01e88f43-249c-4e07-a3e1-8986e8a178d1'}), log id: 2d2f6ab5

engine.log-20171213:2017-12-12 22:34:01,305+13 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (default task-9) [80c29919-3826-4ed4-b8a0-d0863e61ca2c] START, CreateSnapshotVDSCommand( CreateSnapshotVDSCommandParameters:{runAsync='true', storagePoolId='58117215-0261-0167-0311-0000000002b6', ignoreFailoverLimit='false', storageDomainId='0b0107b2-9d2a-4871-b4d6-76bdffba84cd', imageGroupId='6bc3f4b5-ef35-49c0-a5b3-1bb6c515a700', imageSizeInBytes='161061273600', volumeFormat='COW', newImageId='67afa840-528b-4d79-9070-7fdf139e59f0', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='43b29055-fd4e-459a-8e01-1cc74ea70648', sourceImageGroupId='6bc3f4b5-ef35-49c0-a5b3-1bb6c515a700'}), log id: 5cecd0ac

engine.log-20171213:2017-12-12 22:33:59,913+13 ERROR [org.ovirt.engine.core.bll.snapshots.TryBackToSnapshotCommand] (default task-9) [80c29919-3826-4ed4-b8a0-d0863e61ca2c] Can't find image to update to active 'false', snapshot type 'PREVIEW', original image id '62011c11-c268-475a-a6d6-a86f30329927'

engine.log-20171213:2017-12-12 22:34:01,109+13 ERROR [org.ovirt.engine.core.bll.snapshots.TryBackToSnapshotCommand] (default task-9) [80c29919-3826-4ed4-b8a0-d0863e61ca2c] Can't find image to update to active 'false', snapshot type 'PREVIEW', original image id '0cf30697-ae86-419e-8f6c-f85c46546d46'

engine.log-20171213:2017-12-12 22:34:02,600+13 ERROR [org.ovirt.engine.core.bll.snapshots.TryBackToSnapshotCommand] (default task-9) [80c29919-3826-4ed4-b8a0-d0863e61ca2c] Can't find image to update to active 'false', snapshot type 'PREVIEW', original image id '43b29055-fd4e-459a-8e01-1cc74ea70648'

(Originally by Germano Veit Michel)

Comment 7 rhev-integ 2017-12-26 09:28:11 UTC

Fred, I have a vague recollection of you handling a "duplicate disks" issue, although I'm not sure if this is the same issue or not.

Could you take a look, please?
Thanks!

(Originally by Allon Mureinik)

Comment 8 rhev-integ 2017-12-26 09:28:15 UTC

Hi,

I could not find the root cause for this problem, maybe the UI exceptions messed with the parameters sent.

In any case, I can reproduce via REST API by providing a disk from another VM in the restore request.

I will add a validation in the command to avoid this issue.

(Originally by Fred Rolland)

Comment 9 rhev-integ 2017-12-26 09:28:20 UTC

Hi Fred,

That sounds good enough to prevent what the customer hit, you will also be preventing any incorrect API usage to break the VMs.

Can you still get this to 4.1.z? As we don't really know what caused the issue (likely the misbehaving UI on Safari?) and the customer is a bit worried it may happen again (I think it's unlikely though).

Thank you.

(Originally by Germano Veit Michel)

Comment 10 Natalie Gavrielov 2018-01-23 11:32:14 UTC

Verified using:
ovirt-engine-4.1.9.1-0.1.el7.noarch
vdsm-4.19.45-1.el7ev.x86_64

Scenario tested:
1. Created 2 VMs with snapshots.
2. Try to preview a snapshot of VM1 with a disk from VM2 using REST-API

Result:
Operation is blocked.

Method: POST
URL: https://ovirt-engine-url/ovirt-engine/api/vms/vm-id/previewsnapshot
Body:
<action>
  <async>false</async>
  <disks>
    <disk id="disk-of-second-vm">
       <image_id>image-id-of-second-vm</image_id>
       <snapshot id="snapshot-id-of-snapshot-in-vm1-or-vm2"/>
    </disk>
  </disks>
  <restore_memory>false</restore_memory>
  <snapshot id="snapshot-id-of-vm1"/>
</action>

Output:
.
.
.
  <fault>
    <detail>[action type failed vm disk snapshot not attached to vm]</detail>
    <reason>Operation Failed</reason>
  </fault>
.
.
.

Comment 13 errata-xmlrpc 2018-01-24 14:43:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0135

Comment 14 Franta Kust 2019-05-16 12:54:39 UTC

BZ<2>Jira re-sync

Comment 15 Daniel Gur 2019-08-28 13:11:37 UTC

sync2jira

Comment 16 Daniel Gur 2019-08-28 13:15:50 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.