1032807 – TryBackToAllSnapshotsOfVm threw NullPointerException during snapshot-preview because of random disk attached to VM

Bug 1032807 - TryBackToAllSnapshotsOfVm threw NullPointerException during snapshot-preview because of random disk attached to VM

Summary: TryBackToAllSnapshotsOfVm threw NullPointerException during snapshot-preview ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.2.0
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	3.3.0
Assignee:	Liron Aravot
QA Contact:	Gadi Ickowicz
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	3.3snap4 1067344
TreeView+	depends on / blocked

Reported:	2013-11-20 21:36 UTC by Derrick Ornelas
Modified:	2018-12-04 16:25 UTC (History)
CC List:	17 users (show)
Fixed In Version:	is26
Doc Type:	Bug Fix
Doc Text:	Previously the snapshot manager attempted to remove disks that are not part of a snapshot. Now, if the device is not a disk known by the engine, it is not removed, similar to the behavior of non-snapshot disks.
Clone Of:
Clones:	1067344 (view as bug list)
Environment:
Last Closed:	2014-01-21 17:39:31 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2014:0038	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Virtualization Manager 3.3.0 update	2014-01-21 22:03:06 UTC
oVirt gerrit	21654	None	MERGED	core: SnapshotManager - disk devices might not represent engine disks	2020-12-03 16:48:17 UTC
oVirt gerrit	21802	None	MERGED	core: further handling of unmanaged disk devices	2020-12-03 16:48:17 UTC

Description Derrick Ornelas 2013-11-20 21:36:22 UTC

Description of problem:

snapshot-preview fails:

2013-11-04 17:04:43,572 ERROR [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-50) EntityAsyncTask::EndCommandAction [within thread]: EndAction for
action type TryBackToAllSnapshotsOfVm threw an exception: javax.ejb.EJBException: java.lang.NullPointerException
[...]
Caused by: java.lang.NullPointerException
    at org.ovirt.engine.core.bll.snapshots.SnapshotsManager.removeDisksNotInSnapshot(SnapshotsManager.java:411) [engine-bll.jar:]
    at org.ovirt.engine.core.bll.snapshots.SnapshotsManager.synchronizeDisksFromSnapshot(SnapshotsManager.java:398) [engine-bll.jar:]
    at org.ovirt.engine.core.bll.snapshots.SnapshotsManager.attempToRestoreVmConfigurationFromSnapshot(SnapshotsManager.java:247) [engine-bll.jar:]
    at org.ovirt.engine.core.bll.TryBackToAllSnapshotsOfVmCommand.restoreVmConfigFromSnapshot(TryBackToAllSnapshotsOfVmCommand.java:103) [engine-bll.jar:]
    at org.ovirt.engine.core.bll.TryBackToAllSnapshotsOfVmCommand.endSuccessfully(TryBackToAllSnapshotsOfVmCommand.java:87) [engine-bll.jar:]
    at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:537) [engine-bll.jar:]
    at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:484) [engine-bll.jar:]
    at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1521) [engine-bll.jar:]
    at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) [engine-utils.jar:]
    at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:141) [engine-utils.jar:]
    at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:110) [engine-utils.jar:]
    at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:430) [engine-bll.jar:]
    at org.ovirt.engine.core.bll.Backend.endAction(Backend.java:379) [engine-bll.jar:]


ovirt-engine/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/SnapshotsManager.java
---
401     /**
402      * Remove all the disks which are allowed to be snapshot but not exist in the snapshot.
403      * @param vmId - The vm id which is being snapshot.
404      * @param diskIdsFromSnapshot - An image group id list for images which are part of the VM.
405      */
406     private void removeDisksNotInSnapshot(Guid vmId, List<Guid> diskIdsFromSnapshot) {
407         for (VmDevice vmDevice : getVmDeviceDao().getVmDeviceByVmIdTypeAndDevice(
408                 vmId, VmDeviceType.DISK.getName(), VmDeviceType.DISK.getName())) {
409             if (!diskIdsFromSnapshot.contains(vmDevice.getDeviceId())) {
410                 Disk disk = getDiskDao().get(vmDevice.getDeviceId());
411                 if (disk.isAllowSnapshot()) {
412                     getBaseDiskDao().remove(vmDevice.getDeviceId());
413                     getVmDeviceDao().remove(vmDevice.getId());
414                 }
415             }
416         }
417     }
---

Here's the snapshots

engine=> select snapshot_id,snapshot_type,status,description,creation_date,_create_date,_update_date from snapshots where vm_id =
'5adbadf8-b090-4ae8-818b-2d34244ec895'\x\g\x
Expanded display is on.
-[ RECORD 1 ]-+-------------------------------------
snapshot_id   | e174b341-c5cf-405a-a835-960c829c8441
snapshot_type | REGULAR
status        | OK
description   | EN BLANCO
creation_date | 2013-11-01 15:30:43.709-04
_create_date  | 2013-11-01 15:30:43.648312-04
_update_date  |
-[ RECORD 2 ]-+-------------------------------------
snapshot_id   | 194c6163-3e9f-40b2-a93f-52e455aaa628
snapshot_type | PREVIEW
status        | LOCKED
description   | Active VM before the preview
creation_date | 2013-11-04 15:54:31.709-05
_create_date  | 2013-11-04 15:54:31.648832-05
_update_date  |
-[ RECORD 3 ]-+-------------------------------------
snapshot_id   | 0ef11d8a-52b0-40c3-b864-72b207412c72
snapshot_type | ACTIVE
status        | OK
description   | Active VM
creation_date | 2013-11-04 15:54:31.72-05
_create_date  | 2013-11-04 15:54:31.648832-05
_update_date  |


And disk/volumes

engine=> select image_guid,parentid,image_group_id,vm_snapshot_id from images where image_group_id in (select device_id from vm_device where vm_id =
'5adbadf8-b090-4ae8-818b-2d34244ec895' and device = 'disk');
              image_guid              |               parentid               |            image_group_id            |            vm_snapshot_id
--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------
 10f9a791-8b7f-470b-859f-3dec2ea58309 | 7beb1725-5394-441a-a331-df39c7078e5f | 700be63c-f2dd-4403-a342-f8b88ec7ba18 | 194c6163-3e9f-40b2-a93f-52e455aaa628
 d75de395-1ad7-480d-adde-33630f8a49f5 | 7beb1725-5394-441a-a331-df39c7078e5f | 700be63c-f2dd-4403-a342-f8b88ec7ba18 | 0ef11d8a-52b0-40c3-b864-72b207412c72
 7beb1725-5394-441a-a331-df39c7078e5f | 00000000-0000-0000-0000-000000000000 | 700be63c-f2dd-4403-a342-f8b88ec7ba18 | e174b341-c5cf-405a-a835-960c829c8441
(3 rows)



My understanding is that we're failing here because the VM has a second disk attached for some unknown reason


engine=> select * from vm_device where vm_id = '5adbadf8-b090-4ae8-818b-2d34244ec895' and device = 'disk'\x\g\x
Expanded display is on.
-[ RECORD 1 ]+-------------------------------------------------------------
device_id    | eb95f080-1458-41ec-b4f5-a3b691c9d98d
vm_id        | 5adbadf8-b090-4ae8-818b-2d34244ec895
type         | disk
device       | disk
address      | {bus=0x00, domain=0x0000, type=pci, slot=0x06, function=0x0}
boot_order   | 0
spec_params  | { }
is_managed   | f
is_plugged   | t
is_readonly  | f
_create_date | 2013-11-01 16:38:01.562009-04
_update_date |
alias        | virtio-disk701
-[ RECORD 2 ]+-------------------------------------------------------------
device_id    | 700be63c-f2dd-4403-a342-f8b88ec7ba18
vm_id        | 5adbadf8-b090-4ae8-818b-2d34244ec895
type         | disk
device       | disk
address      | {unit=0, bus=0, target=0, controller=0, type=drive}
boot_order   | 0
spec_params  | { }
is_managed   | t
is_plugged   | t
is_readonly  | f
_create_date | 2013-11-01 13:57:31.389081-04
_update_date | 2013-11-01 16:38:01.548234-04
alias        | ide0-0-0


This disk, eb95f080-1458-41ec-b4f5-a3b691c9d98d, has no volumes in the database nor on the physical storage.  The _create_date timestamp is ~2.75 hours after
the VM was created, but there's not a single reference to this device_id value in the engine or SPM logs, nor in the logs of the host that was running the VM.



Version-Release number of selected component (if applicable):
rhevm-3.2.2-0.41.el6ev



How reproducible:

I have not been able to reproduce it internally, and the customer has only encountered this once



Steps to reproduce:

1.  Create VM
2.  Add nic1
3.  Add nic2
4.  Add IDE disk
5.  Start VM
6.  Attach ISO
7.  Create live snapshot
8.  Eject ISO
9.  Attach ISO
10. Shutdown VM
11. Initiate snapshot-preview


Actual results:

snapshot-preview fails


Expected results:

snapshot-preview action succeeds and VM can be started to preview snapshot state

Comment 6 Federico Simoncelli 2013-11-22 14:13:21 UTC

(In reply to Derrick Ornelas from comment #0)
> engine=> select * from vm_device where vm_id =
> '5adbadf8-b090-4ae8-818b-2d34244ec895' and device = 'disk'\x\g\x
> Expanded display is on.
> -[ RECORD 1 ]+-------------------------------------------------------------
> device_id    | eb95f080-1458-41ec-b4f5-a3b691c9d98d
> vm_id        | 5adbadf8-b090-4ae8-818b-2d34244ec895
> type         | disk
> device       | disk
> address      | {bus=0x00, domain=0x0000, type=pci, slot=0x06, function=0x0}
> boot_order   | 0
> spec_params  | { }
> is_managed   | f
> is_plugged   | t
> is_readonly  | f
> _create_date | 2013-11-01 16:38:01.562009-04
> _update_date |
> alias        | virtio-disk701
> -[ RECORD 2 ]+-------------------------------------------------------------
> device_id    | 700be63c-f2dd-4403-a342-f8b88ec7ba18
> vm_id        | 5adbadf8-b090-4ae8-818b-2d34244ec895
> type         | disk
> device       | disk
> address      | {unit=0, bus=0, target=0, controller=0, type=drive}
> boot_order   | 0
> spec_params  | { }
> is_managed   | t
> is_plugged   | t
> is_readonly  | f
> _create_date | 2013-11-01 13:57:31.389081-04
> _update_date | 2013-11-01 16:38:01.548234-04
> alias        | ide0-0-0
> 
> 
> This disk, eb95f080-1458-41ec-b4f5-a3b691c9d98d, has no volumes in the
> database nor on the physical storage.  The _create_date timestamp is ~2.75
> hours after the VM was created, but there's not a single reference to this
> device_id value in the engine or SPM logs, nor in the logs of the host that
> was running the VM.

At ~2013-11-01 16:38:01:

2013-11-01 16:38:01,489 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVdsCommand] (QuartzScheduler_Worker-20) START, FullListVdsCommand( HostId = 5ade55e9-6ccc-426a-bd4d-4bfc4b84eead, vds=null, vmIds=[5adbadf8-b090-4ae8-818b-2d34244ec895]), log id: 3bcf9b80
2013-11-01 16:38:01,516 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVdsCommand] (QuartzScheduler_Worker-20) FINISH, FullListVdsCommand, return: [Lorg.ovirt.engine.core.vdsbroker.xmlrpc.XmlRpcStruct;@3aea0f2d, log id: 3bcf9b80

Following FullListVdsCommand > getVmInfo > updateVmDevices > processVmDevices, it seems that the disk could have been created/added there (addNewVmDevice):

        for (Object o : (Object[]) vm.get(VdsProperties.Devices)) {
            Map device = (Map<String, Object>) o;
            if (device.get(VdsProperties.Address) == null) {
                logDeviceInformation(vmId, device);
                continue;
            }

            Guid deviceId = getDeviceId(device);
            VmDevice vmDevice = deviceMap.get(new VmDeviceId(deviceId, vmId));
            if (deviceId == null || vmDevice == null) {
                deviceId = addNewVmDevice(vmId, device);
            } else {
                vmDevice.setAddress(((Map<String, String>) device.get(VdsProperties.Address)).toString());
                vmDevice.setAlias(StringUtils.defaultString((String) device.get(VdsProperties.Alias)));
                addVmDeviceToList(vmDevice);
            }

            processedDevices.add(deviceId);
        }

(note that addNewVmDevice generates a new deviceId)

To understand what happened we need the VDSM response to FullListVdsCommand (list).

Comment 7 Liron Aravot 2013-11-24 08:58:30 UTC

After the vm is created, the user attempts to run it - 
As we can see, when the user attempts to run the vm there is a hook which i suspect that leads to the final issue-

Thread-2681917::DEBUG::2013-11-01 17:57:41,230::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/libexec/vdsm/hooks/before_vm_start/50_vhostmd' (cwd None)


it can be seen that the xml passed to libvirt contains that "disk" whose source is "/dev/shm/vhostmd0" (like the hook name..) 

Thread-2681917::DEBUG::2013-11-01 17:57:41,440::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ''; <rc> = 0
Thread-2681917::INFO::2013-11-01 17:57:41,441::hooks::76::root::(_runHooksDir) 
Thread-2681917::DEBUG::2013-11-01 17:57:41,446::libvirtvm::1509::vm.Vm::(_run) vmId=`5adbadf8-b090-4ae8-818b-2d34244ec895`::<?xml version="1.0" encoding="utf-8"?><domain typ
e="kvm">
        <name>voipswitch_01</name>

....

        <disk device="disk" type="file"><source file="/dev/shm/vhostmd0"/><target bus="virtio" dev="vdzz"/><readonly/></disk></devices>



After the vm is running, that device is added to the engine from the VdsUpdateRuntimeInfo flow, as it's unknown in the engine..as it can be seen , as it's type is disk, it's added as disk device to the engine.

Later on, in the preview snapshot flow it's assumed that all the disk devices in the engine db are actually existing disks - therefore an NPE occurs.

We need to find out -
1. What happens on the hook, why is that device passed as a disk to libvirt? I suspent the hook causes it.. please verify what's going on there.


2. Regardles of (1) in the engine, the assumption that disk devices are only of disks that exist on the engine should be fixed, though if (1) is solved it's less "urgent" for that case (will be fixed anyway).


*note that there is 4 hours difference between the engine and vdsm logs.
relevant logs -
hsm - vdsm.log.6.xz
engine - engine.log.1-20131104_033102_0088.xz

Comment 9 Liron Aravot 2013-11-25 16:29:49 UTC

Derrick, 
can we somehow manage without having this property set to fix the issue for the customer? or it's mandatory for his usages?


regardless, the issue was fixed upstream and now i'm backporting it to 3.3 branch, then possibly to zstream.

Comment 14 Liron Aravot 2013-11-28 12:24:34 UTC

Aahron,
as marina wrote - 
I'd add unmanaged disk devices and test the preview/restore/run stateless flows.

it could be done with hook as the customer did, or by manually adding devices to the vm_device table.

Comment 15 Liron Aravot 2013-11-28 12:30:19 UTC

Derrick, thanks -
the fixed that we applied basically removes that assumption that any disk device is a known disk in the engine.

Note that as other unmanaged devices, if you take a snapshot with this unknown disk device, as you'll attempt to preview that snapshot - you'll have this device back for the next vm run.

Regardless of this bug, we need to look into that script, it seems like potentially the same device would be added on each "regular" vm run - as it's always added to the xml (we need this see how libvirt/vdsm acts on it) and might be also be sent by the engine.

Comment 16 Gadi Ickowicz 2014-01-01 07:26:52 UTC

verified on is29.

Used the hook to add a disk to the vm that is not managed by engine:

Thread-41794::DEBUG::2014-01-01 09:18:04,795::utils::489::root::(execCmd) '/usr/libexec/vdsm/hooks/before_vm_start/the_hook' (cwd None)
Thread-41794::DEBUG::2014-01-01 09:18:04,832::utils::509::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
Thread-41794::INFO::2014-01-01 09:18:04,833::hooks::88::root::(_runHooksDir) 


Started the vm:
Thread-41794::DEBUG::2014-01-01 09:18:04,833::vm::3031::vm.Vm::(_run) vmId=`4b4cd142-0339-4520-83a6-a79f1f282166`::<?xml version="1.0" encoding="utf-8"?><domain type="kvm">
        <name>the_vm</name>
        <uuid>4b4cd142-0339-4520-83a6-a79f1f282166</uuid>
.....
        <disk device="disk" type="file"><source file="/tmp/the_disk"/><target bus="virtio" dev="vdzz"/><readonly/></disk></devices>


Took a live snapshot and then was able to  preview snapshot / commit snapshot:

2014-01-01 09:21:25,886 INFO  [org.ovirt.engine.core.bll.TryBackToAllSnapshotsOfVmCommand] (ajp-/127.0.0.1:8702-3) [139b8a69] Lock Acquired to object EngineLock [exclusiveLocks=
 key: 4b4cd142-0339-4520-83a6-a79f1f282166 value: VM
, sharedLocks= ]
2014-01-01 09:21:25,910 INFO  [org.ovirt.engine.core.bll.TryBackToAllSnapshotsOfVmCommand] (ajp-/127.0.0.1:8702-3) [139b8a69] Running command: TryBackToAllSnapshotsOfVmCommand i
nternal: false. Entities affected :  ID: 4b4cd142-0339-4520-83a6-a79f1f282166 Type: VM
<.... snip.....>
70 (Parent Command TryBackToAllSnapshotsOfVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'success'.
2014-01-01 09:21:31,251 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-88) BaseAsyncTask::OnTaskEndSuccess: Task 931b0761-0dd0-4a9a-a969-daae139d9f70 (Parent Command TryBackToAllSnapshotsOfVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully.

Comment 17 errata-xmlrpc 2014-01-21 17:39:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html

Note You need to log in before you can comment on or make changes to this bug.