Description of problem: snapshot-preview fails: 2013-11-04 17:04:43,572 ERROR [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-50) EntityAsyncTask::EndCommandAction [within thread]: EndAction for action type TryBackToAllSnapshotsOfVm threw an exception: javax.ejb.EJBException: java.lang.NullPointerException [...] Caused by: java.lang.NullPointerException at org.ovirt.engine.core.bll.snapshots.SnapshotsManager.removeDisksNotInSnapshot(SnapshotsManager.java:411) [engine-bll.jar:] at org.ovirt.engine.core.bll.snapshots.SnapshotsManager.synchronizeDisksFromSnapshot(SnapshotsManager.java:398) [engine-bll.jar:] at org.ovirt.engine.core.bll.snapshots.SnapshotsManager.attempToRestoreVmConfigurationFromSnapshot(SnapshotsManager.java:247) [engine-bll.jar:] at org.ovirt.engine.core.bll.TryBackToAllSnapshotsOfVmCommand.restoreVmConfigFromSnapshot(TryBackToAllSnapshotsOfVmCommand.java:103) [engine-bll.jar:] at org.ovirt.engine.core.bll.TryBackToAllSnapshotsOfVmCommand.endSuccessfully(TryBackToAllSnapshotsOfVmCommand.java:87) [engine-bll.jar:] at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:537) [engine-bll.jar:] at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:484) [engine-bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1521) [engine-bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) [engine-utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:141) [engine-utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:110) [engine-utils.jar:] at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:430) [engine-bll.jar:] at org.ovirt.engine.core.bll.Backend.endAction(Backend.java:379) [engine-bll.jar:] ovirt-engine/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/SnapshotsManager.java --- 401 /** 402 * Remove all the disks which are allowed to be snapshot but not exist in the snapshot. 403 * @param vmId - The vm id which is being snapshot. 404 * @param diskIdsFromSnapshot - An image group id list for images which are part of the VM. 405 */ 406 private void removeDisksNotInSnapshot(Guid vmId, List<Guid> diskIdsFromSnapshot) { 407 for (VmDevice vmDevice : getVmDeviceDao().getVmDeviceByVmIdTypeAndDevice( 408 vmId, VmDeviceType.DISK.getName(), VmDeviceType.DISK.getName())) { 409 if (!diskIdsFromSnapshot.contains(vmDevice.getDeviceId())) { 410 Disk disk = getDiskDao().get(vmDevice.getDeviceId()); 411 if (disk.isAllowSnapshot()) { 412 getBaseDiskDao().remove(vmDevice.getDeviceId()); 413 getVmDeviceDao().remove(vmDevice.getId()); 414 } 415 } 416 } 417 } --- Here's the snapshots engine=> select snapshot_id,snapshot_type,status,description,creation_date,_create_date,_update_date from snapshots where vm_id = '5adbadf8-b090-4ae8-818b-2d34244ec895'\x\g\x Expanded display is on. -[ RECORD 1 ]-+------------------------------------- snapshot_id | e174b341-c5cf-405a-a835-960c829c8441 snapshot_type | REGULAR status | OK description | EN BLANCO creation_date | 2013-11-01 15:30:43.709-04 _create_date | 2013-11-01 15:30:43.648312-04 _update_date | -[ RECORD 2 ]-+------------------------------------- snapshot_id | 194c6163-3e9f-40b2-a93f-52e455aaa628 snapshot_type | PREVIEW status | LOCKED description | Active VM before the preview creation_date | 2013-11-04 15:54:31.709-05 _create_date | 2013-11-04 15:54:31.648832-05 _update_date | -[ RECORD 3 ]-+------------------------------------- snapshot_id | 0ef11d8a-52b0-40c3-b864-72b207412c72 snapshot_type | ACTIVE status | OK description | Active VM creation_date | 2013-11-04 15:54:31.72-05 _create_date | 2013-11-04 15:54:31.648832-05 _update_date | And disk/volumes engine=> select image_guid,parentid,image_group_id,vm_snapshot_id from images where image_group_id in (select device_id from vm_device where vm_id = '5adbadf8-b090-4ae8-818b-2d34244ec895' and device = 'disk'); image_guid | parentid | image_group_id | vm_snapshot_id --------------------------------------+--------------------------------------+--------------------------------------+-------------------------------------- 10f9a791-8b7f-470b-859f-3dec2ea58309 | 7beb1725-5394-441a-a331-df39c7078e5f | 700be63c-f2dd-4403-a342-f8b88ec7ba18 | 194c6163-3e9f-40b2-a93f-52e455aaa628 d75de395-1ad7-480d-adde-33630f8a49f5 | 7beb1725-5394-441a-a331-df39c7078e5f | 700be63c-f2dd-4403-a342-f8b88ec7ba18 | 0ef11d8a-52b0-40c3-b864-72b207412c72 7beb1725-5394-441a-a331-df39c7078e5f | 00000000-0000-0000-0000-000000000000 | 700be63c-f2dd-4403-a342-f8b88ec7ba18 | e174b341-c5cf-405a-a835-960c829c8441 (3 rows) My understanding is that we're failing here because the VM has a second disk attached for some unknown reason engine=> select * from vm_device where vm_id = '5adbadf8-b090-4ae8-818b-2d34244ec895' and device = 'disk'\x\g\x Expanded display is on. -[ RECORD 1 ]+------------------------------------------------------------- device_id | eb95f080-1458-41ec-b4f5-a3b691c9d98d vm_id | 5adbadf8-b090-4ae8-818b-2d34244ec895 type | disk device | disk address | {bus=0x00, domain=0x0000, type=pci, slot=0x06, function=0x0} boot_order | 0 spec_params | { } is_managed | f is_plugged | t is_readonly | f _create_date | 2013-11-01 16:38:01.562009-04 _update_date | alias | virtio-disk701 -[ RECORD 2 ]+------------------------------------------------------------- device_id | 700be63c-f2dd-4403-a342-f8b88ec7ba18 vm_id | 5adbadf8-b090-4ae8-818b-2d34244ec895 type | disk device | disk address | {unit=0, bus=0, target=0, controller=0, type=drive} boot_order | 0 spec_params | { } is_managed | t is_plugged | t is_readonly | f _create_date | 2013-11-01 13:57:31.389081-04 _update_date | 2013-11-01 16:38:01.548234-04 alias | ide0-0-0 This disk, eb95f080-1458-41ec-b4f5-a3b691c9d98d, has no volumes in the database nor on the physical storage. The _create_date timestamp is ~2.75 hours after the VM was created, but there's not a single reference to this device_id value in the engine or SPM logs, nor in the logs of the host that was running the VM. Version-Release number of selected component (if applicable): rhevm-3.2.2-0.41.el6ev How reproducible: I have not been able to reproduce it internally, and the customer has only encountered this once Steps to reproduce: 1. Create VM 2. Add nic1 3. Add nic2 4. Add IDE disk 5. Start VM 6. Attach ISO 7. Create live snapshot 8. Eject ISO 9. Attach ISO 10. Shutdown VM 11. Initiate snapshot-preview Actual results: snapshot-preview fails Expected results: snapshot-preview action succeeds and VM can be started to preview snapshot state
(In reply to Derrick Ornelas from comment #0) > engine=> select * from vm_device where vm_id = > '5adbadf8-b090-4ae8-818b-2d34244ec895' and device = 'disk'\x\g\x > Expanded display is on. > -[ RECORD 1 ]+------------------------------------------------------------- > device_id | eb95f080-1458-41ec-b4f5-a3b691c9d98d > vm_id | 5adbadf8-b090-4ae8-818b-2d34244ec895 > type | disk > device | disk > address | {bus=0x00, domain=0x0000, type=pci, slot=0x06, function=0x0} > boot_order | 0 > spec_params | { } > is_managed | f > is_plugged | t > is_readonly | f > _create_date | 2013-11-01 16:38:01.562009-04 > _update_date | > alias | virtio-disk701 > -[ RECORD 2 ]+------------------------------------------------------------- > device_id | 700be63c-f2dd-4403-a342-f8b88ec7ba18 > vm_id | 5adbadf8-b090-4ae8-818b-2d34244ec895 > type | disk > device | disk > address | {unit=0, bus=0, target=0, controller=0, type=drive} > boot_order | 0 > spec_params | { } > is_managed | t > is_plugged | t > is_readonly | f > _create_date | 2013-11-01 13:57:31.389081-04 > _update_date | 2013-11-01 16:38:01.548234-04 > alias | ide0-0-0 > > > This disk, eb95f080-1458-41ec-b4f5-a3b691c9d98d, has no volumes in the > database nor on the physical storage. The _create_date timestamp is ~2.75 > hours after the VM was created, but there's not a single reference to this > device_id value in the engine or SPM logs, nor in the logs of the host that > was running the VM. At ~2013-11-01 16:38:01: 2013-11-01 16:38:01,489 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVdsCommand] (QuartzScheduler_Worker-20) START, FullListVdsCommand( HostId = 5ade55e9-6ccc-426a-bd4d-4bfc4b84eead, vds=null, vmIds=[5adbadf8-b090-4ae8-818b-2d34244ec895]), log id: 3bcf9b80 2013-11-01 16:38:01,516 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVdsCommand] (QuartzScheduler_Worker-20) FINISH, FullListVdsCommand, return: [Lorg.ovirt.engine.core.vdsbroker.xmlrpc.XmlRpcStruct;@3aea0f2d, log id: 3bcf9b80 Following FullListVdsCommand > getVmInfo > updateVmDevices > processVmDevices, it seems that the disk could have been created/added there (addNewVmDevice): for (Object o : (Object[]) vm.get(VdsProperties.Devices)) { Map device = (Map<String, Object>) o; if (device.get(VdsProperties.Address) == null) { logDeviceInformation(vmId, device); continue; } Guid deviceId = getDeviceId(device); VmDevice vmDevice = deviceMap.get(new VmDeviceId(deviceId, vmId)); if (deviceId == null || vmDevice == null) { deviceId = addNewVmDevice(vmId, device); } else { vmDevice.setAddress(((Map<String, String>) device.get(VdsProperties.Address)).toString()); vmDevice.setAlias(StringUtils.defaultString((String) device.get(VdsProperties.Alias))); addVmDeviceToList(vmDevice); } processedDevices.add(deviceId); } (note that addNewVmDevice generates a new deviceId) To understand what happened we need the VDSM response to FullListVdsCommand (list).
After the vm is created, the user attempts to run it - As we can see, when the user attempts to run the vm there is a hook which i suspect that leads to the final issue- Thread-2681917::DEBUG::2013-11-01 17:57:41,230::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/libexec/vdsm/hooks/before_vm_start/50_vhostmd' (cwd None) it can be seen that the xml passed to libvirt contains that "disk" whose source is "/dev/shm/vhostmd0" (like the hook name..) Thread-2681917::DEBUG::2013-11-01 17:57:41,440::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ''; <rc> = 0 Thread-2681917::INFO::2013-11-01 17:57:41,441::hooks::76::root::(_runHooksDir) Thread-2681917::DEBUG::2013-11-01 17:57:41,446::libvirtvm::1509::vm.Vm::(_run) vmId=`5adbadf8-b090-4ae8-818b-2d34244ec895`::<?xml version="1.0" encoding="utf-8"?><domain typ e="kvm"> <name>voipswitch_01</name> .... <disk device="disk" type="file"><source file="/dev/shm/vhostmd0"/><target bus="virtio" dev="vdzz"/><readonly/></disk></devices> After the vm is running, that device is added to the engine from the VdsUpdateRuntimeInfo flow, as it's unknown in the engine..as it can be seen , as it's type is disk, it's added as disk device to the engine. Later on, in the preview snapshot flow it's assumed that all the disk devices in the engine db are actually existing disks - therefore an NPE occurs. We need to find out - 1. What happens on the hook, why is that device passed as a disk to libvirt? I suspent the hook causes it.. please verify what's going on there. 2. Regardles of (1) in the engine, the assumption that disk devices are only of disks that exist on the engine should be fixed, though if (1) is solved it's less "urgent" for that case (will be fixed anyway). *note that there is 4 hours difference between the engine and vdsm logs. relevant logs - hsm - vdsm.log.6.xz engine - engine.log.1-20131104_033102_0088.xz
Derrick, can we somehow manage without having this property set to fix the issue for the customer? or it's mandatory for his usages? regardless, the issue was fixed upstream and now i'm backporting it to 3.3 branch, then possibly to zstream.
Aahron, as marina wrote - I'd add unmanaged disk devices and test the preview/restore/run stateless flows. it could be done with hook as the customer did, or by manually adding devices to the vm_device table.
Derrick, thanks - the fixed that we applied basically removes that assumption that any disk device is a known disk in the engine. Note that as other unmanaged devices, if you take a snapshot with this unknown disk device, as you'll attempt to preview that snapshot - you'll have this device back for the next vm run. Regardless of this bug, we need to look into that script, it seems like potentially the same device would be added on each "regular" vm run - as it's always added to the xml (we need this see how libvirt/vdsm acts on it) and might be also be sent by the engine.
verified on is29. Used the hook to add a disk to the vm that is not managed by engine: Thread-41794::DEBUG::2014-01-01 09:18:04,795::utils::489::root::(execCmd) '/usr/libexec/vdsm/hooks/before_vm_start/the_hook' (cwd None) Thread-41794::DEBUG::2014-01-01 09:18:04,832::utils::509::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0 Thread-41794::INFO::2014-01-01 09:18:04,833::hooks::88::root::(_runHooksDir) Started the vm: Thread-41794::DEBUG::2014-01-01 09:18:04,833::vm::3031::vm.Vm::(_run) vmId=`4b4cd142-0339-4520-83a6-a79f1f282166`::<?xml version="1.0" encoding="utf-8"?><domain type="kvm"> <name>the_vm</name> <uuid>4b4cd142-0339-4520-83a6-a79f1f282166</uuid> ..... <disk device="disk" type="file"><source file="/tmp/the_disk"/><target bus="virtio" dev="vdzz"/><readonly/></disk></devices> Took a live snapshot and then was able to preview snapshot / commit snapshot: 2014-01-01 09:21:25,886 INFO [org.ovirt.engine.core.bll.TryBackToAllSnapshotsOfVmCommand] (ajp-/127.0.0.1:8702-3) [139b8a69] Lock Acquired to object EngineLock [exclusiveLocks= key: 4b4cd142-0339-4520-83a6-a79f1f282166 value: VM , sharedLocks= ] 2014-01-01 09:21:25,910 INFO [org.ovirt.engine.core.bll.TryBackToAllSnapshotsOfVmCommand] (ajp-/127.0.0.1:8702-3) [139b8a69] Running command: TryBackToAllSnapshotsOfVmCommand i nternal: false. Entities affected : ID: 4b4cd142-0339-4520-83a6-a79f1f282166 Type: VM <.... snip.....> 70 (Parent Command TryBackToAllSnapshotsOfVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'success'. 2014-01-01 09:21:31,251 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-88) BaseAsyncTask::OnTaskEndSuccess: Task 931b0761-0dd0-4a9a-a969-daae139d9f70 (Parent Command TryBackToAllSnapshotsOfVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0038.html