Bug 1236061

Summary: Vm becomes unusable (NPE) when restarting vdsm during snapshot creation
Product: Red Hat Enterprise Virtualization Manager Reporter: Carlos Mestre González <cmestreg>
Component: ovirt-engineAssignee: Shmuel Melamud <smelamud>
Status: CLOSED CURRENTRELEASE QA Contact: Nisim Simsolo <nsimsolo>
Severity: high Docs Contact:
Priority: high    
Version: 3.5.3CC: amureini, gklein, hkim, istein, lpeer, lsurette, mgoldboi, michal.skrivanek, nsimsolo, rbalakri, Rhev-m-bugs, srevivo, tjelinek, tnisan, ykaul, ylavi
Target Milestone: ovirt-3.6.0-rc3Keywords: ZStream
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 3.6.0-11 Doc Type: Bug Fix
Doc Text:
Previously, when VDSM was restarted during VM snapshot creation, it sometimes corrupted the VM and made it unusable. This issue was resolved and now VM is correctly rolled back to the previous state, if snapshot creation is interrupted for any reason.
Story Points: ---
Clone Of:
: 1274717 (view as bug list) Environment:
Last Closed: 2016-04-20 01:36:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1274717    
Attachments:
Description Flags
engine.log
none
vdsm.log none

Description Carlos Mestre González 2015-06-26 12:55:49 UTC
Description of problem:
I've been trying this scenario when manually restarting vdsm during a snapshot creation on different times during the process expecting the rollback, and *some times* the rollback doesn't work and leave the vm unusable (corrupted?). Trying to start the vm returns NPE.

Version-Release number of selected component (if applicable):
rhevm-3.5.3.1-1.4.el6ev.noarch

How reproducible:
25%

Steps to Reproduce:
1. create a vm with multiple disks of different types
2. add a snapshot to the vm (all disks)
3. when the engine.log shows the CreateAllSnapshotsFromVmCommand, restart the vdsm in the spm host

Actual results:
- Creation of the snapshots fails as expected, but the vm becomes unusable (cannot be started, create new snapshots), example:

2015-06-26 15:25:13,585 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (ajp-/127.0.0.1:8702-1) [vms_syncAction_5a9b28e3-c382-4271] Command org.ovirt.engine.core.bll.RunVmCommand throw exception: java.lang.NullPointerException
    at org.ovirt.engine.core.bll.RunVmCommand.getMemoryFromSnapshot(RunVmCommand.java:154) [bll.jar:]


Expected results:
- Creation of the snapshots fails and the vm is working.

Additional info:
- Tested using NFS storage domains
- hypervisors RHEL 7.1 with:
vdsm-4.16.20-1.el7ev.x86_64
libvirt-1.2.8-16.el7_1.3.x86_64
qemu-img-rhev-2.1.2-23.el7_1.3.x86_64

Comment 1 Carlos Mestre González 2015-06-26 12:56:33 UTC
Created attachment 1043502 [details]
engine.log

Comment 2 Carlos Mestre González 2015-06-26 12:56:59 UTC
Created attachment 1043503 [details]
vdsm.log

Comment 3 Tal Nisan 2015-06-28 13:35:28 UTC
Seems that the NPE is in this line: 
cachedMemoryVolumeFromSnapshot = archSupportSnapshot && FeatureSupported.memorySnapshot(getVm().getVdsGroupCompatibilityVersion()) ?
getActiveSnapshot().getMemoryVolume() : StringUtils.EMPTY;

Thus I reckon that it's more of a virt-ish issue, Michal, can one of your guys have a look?

Comment 4 Michal Skrivanek 2015-06-28 13:44:53 UTC
Tal, We'll take a look, but it seems to me the actual snapshot is not aborted/reverted correctly. We can surely fix NPE, but it looks like the state of the VM is not correct, and that's more in your area

Comment 5 Tal Nisan 2015-07-06 11:12:39 UTC
Any insights Michal?

Comment 6 Nisim Simsolo 2015-09-22 07:57:17 UTC
Verified: rhevm-3.6.0-0.13.master.el6
vdsm-4.17.6-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-22.el7.x86_64
sanlock-3.2.4-1.el7.x86_64
libvirt-client-1.2.17-5.el7.x86_64

Scenario:
1. create a vm with multiple disks of different types
2. add a snapshot to the vm (all disks)
3. when the engine.log shows the CreateAllSnapshotsFromVmCommand, restart the vdsm in the spm host

Actual result:
VM remains locked until VDSM is running again

4. Wail till VM is available again and preview created snapshot.
5. commit snapshot and verify VM is running properly.