1357361 – RunVm endAction throws NPE when starting VM from Pool

Bug 1357361 - RunVm endAction throws NPE when starting VM from Pool

Summary: RunVm endAction throws NPE when starting VM from Pool

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.6.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.0.2
Target Release:	4.0.2
Assignee:	Liron Aravot
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1362001
TreeView+	depends on / blocked

Reported:	2016-07-18 05:02 UTC by Germano Veit Michel
Modified:	2019-12-16 06:11 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1362001 (view as bug list)
Environment:
Last Closed:	2016-08-23 20:44:17 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2016:1743	normal	SHIPPED_LIVE	Red Hat Virtualization Manager 4.0 GA Enhancement (ovirt-engine)	2016-09-02 21:54:01 UTC
oVirt gerrit	61315	master	ABANDONED	core: CreateAllSnapshotsFromVm - CreateAllCinderSnapshots execution	2016-07-25 12:42:17 UTC
oVirt gerrit	61316	ovirt-engine-3.6	MERGED	core: CreateAllSnapshotsFromVm - CreateAllCinderSnapshots execution	2016-07-25 13:31:41 UTC

Description Germano Veit Michel 2016-07-18 05:02:19 UTC

Description of problem:

The creation of the snapshot for the stateless VM seems to succeed, but right after in RunVm (endAction) throws a NPE and CreateVmVDSCommand is not reached. I don't fully understand what is going on, but it's apparently related to ExecutionContext.

2016-07-15 11:20:19,049 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (ajp-/127.0.0.1:8702-204) [27fc4fcb] START, CreateSnapshotVDSCommand( CreateSnapshotVDSCommandParameters:{runAsync='true', storagePoolId='00000002-0002-0002-0002-00000000009a', ignoreFailoverLimit='false', storageDomainId='cce0c8f0-cc8c-4d6c-a96b-998046d7b725', imageGroupId='d08aaecc-f7a4-4f85-9237-cf00dd4029a6', imageSizeInBytes='21474836480', volumeFormat='COW', newImageId='e2a4529a-f2dc-4730-b3b2-ed54df99f52d', newImageDescription='', imageInitialSizeInBytes='0', imageId='3d8defe6-7fd8-47fc-b8e8-4c3477874385', sourceImageGroupId='d08aaecc-f7a4-4f85-9237-cf00dd4029a6'}), log id: 1bad7115

2016-07-15 11:20:40,370 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (org.ovirt.thread.pool-6-thread-42) [27fc4fcb] START, GetImageInfoVDSCommand( GetImageInfoVDSCommandParameters:{runAsync='true', storagePoolId='00000002-0002-0002-0002-00000000009a', ignoreFailoverLimit='false', storageDomainId='cce0c8f0-cc8c-4d6c-a96b-998046d7b725', imageGroupId='d08aaecc-f7a4-4f85-9237-cf00dd4029a6', imageId='e2a4529a-f2dc-4730-b3b2-ed54df99f52d'}), log id: f63cdc9
2016-07-15 11:20:40,782 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (org.ovirt.thread.pool-6-thread-42) [27fc4fcb] FINISH, GetImageInfoVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.DiskImage@dd255f12, log id: f63cdc9
2016-07-15 11:20:41,007 INFO  [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (ajp-/127.0.0.1:8702-809) [ebec59] START, IsVmDuringInitiatingVDSCommand( IsVmDuringInitiatingVDSCommandParameters:{runAsync='true', vmId='4504655c-fa6d-4de4-b6df-7752bf19f8e6'}), log id: 3a996a07
2016-07-15 11:20:41,007 INFO  [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (ajp-/127.0.0.1:8702-809) [ebec59] FINISH, IsVmDuringInitiatingVDSCommand, return: false, log id: 3a996a07
2016-07-15 11:20:41,023 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (org.ovirt.thread.pool-6-thread-42) [] Trying to release exclusive lock which does not exist, lock key: '30316f1a-efa5-403a-8d6e-0575c5d53487VM'
2016-07-15 11:20:41,038 INFO  [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (org.ovirt.thread.pool-6-thread-42) [] Lock freed to object 'EngineLock:{exclusiveLocks='[30316f1a-efa5-403a-8d6e-0575c5d53487=<VM, ACTION_TYPE_FAILED_SNAPSHOT_IS_BEING_TAKEN_FOR_VM$VmName VDPool1-237>]', sharedLocks='null'}'
2016-07-15 11:20:41,062 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-6-thread-42) [] [within thread]: endAction for action type RunVm threw an exception.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.RunVmCommand.createContextForRunStatelessVm(RunVmCommand.java:1146) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.endSuccessfully(RunVmCommand.java:1124) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:651) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:595) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1984) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:524) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.DecoratedCommand.endAction(DecoratedCommand.java:17) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.endAction(CoCoAsyncTaskHelper.java:360) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.endAction(CommandCoordinatorImpl.java:415) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.endCommandAction(CommandAsyncTask.java:168) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.access$000(CommandAsyncTask.java:30) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask$1.run(CommandAsyncTask.java:111) [bll.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:89) [utils.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_101]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_101]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_101]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_101]
        at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_101]


According to the backtrace, the NPE is generated here:
Job job = JobRepositoryFactory.getJobRepository().getJobWithSteps(step.getJobId());

From what I gathered it is quite likely 'step' is the culprit. So

Step step = getExecutionContext().getStep();

Could be Null? I don't properly understand exactly how this works and why this would happen. Will add more information once I dig deeper on it.

Version-Release number of selected component (if applicable):
rhevm-3.6.5.3-0.1.el6.noarch

How reproducible:
0% - Simple reproduction using API does not reproduce this. Likely some race condition/locking involved.

Actual results:
NPE

Expected results:
No NPE

Additional info:
This is a quite big environment, many VMs running and returning to the Pool all the time. Stateless snapshots being created and Images Destroyed are constant operation throughout the logs.

Comment 1 Germano Veit Michel 2016-07-18 05:20:14 UTC

This other occurrence shows an interesting error before the NPE

2016-07-09 18:37:38,979 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (org.ovirt.thread.pool-6-thread-3) [4a263009] Error in invocating CTOR of command 'RunVm': null

2016-07-09 18:37:38,980 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-6-thread-3) [4a263009] [within thread]: endAction for action type RunVm threw an exception.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.buildCommand(CoCoAsyncTaskHelper.java:376) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.endAction(CoCoAsyncTaskHelper.java:359) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.endAction(CommandCoordinatorImpl.java:415) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.endCommandAction(CommandAsyncTask.java:168) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.access$000(CommandAsyncTask.java:30) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask$1.run(CommandAsyncTask.java:111) [bll.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:89) [utils.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_101]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_101]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_101]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_101]
        at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_101]

Comment 4 Yaniv Kaul 2016-07-18 07:26:17 UTC

Note that there are multiple issues in this setup (may or may not be related to this specific issue). For example:
2016-07-15 11:13:50,931 WARN  [org.ovirt.engine.core.bll.scheduling.policyunits.CpuAndMemoryBalancingPolicyUnit] (DefaultQuartzScheduler_Worker-7) [4abb9e0a] All hosts are over-utilized,

And:

2016-07-15 11:00:48,274 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SetVmTicketVDSCommand] (ajp-/127.0.0.1:8702-115) [11057733] Command 'SetVmTicketVDSCommand(HostName = rhevh2, SetVmTicketVDSCommandParamet
ers:{runAsync='true', hostId='4ac07f03-9f7a-46aa-a5b7-63e63968bc61', vmId='45923bc6-6ad6-4da9-9d81-90624a6ccb07', protocol='SPICE', ticket='SVnKdIBI+LTV', validTime='120', userName='mmasmmi', userId='06e235f5-0a
34-4ebe-8fe7-22b4ad969a0f', disconnectAction='LOCK_SCREEN'})' execution failed: VDSGenericException: VDSErrorException: Failed to SetVmTicketVDS, error = 'NoneType' object has no attribute 'XMLDesc', code = -326
03
2016-07-15 11:00:48,274 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetVmTicketVDSCommand] (ajp-/127.0.0.1:8702-115) [11057733] FINISH, SetVmTicketVDSCommand, log id: 56553492
2016-07-15 11:00:48,276 ERROR [org.ovirt.engine.core.bll.SetVmTicketCommand] (ajp-/127.0.0.1:8702-115) [11057733] Command 'org.ovirt.engine.core.bll.SetVmTicketCommand' failed: EngineException: org.ovirt.engine.
core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to SetVmTicketVDS, error = 'NoneType' object has no attribute 'XMLDesc', code = -32603 (Failed with error unexpected and
 code 16)


Looks like the host are over-utilized. Not sure it's related, but worth checking and vdsm logs (and host logs in general) might be helpful as well.

Comment 6 Arik 2016-07-20 20:03:48 UTC

I suspect this is a consequence of early phase of cinder support because I see the following pattern whenever NPE is thrown:
VM is attached to user and run
Starting to create the stateless snapshot
starting to create a snapshot for non-cinder disk
starting to create a snapshot for cinder disk
The SPM async task is finished (the snapshot for the non-cinder disk is finished)
 -> ending the CreateAllSnapshotCommand
Then the callback is triggered (the snapshot for the cinder disk is finished)
 -> ending the CreateAllSnapshotCommand again
and in that second time we get the exception.

Germano, before going any further with the investigation can you please:
1. Confirm that the pool is based on a template with cinder disk and non-cinder disk

2. Elaborate on the problem from the user's point of view (putting aside these NPEs that should be resolved. IIUC the user should have seen that the operation succeeded)

3. You wrote that you didn't manage to reproduce it with a simple API call, but could you please try to create a regular VM based on the template of the pool and then try to run it as stateless?

Comment 7 Arik 2016-07-20 20:30:36 UTC

(In reply to Arik from comment #6)
> 2. Elaborate on the problem from the user's point of view (putting aside
> these NPEs that should be resolved. IIUC the user should have seen that the
> operation succeeded)

No no, this is incorrect..
CreateVm is not called because the only attempt to end RunVmCommand fails.

Comment 8 Germano Veit Michel 2016-07-21 02:20:06 UTC

Hi Arik, 

First of all thanks for looking at this.

Wait wait wait... 

There is not a single Cinder disk in whole setup. The template has just one disk, and it's coming from the only storage domain they have, which is NFS.

Then I went back to my "simple API reproducer" and saw that it also ran the CreateAllCinderSnapshotsCommand command!

So:

1. Create VM with single disk (NFS/iSCSI)
2. Set it as stateless
3. Run it
RESULT: CreateAllCinderSnapshotsCommand is run.

I see createSnapshotsForDisks() has roughly 2 phases:

First: 

List<DiskImage> images = ImagesHandler.filterImageDisks(getDisksList()...
        for (DiskImage diskImage : images) {
        ....
        }

Then it goes for the Cinder Images

List<CinderDisk> cinderDisks = ...
CreateAllCinderSnapshotsParameters params = new CreateAllCinderSnapshotsParameters(...);
Future<VdcReturnValueBase> future = CommandCoordinatorUtil.executeAsyncCommand(...

I speak C and Python, but my first impression of this is that we always run the Cinder task, even with no Cinder images. 

It looks like an "empty" task, as it had no Child commands?

[org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler_Worker-31) [7a70c079] Command 'CreateAllCinderSnapshots' id: '6ce61573-bd30-4568-866e-fe5309045631' child commands '[]' executions were completed, status 'SUCCEEDED'

Which shouldn't do much harm. Except that when this Cinder callback is the LAST one to get called we get an NPE which prevents the RunVm from succeeding?

Well, it sounds like CreateAllCinderSnapshotsCommand shouldn't be called in the first place, right?

Does this make sense? How wrong am I? =D

Comment 9 Arik 2016-07-21 04:59:11 UTC

Thanks. It makes a lot of sense, I think we are getting toward the issue here.

Not having a single cinder disk in the setup - that explains why CreateAllCinderSnapshots doesn't fire a single CreateSingleSnapshot. executing CreateAllCinderSnapshots seems indeed like a bug.

So now I have a scenario for a race that would make more sense than what I described in comment 7:
- VM is attached to user and run
- starting to create the stateless snapshot
- starting to create a snapshot for the disk
- scheduling CreateAllCinderSnapshots (it is an async command)
- SPM async task is created for the snapshot for the disk
- the command coordinator infrastructure that is used to execute CreateAllCinderSnapshots changes the steps of the execution context of RunVmCommand somewhere (see https://gerrit.ovirt.org/#/c/53493/3)
- the SPM async task is finished and therefore we try to end RunVmCommand but its step was cleared and thus get the NPE in RunVmCommand
- the callback that monitors CreateAllCinderSnapshots that doesn't actually have what to monitor is triggered and tries to finish CreateAllCinderSnapshots command but the execution job was changed in the previous step and therefore the other exception we see:
ERROR: insert or update on table "step" violates foreign key constraint "fk_step_job"
 
Side note: This part changed a lot since 3.6.5 on the master branch, we no longer have this CreateAllCinderSnapshotsCommand and we have a protection from getting NPE in RunVmCommand because of having step==null.

Allon, if what I wrote above is true then it seems more storage related than virt. Could someone from the storage team take a look?

Comment 10 Allon Mureinik 2016-07-21 09:39:52 UTC

Liron, can you take a look please?
I think this was already solved in 4.0, but you know this area much better than me.

Comment 11 Liron Aravot 2016-07-21 14:31:50 UTC

Hi,
As it seems this issue was solved as well by the solution to BZ 1348079 in 3.6.7.
I'll do quick verification of it - but i'd say that we should be fine with regards to that issue.

Comment 12 Liron Aravot 2016-07-21 14:35:21 UTC

Correcting myself (yes, quickly) - this is a slightly different one.
will fix.

Comment 20 Liron Aravot 2016-07-26 07:10:56 UTC

Try to run a attach a vm from pool and run it. (add permission to a user and run from the userportal).

Comment 23 Allon Mureinik 2016-08-01 06:58:35 UTC

The provided fix is only relevant to the 3.6.z branch. For the 4.0 branch this is essentially a "test only" bug. Setting to ON_QA so the QE stakeholders can decided whether to run a sanity test here (which I believe they do anyway), or just close as CURRENTRELEASE.

Comment 24 Kevin Alon Goldblatt 2016-08-03 15:06:32 UTC

Tested with the following code:
----------------------------------------
rhevm-4.0.2.3-0.1.el7ev.noarch
vdsm-4.18.9-1.el7ev.x86_64


Tested with the following scenario:


Steps to Reproduce:
1. Create a VM Pool of a few VMs from a template with a disk
2. Create a User with permissions to the pool
3. Access the User Portal as the newly created user
4. Start a VM from the pool >>>>> The vm is started successfully and no NPEs are thrown

Actual results:
The vm is started successfully and no NPEs are thrown


Moving to VERIFIED!

Comment 27 errata-xmlrpc 2016-08-23 20:44:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html

Note You need to log in before you can comment on or make changes to this bug.