Bug 1025183

Summary:	VmPoolMonitor throws a NullPointerException while starting a guest that in turn remains down with its images locked.
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	rhev-integ
Component:	ovirt-engine	Assignee:	Tomas Jelinek <tjelinek>
Status:	CLOSED ERRATA	QA Contact:	Pavel Novotny <pnovotny>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.1.5	CC:	acathrow, ahadas, ahoness, byount, cboyle, iheim, lpeer, lyarwood, mavital, michal.skrivanek, pspacek, rgolan, Rhev-m-bugs, rnori, srevivo, tjelinek, yeylon, yzaslavs
Target Milestone:	---	Keywords:	ZStream
Target Release:	3.2.5
Hardware:	All
OS:	Linux
Whiteboard:	virt
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	A problem existed with VmPoolMonitor throwing a NullPointerException while starting a guest that in turn remains down with its images locked. There was a chance that the prestarted pools' VMs ended up in an "image locked" state if turned off and the engine decided to turn them on immediately. This was caused by a race condition in async task management The race condition is now fixed using proper locking of the shared resources. This means the prestarted VMs can be freely turned off and they will not end up in "image locked" state.	Story Points:	---
Clone Of:	1015638	Environment:
Last Closed:	2013-12-18 14:10:02 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1015638
Bug Blocks:

Description rhev-integ 2013-10-31 08:03:37 UTC

+++ This bug is a RHEV-M zstream clone. The original bug is: +++
+++   https://bugzilla.redhat.com/show_bug.cgi?id=1015638. +++
+++ Requested by "mskrivan" +++
======================================================================



----------------------------------------------------------------------
Following comment by lyarwood on October 04 at 16:30:42, 2013

Description of problem:

VmPoolMonitor throws a NullPointerException while starting a guest that in turn remains down with it's images locked.

11942:2013-09-17 14:00:45,784 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-79) [e99c11e] Running command: RunVmCommand internal: true. Entities affected :  ID: 0cbfbabf-91bc-4335-a40f-b6671afe2b60 Type: VM
11943:2013-09-17 14:00:45,809 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-79) [e99c11e] VdcBll.RunVmCommand.RunVmAsStateless - Creating snapshot for stateless vm 312HPFMATEM-2 - 0cbfbabf-91bc-4335-a40f-b6671afe2b60
[..]
11988:2013-09-17 14:00:55,554 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] START, CreateSnapshotVDSCommand( storagePoolId = 2385c9b1-6dcb-4c38-aa8b-87444673a461, ignoreFailoverLimit = false, compatabilityVersion = 3.1, storageDomainId = b5fe10e9-63ac-4d7a-9bf6-f561a126d526, imageGroupId = ac576fe4-4095-4107-b8c1-da8836ef7dcb, imageSizeInBytes = 53687091200, volumeFormat = COW, newImageId = fa18b204-c6d5-4ab4-b021-dae975a1bd50, newImageDescription = , imageId = debf3c4c-2f74-4690-82a4-2d3eb847ffc8, sourceImageGroupId = ac576fe4-4095-4107-b8c1-da8836ef7dcb), log id: 3a4d2b71
11990:2013-09-17 14:00:55,818 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new parameters: description and UUID
11991:2013-09-17 14:00:55,818 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] -- createVolume parameters:
12011:2013-09-17 14:00:56,555 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] FINISH, CreateSnapshotVDSCommand, return: fa18b204-c6d5-4ab4-b021-dae975a1bd50, log id: 3a4d2b71
12025:2013-09-17 14:00:56,692 INFO  [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (QuartzScheduler_Worker-79) [421a2518] EntityMultiAsyncTasks::AttachTask: Attaching task d44b59d4-46a0-46b8-b5f6-dc360c060510 to entity 0cbfbabf-91bc-4335-a40f-b6671afe2b60.
12026:2013-09-17 14:00:56,694 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (QuartzScheduler_Worker-79) [421a2518] Adding task d44b59d4-46a0-46b8-b5f6-dc360c060510 to DataBase
12028:2013-09-17 14:00:56,703 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-79) [421a2518] Adding task d44b59d4-46a0-46b8-b5f6-dc360c060510 (Parent Command RunVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet..
12029:2013-09-17 14:00:56,717 INFO  [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (QuartzScheduler_Worker-79) [421a2518] Lock freed to object EngineLock [exclusiveLocks= key: 0cbfbabf-91bc-4335-a40f-b6671afe2b60 value: VM
12031:2013-09-17 14:00:56,750 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-79) [421a2518] BaseAsyncTask::StartPollingTask: Starting to poll task d44b59d4-46a0-46b8-b5f6-dc360c060510.
[..]
12032 2013-09-17 14:00:56,753 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (QuartzScheduler_Worker-79) failed to invoke sceduled method managePrestartedVmsInAllVmPools: java.lang.reflect.InvocationTargetException
12033     at sun.reflect.GeneratedMethodAccessor628.invoke(Unknown Source) [:1.7.0_25]
12034     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_25]
12035     at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_25]
12036     at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:64) [engine-scheduler.jar:]
12037     at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:]
12038     at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:]
12039 Caused by: javax.ejb.EJBException: java.lang.NullPointerException
12040     at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleExceptionInNoTx(CMTTxInterceptor.java:191) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8]
12041     at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:237) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8]
12042     at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:374) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8]
12043     at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:218) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8]
12044     at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12045     at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8]
12046     at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12047     at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:64) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8]
12048     at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12049     at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8]
12050     at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12051     at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
12052     at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12053     at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
12054     at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12055     at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12056     at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
12057     at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:182) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
12058     at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12059     at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12060     at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
12061     at org.ovirt.engine.core.bll.interfaces.BackendInternal$$$view9.runInternalAction(Unknown Source) [engine-bll.jar:]
12062     at org.ovirt.engine.core.bll.VmPoolMonitor.runVmAsStateless(VmPoolMonitor.java:142) [engine-bll.jar:]
12063     at org.ovirt.engine.core.bll.VmPoolMonitor.prestartVm(VmPoolMonitor.java:123) [engine-bll.jar:]
12064     at org.ovirt.engine.core.bll.VmPoolMonitor.prestartVms(VmPoolMonitor.java:79) [engine-bll.jar:]
12065     at org.ovirt.engine.core.bll.VmPoolMonitor.managePrestartedVmsInPool(VmPoolMonitor.java:60) [engine-bll.jar:]
12066     at org.ovirt.engine.core.bll.VmPoolMonitor.managePrestartedVmsInAllVmPools(VmPoolMonitor.java:32) [engine-bll.jar:]
12067     ... 6 more
12068 Caused by: java.lang.NullPointerException

Version-Release number of selected component (if applicable):
rhevm-3.1.0-55.el6ev.noarch                                 jue 27 jun 2013 09:19:25 CEST
rhevm-backend-3.1.0-55.el6ev.noarch                         jue 27 jun 2013 09:19:05 CEST
rhevm-cli-3.1.1.2-1.el6ev.noarch                            jue 27 jun 2013 09:17:45 CEST
rhevm-config-3.1.0-55.el6ev.noarch                          jue 27 jun 2013 09:18:56 CEST
rhevm-dbscripts-3.1.0-55.el6ev.noarch                       jue 27 jun 2013 09:19:24 CEST
rhevm-doc-3.1.0-21.el6eng.noarch                            jue 27 jun 2013 09:19:24 CEST
rhevm-dwh-3.1.3-2.el6ev.noarch                              lun 01 jul 2013 13:47:00 CEST
rhevm-genericapi-3.1.0-55.el6ev.noarch                      jue 27 jun 2013 09:18:56 CEST
rhevm-image-uploader-3.1.0-8.el6ev.noarch                   jue 27 jun 2013 09:17:22 CEST
rhevm-iso-uploader-3.1.0-9.el6ev.noarch                     jue 27 jun 2013 09:17:22 CEST
rhevm-log-collector-3.1.0-10.el6ev.noarch                   jue 27 jun 2013 09:17:22 CEST
rhevm-notification-service-3.1.0-55.el6ev.noarch            jue 27 jun 2013 09:19:24 CEST
rhevm-reports-3.1.0-20.el6ev.noarch                         lun 01 jul 2013 13:47:29 CEST
rhevm-restapi-3.1.0-55.el6ev.noarch                         jue 27 jun 2013 09:19:24 CEST
rhevm-sdk-3.1.0.16-1.el6ev.noarch                           jue 27 jun 2013 09:17:08 CEST
rhevm-setup-3.1.0-55.el6ev.noarch                           jue 27 jun 2013 09:18:56 CEST
rhevm-spice-client-x64-cab-3.1-8.el6.noarch                 jue 27 jun 2013 09:18:57 CEST
rhevm-spice-client-x86-cab-3.1-8.el6.noarch                 jue 27 jun 2013 09:18:56 CEST
rhevm-tools-common-3.1.0-55.el6ev.noarch                    jue 27 jun 2013 09:18:56 CEST
rhevm-userportal-3.1.0-55.el6ev.noarch                      jue 27 jun 2013 09:19:04 CEST
rhevm-webadmin-portal-3.1.0-55.el6ev.noarch                 jue 27 jun 2013 09:19:16 CEST


How reproducible:
10 instances seen in the current DB.

Steps to Reproduce:
Unclear at present.

Actual results:
Stateless pool guests remain down with their images locked.

Expected results:
Stateless pool guests start fully.

Additional info:
In addition to this we also see some guests remain in a 'waiting for launch' state even after launching successfully on a host. I'm not familiar enough with the VmPoolMonitor to know if the two are related.

----------------------------------------------------------------------
Following comment by lyarwood on October 04 at 16:39:52, 2013

Created attachment 807830 [details]
SFDC#00946299 - example engine.log showing NPE.

Full LC available

http://lyarwood-fab.usersys.redhat.com/tickets/00946299/rhvm-2013091811431379497401/
guest.redhat.com:~/tickets/00946299/rhvm-2013091811431379497401/ (passwd redhat)

In the above LC the 10 guests marked as down with locked images are :

engine=> select s.vm_name, d.vm_guid from vm_dynamic as d, vm_static as s  where s.vm_guid = d.vm_guid and d.vm_guid in ( select vm_guid from vms where vm_name in (select vm_names from vm_images_view where imagestatus != 1));
    vm_name     |               vm_guid                
----------------+--------------------------------------
 312HPALUFIS-14 | b104b412-2e01-41b9-9c7f-7ef30512a24c
 312HPALUFIS-4  | aa1948a9-1958-423f-94fd-9f2fa15930c5
 312HPALUFIS-3  | 138fbb02-9274-44bc-bc95-607b49a123f6
 312HPFMATEM-2  | 0cbfbabf-91bc-4335-a40f-b6671afe2b60
 312HPALUFIS-8  | 7aa75582-1208-4d53-be3f-0d03142c0317
 312HPFMATEM-16 | 22ab3180-436f-4198-b40c-86369eb73001
 312HPFMATEM-8  | 224cabb7-7d99-4699-bf92-35f8314d8a3f
 312HPFMATEM-7  | e620ef7e-b044-4a6e-a21a-04fb9a393dad
 312HPALUFIS-16 | 4a437d13-9bf5-40f0-8169-51d9c3c831dc
 312HPFMATEM-12 | 054026f1-5088-45f8-9cc3-35ac4217c88b
(10 rows)


In addition the guest marked as 'Waiting for launch' are :

engine=> select vm_static.vm_name, vm_static.vm_guid from vm_static INNER JOIN vm_dynamic ON vm_static.vm_guid=vm_dynamic.vm_guid and vm_dynamic.status='9';
    vm_name     |               vm_guid                
----------------+--------------------------------------
 31HPFFarma     | 411cd361-5cfb-4c82-888d-b4dcf7fc48aa
 312HPTARON-50  | c166dfea-99eb-4b0f-9727-b715f2141112
 312HPFMATEM-18 | 6449215f-9b73-4659-81f3-8f7319979506
 312HPTARON-18  | ec1290db-81d1-4be2-959b-69f9e662b538
 312HPTARON-53  | c89a696f-24a7-447a-8340-568cdde94afd
 312HPFMATEM-19 | 4f1d7f18-d448-490f-afcc-e509db4974a2
 312HPTARON-10  | 55bf82ec-0c6a-4c44-8450-7b22a6081415
 312HPTARON-55  | 9bbd7ab3-9bba-4967-9736-c66437ab09e8
 31HPFBiolo     | 869d4a50-0342-448a-8165-f1f382ab1394
 312HPFMATEM-6  | 2a8b4aff-a995-4839-b283-32db044cae7c
 312HPFMATEM-17 | ced5c561-4a39-4637-b5b6-844ae55f265f
 312HPALUFIS-15 | f3f93e4f-b19d-4c81-ac2f-b20827aa1462
(12 rows)

----------------------------------------------------------------------
Following comment by tjelinek on October 07 at 14:55:27, 2013

I have not succeeded to simulate it yet. 

According to our short chat with Lee it seems to me as some kind of a problem with threads which can happen when there are many vms in a pool started in parallel. 
Investigating...

----------------------------------------------------------------------
Following comment by lyarwood on October 08 at 12:37:31, 2013

(In reply to Tomas Jelinek from comment #2)
> I have not succeeded to simulate it yet. 
> 
> According to our short chat with Lee it seems to me as some kind of a
> problem with threads which can happen when there are many vms in a pool
> started in parallel. 
> Investigating...

Digging a little deeper I *think* this is a simple race between the current managePrestartedVmsInAllVmPools call and the one previous (interval defined by VmPoolMonitorIntervalInMinutes). If the thread is still prestarting the last batch of VMs when the managePrestartedVmsInAllVmPools timer is hit again we get this traceback and a guest left 'waiting on launch'.

Trying to reproduce now but as a workaround we can either decrease the number of guests being prestarted in each batch (as defined by VmPoolMonitorBatchSize) or increase the VmPoolMonitorIntervalInMinutes to allow for a batch of guests to be prestarted before it is called again.

----------------------------------------------------------------------
Following comment by tjelinek on October 09 at 13:32:22, 2013

OK, I see where the race is.

the flow which causes the NPE is:

1: [thread 1] EntityAsyncTask's constructor reaches the point 
   "entityInfo.AttachTask(this)" while the "_multiTasksByEntities" still contains 
   one task. After this it gets interrupted

2: [thread 2] the EntityAsyncTask.handleEndActionResult is executed in meanwhile 
   and issues the "_multiTasksByEntities.remove(entityInfo.getContainerId());"

3: [thread 1] the first thread continues the execution and calls the 
   "entityInfo.AttachTask(this);"

4: [thread 1] the EntityAsyncTask.ConcreteStartPollingTask() is called. It asks 
   for the GetEntityMultiAsyncTasks() which returns null (because thread 2 have
   deleted it) and fails on NPE on the next line

The fix should be simple, but first I need to simulate this (I have not succeeded yet) to be sure that the fix indeed fixes the issue.

----------------------------------------------------------------------
Following comment by tjelinek on October 10 at 14:41:34, 2013

OK, some new findings - the reason why the above described thing happens is that:

1: the EntityAsyncTask._multiTasksByEntities is not properly locked
2: the starting of the VM has not yet finished but the 
   VmPoolMonitor.managePrestartedVmsInAllVmPools already returned because the 
   VMs are being processed by other threads. The problem is that when the 
   managePrestartedVmsInAllVmPools returns, the VMs still can be in status down 
   so available for the next managePrestartedVmsInAllVmPools to be picked and 
   attempted to start (causing the race from comment 4).

@Lee: you are right, enlarging the VmPoolMonitorIntervalInMinutes and also the VmPoolMonitorBatchSize should solve the above NPE until a fix will be available. Have you already tried it? Did it help?

Thanx

----------------------------------------------------------------------
Following comment by lyarwood on October 10 at 14:53:10, 2013

(In reply to Tomas Jelinek from comment #5)
> OK, some new findings - the reason why the above described thing happens is
> that:
> 
> 1: the EntityAsyncTask._multiTasksByEntities is not properly locked
> 2: the starting of the VM has not yet finished but the 
>    VmPoolMonitor.managePrestartedVmsInAllVmPools already returned because
> the 
>    VMs are being processed by other threads. The problem is that when the 
>    managePrestartedVmsInAllVmPools returns, the VMs still can be in status
> down 
>    so available for the next managePrestartedVmsInAllVmPools to be picked
> and 
>    attempted to start (causing the race from comment 4).
> 
> @Lee: you are right, enlarging the VmPoolMonitorIntervalInMinutes and also
> the VmPoolMonitorBatchSize should solve the above NPE until a fix will be
> available. Have you already tried it? Did it help?

We asked the customer to decrease the VmPoolMonitorBatchSize and thus far they've not reported any additional issues to us.

Thanks,

Lee

----------------------------------------------------------------------
Following comment by michal.skrivanek on October 15 at 08:01:23, 2013

this seems to be quite likely to happen in 3.2 so better fix it in 3.2.z before more people sees that

----------------------------------------------------------------------
Following comment by tjelinek on October 15 at 09:23:57, 2013

So the race we have seen was actually between the RestoreAllSnapshotsCommand and RunVmCommand and it was caused by the issue described in comment 4.

It is less likely to happen in 3.3 because the infrastructure have been change to attach tasks to command IDs instead of entity IDs. Anyway, it is still possible so created a patch.

In older ovirt it is still quite possible that it will happen.

----------------------------------------------------------------------
Following comment by lyarwood on October 16 at 11:03:07, 2013

(In reply to Lee Yarwood from comment #6)
> We asked the customer to decrease the VmPoolMonitorBatchSize and thus far
> they've not reported any additional issues to us.

The customer has hit this again after changing  VmPoolMonitorBatchSize to 2. Reviewing the logs I see that they have 21 pools in this environment and it is taking over 5 minutes to prestart even 2 guests from each. As such as are going to suggest they double the VmPoolMonitorIntervalInMinutes to 10 minutes. 

I'm also going to push a few log.infoFormat() lines upstream that I'd like to include for this BZ. At present it's hard to tell when managePrestartedVmsInAllVmPools() is first called and then track our progress through the pools if no guests need to be prestarted.

----------------------------------------------------------------------
Following comment by lyarwood on October 16 at 16:08:01, 2013

(In reply to Tomas Jelinek from comment #4)
> OK, I see where the race is.
> 
> the flow which causes the NPE is:
> 
> 1: [thread 1] EntityAsyncTask's constructor reaches the point 
>    "entityInfo.AttachTask(this)" while the "_multiTasksByEntities" still
> contains 
>    one task. After this it gets interrupted
> 
> 2: [thread 2] the EntityAsyncTask.handleEndActionResult is executed in
> meanwhile 
>    and issues the
> "_multiTasksByEntities.remove(entityInfo.getContainerId());"
> 
> 3: [thread 1] the first thread continues the execution and calls the 
>    "entityInfo.AttachTask(this);"
> 
> 4: [thread 1] the EntityAsyncTask.ConcreteStartPollingTask() is called. It
> asks 
>    for the GetEntityMultiAsyncTasks() which returns null (because thread 2
> have
>    deleted it) and fails on NPE on the next line
> 
> The fix should be simple, but first I need to simulate this (I have not
> succeeded yet) to be sure that the fix indeed fixes the issue.

I *think* we have another example of this when manually starting a stateless guest in the same environment. Again the guest is left down with locked images.

Just before http-/0.0.0.0:8443-43 starts to poll the snapshot task pool-4-thread-43 calls EntityAsyncTask.handleEndActionResult and clears the task leading to the NPE :

2066 2013-10-15 09:38:36,474 INFO  [org.ovirt.engine.core.bll.AddPermissionCommand] (http-/0.0.0.0:8443-43) [10244c60] Running command: AddPermissionCommand internal: true. Entities affected :  ID: ae4e36f1-0e8c-479a-b67a-a048b3b9d625 Type: VM
2067 2013-10-15 09:38:36,481 INFO  [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (http-/0.0.0.0:8443-43) [10244c60] Succceeded giving user ac8ff527-5b3e-4397-ace2-0dbe94ef192c permission to Vm ae4e36f1-0e8c-479a-b67a-a048b3b9d625
[..]
2101 2013-10-15 09:38:51,983 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (http-/0.0.0.0:8443-43) [28cb8349] -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new parameters: description and UUID
[..]
2122 2013-10-15 09:38:53,089 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (http-/0.0.0.0:8443-43) [28cb8349] FINISH, CreateSnapshotVDSCommand, return: bb6ccaf3-4f60-4e54-874d-ad96395982ae, log id: 52fe99bd
2123 2013-10-15 09:38:53,093 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-43) START, HSMClearTaskVDSCommand(HostName = gdcpl3472.statestr.com, HostId = 69628f04-d538-11e1-8011-441ea10961c4, taskId=dd36409d-2fec-435d-9deb-059fe5d9f682), log id: 3d0448b3
2124 2013-10-15 09:38:53,180 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-43) FINISH, HSMClearTaskVDSCommand, log id: 3d0448b3
2125 2013-10-15 09:38:53,181 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-43) FINISH, SPMClearTaskVDSCommand, log id: dee1566
2126 2013-10-15 09:38:53,183 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-43) BaseAsyncTask::RemoveTaskFromDB: Removed task dd36409d-2fec-435d-9deb-059fe5d9f682 from DataBase
2127 2013-10-15 09:38:53,183 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-43) EntityAsyncTask::HandleEndActionResult [within thread]: Removing EntityMultiAsyncTasks object for entity ae4e36f1-0e8c-479a-b67a-a048b3b9d62     5
2128 2013-10-15 09:38:53,184 INFO  [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (http-/0.0.0.0:8443-43) [28cb8349] EntityMultiAsyncTasks::AttachTask: Attaching task 7a32265b-2355-4dfd-a588-42eab04d6efb to entity ae4e36f1-0e8c-479a-b6     7a-a048b3b9d625.
2129 2013-10-15 09:38:53,186 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (http-/0.0.0.0:8443-43) [28cb8349] Adding task 7a32265b-2355-4dfd-a588-42eab04d6efb to DataBase
2130 2013-10-15 09:38:53,201 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (http-/0.0.0.0:8443-43) [28cb8349] Adding task 7a32265b-2355-4dfd-a588-42eab04d6efb (Parent Command RunVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet..
[..]
2136 2013-10-15 09:38:53,251 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (http-/0.0.0.0:8443-43) [28cb8349] BaseAsyncTask::StartPollingTask: Starting to poll task 7a32265b-2355-4dfd-a588-42eab04d6efb.
2137 2013-10-15 09:38:53,258 ERROR [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (http-/0.0.0.0:8443-43) [28cb8349] Command org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand throw exception: javax.ejb.EJBTransactionRolledbackException
[..]
2244 Caused by: java.lang.NullPointerException
2245     at org.ovirt.engine.core.bll.EntityAsyncTask.ConcreteStartPollingTask(EntityAsyncTask.java:67) [engine-bll.jar:]
2246     at org.ovirt.engine.core.bll.SPMAsyncTask.StartPollingTask(SPMAsyncTask.java:127) [engine-bll.jar:]
2247     at org.ovirt.engine.core.bll.AsyncTaskManager.StartPollingTask(AsyncTaskManager.java:469) [engine-bll.jar:]
2248     at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1215) [engine-bll.jar:]
2249     at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1220) [engine-bll.jar:]
2250     at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1008) [engine-bll.jar:]
2251     at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:291) [engine-bll.jar:]
2252     at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:351) [engine-bll.jar:]
2253     at org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:597) [engine-bll.jar:]

Full logs to follow.

----------------------------------------------------------------------
Following comment by lyarwood on October 16 at 16:11:19, 2013

Created attachment 813026 [details]
SFDC#00966220 - Example engine.log showing another method of hitting the same NPE.

----------------------------------------------------------------------
Following comment by lyarwood on October 16 at 16:44:02, 2013

(In reply to Lee Yarwood from comment #10)
> I *think* we have another example of this when manually starting a stateless
> guest in the same environment. Again the guest is left down with locked
> images.

My mistake, that example was from a different environment, the following is from the same environment as the issue highlighted in the description.

Tracing this through a little more I see that this version of the race is caused by the engine allowing a user to start a stateless guest before the thread restoring the stateless guest has completed :

41756 2013-10-14 08:27:05,178 INFO  [org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand] (QuartzScheduler_Worker-35) [29c7dd9d] Deleting snapshots for stateless vm 6e80ca4b-cad4-48b6-b307-212073fa7e7f
41757 2013-10-14 08:27:05,184 INFO  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (QuartzScheduler_Worker-35) [e4b18a1] Running command: RestoreStatelessVmCommand internal: true. Entities affected :  ID: 6e80ca4b-cad4-48b6-b307-212073fa7e7f Type: VM
[..]
41781 2013-10-14 08:27:09,609 INFO  [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (QuartzScheduler_Worker-35) [3101ee70] Running command: RestoreAllSnapshotsCommand internal: true. Entities affected :  ID: 6e80ca4b-cad4-48b6-b307-212073fa7e7f Type: VM
41782 2013-10-14 08:27:09,609 INFO  [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (QuartzScheduler_Worker-35) [3101ee70] Locking VM(id = 6e80ca4b-cad4-48b6-b307-212073fa7e7f) without compensation.
41783 2013-10-14 08:27:09,614 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (QuartzScheduler_Worker-35) [3101ee70] START, SetVmStatusVDSCommand( vmId = 6e80ca4b-cad4-48b6-b307-212073fa7e7f, status = ImageLocked), log id: 4272eea1
41784 2013-10-14 08:27:09,638 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (QuartzScheduler_Worker-35) [3101ee70] FINISH, SetVmStatusVDSCommand, log id: 4272eea1
[..]
41844 2013-10-14 08:27:20,773 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (QuartzScheduler_Worker-35) [56f8bb24] Adding task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e to DataBase
41845 2013-10-14 08:27:20,934 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-35) [56f8bb24] Adding task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e (Parent Command RestoreAllSnapshots, Parameters Type org.ovirt.engine      .core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet..
[..]
41846 2013-10-14 08:27:21,159 INFO  [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (QuartzScheduler_Worker-35) [56f8bb24] Lock freed to object EngineLock [exclusiveLocks= key: 6e80ca4b-cad4-48b6-b307-212073fa7e7f value: VM
41847 , sharedLocks= ]
41848 2013-10-14 08:27:21,160 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-35) [56f8bb24] BaseAsyncTask::StartPollingTask: Starting to poll task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e.
41849 2013-10-14 08:27:21,160 INFO  [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (QuartzScheduler_Worker-35) [56f8bb24] EntityMultiAsyncTasks::StartPollingTask: Current Action Type for entity 6e80ca4b-cad4-48b6-b307-212073fa7e7f is RestoreAllSnapshots (determined by task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e)
[..]
41914 2013-10-14 08:27:40,362 INFO  [org.ovirt.engine.core.bll.AddPermissionCommand] (ajp-/127.0.0.1:8702-4) [18d6e8d1] Running command: AddPermissionCommand internal: true. Entities affected :  ID: 6e80ca4b-cad4-48b6-b307-212073fa7e7f Type: VM
41915 2013-10-14 08:27:40,443 INFO  [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-4) [18d6e8d1] Succceeded giving user 50f2b32a-9508-4add-9683-86efda5b1832 permission to Vm 6e80ca4b-cad4-48b6-b307-212073fa7e7f
[..]
41958 2013-10-14 08:27:56,920 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-50) SPMAsyncTask::PollTask: Polling task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e (Parent Command RestoreAllSnapshots, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'success'.
41959 2013-10-14 08:27:56,920 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-50) BaseAsyncTask::OnTaskEndSuccess: Task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e (Parent Command RestoreAllSnapshots, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully.
[..]
42005 2013-10-14 08:28:05,955 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-49) SPMAsyncTask::ClearAsyncTask: Attempting to clear task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e
42006 2013-10-14 08:28:05,955 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-49) START, SPMClearTaskVDSCommand( storagePoolId = 2385c9b1-6dcb-4c38-aa8b-87444673a461, ignoreFailoverLimit = false, co      mpatabilityVersion = null, taskId = 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e), log id: 4275c38e
[..]
42015 2013-10-14 08:28:22,690 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (ajp-/127.0.0.1:8702-4) [5402bccd] -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new parameters: description and UUID
[..]
42028 2013-10-14 08:28:30,044 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (ajp-/127.0.0.1:8702-4) [5402bccd] FINISH, CreateSnapshotVDSCommand, return: 4ae5ec67-330e-4b44-ba91-2bbe972ad87e, log id: 68485573
[..]
42039 2013-10-14 08:28:46,232 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-49) START, HSMClearTaskVDSCommand(HostName = RHVH-4_ucs1b1, HostId = 303d94f6-1611-11e3-bf5e-005056ac1129, taskId=6bdd87f9-95b1-459c-8f4b-6d0206f51a8e), log id: 4c458325
[..]
42041 2013-10-14 08:28:46,391 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-49) FINISH, HSMClearTaskVDSCommand, log id: 4c458325
42042 2013-10-14 08:28:46,391 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-49) FINISH, SPMClearTaskVDSCommand, log id: 4275c38e
42043 2013-10-14 08:28:46,425 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-49) BaseAsyncTask::RemoveTaskFromDB: Removed task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e from DataBase
42044 2013-10-14 08:28:46,425 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-49) EntityAsyncTask::HandleEndActionResult [within thread]: Removing EntityMultiAsyncTasks object for entity 6e80ca4b-cad4-48b6-b307-212073fa7e7f
42045 2013-10-14 08:28:46,426 INFO  [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (ajp-/127.0.0.1:8702-4) [5402bccd] EntityMultiAsyncTasks::AttachTask: Attaching task 023b0a69-ff26-4a9f-922c-d1c5c895126c to entity 6e80ca4b-cad4-48b6-b307-212073fa7e7f.
42046 2013-10-14 08:28:46,436 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (ajp-/127.0.0.1:8702-4) [5402bccd] Adding task 023b0a69-ff26-4a9f-922c-d1c5c895126c to DataBase
42047 2013-10-14 08:28:46,480 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (ajp-/127.0.0.1:8702-4) [5402bccd] Adding task 023b0a69-ff26-4a9f-922c-d1c5c895126c (Parent Command RunVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet..
42050 2013-10-14 08:28:46,738 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (ajp-/127.0.0.1:8702-4) [5402bccd] BaseAsyncTask::StartPollingTask: Starting to poll task 023b0a69-ff26-4a9f-922c-d1c5c895126c.
42051 2013-10-14 08:28:46,752 ERROR [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-4) [5402bccd] Command org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand throw exception: javax.ejb.EJBTra      nsactionRolledbackException
[..]
42169 Caused by: java.lang.NullPointerException
42170     at org.ovirt.engine.core.bll.EntityAsyncTask.ConcreteStartPollingTask(EntityAsyncTask.java:67) [engine-bll.jar:]
42171     at org.ovirt.engine.core.bll.SPMAsyncTask.StartPollingTask(SPMAsyncTask.java:127) [engine-bll.jar:]
42172     at org.ovirt.engine.core.bll.AsyncTaskManager.StartPollingTask(AsyncTaskManager.java:469) [engine-bll.jar:]
42173     at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1215) [engine-bll.jar:]
42174     at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1220) [engine-bll.jar:]
42175     at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1008) [engine-bll.jar:]
42176     at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:291) [engine-bll.jar:]
42177     at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:351) [engine-bll.jar:]
42178     at org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:597) [engine-bll.jar:]

Again full engine logs to follow. Tomas, do you agree with this?

Thanks again,

Lee

----------------------------------------------------------------------
Following comment by lyarwood on October 16 at 16:45:33, 2013

Created attachment 813063 [details]
SFDC#00946299 - Example engine.log showing another method of hitting the same NPE.

----------------------------------------------------------------------
Following comment by byount on October 16 at 23:31:57, 2013

SFDC#00966220 reported by State Street on RHEV 3.1.5. Is this issue known to happen on that version of RHEV also? Just as a note, they are expected to update to 3.2 in the next few weeks.

----------------------------------------------------------------------
Following comment by tjelinek on October 17 at 06:20:29, 2013

> Tracing this through a little more I see that this version of the race is 
> caused by the engine allowing a user to start a stateless guest before the 
> thread restoring the stateless guest has completed

yes, this this is correct. It is the same bug as the original so the provided fix should solve it.

----------------------------------------------------------------------
Following comment by tjelinek on October 17 at 06:23:19, 2013

> SFDC#00966220 reported by State Street on RHEV 3.1.5. Is this issue known to 
> happen on that version of RHEV also?

not sure if reported but looking at the code it is vulnerable the same way in 3.1.5 causing the same exception to occur.

----------------------------------------------------------------------
Following comment by lyarwood on October 17 at 09:50:38, 2013

(In reply to Bryan Yount from comment #14)
> SFDC#00966220 reported by State Street on RHEV 3.1.5. Is this issue known to
> happen on that version of RHEV also? Just as a note, they are expected to
> update to 3.2 in the next few weeks.

Just to make it clear that both State Street in SFDC#00966220 and the Universitat de València in SFDC#00946299 are running 3.1.5. Both are being asked to move to 3.2.z to get a fix for this and we will likely be asked to provide a hotfix to both customers until 3.2.5 is released in December.

Comment 1 meital avital 2013-11-07 13:22:20 UTC

Steps to Reproduce by Tomas: 
Have a pool with bunch of VMs (e.g. 20) all of them prestarted.
Have the VmPoolMonitorIntervalInMinutes small (e.g. 1)
and the VmPoolMonitorBatchSize big (20) (both from vdc_options).

Now wait until all the VMs start.
Now, select half of them and stop them. If it does not fail, wait until they start again and then stop half of them again. Repeat until it fails.

Comment 4 Charlie 2013-11-28 00:42:39 UTC

This bug is currently attached to errata RHBA-2013:16431. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 5 Pavel Novotny 2013-11-28 17:06:48 UTC

Verified in rhevm-3.2.5-0.48.el6ev.noarch (sf22).

Verified according to steps in comment 1:

1. Set VmPoolMonitorBatchSize=20 and VmPoolMonitorIntervalInMinutes=1.
2a. Create 2 prestarted pools, 10 VMs in each.
2b. Create 1 prestarted pool with 20 VMs.
3. Wait untill all VMs are up and then power off/shutdown approx. half of them at once.
4. Repeat step 3 with different VMs and slightly different number of them.

Results:
All stopped VMs were automatically started up again. No errors during the whole process were observed, no NPE in engine.log/server.log, no errors or warnings in events.

Comment 7 errata-xmlrpc 2013-12-18 14:10:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1831.html