Bug 1015638 - VmPoolMonitor throws a NullPointerException while starting a guest that in turn remains down with its images locked.
Summary: VmPoolMonitor throws a NullPointerException while starting a guest that in tu...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.1.5
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ---
: 3.3.0
Assignee: Tomas Jelinek
QA Contact: Pavel Novotny
URL:
Whiteboard: virt
: 1020909 (view as bug list)
Depends On:
Blocks: 1025183 3.3snap2
TreeView+ depends on / blocked
 
Reported: 2013-10-04 16:30 UTC by Lee Yarwood
Modified: 2018-12-04 15:58 UTC (History)
17 users (show)

Fixed In Version: is22
Doc Type: Bug Fix
Doc Text:
Previously, pre-started virtual machines in a virtual machine pool would sometimes remain in an image locked state after being shut down. This was caused by a race condition in asynchronous task handling that would occur when the Red Hat Enterprise Manager attempted to start a pre-started virtual machine while that virtual machine was being shut down. With this update, shared resources are now locked correctly, allowing pre-started virtual machines in virtual machine pools to be stopped and started without issue.
Clone Of:
: 1025183 (view as bug list)
Environment:
Last Closed: 2014-01-21 17:37:44 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 383663 0 None None None Never
Red Hat Product Errata RHSA-2014:0038 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Virtualization Manager 3.3.0 update 2014-01-21 22:03:06 UTC
oVirt gerrit 20183 0 'None' MERGED engine: fix race in CommandAsyncTask 2020-06-30 18:36:02 UTC
oVirt gerrit 20841 0 'None' MERGED engine: fix race in CommandAsyncTask 2020-06-30 18:36:02 UTC

Description Lee Yarwood 2013-10-04 16:30:42 UTC
Description of problem:

VmPoolMonitor throws a NullPointerException while starting a guest that in turn remains down with it's images locked.

11942:2013-09-17 14:00:45,784 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-79) [e99c11e] Running command: RunVmCommand internal: true. Entities affected :  ID: 0cbfbabf-91bc-4335-a40f-b6671afe2b60 Type: VM
11943:2013-09-17 14:00:45,809 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-79) [e99c11e] VdcBll.RunVmCommand.RunVmAsStateless - Creating snapshot for stateless vm 312HPFMATEM-2 - 0cbfbabf-91bc-4335-a40f-b6671afe2b60
[..]
11988:2013-09-17 14:00:55,554 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] START, CreateSnapshotVDSCommand( storagePoolId = 2385c9b1-6dcb-4c38-aa8b-87444673a461, ignoreFailoverLimit = false, compatabilityVersion = 3.1, storageDomainId = b5fe10e9-63ac-4d7a-9bf6-f561a126d526, imageGroupId = ac576fe4-4095-4107-b8c1-da8836ef7dcb, imageSizeInBytes = 53687091200, volumeFormat = COW, newImageId = fa18b204-c6d5-4ab4-b021-dae975a1bd50, newImageDescription = , imageId = debf3c4c-2f74-4690-82a4-2d3eb847ffc8, sourceImageGroupId = ac576fe4-4095-4107-b8c1-da8836ef7dcb), log id: 3a4d2b71
11990:2013-09-17 14:00:55,818 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new parameters: description and UUID
11991:2013-09-17 14:00:55,818 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] -- createVolume parameters:
12011:2013-09-17 14:00:56,555 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] FINISH, CreateSnapshotVDSCommand, return: fa18b204-c6d5-4ab4-b021-dae975a1bd50, log id: 3a4d2b71
12025:2013-09-17 14:00:56,692 INFO  [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (QuartzScheduler_Worker-79) [421a2518] EntityMultiAsyncTasks::AttachTask: Attaching task d44b59d4-46a0-46b8-b5f6-dc360c060510 to entity 0cbfbabf-91bc-4335-a40f-b6671afe2b60.
12026:2013-09-17 14:00:56,694 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (QuartzScheduler_Worker-79) [421a2518] Adding task d44b59d4-46a0-46b8-b5f6-dc360c060510 to DataBase
12028:2013-09-17 14:00:56,703 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-79) [421a2518] Adding task d44b59d4-46a0-46b8-b5f6-dc360c060510 (Parent Command RunVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet..
12029:2013-09-17 14:00:56,717 INFO  [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (QuartzScheduler_Worker-79) [421a2518] Lock freed to object EngineLock [exclusiveLocks= key: 0cbfbabf-91bc-4335-a40f-b6671afe2b60 value: VM
12031:2013-09-17 14:00:56,750 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-79) [421a2518] BaseAsyncTask::StartPollingTask: Starting to poll task d44b59d4-46a0-46b8-b5f6-dc360c060510.
[..]
12032 2013-09-17 14:00:56,753 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (QuartzScheduler_Worker-79) failed to invoke sceduled method managePrestartedVmsInAllVmPools: java.lang.reflect.InvocationTargetException
12033     at sun.reflect.GeneratedMethodAccessor628.invoke(Unknown Source) [:1.7.0_25]
12034     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_25]
12035     at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_25]
12036     at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:64) [engine-scheduler.jar:]
12037     at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:]
12038     at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:]
12039 Caused by: javax.ejb.EJBException: java.lang.NullPointerException
12040     at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleExceptionInNoTx(CMTTxInterceptor.java:191) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8]
12041     at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:237) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8]
12042     at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:374) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8]
12043     at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:218) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8]
12044     at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12045     at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8]
12046     at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12047     at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:64) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8]
12048     at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12049     at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8]
12050     at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12051     at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
12052     at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12053     at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
12054     at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12055     at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12056     at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
12057     at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:182) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
12058     at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12059     at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
12060     at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
12061     at org.ovirt.engine.core.bll.interfaces.BackendInternal$$$view9.runInternalAction(Unknown Source) [engine-bll.jar:]
12062     at org.ovirt.engine.core.bll.VmPoolMonitor.runVmAsStateless(VmPoolMonitor.java:142) [engine-bll.jar:]
12063     at org.ovirt.engine.core.bll.VmPoolMonitor.prestartVm(VmPoolMonitor.java:123) [engine-bll.jar:]
12064     at org.ovirt.engine.core.bll.VmPoolMonitor.prestartVms(VmPoolMonitor.java:79) [engine-bll.jar:]
12065     at org.ovirt.engine.core.bll.VmPoolMonitor.managePrestartedVmsInPool(VmPoolMonitor.java:60) [engine-bll.jar:]
12066     at org.ovirt.engine.core.bll.VmPoolMonitor.managePrestartedVmsInAllVmPools(VmPoolMonitor.java:32) [engine-bll.jar:]
12067     ... 6 more
12068 Caused by: java.lang.NullPointerException

Version-Release number of selected component (if applicable):
rhevm-3.1.0-55.el6ev.noarch                                 jue 27 jun 2013 09:19:25 CEST
rhevm-backend-3.1.0-55.el6ev.noarch                         jue 27 jun 2013 09:19:05 CEST
rhevm-cli-3.1.1.2-1.el6ev.noarch                            jue 27 jun 2013 09:17:45 CEST
rhevm-config-3.1.0-55.el6ev.noarch                          jue 27 jun 2013 09:18:56 CEST
rhevm-dbscripts-3.1.0-55.el6ev.noarch                       jue 27 jun 2013 09:19:24 CEST
rhevm-doc-3.1.0-21.el6eng.noarch                            jue 27 jun 2013 09:19:24 CEST
rhevm-dwh-3.1.3-2.el6ev.noarch                              lun 01 jul 2013 13:47:00 CEST
rhevm-genericapi-3.1.0-55.el6ev.noarch                      jue 27 jun 2013 09:18:56 CEST
rhevm-image-uploader-3.1.0-8.el6ev.noarch                   jue 27 jun 2013 09:17:22 CEST
rhevm-iso-uploader-3.1.0-9.el6ev.noarch                     jue 27 jun 2013 09:17:22 CEST
rhevm-log-collector-3.1.0-10.el6ev.noarch                   jue 27 jun 2013 09:17:22 CEST
rhevm-notification-service-3.1.0-55.el6ev.noarch            jue 27 jun 2013 09:19:24 CEST
rhevm-reports-3.1.0-20.el6ev.noarch                         lun 01 jul 2013 13:47:29 CEST
rhevm-restapi-3.1.0-55.el6ev.noarch                         jue 27 jun 2013 09:19:24 CEST
rhevm-sdk-3.1.0.16-1.el6ev.noarch                           jue 27 jun 2013 09:17:08 CEST
rhevm-setup-3.1.0-55.el6ev.noarch                           jue 27 jun 2013 09:18:56 CEST
rhevm-spice-client-x64-cab-3.1-8.el6.noarch                 jue 27 jun 2013 09:18:57 CEST
rhevm-spice-client-x86-cab-3.1-8.el6.noarch                 jue 27 jun 2013 09:18:56 CEST
rhevm-tools-common-3.1.0-55.el6ev.noarch                    jue 27 jun 2013 09:18:56 CEST
rhevm-userportal-3.1.0-55.el6ev.noarch                      jue 27 jun 2013 09:19:04 CEST
rhevm-webadmin-portal-3.1.0-55.el6ev.noarch                 jue 27 jun 2013 09:19:16 CEST


How reproducible:
10 instances seen in the current DB.

Steps to Reproduce:
Unclear at present.

Actual results:
Stateless pool guests remain down with their images locked.

Expected results:
Stateless pool guests start fully.

Additional info:
In addition to this we also see some guests remain in a 'waiting for launch' state even after launching successfully on a host. I'm not familiar enough with the VmPoolMonitor to know if the two are related.

Comment 2 Tomas Jelinek 2013-10-07 14:55:27 UTC
I have not succeeded to simulate it yet. 

According to our short chat with Lee it seems to me as some kind of a problem with threads which can happen when there are many vms in a pool started in parallel. 
Investigating...

Comment 3 Lee Yarwood 2013-10-08 12:37:31 UTC
(In reply to Tomas Jelinek from comment #2)
> I have not succeeded to simulate it yet. 
> 
> According to our short chat with Lee it seems to me as some kind of a
> problem with threads which can happen when there are many vms in a pool
> started in parallel. 
> Investigating...

Digging a little deeper I *think* this is a simple race between the current managePrestartedVmsInAllVmPools call and the one previous (interval defined by VmPoolMonitorIntervalInMinutes). If the thread is still prestarting the last batch of VMs when the managePrestartedVmsInAllVmPools timer is hit again we get this traceback and a guest left 'waiting on launch'.

Trying to reproduce now but as a workaround we can either decrease the number of guests being prestarted in each batch (as defined by VmPoolMonitorBatchSize) or increase the VmPoolMonitorIntervalInMinutes to allow for a batch of guests to be prestarted before it is called again.

Comment 4 Tomas Jelinek 2013-10-09 13:32:22 UTC
OK, I see where the race is.

the flow which causes the NPE is:

1: [thread 1] EntityAsyncTask's constructor reaches the point 
   "entityInfo.AttachTask(this)" while the "_multiTasksByEntities" still contains 
   one task. After this it gets interrupted

2: [thread 2] the EntityAsyncTask.handleEndActionResult is executed in meanwhile 
   and issues the "_multiTasksByEntities.remove(entityInfo.getContainerId());"

3: [thread 1] the first thread continues the execution and calls the 
   "entityInfo.AttachTask(this);"

4: [thread 1] the EntityAsyncTask.ConcreteStartPollingTask() is called. It asks 
   for the GetEntityMultiAsyncTasks() which returns null (because thread 2 have
   deleted it) and fails on NPE on the next line

The fix should be simple, but first I need to simulate this (I have not succeeded yet) to be sure that the fix indeed fixes the issue.

Comment 5 Tomas Jelinek 2013-10-10 14:41:34 UTC
OK, some new findings - the reason why the above described thing happens is that:

1: the EntityAsyncTask._multiTasksByEntities is not properly locked
2: the starting of the VM has not yet finished but the 
   VmPoolMonitor.managePrestartedVmsInAllVmPools already returned because the 
   VMs are being processed by other threads. The problem is that when the 
   managePrestartedVmsInAllVmPools returns, the VMs still can be in status down 
   so available for the next managePrestartedVmsInAllVmPools to be picked and 
   attempted to start (causing the race from comment 4).

@Lee: you are right, enlarging the VmPoolMonitorIntervalInMinutes and also the VmPoolMonitorBatchSize should solve the above NPE until a fix will be available. Have you already tried it? Did it help?

Thanx

Comment 6 Lee Yarwood 2013-10-10 14:53:10 UTC
(In reply to Tomas Jelinek from comment #5)
> OK, some new findings - the reason why the above described thing happens is
> that:
> 
> 1: the EntityAsyncTask._multiTasksByEntities is not properly locked
> 2: the starting of the VM has not yet finished but the 
>    VmPoolMonitor.managePrestartedVmsInAllVmPools already returned because
> the 
>    VMs are being processed by other threads. The problem is that when the 
>    managePrestartedVmsInAllVmPools returns, the VMs still can be in status
> down 
>    so available for the next managePrestartedVmsInAllVmPools to be picked
> and 
>    attempted to start (causing the race from comment 4).
> 
> @Lee: you are right, enlarging the VmPoolMonitorIntervalInMinutes and also
> the VmPoolMonitorBatchSize should solve the above NPE until a fix will be
> available. Have you already tried it? Did it help?

We asked the customer to decrease the VmPoolMonitorBatchSize and thus far they've not reported any additional issues to us.

Thanks,

Lee

Comment 8 Tomas Jelinek 2013-10-15 09:23:57 UTC
So the race we have seen was actually between the RestoreAllSnapshotsCommand and RunVmCommand and it was caused by the issue described in comment 4.

It is less likely to happen in 3.3 because the infrastructure have been change to attach tasks to command IDs instead of entity IDs. Anyway, it is still possible so created a patch.

In older ovirt it is still quite possible that it will happen.

Comment 9 Lee Yarwood 2013-10-16 11:03:07 UTC
(In reply to Lee Yarwood from comment #6)
> We asked the customer to decrease the VmPoolMonitorBatchSize and thus far
> they've not reported any additional issues to us.

The customer has hit this again after changing  VmPoolMonitorBatchSize to 2. Reviewing the logs I see that they have 21 pools in this environment and it is taking over 5 minutes to prestart even 2 guests from each. As such as are going to suggest they double the VmPoolMonitorIntervalInMinutes to 10 minutes. 

I'm also going to push a few log.infoFormat() lines upstream that I'd like to include for this BZ. At present it's hard to tell when managePrestartedVmsInAllVmPools() is first called and then track our progress through the pools if no guests need to be prestarted.

Comment 10 Lee Yarwood 2013-10-16 16:08:01 UTC
(In reply to Tomas Jelinek from comment #4)
> OK, I see where the race is.
> 
> the flow which causes the NPE is:
> 
> 1: [thread 1] EntityAsyncTask's constructor reaches the point 
>    "entityInfo.AttachTask(this)" while the "_multiTasksByEntities" still
> contains 
>    one task. After this it gets interrupted
> 
> 2: [thread 2] the EntityAsyncTask.handleEndActionResult is executed in
> meanwhile 
>    and issues the
> "_multiTasksByEntities.remove(entityInfo.getContainerId());"
> 
> 3: [thread 1] the first thread continues the execution and calls the 
>    "entityInfo.AttachTask(this);"
> 
> 4: [thread 1] the EntityAsyncTask.ConcreteStartPollingTask() is called. It
> asks 
>    for the GetEntityMultiAsyncTasks() which returns null (because thread 2
> have
>    deleted it) and fails on NPE on the next line
> 
> The fix should be simple, but first I need to simulate this (I have not
> succeeded yet) to be sure that the fix indeed fixes the issue.

I *think* we have another example of this when manually starting a stateless guest in the same environment. Again the guest is left down with locked images.

Just before http-/0.0.0.0:8443-43 starts to poll the snapshot task pool-4-thread-43 calls EntityAsyncTask.handleEndActionResult and clears the task leading to the NPE :

2066 2013-10-15 09:38:36,474 INFO  [org.ovirt.engine.core.bll.AddPermissionCommand] (http-/0.0.0.0:8443-43) [10244c60] Running command: AddPermissionCommand internal: true. Entities affected :  ID: ae4e36f1-0e8c-479a-b67a-a048b3b9d625 Type: VM
2067 2013-10-15 09:38:36,481 INFO  [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (http-/0.0.0.0:8443-43) [10244c60] Succceeded giving user ac8ff527-5b3e-4397-ace2-0dbe94ef192c permission to Vm ae4e36f1-0e8c-479a-b67a-a048b3b9d625
[..]
2101 2013-10-15 09:38:51,983 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (http-/0.0.0.0:8443-43) [28cb8349] -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new parameters: description and UUID
[..]
2122 2013-10-15 09:38:53,089 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (http-/0.0.0.0:8443-43) [28cb8349] FINISH, CreateSnapshotVDSCommand, return: bb6ccaf3-4f60-4e54-874d-ad96395982ae, log id: 52fe99bd
2123 2013-10-15 09:38:53,093 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-43) START, HSMClearTaskVDSCommand(HostName = gdcpl3472.statestr.com, HostId = 69628f04-d538-11e1-8011-441ea10961c4, taskId=dd36409d-2fec-435d-9deb-059fe5d9f682), log id: 3d0448b3
2124 2013-10-15 09:38:53,180 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-43) FINISH, HSMClearTaskVDSCommand, log id: 3d0448b3
2125 2013-10-15 09:38:53,181 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-43) FINISH, SPMClearTaskVDSCommand, log id: dee1566
2126 2013-10-15 09:38:53,183 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-43) BaseAsyncTask::RemoveTaskFromDB: Removed task dd36409d-2fec-435d-9deb-059fe5d9f682 from DataBase
2127 2013-10-15 09:38:53,183 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-43) EntityAsyncTask::HandleEndActionResult [within thread]: Removing EntityMultiAsyncTasks object for entity ae4e36f1-0e8c-479a-b67a-a048b3b9d62     5
2128 2013-10-15 09:38:53,184 INFO  [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (http-/0.0.0.0:8443-43) [28cb8349] EntityMultiAsyncTasks::AttachTask: Attaching task 7a32265b-2355-4dfd-a588-42eab04d6efb to entity ae4e36f1-0e8c-479a-b6     7a-a048b3b9d625.
2129 2013-10-15 09:38:53,186 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (http-/0.0.0.0:8443-43) [28cb8349] Adding task 7a32265b-2355-4dfd-a588-42eab04d6efb to DataBase
2130 2013-10-15 09:38:53,201 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (http-/0.0.0.0:8443-43) [28cb8349] Adding task 7a32265b-2355-4dfd-a588-42eab04d6efb (Parent Command RunVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet..
[..]
2136 2013-10-15 09:38:53,251 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (http-/0.0.0.0:8443-43) [28cb8349] BaseAsyncTask::StartPollingTask: Starting to poll task 7a32265b-2355-4dfd-a588-42eab04d6efb.
2137 2013-10-15 09:38:53,258 ERROR [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (http-/0.0.0.0:8443-43) [28cb8349] Command org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand throw exception: javax.ejb.EJBTransactionRolledbackException
[..]
2244 Caused by: java.lang.NullPointerException
2245     at org.ovirt.engine.core.bll.EntityAsyncTask.ConcreteStartPollingTask(EntityAsyncTask.java:67) [engine-bll.jar:]
2246     at org.ovirt.engine.core.bll.SPMAsyncTask.StartPollingTask(SPMAsyncTask.java:127) [engine-bll.jar:]
2247     at org.ovirt.engine.core.bll.AsyncTaskManager.StartPollingTask(AsyncTaskManager.java:469) [engine-bll.jar:]
2248     at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1215) [engine-bll.jar:]
2249     at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1220) [engine-bll.jar:]
2250     at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1008) [engine-bll.jar:]
2251     at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:291) [engine-bll.jar:]
2252     at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:351) [engine-bll.jar:]
2253     at org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:597) [engine-bll.jar:]

Full logs to follow.

Comment 12 Lee Yarwood 2013-10-16 16:44:02 UTC
(In reply to Lee Yarwood from comment #10)
> I *think* we have another example of this when manually starting a stateless
> guest in the same environment. Again the guest is left down with locked
> images.

My mistake, that example was from a different environment, the following is from the same environment as the issue highlighted in the description.

Tracing this through a little more I see that this version of the race is caused by the engine allowing a user to start a stateless guest before the thread restoring the stateless guest has completed :

41756 2013-10-14 08:27:05,178 INFO  [org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand] (QuartzScheduler_Worker-35) [29c7dd9d] Deleting snapshots for stateless vm 6e80ca4b-cad4-48b6-b307-212073fa7e7f
41757 2013-10-14 08:27:05,184 INFO  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (QuartzScheduler_Worker-35) [e4b18a1] Running command: RestoreStatelessVmCommand internal: true. Entities affected :  ID: 6e80ca4b-cad4-48b6-b307-212073fa7e7f Type: VM
[..]
41781 2013-10-14 08:27:09,609 INFO  [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (QuartzScheduler_Worker-35) [3101ee70] Running command: RestoreAllSnapshotsCommand internal: true. Entities affected :  ID: 6e80ca4b-cad4-48b6-b307-212073fa7e7f Type: VM
41782 2013-10-14 08:27:09,609 INFO  [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (QuartzScheduler_Worker-35) [3101ee70] Locking VM(id = 6e80ca4b-cad4-48b6-b307-212073fa7e7f) without compensation.
41783 2013-10-14 08:27:09,614 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (QuartzScheduler_Worker-35) [3101ee70] START, SetVmStatusVDSCommand( vmId = 6e80ca4b-cad4-48b6-b307-212073fa7e7f, status = ImageLocked), log id: 4272eea1
41784 2013-10-14 08:27:09,638 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (QuartzScheduler_Worker-35) [3101ee70] FINISH, SetVmStatusVDSCommand, log id: 4272eea1
[..]
41844 2013-10-14 08:27:20,773 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (QuartzScheduler_Worker-35) [56f8bb24] Adding task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e to DataBase
41845 2013-10-14 08:27:20,934 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-35) [56f8bb24] Adding task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e (Parent Command RestoreAllSnapshots, Parameters Type org.ovirt.engine      .core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet..
[..]
41846 2013-10-14 08:27:21,159 INFO  [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (QuartzScheduler_Worker-35) [56f8bb24] Lock freed to object EngineLock [exclusiveLocks= key: 6e80ca4b-cad4-48b6-b307-212073fa7e7f value: VM
41847 , sharedLocks= ]
41848 2013-10-14 08:27:21,160 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-35) [56f8bb24] BaseAsyncTask::StartPollingTask: Starting to poll task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e.
41849 2013-10-14 08:27:21,160 INFO  [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (QuartzScheduler_Worker-35) [56f8bb24] EntityMultiAsyncTasks::StartPollingTask: Current Action Type for entity 6e80ca4b-cad4-48b6-b307-212073fa7e7f is RestoreAllSnapshots (determined by task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e)
[..]
41914 2013-10-14 08:27:40,362 INFO  [org.ovirt.engine.core.bll.AddPermissionCommand] (ajp-/127.0.0.1:8702-4) [18d6e8d1] Running command: AddPermissionCommand internal: true. Entities affected :  ID: 6e80ca4b-cad4-48b6-b307-212073fa7e7f Type: VM
41915 2013-10-14 08:27:40,443 INFO  [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-4) [18d6e8d1] Succceeded giving user 50f2b32a-9508-4add-9683-86efda5b1832 permission to Vm 6e80ca4b-cad4-48b6-b307-212073fa7e7f
[..]
41958 2013-10-14 08:27:56,920 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-50) SPMAsyncTask::PollTask: Polling task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e (Parent Command RestoreAllSnapshots, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'success'.
41959 2013-10-14 08:27:56,920 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-50) BaseAsyncTask::OnTaskEndSuccess: Task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e (Parent Command RestoreAllSnapshots, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully.
[..]
42005 2013-10-14 08:28:05,955 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-49) SPMAsyncTask::ClearAsyncTask: Attempting to clear task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e
42006 2013-10-14 08:28:05,955 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-49) START, SPMClearTaskVDSCommand( storagePoolId = 2385c9b1-6dcb-4c38-aa8b-87444673a461, ignoreFailoverLimit = false, co      mpatabilityVersion = null, taskId = 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e), log id: 4275c38e
[..]
42015 2013-10-14 08:28:22,690 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (ajp-/127.0.0.1:8702-4) [5402bccd] -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new parameters: description and UUID
[..]
42028 2013-10-14 08:28:30,044 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (ajp-/127.0.0.1:8702-4) [5402bccd] FINISH, CreateSnapshotVDSCommand, return: 4ae5ec67-330e-4b44-ba91-2bbe972ad87e, log id: 68485573
[..]
42039 2013-10-14 08:28:46,232 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-49) START, HSMClearTaskVDSCommand(HostName = RHVH-4_ucs1b1, HostId = 303d94f6-1611-11e3-bf5e-005056ac1129, taskId=6bdd87f9-95b1-459c-8f4b-6d0206f51a8e), log id: 4c458325
[..]
42041 2013-10-14 08:28:46,391 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-49) FINISH, HSMClearTaskVDSCommand, log id: 4c458325
42042 2013-10-14 08:28:46,391 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-49) FINISH, SPMClearTaskVDSCommand, log id: 4275c38e
42043 2013-10-14 08:28:46,425 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-49) BaseAsyncTask::RemoveTaskFromDB: Removed task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e from DataBase
42044 2013-10-14 08:28:46,425 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-49) EntityAsyncTask::HandleEndActionResult [within thread]: Removing EntityMultiAsyncTasks object for entity 6e80ca4b-cad4-48b6-b307-212073fa7e7f
42045 2013-10-14 08:28:46,426 INFO  [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (ajp-/127.0.0.1:8702-4) [5402bccd] EntityMultiAsyncTasks::AttachTask: Attaching task 023b0a69-ff26-4a9f-922c-d1c5c895126c to entity 6e80ca4b-cad4-48b6-b307-212073fa7e7f.
42046 2013-10-14 08:28:46,436 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (ajp-/127.0.0.1:8702-4) [5402bccd] Adding task 023b0a69-ff26-4a9f-922c-d1c5c895126c to DataBase
42047 2013-10-14 08:28:46,480 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (ajp-/127.0.0.1:8702-4) [5402bccd] Adding task 023b0a69-ff26-4a9f-922c-d1c5c895126c (Parent Command RunVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet..
42050 2013-10-14 08:28:46,738 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (ajp-/127.0.0.1:8702-4) [5402bccd] BaseAsyncTask::StartPollingTask: Starting to poll task 023b0a69-ff26-4a9f-922c-d1c5c895126c.
42051 2013-10-14 08:28:46,752 ERROR [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-4) [5402bccd] Command org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand throw exception: javax.ejb.EJBTra      nsactionRolledbackException
[..]
42169 Caused by: java.lang.NullPointerException
42170     at org.ovirt.engine.core.bll.EntityAsyncTask.ConcreteStartPollingTask(EntityAsyncTask.java:67) [engine-bll.jar:]
42171     at org.ovirt.engine.core.bll.SPMAsyncTask.StartPollingTask(SPMAsyncTask.java:127) [engine-bll.jar:]
42172     at org.ovirt.engine.core.bll.AsyncTaskManager.StartPollingTask(AsyncTaskManager.java:469) [engine-bll.jar:]
42173     at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1215) [engine-bll.jar:]
42174     at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1220) [engine-bll.jar:]
42175     at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1008) [engine-bll.jar:]
42176     at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:291) [engine-bll.jar:]
42177     at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:351) [engine-bll.jar:]
42178     at org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:597) [engine-bll.jar:]

Again full engine logs to follow. Tomas, do you agree with this?

Thanks again,

Lee

Comment 15 Tomas Jelinek 2013-10-17 06:20:29 UTC
> Tracing this through a little more I see that this version of the race is 
> caused by the engine allowing a user to start a stateless guest before the 
> thread restoring the stateless guest has completed

yes, this this is correct. It is the same bug as the original so the provided fix should solve it.

Comment 19 Pavel Novotny 2013-11-12 13:12:15 UTC
Verified in rhevm-3.3.0-0.32.beta1.el6ev.noarch (is22).

I set values VmPoolMonitorBatchSize=30 and VmPoolMonitorIntervalInMinutes=1 to increase the chance of hitting the race.
Then I created 3 VM pools with 30 VMs (10 VMs in each pool) and randomly started/shut down/powered off them several times. During the process no NPE nor other exception occured in engine.log/server.log and all VMs behaved normally.

Comment 20 Ravi Nori 2013-11-26 16:02:15 UTC
*** Bug 1020909 has been marked as a duplicate of this bug. ***

Comment 21 Charlie 2013-11-28 00:07:16 UTC
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 22 errata-xmlrpc 2014-01-21 17:37:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html


Note You need to log in before you can comment on or make changes to this bug.