Description of problem: VmPoolMonitor throws a NullPointerException while starting a guest that in turn remains down with it's images locked. 11942:2013-09-17 14:00:45,784 INFO [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-79) [e99c11e] Running command: RunVmCommand internal: true. Entities affected : ID: 0cbfbabf-91bc-4335-a40f-b6671afe2b60 Type: VM 11943:2013-09-17 14:00:45,809 INFO [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-79) [e99c11e] VdcBll.RunVmCommand.RunVmAsStateless - Creating snapshot for stateless vm 312HPFMATEM-2 - 0cbfbabf-91bc-4335-a40f-b6671afe2b60 [..] 11988:2013-09-17 14:00:55,554 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] START, CreateSnapshotVDSCommand( storagePoolId = 2385c9b1-6dcb-4c38-aa8b-87444673a461, ignoreFailoverLimit = false, compatabilityVersion = 3.1, storageDomainId = b5fe10e9-63ac-4d7a-9bf6-f561a126d526, imageGroupId = ac576fe4-4095-4107-b8c1-da8836ef7dcb, imageSizeInBytes = 53687091200, volumeFormat = COW, newImageId = fa18b204-c6d5-4ab4-b021-dae975a1bd50, newImageDescription = , imageId = debf3c4c-2f74-4690-82a4-2d3eb847ffc8, sourceImageGroupId = ac576fe4-4095-4107-b8c1-da8836ef7dcb), log id: 3a4d2b71 11990:2013-09-17 14:00:55,818 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new parameters: description and UUID 11991:2013-09-17 14:00:55,818 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] -- createVolume parameters: 12011:2013-09-17 14:00:56,555 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] FINISH, CreateSnapshotVDSCommand, return: fa18b204-c6d5-4ab4-b021-dae975a1bd50, log id: 3a4d2b71 12025:2013-09-17 14:00:56,692 INFO [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (QuartzScheduler_Worker-79) [421a2518] EntityMultiAsyncTasks::AttachTask: Attaching task d44b59d4-46a0-46b8-b5f6-dc360c060510 to entity 0cbfbabf-91bc-4335-a40f-b6671afe2b60. 12026:2013-09-17 14:00:56,694 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (QuartzScheduler_Worker-79) [421a2518] Adding task d44b59d4-46a0-46b8-b5f6-dc360c060510 to DataBase 12028:2013-09-17 14:00:56,703 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-79) [421a2518] Adding task d44b59d4-46a0-46b8-b5f6-dc360c060510 (Parent Command RunVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet.. 12029:2013-09-17 14:00:56,717 INFO [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (QuartzScheduler_Worker-79) [421a2518] Lock freed to object EngineLock [exclusiveLocks= key: 0cbfbabf-91bc-4335-a40f-b6671afe2b60 value: VM 12031:2013-09-17 14:00:56,750 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-79) [421a2518] BaseAsyncTask::StartPollingTask: Starting to poll task d44b59d4-46a0-46b8-b5f6-dc360c060510. [..] 12032 2013-09-17 14:00:56,753 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (QuartzScheduler_Worker-79) failed to invoke sceduled method managePrestartedVmsInAllVmPools: java.lang.reflect.InvocationTargetException 12033 at sun.reflect.GeneratedMethodAccessor628.invoke(Unknown Source) [:1.7.0_25] 12034 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_25] 12035 at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_25] 12036 at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:64) [engine-scheduler.jar:] 12037 at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:] 12038 at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:] 12039 Caused by: javax.ejb.EJBException: java.lang.NullPointerException 12040 at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleExceptionInNoTx(CMTTxInterceptor.java:191) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8] 12041 at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:237) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8] 12042 at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:374) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8] 12043 at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:218) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8] 12044 at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12045 at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8] 12046 at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12047 at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:64) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8] 12048 at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12049 at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8] 12050 at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12051 at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.2.0.Final-redhat-8] 12052 at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12053 at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee.jar:7.2.0.Final-redhat-8] 12054 at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12055 at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12056 at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee.jar:7.2.0.Final-redhat-8] 12057 at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:182) [jboss-as-ee.jar:7.2.0.Final-redhat-8] 12058 at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12059 at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12060 at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee.jar:7.2.0.Final-redhat-8] 12061 at org.ovirt.engine.core.bll.interfaces.BackendInternal$$$view9.runInternalAction(Unknown Source) [engine-bll.jar:] 12062 at org.ovirt.engine.core.bll.VmPoolMonitor.runVmAsStateless(VmPoolMonitor.java:142) [engine-bll.jar:] 12063 at org.ovirt.engine.core.bll.VmPoolMonitor.prestartVm(VmPoolMonitor.java:123) [engine-bll.jar:] 12064 at org.ovirt.engine.core.bll.VmPoolMonitor.prestartVms(VmPoolMonitor.java:79) [engine-bll.jar:] 12065 at org.ovirt.engine.core.bll.VmPoolMonitor.managePrestartedVmsInPool(VmPoolMonitor.java:60) [engine-bll.jar:] 12066 at org.ovirt.engine.core.bll.VmPoolMonitor.managePrestartedVmsInAllVmPools(VmPoolMonitor.java:32) [engine-bll.jar:] 12067 ... 6 more 12068 Caused by: java.lang.NullPointerException Version-Release number of selected component (if applicable): rhevm-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:19:25 CEST rhevm-backend-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:19:05 CEST rhevm-cli-3.1.1.2-1.el6ev.noarch jue 27 jun 2013 09:17:45 CEST rhevm-config-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:18:56 CEST rhevm-dbscripts-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:19:24 CEST rhevm-doc-3.1.0-21.el6eng.noarch jue 27 jun 2013 09:19:24 CEST rhevm-dwh-3.1.3-2.el6ev.noarch lun 01 jul 2013 13:47:00 CEST rhevm-genericapi-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:18:56 CEST rhevm-image-uploader-3.1.0-8.el6ev.noarch jue 27 jun 2013 09:17:22 CEST rhevm-iso-uploader-3.1.0-9.el6ev.noarch jue 27 jun 2013 09:17:22 CEST rhevm-log-collector-3.1.0-10.el6ev.noarch jue 27 jun 2013 09:17:22 CEST rhevm-notification-service-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:19:24 CEST rhevm-reports-3.1.0-20.el6ev.noarch lun 01 jul 2013 13:47:29 CEST rhevm-restapi-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:19:24 CEST rhevm-sdk-3.1.0.16-1.el6ev.noarch jue 27 jun 2013 09:17:08 CEST rhevm-setup-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:18:56 CEST rhevm-spice-client-x64-cab-3.1-8.el6.noarch jue 27 jun 2013 09:18:57 CEST rhevm-spice-client-x86-cab-3.1-8.el6.noarch jue 27 jun 2013 09:18:56 CEST rhevm-tools-common-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:18:56 CEST rhevm-userportal-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:19:04 CEST rhevm-webadmin-portal-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:19:16 CEST How reproducible: 10 instances seen in the current DB. Steps to Reproduce: Unclear at present. Actual results: Stateless pool guests remain down with their images locked. Expected results: Stateless pool guests start fully. Additional info: In addition to this we also see some guests remain in a 'waiting for launch' state even after launching successfully on a host. I'm not familiar enough with the VmPoolMonitor to know if the two are related.
I have not succeeded to simulate it yet. According to our short chat with Lee it seems to me as some kind of a problem with threads which can happen when there are many vms in a pool started in parallel. Investigating...
(In reply to Tomas Jelinek from comment #2) > I have not succeeded to simulate it yet. > > According to our short chat with Lee it seems to me as some kind of a > problem with threads which can happen when there are many vms in a pool > started in parallel. > Investigating... Digging a little deeper I *think* this is a simple race between the current managePrestartedVmsInAllVmPools call and the one previous (interval defined by VmPoolMonitorIntervalInMinutes). If the thread is still prestarting the last batch of VMs when the managePrestartedVmsInAllVmPools timer is hit again we get this traceback and a guest left 'waiting on launch'. Trying to reproduce now but as a workaround we can either decrease the number of guests being prestarted in each batch (as defined by VmPoolMonitorBatchSize) or increase the VmPoolMonitorIntervalInMinutes to allow for a batch of guests to be prestarted before it is called again.
OK, I see where the race is. the flow which causes the NPE is: 1: [thread 1] EntityAsyncTask's constructor reaches the point "entityInfo.AttachTask(this)" while the "_multiTasksByEntities" still contains one task. After this it gets interrupted 2: [thread 2] the EntityAsyncTask.handleEndActionResult is executed in meanwhile and issues the "_multiTasksByEntities.remove(entityInfo.getContainerId());" 3: [thread 1] the first thread continues the execution and calls the "entityInfo.AttachTask(this);" 4: [thread 1] the EntityAsyncTask.ConcreteStartPollingTask() is called. It asks for the GetEntityMultiAsyncTasks() which returns null (because thread 2 have deleted it) and fails on NPE on the next line The fix should be simple, but first I need to simulate this (I have not succeeded yet) to be sure that the fix indeed fixes the issue.
OK, some new findings - the reason why the above described thing happens is that: 1: the EntityAsyncTask._multiTasksByEntities is not properly locked 2: the starting of the VM has not yet finished but the VmPoolMonitor.managePrestartedVmsInAllVmPools already returned because the VMs are being processed by other threads. The problem is that when the managePrestartedVmsInAllVmPools returns, the VMs still can be in status down so available for the next managePrestartedVmsInAllVmPools to be picked and attempted to start (causing the race from comment 4). @Lee: you are right, enlarging the VmPoolMonitorIntervalInMinutes and also the VmPoolMonitorBatchSize should solve the above NPE until a fix will be available. Have you already tried it? Did it help? Thanx
(In reply to Tomas Jelinek from comment #5) > OK, some new findings - the reason why the above described thing happens is > that: > > 1: the EntityAsyncTask._multiTasksByEntities is not properly locked > 2: the starting of the VM has not yet finished but the > VmPoolMonitor.managePrestartedVmsInAllVmPools already returned because > the > VMs are being processed by other threads. The problem is that when the > managePrestartedVmsInAllVmPools returns, the VMs still can be in status > down > so available for the next managePrestartedVmsInAllVmPools to be picked > and > attempted to start (causing the race from comment 4). > > @Lee: you are right, enlarging the VmPoolMonitorIntervalInMinutes and also > the VmPoolMonitorBatchSize should solve the above NPE until a fix will be > available. Have you already tried it? Did it help? We asked the customer to decrease the VmPoolMonitorBatchSize and thus far they've not reported any additional issues to us. Thanks, Lee
So the race we have seen was actually between the RestoreAllSnapshotsCommand and RunVmCommand and it was caused by the issue described in comment 4. It is less likely to happen in 3.3 because the infrastructure have been change to attach tasks to command IDs instead of entity IDs. Anyway, it is still possible so created a patch. In older ovirt it is still quite possible that it will happen.
(In reply to Lee Yarwood from comment #6) > We asked the customer to decrease the VmPoolMonitorBatchSize and thus far > they've not reported any additional issues to us. The customer has hit this again after changing VmPoolMonitorBatchSize to 2. Reviewing the logs I see that they have 21 pools in this environment and it is taking over 5 minutes to prestart even 2 guests from each. As such as are going to suggest they double the VmPoolMonitorIntervalInMinutes to 10 minutes. I'm also going to push a few log.infoFormat() lines upstream that I'd like to include for this BZ. At present it's hard to tell when managePrestartedVmsInAllVmPools() is first called and then track our progress through the pools if no guests need to be prestarted.
(In reply to Tomas Jelinek from comment #4) > OK, I see where the race is. > > the flow which causes the NPE is: > > 1: [thread 1] EntityAsyncTask's constructor reaches the point > "entityInfo.AttachTask(this)" while the "_multiTasksByEntities" still > contains > one task. After this it gets interrupted > > 2: [thread 2] the EntityAsyncTask.handleEndActionResult is executed in > meanwhile > and issues the > "_multiTasksByEntities.remove(entityInfo.getContainerId());" > > 3: [thread 1] the first thread continues the execution and calls the > "entityInfo.AttachTask(this);" > > 4: [thread 1] the EntityAsyncTask.ConcreteStartPollingTask() is called. It > asks > for the GetEntityMultiAsyncTasks() which returns null (because thread 2 > have > deleted it) and fails on NPE on the next line > > The fix should be simple, but first I need to simulate this (I have not > succeeded yet) to be sure that the fix indeed fixes the issue. I *think* we have another example of this when manually starting a stateless guest in the same environment. Again the guest is left down with locked images. Just before http-/0.0.0.0:8443-43 starts to poll the snapshot task pool-4-thread-43 calls EntityAsyncTask.handleEndActionResult and clears the task leading to the NPE : 2066 2013-10-15 09:38:36,474 INFO [org.ovirt.engine.core.bll.AddPermissionCommand] (http-/0.0.0.0:8443-43) [10244c60] Running command: AddPermissionCommand internal: true. Entities affected : ID: ae4e36f1-0e8c-479a-b67a-a048b3b9d625 Type: VM 2067 2013-10-15 09:38:36,481 INFO [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (http-/0.0.0.0:8443-43) [10244c60] Succceeded giving user ac8ff527-5b3e-4397-ace2-0dbe94ef192c permission to Vm ae4e36f1-0e8c-479a-b67a-a048b3b9d625 [..] 2101 2013-10-15 09:38:51,983 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (http-/0.0.0.0:8443-43) [28cb8349] -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new parameters: description and UUID [..] 2122 2013-10-15 09:38:53,089 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (http-/0.0.0.0:8443-43) [28cb8349] FINISH, CreateSnapshotVDSCommand, return: bb6ccaf3-4f60-4e54-874d-ad96395982ae, log id: 52fe99bd 2123 2013-10-15 09:38:53,093 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-43) START, HSMClearTaskVDSCommand(HostName = gdcpl3472.statestr.com, HostId = 69628f04-d538-11e1-8011-441ea10961c4, taskId=dd36409d-2fec-435d-9deb-059fe5d9f682), log id: 3d0448b3 2124 2013-10-15 09:38:53,180 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-43) FINISH, HSMClearTaskVDSCommand, log id: 3d0448b3 2125 2013-10-15 09:38:53,181 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-43) FINISH, SPMClearTaskVDSCommand, log id: dee1566 2126 2013-10-15 09:38:53,183 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-43) BaseAsyncTask::RemoveTaskFromDB: Removed task dd36409d-2fec-435d-9deb-059fe5d9f682 from DataBase 2127 2013-10-15 09:38:53,183 INFO [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-43) EntityAsyncTask::HandleEndActionResult [within thread]: Removing EntityMultiAsyncTasks object for entity ae4e36f1-0e8c-479a-b67a-a048b3b9d62 5 2128 2013-10-15 09:38:53,184 INFO [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (http-/0.0.0.0:8443-43) [28cb8349] EntityMultiAsyncTasks::AttachTask: Attaching task 7a32265b-2355-4dfd-a588-42eab04d6efb to entity ae4e36f1-0e8c-479a-b6 7a-a048b3b9d625. 2129 2013-10-15 09:38:53,186 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (http-/0.0.0.0:8443-43) [28cb8349] Adding task 7a32265b-2355-4dfd-a588-42eab04d6efb to DataBase 2130 2013-10-15 09:38:53,201 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (http-/0.0.0.0:8443-43) [28cb8349] Adding task 7a32265b-2355-4dfd-a588-42eab04d6efb (Parent Command RunVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet.. [..] 2136 2013-10-15 09:38:53,251 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (http-/0.0.0.0:8443-43) [28cb8349] BaseAsyncTask::StartPollingTask: Starting to poll task 7a32265b-2355-4dfd-a588-42eab04d6efb. 2137 2013-10-15 09:38:53,258 ERROR [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (http-/0.0.0.0:8443-43) [28cb8349] Command org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand throw exception: javax.ejb.EJBTransactionRolledbackException [..] 2244 Caused by: java.lang.NullPointerException 2245 at org.ovirt.engine.core.bll.EntityAsyncTask.ConcreteStartPollingTask(EntityAsyncTask.java:67) [engine-bll.jar:] 2246 at org.ovirt.engine.core.bll.SPMAsyncTask.StartPollingTask(SPMAsyncTask.java:127) [engine-bll.jar:] 2247 at org.ovirt.engine.core.bll.AsyncTaskManager.StartPollingTask(AsyncTaskManager.java:469) [engine-bll.jar:] 2248 at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1215) [engine-bll.jar:] 2249 at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1220) [engine-bll.jar:] 2250 at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1008) [engine-bll.jar:] 2251 at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:291) [engine-bll.jar:] 2252 at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:351) [engine-bll.jar:] 2253 at org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:597) [engine-bll.jar:] Full logs to follow.
(In reply to Lee Yarwood from comment #10) > I *think* we have another example of this when manually starting a stateless > guest in the same environment. Again the guest is left down with locked > images. My mistake, that example was from a different environment, the following is from the same environment as the issue highlighted in the description. Tracing this through a little more I see that this version of the race is caused by the engine allowing a user to start a stateless guest before the thread restoring the stateless guest has completed : 41756 2013-10-14 08:27:05,178 INFO [org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand] (QuartzScheduler_Worker-35) [29c7dd9d] Deleting snapshots for stateless vm 6e80ca4b-cad4-48b6-b307-212073fa7e7f 41757 2013-10-14 08:27:05,184 INFO [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (QuartzScheduler_Worker-35) [e4b18a1] Running command: RestoreStatelessVmCommand internal: true. Entities affected : ID: 6e80ca4b-cad4-48b6-b307-212073fa7e7f Type: VM [..] 41781 2013-10-14 08:27:09,609 INFO [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (QuartzScheduler_Worker-35) [3101ee70] Running command: RestoreAllSnapshotsCommand internal: true. Entities affected : ID: 6e80ca4b-cad4-48b6-b307-212073fa7e7f Type: VM 41782 2013-10-14 08:27:09,609 INFO [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (QuartzScheduler_Worker-35) [3101ee70] Locking VM(id = 6e80ca4b-cad4-48b6-b307-212073fa7e7f) without compensation. 41783 2013-10-14 08:27:09,614 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (QuartzScheduler_Worker-35) [3101ee70] START, SetVmStatusVDSCommand( vmId = 6e80ca4b-cad4-48b6-b307-212073fa7e7f, status = ImageLocked), log id: 4272eea1 41784 2013-10-14 08:27:09,638 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (QuartzScheduler_Worker-35) [3101ee70] FINISH, SetVmStatusVDSCommand, log id: 4272eea1 [..] 41844 2013-10-14 08:27:20,773 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (QuartzScheduler_Worker-35) [56f8bb24] Adding task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e to DataBase 41845 2013-10-14 08:27:20,934 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-35) [56f8bb24] Adding task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e (Parent Command RestoreAllSnapshots, Parameters Type org.ovirt.engine .core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet.. [..] 41846 2013-10-14 08:27:21,159 INFO [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (QuartzScheduler_Worker-35) [56f8bb24] Lock freed to object EngineLock [exclusiveLocks= key: 6e80ca4b-cad4-48b6-b307-212073fa7e7f value: VM 41847 , sharedLocks= ] 41848 2013-10-14 08:27:21,160 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-35) [56f8bb24] BaseAsyncTask::StartPollingTask: Starting to poll task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e. 41849 2013-10-14 08:27:21,160 INFO [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (QuartzScheduler_Worker-35) [56f8bb24] EntityMultiAsyncTasks::StartPollingTask: Current Action Type for entity 6e80ca4b-cad4-48b6-b307-212073fa7e7f is RestoreAllSnapshots (determined by task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e) [..] 41914 2013-10-14 08:27:40,362 INFO [org.ovirt.engine.core.bll.AddPermissionCommand] (ajp-/127.0.0.1:8702-4) [18d6e8d1] Running command: AddPermissionCommand internal: true. Entities affected : ID: 6e80ca4b-cad4-48b6-b307-212073fa7e7f Type: VM 41915 2013-10-14 08:27:40,443 INFO [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-4) [18d6e8d1] Succceeded giving user 50f2b32a-9508-4add-9683-86efda5b1832 permission to Vm 6e80ca4b-cad4-48b6-b307-212073fa7e7f [..] 41958 2013-10-14 08:27:56,920 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-50) SPMAsyncTask::PollTask: Polling task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e (Parent Command RestoreAllSnapshots, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'success'. 41959 2013-10-14 08:27:56,920 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-50) BaseAsyncTask::OnTaskEndSuccess: Task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e (Parent Command RestoreAllSnapshots, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully. [..] 42005 2013-10-14 08:28:05,955 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-49) SPMAsyncTask::ClearAsyncTask: Attempting to clear task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e 42006 2013-10-14 08:28:05,955 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-49) START, SPMClearTaskVDSCommand( storagePoolId = 2385c9b1-6dcb-4c38-aa8b-87444673a461, ignoreFailoverLimit = false, co mpatabilityVersion = null, taskId = 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e), log id: 4275c38e [..] 42015 2013-10-14 08:28:22,690 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (ajp-/127.0.0.1:8702-4) [5402bccd] -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new parameters: description and UUID [..] 42028 2013-10-14 08:28:30,044 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (ajp-/127.0.0.1:8702-4) [5402bccd] FINISH, CreateSnapshotVDSCommand, return: 4ae5ec67-330e-4b44-ba91-2bbe972ad87e, log id: 68485573 [..] 42039 2013-10-14 08:28:46,232 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-49) START, HSMClearTaskVDSCommand(HostName = RHVH-4_ucs1b1, HostId = 303d94f6-1611-11e3-bf5e-005056ac1129, taskId=6bdd87f9-95b1-459c-8f4b-6d0206f51a8e), log id: 4c458325 [..] 42041 2013-10-14 08:28:46,391 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-49) FINISH, HSMClearTaskVDSCommand, log id: 4c458325 42042 2013-10-14 08:28:46,391 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-49) FINISH, SPMClearTaskVDSCommand, log id: 4275c38e 42043 2013-10-14 08:28:46,425 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-49) BaseAsyncTask::RemoveTaskFromDB: Removed task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e from DataBase 42044 2013-10-14 08:28:46,425 INFO [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-49) EntityAsyncTask::HandleEndActionResult [within thread]: Removing EntityMultiAsyncTasks object for entity 6e80ca4b-cad4-48b6-b307-212073fa7e7f 42045 2013-10-14 08:28:46,426 INFO [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (ajp-/127.0.0.1:8702-4) [5402bccd] EntityMultiAsyncTasks::AttachTask: Attaching task 023b0a69-ff26-4a9f-922c-d1c5c895126c to entity 6e80ca4b-cad4-48b6-b307-212073fa7e7f. 42046 2013-10-14 08:28:46,436 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (ajp-/127.0.0.1:8702-4) [5402bccd] Adding task 023b0a69-ff26-4a9f-922c-d1c5c895126c to DataBase 42047 2013-10-14 08:28:46,480 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (ajp-/127.0.0.1:8702-4) [5402bccd] Adding task 023b0a69-ff26-4a9f-922c-d1c5c895126c (Parent Command RunVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet.. 42050 2013-10-14 08:28:46,738 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (ajp-/127.0.0.1:8702-4) [5402bccd] BaseAsyncTask::StartPollingTask: Starting to poll task 023b0a69-ff26-4a9f-922c-d1c5c895126c. 42051 2013-10-14 08:28:46,752 ERROR [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-4) [5402bccd] Command org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand throw exception: javax.ejb.EJBTra nsactionRolledbackException [..] 42169 Caused by: java.lang.NullPointerException 42170 at org.ovirt.engine.core.bll.EntityAsyncTask.ConcreteStartPollingTask(EntityAsyncTask.java:67) [engine-bll.jar:] 42171 at org.ovirt.engine.core.bll.SPMAsyncTask.StartPollingTask(SPMAsyncTask.java:127) [engine-bll.jar:] 42172 at org.ovirt.engine.core.bll.AsyncTaskManager.StartPollingTask(AsyncTaskManager.java:469) [engine-bll.jar:] 42173 at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1215) [engine-bll.jar:] 42174 at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1220) [engine-bll.jar:] 42175 at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1008) [engine-bll.jar:] 42176 at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:291) [engine-bll.jar:] 42177 at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:351) [engine-bll.jar:] 42178 at org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:597) [engine-bll.jar:] Again full engine logs to follow. Tomas, do you agree with this? Thanks again, Lee
> Tracing this through a little more I see that this version of the race is > caused by the engine allowing a user to start a stateless guest before the > thread restoring the stateless guest has completed yes, this this is correct. It is the same bug as the original so the provided fix should solve it.
Verified in rhevm-3.3.0-0.32.beta1.el6ev.noarch (is22). I set values VmPoolMonitorBatchSize=30 and VmPoolMonitorIntervalInMinutes=1 to increase the chance of hitting the race. Then I created 3 VM pools with 30 VMs (10 VMs in each pool) and randomly started/shut down/powered off them several times. During the process no NPE nor other exception occured in engine.log/server.log and all VMs behaved normally.
*** Bug 1020909 has been marked as a duplicate of this bug. ***
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0038.html