+++ This bug is a RHEV-M zstream clone. The original bug is: +++ +++ https://bugzilla.redhat.com/show_bug.cgi?id=1015638. +++ +++ Requested by "mskrivan" +++ ====================================================================== ---------------------------------------------------------------------- Following comment by lyarwood on October 04 at 16:30:42, 2013 Description of problem: VmPoolMonitor throws a NullPointerException while starting a guest that in turn remains down with it's images locked. 11942:2013-09-17 14:00:45,784 INFO [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-79) [e99c11e] Running command: RunVmCommand internal: true. Entities affected : ID: 0cbfbabf-91bc-4335-a40f-b6671afe2b60 Type: VM 11943:2013-09-17 14:00:45,809 INFO [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-79) [e99c11e] VdcBll.RunVmCommand.RunVmAsStateless - Creating snapshot for stateless vm 312HPFMATEM-2 - 0cbfbabf-91bc-4335-a40f-b6671afe2b60 [..] 11988:2013-09-17 14:00:55,554 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] START, CreateSnapshotVDSCommand( storagePoolId = 2385c9b1-6dcb-4c38-aa8b-87444673a461, ignoreFailoverLimit = false, compatabilityVersion = 3.1, storageDomainId = b5fe10e9-63ac-4d7a-9bf6-f561a126d526, imageGroupId = ac576fe4-4095-4107-b8c1-da8836ef7dcb, imageSizeInBytes = 53687091200, volumeFormat = COW, newImageId = fa18b204-c6d5-4ab4-b021-dae975a1bd50, newImageDescription = , imageId = debf3c4c-2f74-4690-82a4-2d3eb847ffc8, sourceImageGroupId = ac576fe4-4095-4107-b8c1-da8836ef7dcb), log id: 3a4d2b71 11990:2013-09-17 14:00:55,818 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new parameters: description and UUID 11991:2013-09-17 14:00:55,818 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] -- createVolume parameters: 12011:2013-09-17 14:00:56,555 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (QuartzScheduler_Worker-79) [421a2518] FINISH, CreateSnapshotVDSCommand, return: fa18b204-c6d5-4ab4-b021-dae975a1bd50, log id: 3a4d2b71 12025:2013-09-17 14:00:56,692 INFO [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (QuartzScheduler_Worker-79) [421a2518] EntityMultiAsyncTasks::AttachTask: Attaching task d44b59d4-46a0-46b8-b5f6-dc360c060510 to entity 0cbfbabf-91bc-4335-a40f-b6671afe2b60. 12026:2013-09-17 14:00:56,694 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (QuartzScheduler_Worker-79) [421a2518] Adding task d44b59d4-46a0-46b8-b5f6-dc360c060510 to DataBase 12028:2013-09-17 14:00:56,703 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-79) [421a2518] Adding task d44b59d4-46a0-46b8-b5f6-dc360c060510 (Parent Command RunVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet.. 12029:2013-09-17 14:00:56,717 INFO [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (QuartzScheduler_Worker-79) [421a2518] Lock freed to object EngineLock [exclusiveLocks= key: 0cbfbabf-91bc-4335-a40f-b6671afe2b60 value: VM 12031:2013-09-17 14:00:56,750 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-79) [421a2518] BaseAsyncTask::StartPollingTask: Starting to poll task d44b59d4-46a0-46b8-b5f6-dc360c060510. [..] 12032 2013-09-17 14:00:56,753 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (QuartzScheduler_Worker-79) failed to invoke sceduled method managePrestartedVmsInAllVmPools: java.lang.reflect.InvocationTargetException 12033 at sun.reflect.GeneratedMethodAccessor628.invoke(Unknown Source) [:1.7.0_25] 12034 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_25] 12035 at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_25] 12036 at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:64) [engine-scheduler.jar:] 12037 at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:] 12038 at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:] 12039 Caused by: javax.ejb.EJBException: java.lang.NullPointerException 12040 at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleExceptionInNoTx(CMTTxInterceptor.java:191) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8] 12041 at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:237) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8] 12042 at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:374) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8] 12043 at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:218) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8] 12044 at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12045 at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8] 12046 at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12047 at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:64) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8] 12048 at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12049 at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3.jar:7.2.0.Final-redhat-8] 12050 at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12051 at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.2.0.Final-redhat-8] 12052 at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12053 at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee.jar:7.2.0.Final-redhat-8] 12054 at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12055 at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12056 at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee.jar:7.2.0.Final-redhat-8] 12057 at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:182) [jboss-as-ee.jar:7.2.0.Final-redhat-8] 12058 at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12059 at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2] 12060 at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee.jar:7.2.0.Final-redhat-8] 12061 at org.ovirt.engine.core.bll.interfaces.BackendInternal$$$view9.runInternalAction(Unknown Source) [engine-bll.jar:] 12062 at org.ovirt.engine.core.bll.VmPoolMonitor.runVmAsStateless(VmPoolMonitor.java:142) [engine-bll.jar:] 12063 at org.ovirt.engine.core.bll.VmPoolMonitor.prestartVm(VmPoolMonitor.java:123) [engine-bll.jar:] 12064 at org.ovirt.engine.core.bll.VmPoolMonitor.prestartVms(VmPoolMonitor.java:79) [engine-bll.jar:] 12065 at org.ovirt.engine.core.bll.VmPoolMonitor.managePrestartedVmsInPool(VmPoolMonitor.java:60) [engine-bll.jar:] 12066 at org.ovirt.engine.core.bll.VmPoolMonitor.managePrestartedVmsInAllVmPools(VmPoolMonitor.java:32) [engine-bll.jar:] 12067 ... 6 more 12068 Caused by: java.lang.NullPointerException Version-Release number of selected component (if applicable): rhevm-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:19:25 CEST rhevm-backend-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:19:05 CEST rhevm-cli-3.1.1.2-1.el6ev.noarch jue 27 jun 2013 09:17:45 CEST rhevm-config-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:18:56 CEST rhevm-dbscripts-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:19:24 CEST rhevm-doc-3.1.0-21.el6eng.noarch jue 27 jun 2013 09:19:24 CEST rhevm-dwh-3.1.3-2.el6ev.noarch lun 01 jul 2013 13:47:00 CEST rhevm-genericapi-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:18:56 CEST rhevm-image-uploader-3.1.0-8.el6ev.noarch jue 27 jun 2013 09:17:22 CEST rhevm-iso-uploader-3.1.0-9.el6ev.noarch jue 27 jun 2013 09:17:22 CEST rhevm-log-collector-3.1.0-10.el6ev.noarch jue 27 jun 2013 09:17:22 CEST rhevm-notification-service-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:19:24 CEST rhevm-reports-3.1.0-20.el6ev.noarch lun 01 jul 2013 13:47:29 CEST rhevm-restapi-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:19:24 CEST rhevm-sdk-3.1.0.16-1.el6ev.noarch jue 27 jun 2013 09:17:08 CEST rhevm-setup-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:18:56 CEST rhevm-spice-client-x64-cab-3.1-8.el6.noarch jue 27 jun 2013 09:18:57 CEST rhevm-spice-client-x86-cab-3.1-8.el6.noarch jue 27 jun 2013 09:18:56 CEST rhevm-tools-common-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:18:56 CEST rhevm-userportal-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:19:04 CEST rhevm-webadmin-portal-3.1.0-55.el6ev.noarch jue 27 jun 2013 09:19:16 CEST How reproducible: 10 instances seen in the current DB. Steps to Reproduce: Unclear at present. Actual results: Stateless pool guests remain down with their images locked. Expected results: Stateless pool guests start fully. Additional info: In addition to this we also see some guests remain in a 'waiting for launch' state even after launching successfully on a host. I'm not familiar enough with the VmPoolMonitor to know if the two are related. ---------------------------------------------------------------------- Following comment by lyarwood on October 04 at 16:39:52, 2013 Created attachment 807830 [details] SFDC#00946299 - example engine.log showing NPE. Full LC available http://lyarwood-fab.usersys.redhat.com/tickets/00946299/rhvm-2013091811431379497401/ guest.redhat.com:~/tickets/00946299/rhvm-2013091811431379497401/ (passwd redhat) In the above LC the 10 guests marked as down with locked images are : engine=> select s.vm_name, d.vm_guid from vm_dynamic as d, vm_static as s where s.vm_guid = d.vm_guid and d.vm_guid in ( select vm_guid from vms where vm_name in (select vm_names from vm_images_view where imagestatus != 1)); vm_name | vm_guid ----------------+-------------------------------------- 312HPALUFIS-14 | b104b412-2e01-41b9-9c7f-7ef30512a24c 312HPALUFIS-4 | aa1948a9-1958-423f-94fd-9f2fa15930c5 312HPALUFIS-3 | 138fbb02-9274-44bc-bc95-607b49a123f6 312HPFMATEM-2 | 0cbfbabf-91bc-4335-a40f-b6671afe2b60 312HPALUFIS-8 | 7aa75582-1208-4d53-be3f-0d03142c0317 312HPFMATEM-16 | 22ab3180-436f-4198-b40c-86369eb73001 312HPFMATEM-8 | 224cabb7-7d99-4699-bf92-35f8314d8a3f 312HPFMATEM-7 | e620ef7e-b044-4a6e-a21a-04fb9a393dad 312HPALUFIS-16 | 4a437d13-9bf5-40f0-8169-51d9c3c831dc 312HPFMATEM-12 | 054026f1-5088-45f8-9cc3-35ac4217c88b (10 rows) In addition the guest marked as 'Waiting for launch' are : engine=> select vm_static.vm_name, vm_static.vm_guid from vm_static INNER JOIN vm_dynamic ON vm_static.vm_guid=vm_dynamic.vm_guid and vm_dynamic.status='9'; vm_name | vm_guid ----------------+-------------------------------------- 31HPFFarma | 411cd361-5cfb-4c82-888d-b4dcf7fc48aa 312HPTARON-50 | c166dfea-99eb-4b0f-9727-b715f2141112 312HPFMATEM-18 | 6449215f-9b73-4659-81f3-8f7319979506 312HPTARON-18 | ec1290db-81d1-4be2-959b-69f9e662b538 312HPTARON-53 | c89a696f-24a7-447a-8340-568cdde94afd 312HPFMATEM-19 | 4f1d7f18-d448-490f-afcc-e509db4974a2 312HPTARON-10 | 55bf82ec-0c6a-4c44-8450-7b22a6081415 312HPTARON-55 | 9bbd7ab3-9bba-4967-9736-c66437ab09e8 31HPFBiolo | 869d4a50-0342-448a-8165-f1f382ab1394 312HPFMATEM-6 | 2a8b4aff-a995-4839-b283-32db044cae7c 312HPFMATEM-17 | ced5c561-4a39-4637-b5b6-844ae55f265f 312HPALUFIS-15 | f3f93e4f-b19d-4c81-ac2f-b20827aa1462 (12 rows) ---------------------------------------------------------------------- Following comment by tjelinek on October 07 at 14:55:27, 2013 I have not succeeded to simulate it yet. According to our short chat with Lee it seems to me as some kind of a problem with threads which can happen when there are many vms in a pool started in parallel. Investigating... ---------------------------------------------------------------------- Following comment by lyarwood on October 08 at 12:37:31, 2013 (In reply to Tomas Jelinek from comment #2) > I have not succeeded to simulate it yet. > > According to our short chat with Lee it seems to me as some kind of a > problem with threads which can happen when there are many vms in a pool > started in parallel. > Investigating... Digging a little deeper I *think* this is a simple race between the current managePrestartedVmsInAllVmPools call and the one previous (interval defined by VmPoolMonitorIntervalInMinutes). If the thread is still prestarting the last batch of VMs when the managePrestartedVmsInAllVmPools timer is hit again we get this traceback and a guest left 'waiting on launch'. Trying to reproduce now but as a workaround we can either decrease the number of guests being prestarted in each batch (as defined by VmPoolMonitorBatchSize) or increase the VmPoolMonitorIntervalInMinutes to allow for a batch of guests to be prestarted before it is called again. ---------------------------------------------------------------------- Following comment by tjelinek on October 09 at 13:32:22, 2013 OK, I see where the race is. the flow which causes the NPE is: 1: [thread 1] EntityAsyncTask's constructor reaches the point "entityInfo.AttachTask(this)" while the "_multiTasksByEntities" still contains one task. After this it gets interrupted 2: [thread 2] the EntityAsyncTask.handleEndActionResult is executed in meanwhile and issues the "_multiTasksByEntities.remove(entityInfo.getContainerId());" 3: [thread 1] the first thread continues the execution and calls the "entityInfo.AttachTask(this);" 4: [thread 1] the EntityAsyncTask.ConcreteStartPollingTask() is called. It asks for the GetEntityMultiAsyncTasks() which returns null (because thread 2 have deleted it) and fails on NPE on the next line The fix should be simple, but first I need to simulate this (I have not succeeded yet) to be sure that the fix indeed fixes the issue. ---------------------------------------------------------------------- Following comment by tjelinek on October 10 at 14:41:34, 2013 OK, some new findings - the reason why the above described thing happens is that: 1: the EntityAsyncTask._multiTasksByEntities is not properly locked 2: the starting of the VM has not yet finished but the VmPoolMonitor.managePrestartedVmsInAllVmPools already returned because the VMs are being processed by other threads. The problem is that when the managePrestartedVmsInAllVmPools returns, the VMs still can be in status down so available for the next managePrestartedVmsInAllVmPools to be picked and attempted to start (causing the race from comment 4). @Lee: you are right, enlarging the VmPoolMonitorIntervalInMinutes and also the VmPoolMonitorBatchSize should solve the above NPE until a fix will be available. Have you already tried it? Did it help? Thanx ---------------------------------------------------------------------- Following comment by lyarwood on October 10 at 14:53:10, 2013 (In reply to Tomas Jelinek from comment #5) > OK, some new findings - the reason why the above described thing happens is > that: > > 1: the EntityAsyncTask._multiTasksByEntities is not properly locked > 2: the starting of the VM has not yet finished but the > VmPoolMonitor.managePrestartedVmsInAllVmPools already returned because > the > VMs are being processed by other threads. The problem is that when the > managePrestartedVmsInAllVmPools returns, the VMs still can be in status > down > so available for the next managePrestartedVmsInAllVmPools to be picked > and > attempted to start (causing the race from comment 4). > > @Lee: you are right, enlarging the VmPoolMonitorIntervalInMinutes and also > the VmPoolMonitorBatchSize should solve the above NPE until a fix will be > available. Have you already tried it? Did it help? We asked the customer to decrease the VmPoolMonitorBatchSize and thus far they've not reported any additional issues to us. Thanks, Lee ---------------------------------------------------------------------- Following comment by michal.skrivanek on October 15 at 08:01:23, 2013 this seems to be quite likely to happen in 3.2 so better fix it in 3.2.z before more people sees that ---------------------------------------------------------------------- Following comment by tjelinek on October 15 at 09:23:57, 2013 So the race we have seen was actually between the RestoreAllSnapshotsCommand and RunVmCommand and it was caused by the issue described in comment 4. It is less likely to happen in 3.3 because the infrastructure have been change to attach tasks to command IDs instead of entity IDs. Anyway, it is still possible so created a patch. In older ovirt it is still quite possible that it will happen. ---------------------------------------------------------------------- Following comment by lyarwood on October 16 at 11:03:07, 2013 (In reply to Lee Yarwood from comment #6) > We asked the customer to decrease the VmPoolMonitorBatchSize and thus far > they've not reported any additional issues to us. The customer has hit this again after changing VmPoolMonitorBatchSize to 2. Reviewing the logs I see that they have 21 pools in this environment and it is taking over 5 minutes to prestart even 2 guests from each. As such as are going to suggest they double the VmPoolMonitorIntervalInMinutes to 10 minutes. I'm also going to push a few log.infoFormat() lines upstream that I'd like to include for this BZ. At present it's hard to tell when managePrestartedVmsInAllVmPools() is first called and then track our progress through the pools if no guests need to be prestarted. ---------------------------------------------------------------------- Following comment by lyarwood on October 16 at 16:08:01, 2013 (In reply to Tomas Jelinek from comment #4) > OK, I see where the race is. > > the flow which causes the NPE is: > > 1: [thread 1] EntityAsyncTask's constructor reaches the point > "entityInfo.AttachTask(this)" while the "_multiTasksByEntities" still > contains > one task. After this it gets interrupted > > 2: [thread 2] the EntityAsyncTask.handleEndActionResult is executed in > meanwhile > and issues the > "_multiTasksByEntities.remove(entityInfo.getContainerId());" > > 3: [thread 1] the first thread continues the execution and calls the > "entityInfo.AttachTask(this);" > > 4: [thread 1] the EntityAsyncTask.ConcreteStartPollingTask() is called. It > asks > for the GetEntityMultiAsyncTasks() which returns null (because thread 2 > have > deleted it) and fails on NPE on the next line > > The fix should be simple, but first I need to simulate this (I have not > succeeded yet) to be sure that the fix indeed fixes the issue. I *think* we have another example of this when manually starting a stateless guest in the same environment. Again the guest is left down with locked images. Just before http-/0.0.0.0:8443-43 starts to poll the snapshot task pool-4-thread-43 calls EntityAsyncTask.handleEndActionResult and clears the task leading to the NPE : 2066 2013-10-15 09:38:36,474 INFO [org.ovirt.engine.core.bll.AddPermissionCommand] (http-/0.0.0.0:8443-43) [10244c60] Running command: AddPermissionCommand internal: true. Entities affected : ID: ae4e36f1-0e8c-479a-b67a-a048b3b9d625 Type: VM 2067 2013-10-15 09:38:36,481 INFO [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (http-/0.0.0.0:8443-43) [10244c60] Succceeded giving user ac8ff527-5b3e-4397-ace2-0dbe94ef192c permission to Vm ae4e36f1-0e8c-479a-b67a-a048b3b9d625 [..] 2101 2013-10-15 09:38:51,983 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (http-/0.0.0.0:8443-43) [28cb8349] -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new parameters: description and UUID [..] 2122 2013-10-15 09:38:53,089 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (http-/0.0.0.0:8443-43) [28cb8349] FINISH, CreateSnapshotVDSCommand, return: bb6ccaf3-4f60-4e54-874d-ad96395982ae, log id: 52fe99bd 2123 2013-10-15 09:38:53,093 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-43) START, HSMClearTaskVDSCommand(HostName = gdcpl3472.statestr.com, HostId = 69628f04-d538-11e1-8011-441ea10961c4, taskId=dd36409d-2fec-435d-9deb-059fe5d9f682), log id: 3d0448b3 2124 2013-10-15 09:38:53,180 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-43) FINISH, HSMClearTaskVDSCommand, log id: 3d0448b3 2125 2013-10-15 09:38:53,181 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-43) FINISH, SPMClearTaskVDSCommand, log id: dee1566 2126 2013-10-15 09:38:53,183 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-43) BaseAsyncTask::RemoveTaskFromDB: Removed task dd36409d-2fec-435d-9deb-059fe5d9f682 from DataBase 2127 2013-10-15 09:38:53,183 INFO [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-43) EntityAsyncTask::HandleEndActionResult [within thread]: Removing EntityMultiAsyncTasks object for entity ae4e36f1-0e8c-479a-b67a-a048b3b9d62 5 2128 2013-10-15 09:38:53,184 INFO [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (http-/0.0.0.0:8443-43) [28cb8349] EntityMultiAsyncTasks::AttachTask: Attaching task 7a32265b-2355-4dfd-a588-42eab04d6efb to entity ae4e36f1-0e8c-479a-b6 7a-a048b3b9d625. 2129 2013-10-15 09:38:53,186 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (http-/0.0.0.0:8443-43) [28cb8349] Adding task 7a32265b-2355-4dfd-a588-42eab04d6efb to DataBase 2130 2013-10-15 09:38:53,201 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (http-/0.0.0.0:8443-43) [28cb8349] Adding task 7a32265b-2355-4dfd-a588-42eab04d6efb (Parent Command RunVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet.. [..] 2136 2013-10-15 09:38:53,251 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (http-/0.0.0.0:8443-43) [28cb8349] BaseAsyncTask::StartPollingTask: Starting to poll task 7a32265b-2355-4dfd-a588-42eab04d6efb. 2137 2013-10-15 09:38:53,258 ERROR [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (http-/0.0.0.0:8443-43) [28cb8349] Command org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand throw exception: javax.ejb.EJBTransactionRolledbackException [..] 2244 Caused by: java.lang.NullPointerException 2245 at org.ovirt.engine.core.bll.EntityAsyncTask.ConcreteStartPollingTask(EntityAsyncTask.java:67) [engine-bll.jar:] 2246 at org.ovirt.engine.core.bll.SPMAsyncTask.StartPollingTask(SPMAsyncTask.java:127) [engine-bll.jar:] 2247 at org.ovirt.engine.core.bll.AsyncTaskManager.StartPollingTask(AsyncTaskManager.java:469) [engine-bll.jar:] 2248 at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1215) [engine-bll.jar:] 2249 at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1220) [engine-bll.jar:] 2250 at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1008) [engine-bll.jar:] 2251 at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:291) [engine-bll.jar:] 2252 at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:351) [engine-bll.jar:] 2253 at org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:597) [engine-bll.jar:] Full logs to follow. ---------------------------------------------------------------------- Following comment by lyarwood on October 16 at 16:11:19, 2013 Created attachment 813026 [details] SFDC#00966220 - Example engine.log showing another method of hitting the same NPE. ---------------------------------------------------------------------- Following comment by lyarwood on October 16 at 16:44:02, 2013 (In reply to Lee Yarwood from comment #10) > I *think* we have another example of this when manually starting a stateless > guest in the same environment. Again the guest is left down with locked > images. My mistake, that example was from a different environment, the following is from the same environment as the issue highlighted in the description. Tracing this through a little more I see that this version of the race is caused by the engine allowing a user to start a stateless guest before the thread restoring the stateless guest has completed : 41756 2013-10-14 08:27:05,178 INFO [org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand] (QuartzScheduler_Worker-35) [29c7dd9d] Deleting snapshots for stateless vm 6e80ca4b-cad4-48b6-b307-212073fa7e7f 41757 2013-10-14 08:27:05,184 INFO [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (QuartzScheduler_Worker-35) [e4b18a1] Running command: RestoreStatelessVmCommand internal: true. Entities affected : ID: 6e80ca4b-cad4-48b6-b307-212073fa7e7f Type: VM [..] 41781 2013-10-14 08:27:09,609 INFO [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (QuartzScheduler_Worker-35) [3101ee70] Running command: RestoreAllSnapshotsCommand internal: true. Entities affected : ID: 6e80ca4b-cad4-48b6-b307-212073fa7e7f Type: VM 41782 2013-10-14 08:27:09,609 INFO [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (QuartzScheduler_Worker-35) [3101ee70] Locking VM(id = 6e80ca4b-cad4-48b6-b307-212073fa7e7f) without compensation. 41783 2013-10-14 08:27:09,614 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (QuartzScheduler_Worker-35) [3101ee70] START, SetVmStatusVDSCommand( vmId = 6e80ca4b-cad4-48b6-b307-212073fa7e7f, status = ImageLocked), log id: 4272eea1 41784 2013-10-14 08:27:09,638 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (QuartzScheduler_Worker-35) [3101ee70] FINISH, SetVmStatusVDSCommand, log id: 4272eea1 [..] 41844 2013-10-14 08:27:20,773 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (QuartzScheduler_Worker-35) [56f8bb24] Adding task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e to DataBase 41845 2013-10-14 08:27:20,934 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-35) [56f8bb24] Adding task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e (Parent Command RestoreAllSnapshots, Parameters Type org.ovirt.engine .core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet.. [..] 41846 2013-10-14 08:27:21,159 INFO [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (QuartzScheduler_Worker-35) [56f8bb24] Lock freed to object EngineLock [exclusiveLocks= key: 6e80ca4b-cad4-48b6-b307-212073fa7e7f value: VM 41847 , sharedLocks= ] 41848 2013-10-14 08:27:21,160 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-35) [56f8bb24] BaseAsyncTask::StartPollingTask: Starting to poll task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e. 41849 2013-10-14 08:27:21,160 INFO [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (QuartzScheduler_Worker-35) [56f8bb24] EntityMultiAsyncTasks::StartPollingTask: Current Action Type for entity 6e80ca4b-cad4-48b6-b307-212073fa7e7f is RestoreAllSnapshots (determined by task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e) [..] 41914 2013-10-14 08:27:40,362 INFO [org.ovirt.engine.core.bll.AddPermissionCommand] (ajp-/127.0.0.1:8702-4) [18d6e8d1] Running command: AddPermissionCommand internal: true. Entities affected : ID: 6e80ca4b-cad4-48b6-b307-212073fa7e7f Type: VM 41915 2013-10-14 08:27:40,443 INFO [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-4) [18d6e8d1] Succceeded giving user 50f2b32a-9508-4add-9683-86efda5b1832 permission to Vm 6e80ca4b-cad4-48b6-b307-212073fa7e7f [..] 41958 2013-10-14 08:27:56,920 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-50) SPMAsyncTask::PollTask: Polling task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e (Parent Command RestoreAllSnapshots, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'success'. 41959 2013-10-14 08:27:56,920 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-50) BaseAsyncTask::OnTaskEndSuccess: Task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e (Parent Command RestoreAllSnapshots, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully. [..] 42005 2013-10-14 08:28:05,955 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-49) SPMAsyncTask::ClearAsyncTask: Attempting to clear task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e 42006 2013-10-14 08:28:05,955 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-49) START, SPMClearTaskVDSCommand( storagePoolId = 2385c9b1-6dcb-4c38-aa8b-87444673a461, ignoreFailoverLimit = false, co mpatabilityVersion = null, taskId = 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e), log id: 4275c38e [..] 42015 2013-10-14 08:28:22,690 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (ajp-/127.0.0.1:8702-4) [5402bccd] -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new parameters: description and UUID [..] 42028 2013-10-14 08:28:30,044 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (ajp-/127.0.0.1:8702-4) [5402bccd] FINISH, CreateSnapshotVDSCommand, return: 4ae5ec67-330e-4b44-ba91-2bbe972ad87e, log id: 68485573 [..] 42039 2013-10-14 08:28:46,232 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-49) START, HSMClearTaskVDSCommand(HostName = RHVH-4_ucs1b1, HostId = 303d94f6-1611-11e3-bf5e-005056ac1129, taskId=6bdd87f9-95b1-459c-8f4b-6d0206f51a8e), log id: 4c458325 [..] 42041 2013-10-14 08:28:46,391 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-49) FINISH, HSMClearTaskVDSCommand, log id: 4c458325 42042 2013-10-14 08:28:46,391 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-49) FINISH, SPMClearTaskVDSCommand, log id: 4275c38e 42043 2013-10-14 08:28:46,425 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-49) BaseAsyncTask::RemoveTaskFromDB: Removed task 6bdd87f9-95b1-459c-8f4b-6d0206f51a8e from DataBase 42044 2013-10-14 08:28:46,425 INFO [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-49) EntityAsyncTask::HandleEndActionResult [within thread]: Removing EntityMultiAsyncTasks object for entity 6e80ca4b-cad4-48b6-b307-212073fa7e7f 42045 2013-10-14 08:28:46,426 INFO [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (ajp-/127.0.0.1:8702-4) [5402bccd] EntityMultiAsyncTasks::AttachTask: Attaching task 023b0a69-ff26-4a9f-922c-d1c5c895126c to entity 6e80ca4b-cad4-48b6-b307-212073fa7e7f. 42046 2013-10-14 08:28:46,436 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskUtils] (ajp-/127.0.0.1:8702-4) [5402bccd] Adding task 023b0a69-ff26-4a9f-922c-d1c5c895126c to DataBase 42047 2013-10-14 08:28:46,480 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (ajp-/127.0.0.1:8702-4) [5402bccd] Adding task 023b0a69-ff26-4a9f-922c-d1c5c895126c (Parent Command RunVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet.. 42050 2013-10-14 08:28:46,738 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (ajp-/127.0.0.1:8702-4) [5402bccd] BaseAsyncTask::StartPollingTask: Starting to poll task 023b0a69-ff26-4a9f-922c-d1c5c895126c. 42051 2013-10-14 08:28:46,752 ERROR [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-4) [5402bccd] Command org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand throw exception: javax.ejb.EJBTra nsactionRolledbackException [..] 42169 Caused by: java.lang.NullPointerException 42170 at org.ovirt.engine.core.bll.EntityAsyncTask.ConcreteStartPollingTask(EntityAsyncTask.java:67) [engine-bll.jar:] 42171 at org.ovirt.engine.core.bll.SPMAsyncTask.StartPollingTask(SPMAsyncTask.java:127) [engine-bll.jar:] 42172 at org.ovirt.engine.core.bll.AsyncTaskManager.StartPollingTask(AsyncTaskManager.java:469) [engine-bll.jar:] 42173 at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1215) [engine-bll.jar:] 42174 at org.ovirt.engine.core.bll.CommandBase.startPollingAsyncTasks(CommandBase.java:1220) [engine-bll.jar:] 42175 at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1008) [engine-bll.jar:] 42176 at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:291) [engine-bll.jar:] 42177 at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:351) [engine-bll.jar:] 42178 at org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:597) [engine-bll.jar:] Again full engine logs to follow. Tomas, do you agree with this? Thanks again, Lee ---------------------------------------------------------------------- Following comment by lyarwood on October 16 at 16:45:33, 2013 Created attachment 813063 [details] SFDC#00946299 - Example engine.log showing another method of hitting the same NPE. ---------------------------------------------------------------------- Following comment by byount on October 16 at 23:31:57, 2013 SFDC#00966220 reported by State Street on RHEV 3.1.5. Is this issue known to happen on that version of RHEV also? Just as a note, they are expected to update to 3.2 in the next few weeks. ---------------------------------------------------------------------- Following comment by tjelinek on October 17 at 06:20:29, 2013 > Tracing this through a little more I see that this version of the race is > caused by the engine allowing a user to start a stateless guest before the > thread restoring the stateless guest has completed yes, this this is correct. It is the same bug as the original so the provided fix should solve it. ---------------------------------------------------------------------- Following comment by tjelinek on October 17 at 06:23:19, 2013 > SFDC#00966220 reported by State Street on RHEV 3.1.5. Is this issue known to > happen on that version of RHEV also? not sure if reported but looking at the code it is vulnerable the same way in 3.1.5 causing the same exception to occur. ---------------------------------------------------------------------- Following comment by lyarwood on October 17 at 09:50:38, 2013 (In reply to Bryan Yount from comment #14) > SFDC#00966220 reported by State Street on RHEV 3.1.5. Is this issue known to > happen on that version of RHEV also? Just as a note, they are expected to > update to 3.2 in the next few weeks. Just to make it clear that both State Street in SFDC#00966220 and the Universitat de València in SFDC#00946299 are running 3.1.5. Both are being asked to move to 3.2.z to get a fix for this and we will likely be asked to provide a hotfix to both customers until 3.2.5 is released in December.
Steps to Reproduce by Tomas: Have a pool with bunch of VMs (e.g. 20) all of them prestarted. Have the VmPoolMonitorIntervalInMinutes small (e.g. 1) and the VmPoolMonitorBatchSize big (20) (both from vdc_options). Now wait until all the VMs start. Now, select half of them and stop them. If it does not fail, wait until they start again and then stop half of them again. Repeat until it fails.
This bug is currently attached to errata RHBA-2013:16431. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance.
Verified in rhevm-3.2.5-0.48.el6ev.noarch (sf22). Verified according to steps in comment 1: 1. Set VmPoolMonitorBatchSize=20 and VmPoolMonitorIntervalInMinutes=1. 2a. Create 2 prestarted pools, 10 VMs in each. 2b. Create 1 prestarted pool with 20 VMs. 3. Wait untill all VMs are up and then power off/shutdown approx. half of them at once. 4. Repeat step 3 with different VMs and slightly different number of them. Results: All stopped VMs were automatically started up again. No errors during the whole process were observed, no NPE in engine.log/server.log, no errors or warnings in events.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1831.html