Description of problem: The creation of the snapshot for the stateless VM seems to succeed, but right after in RunVm (endAction) throws a NPE and CreateVmVDSCommand is not reached. I don't fully understand what is going on, but it's apparently related to ExecutionContext. 2016-07-15 11:20:19,049 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (ajp-/127.0.0.1:8702-204) [27fc4fcb] START, CreateSnapshotVDSCommand( CreateSnapshotVDSCommandParameters:{runAsync='true', storagePoolId='00000002-0002-0002-0002-00000000009a', ignoreFailoverLimit='false', storageDomainId='cce0c8f0-cc8c-4d6c-a96b-998046d7b725', imageGroupId='d08aaecc-f7a4-4f85-9237-cf00dd4029a6', imageSizeInBytes='21474836480', volumeFormat='COW', newImageId='e2a4529a-f2dc-4730-b3b2-ed54df99f52d', newImageDescription='', imageInitialSizeInBytes='0', imageId='3d8defe6-7fd8-47fc-b8e8-4c3477874385', sourceImageGroupId='d08aaecc-f7a4-4f85-9237-cf00dd4029a6'}), log id: 1bad7115 2016-07-15 11:20:40,370 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (org.ovirt.thread.pool-6-thread-42) [27fc4fcb] START, GetImageInfoVDSCommand( GetImageInfoVDSCommandParameters:{runAsync='true', storagePoolId='00000002-0002-0002-0002-00000000009a', ignoreFailoverLimit='false', storageDomainId='cce0c8f0-cc8c-4d6c-a96b-998046d7b725', imageGroupId='d08aaecc-f7a4-4f85-9237-cf00dd4029a6', imageId='e2a4529a-f2dc-4730-b3b2-ed54df99f52d'}), log id: f63cdc9 2016-07-15 11:20:40,782 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (org.ovirt.thread.pool-6-thread-42) [27fc4fcb] FINISH, GetImageInfoVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.DiskImage@dd255f12, log id: f63cdc9 2016-07-15 11:20:41,007 INFO [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (ajp-/127.0.0.1:8702-809) [ebec59] START, IsVmDuringInitiatingVDSCommand( IsVmDuringInitiatingVDSCommandParameters:{runAsync='true', vmId='4504655c-fa6d-4de4-b6df-7752bf19f8e6'}), log id: 3a996a07 2016-07-15 11:20:41,007 INFO [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (ajp-/127.0.0.1:8702-809) [ebec59] FINISH, IsVmDuringInitiatingVDSCommand, return: false, log id: 3a996a07 2016-07-15 11:20:41,023 WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (org.ovirt.thread.pool-6-thread-42) [] Trying to release exclusive lock which does not exist, lock key: '30316f1a-efa5-403a-8d6e-0575c5d53487VM' 2016-07-15 11:20:41,038 INFO [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (org.ovirt.thread.pool-6-thread-42) [] Lock freed to object 'EngineLock:{exclusiveLocks='[30316f1a-efa5-403a-8d6e-0575c5d53487=<VM, ACTION_TYPE_FAILED_SNAPSHOT_IS_BEING_TAKEN_FOR_VM$VmName VDPool1-237>]', sharedLocks='null'}' 2016-07-15 11:20:41,062 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-6-thread-42) [] [within thread]: endAction for action type RunVm threw an exception.: java.lang.NullPointerException at org.ovirt.engine.core.bll.RunVmCommand.createContextForRunStatelessVm(RunVmCommand.java:1146) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.endSuccessfully(RunVmCommand.java:1124) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:651) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:595) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1984) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:524) [bll.jar:] at org.ovirt.engine.core.bll.tasks.DecoratedCommand.endAction(DecoratedCommand.java:17) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.endAction(CoCoAsyncTaskHelper.java:360) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.endAction(CommandCoordinatorImpl.java:415) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.endCommandAction(CommandAsyncTask.java:168) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.access$000(CommandAsyncTask.java:30) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandAsyncTask$1.run(CommandAsyncTask.java:111) [bll.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:89) [utils.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_101] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_101] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_101] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_101] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_101] According to the backtrace, the NPE is generated here: Job job = JobRepositoryFactory.getJobRepository().getJobWithSteps(step.getJobId()); From what I gathered it is quite likely 'step' is the culprit. So Step step = getExecutionContext().getStep(); Could be Null? I don't properly understand exactly how this works and why this would happen. Will add more information once I dig deeper on it. Version-Release number of selected component (if applicable): rhevm-3.6.5.3-0.1.el6.noarch How reproducible: 0% - Simple reproduction using API does not reproduce this. Likely some race condition/locking involved. Actual results: NPE Expected results: No NPE Additional info: This is a quite big environment, many VMs running and returning to the Pool all the time. Stateless snapshots being created and Images Destroyed are constant operation throughout the logs.
This other occurrence shows an interesting error before the NPE 2016-07-09 18:37:38,979 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (org.ovirt.thread.pool-6-thread-3) [4a263009] Error in invocating CTOR of command 'RunVm': null 2016-07-09 18:37:38,980 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-6-thread-3) [4a263009] [within thread]: endAction for action type RunVm threw an exception.: java.lang.NullPointerException at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.buildCommand(CoCoAsyncTaskHelper.java:376) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.endAction(CoCoAsyncTaskHelper.java:359) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.endAction(CommandCoordinatorImpl.java:415) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.endCommandAction(CommandAsyncTask.java:168) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.access$000(CommandAsyncTask.java:30) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandAsyncTask$1.run(CommandAsyncTask.java:111) [bll.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:89) [utils.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_101] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_101] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_101] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_101] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_101]
Note that there are multiple issues in this setup (may or may not be related to this specific issue). For example: 2016-07-15 11:13:50,931 WARN [org.ovirt.engine.core.bll.scheduling.policyunits.CpuAndMemoryBalancingPolicyUnit] (DefaultQuartzScheduler_Worker-7) [4abb9e0a] All hosts are over-utilized, And: 2016-07-15 11:00:48,274 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SetVmTicketVDSCommand] (ajp-/127.0.0.1:8702-115) [11057733] Command 'SetVmTicketVDSCommand(HostName = rhevh2, SetVmTicketVDSCommandParamet ers:{runAsync='true', hostId='4ac07f03-9f7a-46aa-a5b7-63e63968bc61', vmId='45923bc6-6ad6-4da9-9d81-90624a6ccb07', protocol='SPICE', ticket='SVnKdIBI+LTV', validTime='120', userName='mmasmmi', userId='06e235f5-0a 34-4ebe-8fe7-22b4ad969a0f', disconnectAction='LOCK_SCREEN'})' execution failed: VDSGenericException: VDSErrorException: Failed to SetVmTicketVDS, error = 'NoneType' object has no attribute 'XMLDesc', code = -326 03 2016-07-15 11:00:48,274 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SetVmTicketVDSCommand] (ajp-/127.0.0.1:8702-115) [11057733] FINISH, SetVmTicketVDSCommand, log id: 56553492 2016-07-15 11:00:48,276 ERROR [org.ovirt.engine.core.bll.SetVmTicketCommand] (ajp-/127.0.0.1:8702-115) [11057733] Command 'org.ovirt.engine.core.bll.SetVmTicketCommand' failed: EngineException: org.ovirt.engine. core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to SetVmTicketVDS, error = 'NoneType' object has no attribute 'XMLDesc', code = -32603 (Failed with error unexpected and code 16) Looks like the host are over-utilized. Not sure it's related, but worth checking and vdsm logs (and host logs in general) might be helpful as well.
I suspect this is a consequence of early phase of cinder support because I see the following pattern whenever NPE is thrown: VM is attached to user and run Starting to create the stateless snapshot starting to create a snapshot for non-cinder disk starting to create a snapshot for cinder disk The SPM async task is finished (the snapshot for the non-cinder disk is finished) -> ending the CreateAllSnapshotCommand Then the callback is triggered (the snapshot for the cinder disk is finished) -> ending the CreateAllSnapshotCommand again and in that second time we get the exception. Germano, before going any further with the investigation can you please: 1. Confirm that the pool is based on a template with cinder disk and non-cinder disk 2. Elaborate on the problem from the user's point of view (putting aside these NPEs that should be resolved. IIUC the user should have seen that the operation succeeded) 3. You wrote that you didn't manage to reproduce it with a simple API call, but could you please try to create a regular VM based on the template of the pool and then try to run it as stateless?
(In reply to Arik from comment #6) > 2. Elaborate on the problem from the user's point of view (putting aside > these NPEs that should be resolved. IIUC the user should have seen that the > operation succeeded) No no, this is incorrect.. CreateVm is not called because the only attempt to end RunVmCommand fails.
Hi Arik, First of all thanks for looking at this. Wait wait wait... There is not a single Cinder disk in whole setup. The template has just one disk, and it's coming from the only storage domain they have, which is NFS. Then I went back to my "simple API reproducer" and saw that it also ran the CreateAllCinderSnapshotsCommand command! So: 1. Create VM with single disk (NFS/iSCSI) 2. Set it as stateless 3. Run it RESULT: CreateAllCinderSnapshotsCommand is run. I see createSnapshotsForDisks() has roughly 2 phases: First: List<DiskImage> images = ImagesHandler.filterImageDisks(getDisksList()... for (DiskImage diskImage : images) { .... } Then it goes for the Cinder Images List<CinderDisk> cinderDisks = ... CreateAllCinderSnapshotsParameters params = new CreateAllCinderSnapshotsParameters(...); Future<VdcReturnValueBase> future = CommandCoordinatorUtil.executeAsyncCommand(... I speak C and Python, but my first impression of this is that we always run the Cinder task, even with no Cinder images. It looks like an "empty" task, as it had no Child commands? [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler_Worker-31) [7a70c079] Command 'CreateAllCinderSnapshots' id: '6ce61573-bd30-4568-866e-fe5309045631' child commands '[]' executions were completed, status 'SUCCEEDED' Which shouldn't do much harm. Except that when this Cinder callback is the LAST one to get called we get an NPE which prevents the RunVm from succeeding? Well, it sounds like CreateAllCinderSnapshotsCommand shouldn't be called in the first place, right? Does this make sense? How wrong am I? =D
Thanks. It makes a lot of sense, I think we are getting toward the issue here. Not having a single cinder disk in the setup - that explains why CreateAllCinderSnapshots doesn't fire a single CreateSingleSnapshot. executing CreateAllCinderSnapshots seems indeed like a bug. So now I have a scenario for a race that would make more sense than what I described in comment 7: - VM is attached to user and run - starting to create the stateless snapshot - starting to create a snapshot for the disk - scheduling CreateAllCinderSnapshots (it is an async command) - SPM async task is created for the snapshot for the disk - the command coordinator infrastructure that is used to execute CreateAllCinderSnapshots changes the steps of the execution context of RunVmCommand somewhere (see https://gerrit.ovirt.org/#/c/53493/3) - the SPM async task is finished and therefore we try to end RunVmCommand but its step was cleared and thus get the NPE in RunVmCommand - the callback that monitors CreateAllCinderSnapshots that doesn't actually have what to monitor is triggered and tries to finish CreateAllCinderSnapshots command but the execution job was changed in the previous step and therefore the other exception we see: ERROR: insert or update on table "step" violates foreign key constraint "fk_step_job" Side note: This part changed a lot since 3.6.5 on the master branch, we no longer have this CreateAllCinderSnapshotsCommand and we have a protection from getting NPE in RunVmCommand because of having step==null. Allon, if what I wrote above is true then it seems more storage related than virt. Could someone from the storage team take a look?
Liron, can you take a look please? I think this was already solved in 4.0, but you know this area much better than me.
Hi, As it seems this issue was solved as well by the solution to BZ 1348079 in 3.6.7. I'll do quick verification of it - but i'd say that we should be fine with regards to that issue.
Correcting myself (yes, quickly) - this is a slightly different one. will fix.
Try to run a attach a vm from pool and run it. (add permission to a user and run from the userportal).
The provided fix is only relevant to the 3.6.z branch. For the 4.0 branch this is essentially a "test only" bug. Setting to ON_QA so the QE stakeholders can decided whether to run a sanity test here (which I believe they do anyway), or just close as CURRENTRELEASE.
Tested with the following code: ---------------------------------------- rhevm-4.0.2.3-0.1.el7ev.noarch vdsm-4.18.9-1.el7ev.x86_64 Tested with the following scenario: Steps to Reproduce: 1. Create a VM Pool of a few VMs from a template with a disk 2. Create a User with permissions to the pool 3. Access the User Portal as the newly created user 4. Start a VM from the pool >>>>> The vm is started successfully and no NPEs are thrown Actual results: The vm is started successfully and no NPEs are thrown Moving to VERIFIED!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1743.html