Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1109749

Summary: User fails to get attached to a prestarted pool in case messages parameter of canRunVm is null - NPE is throwed
Product: Red Hat Enterprise Virtualization Manager Reporter: Tomas Dosek <tdosek>
Component: ovirt-engineAssignee: Arik <ahadas>
Status: CLOSED CURRENTRELEASE QA Contact: Ilanit Stein <istein>
Severity: medium Docs Contact:
Priority: high    
Version: 3.3.0CC: aberezin, acathrow, ederevea, eedri, gchaplik, gwatson, iheim, lpeer, mavital, michal.skrivanek, ofrenkel, Rhev-m-bugs, tdosek, yeylon, yobshans
Target Milestone: ---Keywords: ZStream
Target Release: 3.4.1Flags: istein: needinfo+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: ovirt-engine-3.5.0_beta Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-08-05 08:25:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
verification engine.log none

Description Tomas Dosek 2014-06-16 10:09:37 UTC
Description of problem:
Engine validator canRunVm cannot handle situation of messages parameter being Null. This, under certain circumstances, blocks users from being attached to a pool VM and as a result of that they're not able to get a VM from Pool.

Version-Release number of selected component (if applicable):
3.3.3

How reproducible:
Not sure of the reproducer. In customer's usecase any pool created under 3.3 Compatibility mode exhibits this behavior.

Additional info:

014-06-02 09:05:51,603 INFO  [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-41) [343c44e3] Lock freed to object EngineLock [exclusiveLocks= key: 2ce9ba7c-bc2c-4b36-afdd-44122092b16d value: USER_VM_POOL
, sharedLocks= ]
2014-06-02 09:05:51,687 INFO  [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-16) [4aa9b181] Lock Acquired to object EngineLock [exclusiveLocks= key: 3486bf14-1e2a-49d8-a3b8-53e4edee3f0e value: USER_VM_POOL
, sharedLocks= ]
2014-06-02 09:05:51,932 ERROR [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-16) [4aa9b181] Error during CanDoActionFailure.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.VmPoolCommandBase.canRunPoolVm(VmPoolCommandBase.java:229) [bll.jar:]
        at org.ovirt.engine.core.bll.VmPoolCommandBase.isVmFree(VmPoolCommandBase.java:139) [bll.jar:]
        at org.ovirt.engine.core.bll.VmPoolCommandBase.canAttachNonPrestartedVmToUser(VmPoolCommandBase.java:105) [bll.jar:]
        at org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand.getNonPrestartedVmToAttach(AttachUserToVmFromPoolAndRunCommand.java:127) [bll.jar:]
        at org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand.getVmToAttach(AttachUserToVmFromPoolAndRunCommand.java:104) [bll.jar:]
        at org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand.canDoAction(AttachUserToVmFromPoolAndRunCommand.java:64) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.internalCanDoAction(CommandBase.java:741) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:356) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:416) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:395) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.RunAction(Backend.java:353) [bll.jar:]
        at sun.reflect.GeneratedMethodAccessor187.invoke(Unknown Source) [:1.7.0_51]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_51]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_51]
        at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72) [jboss-as-ee.jar:7.3.1.Final-redhat-3]


Taking a look into the code the NullPointer Exception is thrown from CanDoAction method which checks availability of Pool VM in AttachUserToVmFromPoolAndRunCommand.java class

    protected boolean canDoAction() {
        boolean returnValue = true;

        synchronized (_lockObject) {
            // no available VMs:
            if (Guid.Empty.equals(getVmToAttach(getParameters().getVmPoolId())))
            {
                addCanDoActionMessage(VdcBllMessages.ACTION_TYPE_FAILED_NO_AVAILABLE_POOL_VMS);
                returnValue = false;
            }
        }

This means NPE has to be thrown out by following clause:
if (Guid.Empty.equals(getVmToAttach(getParameters().getVmPoolId())))

Following code is the one throwing NPE:

    protected static boolean canRunPoolVm(Guid vmId, ArrayList<String> messages) {
        VM vm = DbFacade.getInstance().getVmDao().get(vmId);
        if (vm == null) {
            messages.add(VdcBllMessages.ACTION_TYPE_FAILED_VM_NOT_FOUND.name());
            return false;
        }

        // TODO: This is done to keep consistency with VmDAO.getById.
        // It can probably be removed, but that requires some more research
        VmHandler.updateNetworkInterfacesFromDb(vm);

        RunVmParams runVmParams = new RunVmParams(vmId);
        runVmParams.setUseVnc(osRepository.isLinux(vm.getVmOsId()) || vm.getVmType() == VmType.Server);

        return new RunVmValidator(vm, runVmParams, false, findActiveISODomain(vm.getStoragePoolId()))
                .canRunVm(
   --------->           messages,
                        fetchStoragePool(vm.getStoragePoolId()),
                        Collections.<Guid> emptyList(),
                        null,
                        null,
                        DbFacade.getInstance().getVdsGroupDao().get(vm.getVdsGroupId()));
    }

    ---------> It looks like the messages parameter of canRunVm is null

Checking lower level code it looks like the validator expects the messages to be a list,
but does not anyhow handle if the list is null and no messages are populated in it:

    public boolean canRunVm(List<String> messages, StoragePool storagePool, List<Guid> vdsBlackList,
            List<Guid> vdsWhiteList, Guid destVds, VDSGroup vdsGroup) {

        if (vm.getStatus() == VMStatus.Paused) {
            // if the VM is paused, we should only check the VDS status
            // as the rest of the checks were already checked before
            return validate(validateVdsStatus(vm), messages);
        }

        return
                validateVmProperties(vm, runVmParam.getCustomProperties(), messages) &&
                validate(validateBootSequence(vm, runVmParam.getBootSequence(), getVmDisks(), activeIsoDomainId), messages) &&
                validate(validateDisplayType(), messages) &&
                validate(new VmValidator(vm).vmNotLocked(), messages) &&
                validate(getSnapshotValidator().vmNotDuringSnapshot(vm.getId()), messages) &&
                validate(validateVmStatusUsingMatrix(vm), messages) &&
                validate(validateStoragePoolUp(vm, storagePool, getVmImageDisks()), messages) &&
                validate(validateIsoPath(vm, runVmParam.getDiskPath(), runVmParam.getFloppyPath(), activeIsoDomainId), messages)  &&
                validate(vmDuringInitialization(vm), messages) &&
                validate(validateStatelessVm(vm, getVmDisks(), runVmParam.getRunAsStateless()), messages) &&
                validate(validateStorageDomains(vm, isInternalExecution, getVmImageDisks()), messages) &&
                validate(validateImagesForRunVm(vm, getVmImageDisks()), messages) &&
                SchedulingManager.getInstance().canSchedule(
                        vdsGroup, vm, vdsBlackList, vdsWhiteList, destVds, messages);
    }

Comment 2 Arik 2014-06-24 06:33:58 UTC
There is no flow where the passed 'messages' can be null. In 3.3.3 VmPoolCommandBase#229 is:
runVmParams.setUseVnc(osRepository.isLinux(vm.getVmOsId()) || vm.getVmType() == VmType.Server);

Since the VM can't be null (because of prior check), the only option is that osRepository is null. we suspect that the loading of the os-repository file failed.

Please:
1. Check with the customer whether he can edit/create VM.
2. Check whether the customer edited the os-repository file.

Comment 5 Arik 2014-06-24 07:38:52 UTC
ok, so we discovered that there was a compensation for AddVmAndAttachToPool..
It means we touch the VmPoolCommandBase class, thus its static fields are initialized and it happens before we initialize os-info.

So in order to reproduce this bug:
1. define pool with prestarted VMs
2. while VMs are being prestarted, reboot the engine (and hope that it will be in the middle of one of the operations)

After the reboot, you won't be able to attach users to prestarted VMs.

I'll move the os-info initialization to be executed before the compensation.

Comment 6 Arik 2014-06-24 14:02:05 UTC
(In reply to Arik from comment #5)
> ok, so we discovered that there was a compensation for AddVmAndAttachToPool..
> It means we touch the VmPoolCommandBase class, thus its static fields are
> initialized and it happens before we initialize os-info.
> 
> So in order to reproduce this bug:
> 1. define pool with prestarted VMs
> 2. while VMs are being prestarted, reboot the engine (and hope that it will
> be in the middle of one of the operations)

correction: the problematic operation is AttachUserToVmFromPoolAndRunCommand so you need to attach users to VMs from pool and restart the engine in the middle

> 
> After the reboot, you won't be able to attach users to prestarted VMs.
> 
> I'll move the os-info initialization to be executed before the compensation.

Comment 8 Arik 2014-06-26 12:24:37 UTC
(In reply to Arik from comment #6)
> correction: the problematic operation is AttachUserToVmFromPoolAndRunCommand
> so you need to attach users to VMs from pool and restart the engine in the
> middle

The compensation is create in AddPermissionsCommand so it is important that the engine will be restarted after the compensation was created in that command.

Comment 9 Ilanit Stein 2014-07-03 14:07:08 UTC
This was tested on 3.4.1 (av10):
1. Add a VM pool via webadmin.
2. Add permission 'everyone' , 'userRole' to VM pool via webadmin.
3. Connect via userPortal, and run a VM
4. Immediately,right after, restart ovirt-engine service.

engine.log show, the engine restart was not done in the desired timing,
which is between these 2 steps:

2014-07-03 14:46:38,880 INFO  [org.ovirt.engine.core.bll.AddPermissionCommand] (ajp-/127.0.0.1:8702-1) [23a06d0f] Running command: AddPermissionCommand internal: true. Entities affected :  ID: b1fd528e-804b-4b4e-86de-d1a8f53e6412 Type: V
M
2014-07-03 14:46:38,891 INFO  [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-1) [23a06d0f] Succceeded giving user fdfc627c-d875-11e0-90f0-83df133b58cc permission to Vm b1fd528e-804b-4b4e-86de-d1a8f53
e6412 

I've tried to run several times, on user portal, stop VM, start VM, and engine restart, but couldn't create the desired state of having the engine restart happen between the 2 steps, mention above.

Gil, 
I recommend to move this bug to QE scale team, to run many VMs, and try to stop & start them, and run engine restart in parallel.

Comment 10 Yuri Obshansky 2014-07-08 13:15:11 UTC
We tested on RHEVM version 3.4.1-0.23.el6ev
1. Created Pool with 50 VMs
2. Added 50 users to Pool (UserRole)
3. Ran JMeter script performed 50 current threads 
which call REST APIs /api/vmpools/765e3c19-a118-49cc-bc83-3ae56016c668/allocatevm
with different users Authorizations
4. Imminently restarted Engine 
5. Analyze engine.log file

Comment 11 Ilanit Stein 2014-07-08 13:19:45 UTC
Bug was verified by test in comment #10:

Engine restart was done in the middle of add permission command,
and there was a roleback, and compensation, as expected:

 
2014-07-08 15:41:20,798 ERROR [org.ovirt.engine.core.bll.AddPermissionCommand] (ajp-/127.0.0.1:8702-12) Transaction rolled-back for command: org.ovirt.engine.core.bll.AddPermissionCommand.
2014-07-08 15:41:20,798 ERROR [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-12) Transaction rolled-back for command: org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand.

2014-07-08 15:41:27,607 INFO  [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ServerService Thread Pool -- 47) Command [id=df6e8b71-3d00-473d-b1f5-968fa5f007d2]: Compensating NEW_ENTITY_ID of org.ovirt.engine.core.common
.businessentities.Permissions; snapshot: 9d466422-988a-4561-a098-cec04959c99f.
2014-07-08 15:41:27,620 INFO  [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 47) Running compensation on startup for Command : org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand , Command Id : df6e8b71-3d00-
473d-b1f5-968fa5f007d2

Comment 12 Ilanit Stein 2014-07-08 13:46:00 UTC
Created attachment 916383 [details]
verification engine.log

Comment 13 Tomas Dosek 2014-08-04 12:18:46 UTC
This bugzilla was mistakenly not part of official RHEV 3.4.1 Errata announcement. Please consider this BZ as released by the relevant errata.

More information is available via following article:
https://access.redhat.com/solutions/1155243

Comment 14 Eyal Edri 2014-08-05 08:25:36 UTC
all these bugs were fixed and released as part of 3.4.1,
and weren't closed since they weren't included in any errata.
closing as current release.