Description of problem: vm with dedicated host and migration policy of "Allow automatic and manual migration" fails to run if the host is in maintenance, although other host can run it. Version-Release number of selected component (if applicable): 3.6 master How reproducible: always Steps to Reproduce: 1. have 2 up hosts in a cluster (A & B) 2. create vm and set it to start running on specific host A 3. move host A to maintenance 4. start vm Actual results: vm fails to run, exception in log: 2015-07-16 14:40:04,401 INFO [org.ovirt.engine.core.bll.RunVmCommand] (default task-58) [3df9f7f] Lock Acquired to object 'EngineLock:{exclusiveLocks='[e01c6100-3cc6-4f44-9308-38f562c1012c=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2015-07-16 14:40:04,471 INFO [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (default task-58) [3df9f7f] START, IsVmDuringInitiatingVDSCommand( IsVmDuringInitiatingVDSCommandParameters:{runAsync='true', vmId='e01c6100-3cc6-4f44-9308-38f562c1012c'}), log id: 6c417e32 2015-07-16 14:40:04,471 INFO [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (default task-58) [3df9f7f] FINISH, IsVmDuringInitiatingVDSCommand, return: false, log id: 6c417e32 2015-07-16 14:40:04,544 INFO [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-8-thread-22) [3df9f7f] Running command: RunVmCommand internal: false. Entities affected : ID: e01c6100-3cc6-4f44-9308-38f562c1012c Type: VMAction group RUN_VM with role type USER 2015-07-16 14:40:04,589 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-8-thread-22) [3df9f7f] Can't find VDS to run the VM 'e01c6100-3cc6-4f44-9308-38f562c1012c' on, so this VM will not be run. 2015-07-16 14:40:04,590 INFO [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-8-thread-22) [3df9f7f] Lock freed to object 'EngineLock:{exclusiveLocks='[e01c6100-3cc6-4f44-9308-38f562c1012c=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2015-07-16 14:40:04,590 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-8-thread-22) [3df9f7f] Command 'org.ovirt.engine.core.bll.RunVmCommand' failed: null 2015-07-16 14:40:04,590 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-8-thread-22) [3df9f7f] Exception: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333) [rt.jar:1.7.0_79] at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988) [rt.jar:1.7.0_79] at org.ovirt.engine.core.vdsbroker.ResourceManager.GetVdsManager(ResourceManager.java:305) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.GetVdsManager(ResourceManager.java:301) [vdsbroker.jar:] at org.ovirt.engine.core.bll.RunVmCommandBase.getMonitor(RunVmCommandBase.java:336) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommandBase.getBlockingQueue(RunVmCommandBase.java:326) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommandBase.decreasePendingVm(RunVmCommandBase.java:297) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommandBase.decreasePendingVm(RunVmCommandBase.java:291) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommandBase.runningFailed(RunVmCommandBase.java:137) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.runningFailed(RunVmCommand.java:1137) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:288) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:411) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:335) [bll.jar:] at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:104) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1211) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1355) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1979) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1392) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:374) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.executeValidatedCommand(MultipleActionsRunner.java:202) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.runCommands(MultipleActionsRunner.java:170) [bll.jar:] at org.ovirt.engine.core.bll.SortedMultipleActionsRunnerBase.runCommands(SortedMultipleActionsRunnerBase.java:20) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner$2.run(MultipleActionsRunner.java:179) [bll.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:92) [utils.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_79] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_79] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_79] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_79] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_79] 2015-07-16 14:40:04,602 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-8-thread-22) [3df9f7f] Correlation ID: 3df9f7f, Job ID: 3efd481e-c00b-4680-8c25-31c681d12e43, Call Stack: null, Custom Event ID: -1, Message: Failed to run VM f21-new (User: admin@internal). Expected results: vm should run on host B Additional info: the above is all relevant log, no other info in the log. i verified that manually the vm can run on host B (run once and select the host)
This bug not appear under version ovirt-engine-3.6.0-0.0.master.20150627185750.git6f063c1.el6.noarch.
please note that the exception was already fixed by https://gerrit.ovirt.org/#/c/44243/ but still there is an issue with the scheduler that doesn't return host in getVdsToRunOn() although there is available host to run on.
Problem reconstruction failed on master build. Found a different problem, here is the reconstruction scenario. 1. Have 2 hosts in cluster A, B. Both hosts idle with no running vms. 2. Assign vm dedicated hosts A and B. 3. Start vm (vm starts on A). 4. Put host A on maintenance 5. vm migrating to host B. 6. Migration failed!, host A is pending maintenance. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 7. Stop vm, host A reached maintenance status. 8. Start vm, vm starts on host B. As expected. Yet migration fails, manual or automatic. Migration is not working. Reported dug 1252820
why did you move to modified without a fix? i verified on latest master this still happens.
Bug is not reproducible for Dudi Maroshi, running oVirt master build. Nor for Artyom. See comment 1.
please describe in details how you test and what was the result
In reply to comment 6. Prepare correct build: ---------------------- cd ~/git/ovirt-engine git checkout master git fetch git rebase origin/master make clean install-dev PREFIX="$HOME/ovirt-engine" Prepare and run correct engine ------------------------------ $HOME/ovirt-engine/bin/engine-setup --jboss-home="${JBOSS_HOME}" echo "ENGINE_JAVA_MODULEPATH=\"/usr/share/ovirt-engine-wildfly-overlay/modules:\${ENGINE_JAVA_MODULEPATH}\"" \ ~/ovirt-engine/etc/ovirt-engine/engine.conf.d/10-setup-jboss-overlay.conf cd $HOME/ovirt-engine ./share/ovirt-engine/services/ovirt-engine/ovirt-engine.py start Reconstruct bug --------------- 1. Have to active hosts A and B. With no running VMs. 2. Create a new VM 3. On newly created VM, apply dedicated host A and host B. Apply migration policy "Allow manual and automatic migration" 4. Run newly created VM. Assuming it run on host A. Wait for up state. 5. Stop newly create VM. 6. Put host A on maintenance. 7. Run newly created VM. 8. Check VM runs on host B.
> 3. On newly created VM, apply dedicated host A and host B. Apply migration policy "Allow manual and automatic migration" this is the problem ^^ if you check the bug description, you should set ONLY host A then put it to maintenance, vm should still run on host B, but it doesnt. i verified again on latest master this happens, re-opening the bug.
Had private bug demo with ofrenkel . Bug was reconstructed, understood and fixed.
Omer, what leads you to the assumption that the VM should start on host B? If Start running on contains only host A, then the VM can't be started unless host A is available. There is nothing in the documentation that would imply otherwise, the RHEV-M guide actually says: 10.5.5 table 10.8 "The virtual machine will start running on a particular host in the cluster. However, the Manager or an administrator can migrate the virtual machine to a different host in the cluster depending on the migration and high-availability settings of the virtual machine."
(In reply to Martin Sivák from comment #10) > Omer, what leads you to the assumption that the VM should start on host B? > If Start running on contains only host A, then the VM can't be started > unless host A is available. its not an assumption, this is how it always worked. (be my guest to check any previous version) since i choose: "Allow manual and automatic migration" the system can choose to migrate the vm, and also choose to start it on another host. the list of hosts is just a preference. think of this use case: i prefer my vm to start on host A first, if possible. otherwise any other host.. > > There is nothing in the documentation that would imply otherwise, the RHEV-M > guide actually says: > > 10.5.5 table 10.8 > > "The virtual machine will start running on a particular host in the cluster. > However, the Manager or an administrator can migrate the virtual machine to > a different host in the cluster depending on the migration and > high-availability settings of the virtual machine." docs need to be updated.
Omer, the behaviour you describe has not been there since at least oVirt 3.3 (since the new scheduler was introduced). And the documentation actually describes the current code accurately. Is there any use case associated with the preferred host? It looks like it can be used to do manual scheduling override... but it is not useful or predictable at all: - The engine might start the VM on your preferred host and migrate it immediately because of balancing (making the preference setting useless in this case) - or it can start your VM anywhere else in the case you describe (and the preference setting is again useless). I do not see any other scenario where having preference together with the ability to choose any other host would make sense. Dudi's patch broke RunVmOnce (internally, it worked due to another internal bug :) and MigrateTo flows for example so we have to decide what the behaviour should be.
I just verified this flow works as i described on rhevm-3.5.4.2-1.3.el6ev created vm and set the specific host to 'X' and "Allow manual and automatic migration" moved host 'X' to maintenance and started the vm. the vm started on another host. I'm not sure if there is a use case for this flow, but its a regression. I use this flow for my needs: during development i want vms to start on specific host, if its available, but sometimes the host is used by other developer, and i still want the vm to start.
Hmm.. interesting (because the code wasn't exactly written like that..), would you expect Run Once and Migrate To to behave the same or should those fail when the host is not available? There was quite a lot of confusion in how the scheduling API (whitelist vs destHostId) was used and this needs to be cleared up.
migrate - yes, if i allow automatic migration, then first try to migrate to my selected (preferred) host(s), if not available, try anything else run-once - i guess that its missing the migration option, but as it is right now - if user select run-once and specify host X, it should not try anything else if X is not available/fails to run the vm. adding need-info on Moran so he could share his opinion, if different than mine.
(In reply to Omer Frenkel from comment #15) > migrate - yes, if i allow automatic migration, then first try to migrate to > my selected (preferred) host(s), if not available, try anything else not sure. We then probably need to differentiate between a hard constraint list of allowed host(s). When you only want to express a preference then Affinity would be the right feature to kick in, no? > run-once - i guess that its missing the migration option, but as it is right > now - if user select run-once and specify host X, it should not try anything > else if X is not available/fails to run the vm. I believe that's correct
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
This issue should be split in to 2. We first fix this and open a differnt bz for the evolving of the pinning, affinity of vm to host work.
Martin please adapt your patch to fix the behaviour according to this bug, preserving the behaviour as is.
https://gerrit.ovirt.org/#/c/45244 is enough to fix it. Dudi please revive the patch
Verified on rhevm-3.6.0.2-0.1.el6.noarch 1) Create vm with start on equal to host_1 2) put host_1 to maintenance 3) start vm 4) vm start on host_2
oVirt 3.6.0 has been released on November 4th, 2015 and should fix this issue. If problems still persist, please open a new BZ and reference this one.