Description of problem: Background: ----------- In a setup, 5 out of 7 up hosts put in maintenance, and while VMs migrating to the remaining 2 up hosts, destination host (SPM) had VDSNetworkexception and was initializing (recovering from crash), As a result, the VMs started migration to a the second up destination host, and this failed as well, as it was in contending state. some of the 5 hosts put in maintenance, hosts stayed stuck at preparing for maintenance forever, as there were still VMs running on them (should be resolved by bug 966503 fix). Problem: ------- The VMs that are still running on source hosts cannot be powered off. engine.log (for the VM power off): --------------------------------- 2014-01-06 11:13:19,008 INFO [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (pool-4-thread-49) [282a3a91] START, DestroyVmVDSCommand(HostName = lilach-vdsb.tlv.redhat.com, HostId = 5ada85a2-ed80-4fb0-abaf-5b329ca5f3be, vmId=afe95ae8-52cb-4618-8d7a-a60bdda82412, force=false, secondsToWait=0, gracefully=false), log id: 6cc2e424 2014-01-06 11:13:19,011 ERROR [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (pool-4-thread-49) [282a3a91] Command DestroyVmVDS execution failed. Exception: EJBException: java.lang.NullPointerException 2014-01-06 11:13:19,011 INFO [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (pool-4-thread-49) [282a3a91] FINISH, DestroyVmVDSCommand, log id: 6cc2e424 2014-01-06 11:13:19,011 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (pool-4-thread-49) [282a3a91] Command org.ovirt.engine.core.bll.StopVmCommand throw Vdc Bll exception. With error message VdcBLLException: javax.ejb.EJBException: java.lang.NullPointerException (Failed with error ENGINE and code 5001) 2014-01-06 11:13:19,039 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-4-thread-49) [282a3a91] Correlation ID: 282a3a91, Job ID: 73d3b5ab-8de1-47ed-a619-e523fbfadfcb, Call Stack: null, Custom Event ID: -1, Message: Failed to power off VM POOLTEST-6 (Host: lilach-vdsb.tlv.redhat.com, User: admin@internal). Version-Release number of selected component (if applicable): is30 Actual results: Expected results: It should be possible to power off the VM, even in case VM migration trials fail.
Created attachment 846019 [details] engine log Hosts to maint @ 10:58. VDSNetworkException to SPM host while migration running@ 11:02, Try&fail to power off VM @ 11:13.
Created attachment 846085 [details] host_1 logs host time 2 hours behind rhevm time.
Created attachment 846086 [details] host_2 logs host time 2 hours behind rhevm time.
Created attachment 846110 [details] host_a_put_in_maint_logs time is behind engine time in 2 hours.
Created attachment 846111 [details] host_b_put_in_maint_logs time is behind engine time in 2 hours.
from our investigation it seems that 'rerun' procedure clears the destinationVdsId for the command, but the command is still in the 'asyncRunningCommands' cache, so on next async command (stop/migrate) we try to use it (reportCompleted) and it fails because destinationVdsId was cleared. we probably need to remove the command from the cache, or make sure code can handle missing destinationVdsId (i favor option 1 if it can work ok)
(In reply to Omer Frenkel from comment #7) > we probably need to remove the command from the cache, or make sure code can > handle missing destinationVdsId (i favor option 1 if it can work ok) I agree that it will be good to remove the command from the cache and imo the reportCompleted method is not the place to decrease the pending memory, it should be moved to other async callback methods. But as we understand that this bug happens now because of the addition of the call to decrease pending memory method in reportCompleted and we didn't have other issues with the command being in the cache, I suggest we'll just add null-check at this point to keep it safe and easy to backport and I'll make some refactoring to improve that code in u/s later on.
merged to master, pending 3.4 backport
http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=6e9d39a8fdab1b2e3e0f0ce8917fc877e7e221f2
Verified on ovirt-engine-3.4.0-0.7.beta2.el6.noarch. Have 3 rhel hosts. - Run VM on host1. - Have host 3 avail memory not sufficient to contain this VM. - Migrate VM (to any host). - VM starts to migrate to host2 (since host 3 memory is full) - While migration is running, Kill qemu process on host2 by: ps aux | grep qemu | grep -v grep | grep -v supervdsmServer | awk '{print $2}' | xargs -I^ kill -9 ^ - As a result: a. migration fail b. Trying another host (host 3) - which fail as well. VM stays running on host1. - Power off the VM succeeded.
Closing as part of 3.4.0