Bug 1048790

Summary: Not possible to power off VM that failed migration.
Product: Red Hat Enterprise Virtualization Manager Reporter: Ilanit Stein <istein>
Component: ovirt-engineAssignee: Arik <ahadas>
Status: CLOSED CURRENTRELEASE QA Contact: Ilanit Stein <istein>
Severity: medium Docs Contact:
Priority: high    
Version: 3.3.0CC: acathrow, iheim, lpeer, mavital, michal.skrivanek, ofrenkel, Rhev-m-bugs, sherold, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: ovirt-3.4.0-beta2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1058764 (view as bug list) Environment:
Last Closed: 2014-06-12 14:04:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1058764, 1078909, 1142926    
Attachments:
Description Flags
engine log
none
host_1 logs
none
host_2 logs
none
host_a_put_in_maint_logs
none
host_b_put_in_maint_logs none

Description Ilanit Stein 2014-01-06 10:53:06 UTC
Description of problem:

Background:
-----------
In a setup, 5 out of 7 up hosts put in maintenance,
and while VMs migrating to the remaining 2 up hosts, destination host (SPM) had VDSNetworkexception and was initializing (recovering from crash),
As a result, the VMs started migration to a the second up destination host, and this failed as well, as it was in contending state.
some of the 5 hosts put in maintenance, hosts stayed stuck at preparing for maintenance forever, as there were still VMs running on them (should be resolved by bug 966503 fix).

Problem:
-------
The VMs that are still running on source hosts cannot be powered off.

engine.log (for the VM power off):
---------------------------------
2014-01-06 11:13:19,008 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (pool-4-thread-49) [282a3a91] START, DestroyVmVDSCommand(HostName = lilach-vdsb.tlv.redhat.com, HostId = 5ada85a2-ed80-4fb0-abaf-5b329ca5f3be, vmId=afe95ae8-52cb-4618-8d7a-a60bdda82412, force=false, secondsToWait=0, gracefully=false), log id: 6cc2e424
2014-01-06 11:13:19,011 ERROR [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (pool-4-thread-49) [282a3a91] Command DestroyVmVDS execution failed. Exception: EJBException: java.lang.NullPointerException
2014-01-06 11:13:19,011 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (pool-4-thread-49) [282a3a91] FINISH, DestroyVmVDSCommand, log id: 6cc2e424
2014-01-06 11:13:19,011 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (pool-4-thread-49) [282a3a91] Command org.ovirt.engine.core.bll.StopVmCommand throw Vdc Bll exception. With error message VdcBLLException: javax.ejb.EJBException: java.lang.NullPointerException (Failed with error ENGINE and code 5001)
2014-01-06 11:13:19,039 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-4-thread-49) [282a3a91] Correlation ID: 282a3a91, Job ID: 73d3b5ab-8de1-47ed-a619-e523fbfadfcb, Call Stack: null, Custom Event ID: -1, Message: Failed to power off VM POOLTEST-6 (Host: lilach-vdsb.tlv.redhat.com, User: admin@internal).

  
Version-Release number of selected component (if applicable):
is30

Actual results:


Expected results:
It should be possible to power off the VM, even in case VM migration trials  fail.

Comment 1 Ilanit Stein 2014-01-06 10:59:10 UTC
Created attachment 846019 [details]
engine log

Hosts to maint @ 10:58. VDSNetworkException to SPM host while migration running@ 11:02, Try&fail to power off VM @ 11:13.

Comment 2 Ilanit Stein 2014-01-06 13:32:00 UTC
Created attachment 846085 [details]
host_1 logs

host time 2 hours behind rhevm time.

Comment 3 Ilanit Stein 2014-01-06 13:32:58 UTC
Created attachment 846086 [details]
host_2 logs

host time 2 hours behind rhevm time.

Comment 4 Ilanit Stein 2014-01-06 14:00:17 UTC
Created attachment 846110 [details]
host_a_put_in_maint_logs

time is behind engine time in 2 hours.

Comment 5 Ilanit Stein 2014-01-06 14:01:08 UTC
Created attachment 846111 [details]
host_b_put_in_maint_logs

time is behind engine time in 2 hours.

Comment 7 Omer Frenkel 2014-01-14 07:42:17 UTC
from our investigation it seems that 'rerun' procedure clears the destinationVdsId for the command, but the command is still in the 'asyncRunningCommands' cache, so on next async command (stop/migrate) we try to use it (reportCompleted) and it fails because destinationVdsId was cleared.

we probably need to remove the command from the cache, or make sure code can handle missing destinationVdsId (i favor option 1 if it can work ok)

Comment 9 Arik 2014-01-26 11:53:57 UTC
(In reply to Omer Frenkel from comment #7)
> we probably need to remove the command from the cache, or make sure code can
> handle missing destinationVdsId (i favor option 1 if it can work ok)

I agree that it will be good to remove the command from the cache and imo the reportCompleted method is not the place to decrease the pending memory, it should be moved to other async callback methods.

But as we understand that this bug happens now because of the addition of the call to decrease pending memory method in reportCompleted and we didn't have other issues with the command being in the cache, I suggest we'll just add null-check at this point to keep it safe and easy to backport and I'll make some refactoring to improve that code in u/s later on.

Comment 11 Michal Skrivanek 2014-01-28 13:51:56 UTC
merged to master, pending 3.4 backport

Comment 13 Ilanit Stein 2014-02-18 12:46:13 UTC
Verified on ovirt-engine-3.4.0-0.7.beta2.el6.noarch.

Have 3 rhel hosts. 
- Run VM on host1.
- Have host 3 avail memory not sufficient to contain this VM.
- Migrate VM (to any host).
- VM starts to migrate to host2 (since host 3 memory is full)
- While migration is running, Kill qemu process on host2 by:
ps aux | grep qemu | grep -v grep | grep -v supervdsmServer | awk '{print $2}' | xargs -I^ kill -9 ^ 
- As a result: a. migration fail b. Trying another host (host 3) - which fail as well. VM stays running on host1.
- Power off the VM succeeded.

Comment 15 Itamar Heim 2014-06-12 14:04:30 UTC
Closing as part of 3.4.0