1048790 – Not possible to power off VM that failed migration.

Bug 1048790 - Not possible to power off VM that failed migration.

Summary: Not possible to power off VM that failed migration.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	3.4.0
Assignee:	Arik
QA Contact:	Ilanit Stein
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:	1058764 rhev3.4beta 1142926
TreeView+	depends on / blocked

Reported:	2014-01-06 10:53 UTC by Ilanit Stein
Modified:	2014-09-18 12:24 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ovirt-3.4.0-beta2
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1058764 (view as bug list)
Environment:
Last Closed:	2014-06-12 14:04:30 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine log (69.96 KB, application/x-gzip) 2014-01-06 10:59 UTC, Ilanit Stein	no flags	Details
host_1 logs (2.88 MB, application/zip) 2014-01-06 13:32 UTC, Ilanit Stein	no flags	Details
host_2 logs (4.39 MB, application/zip) 2014-01-06 13:32 UTC, Ilanit Stein	no flags	Details
host_a_put_in_maint_logs (1.82 MB, application/zip) 2014-01-06 14:00 UTC, Ilanit Stein	no flags	Details
host_b_put_in_maint_logs (1.86 MB, application/zip) 2014-01-06 14:01 UTC, Ilanit Stein	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	23705	None	None	None	Never
oVirt gerrit	23760	None	None	None	Never
oVirt gerrit	23812	None	None	None	Never

Description Ilanit Stein 2014-01-06 10:53:06 UTC

Description of problem:

Background:
-----------
In a setup, 5 out of 7 up hosts put in maintenance,
and while VMs migrating to the remaining 2 up hosts, destination host (SPM) had VDSNetworkexception and was initializing (recovering from crash),
As a result, the VMs started migration to a the second up destination host, and this failed as well, as it was in contending state.
some of the 5 hosts put in maintenance, hosts stayed stuck at preparing for maintenance forever, as there were still VMs running on them (should be resolved by bug 966503 fix).

Problem:
-------
The VMs that are still running on source hosts cannot be powered off.

engine.log (for the VM power off):
---------------------------------
2014-01-06 11:13:19,008 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (pool-4-thread-49) [282a3a91] START, DestroyVmVDSCommand(HostName = lilach-vdsb.tlv.redhat.com, HostId = 5ada85a2-ed80-4fb0-abaf-5b329ca5f3be, vmId=afe95ae8-52cb-4618-8d7a-a60bdda82412, force=false, secondsToWait=0, gracefully=false), log id: 6cc2e424
2014-01-06 11:13:19,011 ERROR [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (pool-4-thread-49) [282a3a91] Command DestroyVmVDS execution failed. Exception: EJBException: java.lang.NullPointerException
2014-01-06 11:13:19,011 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (pool-4-thread-49) [282a3a91] FINISH, DestroyVmVDSCommand, log id: 6cc2e424
2014-01-06 11:13:19,011 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (pool-4-thread-49) [282a3a91] Command org.ovirt.engine.core.bll.StopVmCommand throw Vdc Bll exception. With error message VdcBLLException: javax.ejb.EJBException: java.lang.NullPointerException (Failed with error ENGINE and code 5001)
2014-01-06 11:13:19,039 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-4-thread-49) [282a3a91] Correlation ID: 282a3a91, Job ID: 73d3b5ab-8de1-47ed-a619-e523fbfadfcb, Call Stack: null, Custom Event ID: -1, Message: Failed to power off VM POOLTEST-6 (Host: lilach-vdsb.tlv.redhat.com, User: admin@internal).

  
Version-Release number of selected component (if applicable):
is30

Actual results:


Expected results:
It should be possible to power off the VM, even in case VM migration trials  fail.

Comment 1 Ilanit Stein 2014-01-06 10:59:10 UTC

Created attachment 846019 [details]
engine log

Hosts to maint @ 10:58. VDSNetworkException to SPM host while migration running@ 11:02, Try&fail to power off VM @ 11:13.

Comment 2 Ilanit Stein 2014-01-06 13:32:00 UTC

Created attachment 846085 [details]
host_1 logs

host time 2 hours behind rhevm time.

Comment 3 Ilanit Stein 2014-01-06 13:32:58 UTC

Created attachment 846086 [details]
host_2 logs

host time 2 hours behind rhevm time.

Comment 4 Ilanit Stein 2014-01-06 14:00:17 UTC

Created attachment 846110 [details]
host_a_put_in_maint_logs

time is behind engine time in 2 hours.

Comment 5 Ilanit Stein 2014-01-06 14:01:08 UTC

Created attachment 846111 [details]
host_b_put_in_maint_logs

time is behind engine time in 2 hours.

Comment 7 Omer Frenkel 2014-01-14 07:42:17 UTC

from our investigation it seems that 'rerun' procedure clears the destinationVdsId for the command, but the command is still in the 'asyncRunningCommands' cache, so on next async command (stop/migrate) we try to use it (reportCompleted) and it fails because destinationVdsId was cleared.

we probably need to remove the command from the cache, or make sure code can handle missing destinationVdsId (i favor option 1 if it can work ok)

Comment 9 Arik 2014-01-26 11:53:57 UTC

(In reply to Omer Frenkel from comment #7)
> we probably need to remove the command from the cache, or make sure code can
> handle missing destinationVdsId (i favor option 1 if it can work ok)

I agree that it will be good to remove the command from the cache and imo the reportCompleted method is not the place to decrease the pending memory, it should be moved to other async callback methods.

But as we understand that this bug happens now because of the addition of the call to decrease pending memory method in reportCompleted and we didn't have other issues with the command being in the cache, I suggest we'll just add null-check at this point to keep it safe and easy to backport and I'll make some refactoring to improve that code in u/s later on.

Comment 11 Michal Skrivanek 2014-01-28 13:51:56 UTC

merged to master, pending 3.4 backport

Comment 12 Arik 2014-01-29 13:25:33 UTC

http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=6e9d39a8fdab1b2e3e0f0ce8917fc877e7e221f2

Comment 13 Ilanit Stein 2014-02-18 12:46:13 UTC

Verified on ovirt-engine-3.4.0-0.7.beta2.el6.noarch.

Have 3 rhel hosts. 
- Run VM on host1.
- Have host 3 avail memory not sufficient to contain this VM.
- Migrate VM (to any host).
- VM starts to migrate to host2 (since host 3 memory is full)
- While migration is running, Kill qemu process on host2 by:
ps aux | grep qemu | grep -v grep | grep -v supervdsmServer | awk '{print $2}' | xargs -I^ kill -9 ^ 
- As a result: a. migration fail b. Trying another host (host 3) - which fail as well. VM stays running on host1.
- Power off the VM succeeded.

Comment 15 Itamar Heim 2014-06-12 14:04:30 UTC

Closing as part of 3.4.0

Note You need to log in before you can comment on or make changes to this bug.