Bug 697277

Summary: Backend: wrong error message when migration fails
Product: Red Hat Enterprise Virtualization Manager Reporter: Yaniv Kaul <ykaul>
Component: ovirt-engineAssignee: Francesco Romani <fromani>
Status: CLOSED CURRENTRELEASE QA Contact: Pavel Novotny <pnovotny>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.2.7CC: bsettle, iheim, lpeer, mavital, michal.skrivanek, rbalakri, Rhev-m-bugs, sherold, yeylon
Target Milestone: ---Flags: sherold: Triaged+
Target Release: 3.5.0   
Hardware: All   
OS: Windows   
Whiteboard: virt
Fixed In Version: vt3 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-17 08:30:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 860222, 1142923, 1156165    

Description Yaniv Kaul 2011-04-17 12:17:18 UTC
Description of problem:
I've migrated a VM from host A to host B.
Regretfully, migration failed (due to timeout).
This has caused the VM on the destination to exit:
Thread-146826::DEBUG::2011-04-17 14:55:44,912::vm::1434::vds.vmlog.514e0257-3f28-4f39-9a44-d2b786146675::Changed state to Down: Migration failed

The event on the RHEVM is:
VM <vmname> is down. Exit message Migration failed.

while technically it is somewhat correct (the destination is down), the reality is that it should be still up (and hopefully running!) on the source!
The message, as is, is quite confusing and alarming.

Version-Release number of selected component (if applicable):
2.2.7

How reproducible:


Steps to Reproduce:
1. Cause migration to fail due to timeout (make the VM do a lot of IO, for example, or memory scanning).
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 lpeer 2011-04-21 10:41:13 UTC
When running a VM on a host RHEVM collects statistics from which it learns the status of the VM.

ATM VDSM returns on Down Vms one of two exit status: Normal / ERROR, and for error adding a String with exit message.

For having a special message on migration failed, RHEVM+VDSM need to support another exit code or have obligation that the message can not be changed and RHEVM can base logic on the content of the message. 
I personally prefer new exit code ERROR_ON_MIGRATION

Anyway moving to RFE to support either of the two options.

Comment 2 Simon Grinberg 2011-05-11 13:56:05 UTC
(In reply to comment #0)

> while technically it is somewhat correct (the destination is down), the reality
> is that it should be still up (and hopefully running!) on the source!
> The message, as is, is quite confusing and alarming.
> 

Kaul, as I understand this is the case, IE the source machine is up and running as it should. The problem is the event that declares the machine went down, while the status in the GUI returns to up - right? 

(In reply to comment #1)
> When running a VM on a host RHEVM collects statistics from which it learns the
> status of the VM.
> 
> ATM VDSM returns on Down Vms one of two exit status: Normal / ERROR, and for
> error adding a String with exit message.
> 
> For having a special message on migration failed, RHEVM+VDSM need to support
> another exit code or have obligation that the message can not be changed and
> RHEVM can base logic on the content of the message. 
> I personally prefer new exit code ERROR_ON_MIGRATION
> 
> Anyway moving to RFE to support either of the two options.

Livnat, if Kaul's response to my question is positive the problem is that RHEV Manager collects the status from both destination as source hosts and while encountering the down message from the destination it logs it, even though the VM itself is up and running in the source. In this case it is a bug and not an RFE. The backend should cross this message from the destination with the status in the source and conclude migration failure. Unless you are not maintaining a thread that follows migration from start to finish.

Comment 8 Michal Skrivanek 2014-01-30 09:55:39 UTC
while waiting on engine side refactoring, comment #1 makes sense to implement independently. Once we have a differentiation in VM's exit code the engine side change would be trivial to do

Comment 9 Francesco Romani 2014-01-30 10:05:30 UTC
seems related: https://bugzilla.redhat.com/show_bug.cgi?id=557125 , considering the amendements suggested by Federico and implemented in the last patchset:
http://gerrit.ovirt.org/#/c/22631/5/

Comment 10 Francesco Romani 2014-09-09 13:31:56 UTC
This independent fix: http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=68aba2b12b90a997cee0f1e0221eb6f48eb8fd35

for this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1104195

should have solved this problem as well. Moving to MODIFIED and abandoning my patch.

Comment 11 Eyal Edri 2014-09-10 20:21:58 UTC
fixed in vt3, moving to on_qa.
if you believe this bug isn't released in vt3, please report to rhev-integ

Comment 12 Pavel Novotny 2014-10-08 16:41:21 UTC
Verified in rhevm-3.5.0-0.14.beta.el6ev.noarch (vt5).

Verification steps follow the reproducer:
1. Have a VM with high CPU & IO load.
2. Start migration of the VM from host A to host B.

Result:
Migrations is cancelled due to timeout. VM event message says:
"Migration failed due to Error: Migration not in progress (VM: user-vm01, Source: A, Destination: B)."
The VM then remains running on the source host A.

Comment 13 Omer Frenkel 2015-02-17 08:30:16 UTC
RHEV-M 3.5.0 has been released