Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 499835

Summary: detect failed virtual machine migrations in a cluster
Product: Red Hat Enterprise Linux 5 Reporter: Lon Hohberger <lhh>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 5.3CC: cluster-maint, cward, djansa, edamato, federico.simoncelli, henry.robertson, hklein, jcastillo, jruemker, mrappa, pep, rbinkhor, samuel.kielek, tao, ywong
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: rgmanager-2.0.52-1.27.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-30 08:49:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 412911    
Bug Blocks:    
Attachments:
Description Flags
Fix rgmanager behavior based on expectation that migration is a synchronous operation none

Description Lon Hohberger 2009-05-08 14:04:41 UTC
Description of problem:

So, virtual machine management tools as documented do not reliably 
complete migration when the initiating command has exited:

[from virsh man page]
Most virsh commands act asynchronously, so just because the virsh 
program returned, doesn’t mean the action is complete.  This is
important, as many operations on domains, like create and shutdown,
can take considerable time (30 seconds or more) to bring the machine
into a fully compliant state.  If you want to know when one of these
actions has finished you must poll through virsh list periodically.
[end quote]

Therefore, rgmanager's migration code works like this:

* initiate migrate
* if initiation was successful, mark state as 'migrating'
  to the target host.
* target host polls VM tool sets (xm in 5.3, virsh in 5.4)
* when target host sees the vm running and in the right 
  state, mark the VM as 'started' on the target host

This works well, however, in practice, a number of problems can
arise because xm for example will return success but migration
itself can fail.  Currently, there is no resource checking
performed prior to a migrate operation using 'xm', so we can get
failures without knowing halfway through the migration.

So, this RFE is to enhance rgmanager to not subvirt (get it?)
or duplicate the work on resource-checking in libvirt, but
rather to simply detect when a migration operation ceased to
function (thereby leaving a virtual machine inoperable) for
reasons outside of cluster software's control.  After we detect
the correct state, we can subsequently attempt to recover the
virtual machine.

Preliminary Examples:

* vm stuck in non-running state on target
  * destroy vm on source
  * destroy vm on target
  * restart vm on target
* vm nonexistent
  * destroy vm on source
  * destroy vm on target
  * restart vm on target
* vm running on source and paused on target but no signs of
  migration occurring exist any more
  * destroy vm on target

Comment 1 Lon Hohberger 2009-07-29 14:59:28 UTC
Talking with the virt developers, the 'virsh migrate' operation is supposed to be synchronous, despite the man page being wrong:

https://bugzilla.redhat.com/show_bug.cgi?id=514532

Ergo, if 'virsh migrate' succeeds but the VM is still on the source host, the migration failed and action can be taken immediately to remedy the situation.

Comment 2 Lon Hohberger 2009-09-28 15:46:55 UTC
Pursuant to comment #1: when using virsh, we can detect the failed migrations easily as 'virsh migrate' is a synchronous operation.

Ergo, if 'virsh migrate' fails, we can simply do a 'status check' followed by a destroy if required on src/target.

Comment 5 Lon Hohberger 2009-10-15 20:02:37 UTC
if 'xm migrate' is expected to be synchronous, we can simplify this operation immensely:

- migrate vm
- if successful, flip state

This works with 'virsh'.

Comment 6 Lon Hohberger 2009-10-15 20:03:53 UTC
Created attachment 364985 [details]
Fix rgmanager behavior based on expectation that migration is a synchronous operation

Comment 7 Lon Hohberger 2009-10-15 20:30:20 UTC
After muddling through the Xen xend and xm code, it appears that 'xm' is expected to be sync.  Effectively, 'xm suspend' and 'xm migrate' call the same backend utility: xc_save.  They pass in a different file descriptor.  Naturally, one is over the network and the other is a file on disk where the memory is dumped.

Comment 8 Lon Hohberger 2009-10-16 13:42:51 UTC
Talking with engineers here who work on Xen, xm migrate is also synchronous.

Comment 9 Lon Hohberger 2009-11-11 16:14:51 UTC
So, at a minimum, we need:

- sync migrate patch (provided)

- test-after-migrate patch: if 'virsh migrate' or 'xm migrate' fails, recheck status of the VM locally.  If it's still in a good state, then we need to NOT return a failure and/or return a non-fatal error so rgmanager does not mark the VM as 'failed' (which would require a restart)

Comment 10 Lon Hohberger 2009-11-23 15:41:05 UTC
*** Bug 315131 has been marked as a duplicate of this bug. ***

Comment 11 Harald Klein 2009-12-09 15:33:15 UTC
Hi Lon,

do we have an estimate when a fix will be available?

best regards,
Hari

Comment 12 Lon Hohberger 2009-12-14 21:18:29 UTC
Within the above outlined constraints for what needs to be done, I should have a package today.

Comment 16 Chris Ward 2010-02-11 10:11:42 UTC
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 19 errata-xmlrpc 2010-03-30 08:49:03 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0280.html