Bug 505340

Summary: VM migration and subsequent cluster.conf update can cause the VM restart
Product: Red Hat Enterprise Linux 5 Reporter: Lon Hohberger <lhh>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.4CC: cfeist, cluster-maint, cward, djansa, edamato, samuel.kielek, sghosh, tao
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: rgmanager-2.0.51-1.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 11:03:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 505479    
Attachments:
Description Flags
Fix none

Description Lon Hohberger 2009-06-11 14:56:35 UTC
Description of problem:

If you perform the following steps:
* Migrate a virtual machine to another host in the cluster.
* Update the cluster configuration.

There is a nonzero chance that each previously migrated virtual machine will be restarted.  This occurs because of the way rgmanager performs migration.  When we migrate, we assign the virtual machine to the new owner in rgmanager's internal state and execute 'xm migrate' (or virsh migrate).  The new owner in rgmanager's view then periodically checks for the completion of the migration using a status check.

This is fine, however, internally it is using the RG_STATUS operation.  This operation, if status is 'bad' (i.e. the VM has not completed migration yet), the VM is flagged with RF_NEEDSTOP.

In rgmanager 2.0.46-1.el5.3_3, we added the RG_STATUS_INQUIRY operation internally -- which is a 'quick' status check which never records anything in the resource tree.  This operation should be used for the VM during migration completion checks, but isn't.

Version-Release number of selected component (if applicable): 2.0.46-1.el5.3_3, 2.0.50-1.el5

How reproducible: Sometimes.  If the VM migration completes prior to the status check being performed, the status check will not fail and the RF_NEEDSTOP flag will not be set.

For example, if you have a migration that takes 10 seconds, there's approximately a 1 in 3 chance that the VM will be restarted on a subsequent reconfiguration (the VM check interval is 30 seconds by default).

Comment 2 Lon Hohberger 2009-06-11 15:27:39 UTC
Created attachment 347426 [details]
Fix

* Ensures we use RG_STATUS_INQUIRY when determining if migration has completed,
* Ensures that subsequent RG_STATUS checks clear any NEEDSTOP flags to prevent this from happening in the future, and
* Fixes erroneous migration_mapping noise

Comment 4 Lon Hohberger 2009-06-11 17:41:49 UTC
I have reproduced this problem.

Comment 5 Lon Hohberger 2009-06-11 20:11:30 UTC
The patch above worked for me.  I tested by:

* Creating a faux VM service,
* Starting migrating it and ensuring that the actual migration for my fake VM service took >5 minutes.
* After migration completed, I updated cluster.conf

As expected, without the patch, the VM was restarted.

With the patch, the update to cluster.conf did not cause a VM restart.  Additionally, I tested standard service failure (by creating a service with an IP address and removing the IP address forcefully) to ensure the patch did not break normal service recovery.

RHEL5 branch:

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=47f11a83f75e8b6ea332dc64bf74bce793b0388e

RHEL54 branch:

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=1d101c6b92cad6266d41ad935e2dc8e1d06d732d

Comment 6 Lon Hohberger 2009-06-11 20:19:09 UTC
I also tested:

* Begin migrating faux VM service
* Update cluster.conf with new config version -before- migration completes

Note that users should not update configurations of services of virtual machines while they are in transition.

Comment 11 Chris Ward 2009-07-03 18:45:38 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 12 Lon Hohberger 2009-07-30 21:49:31 UTC
For updating cluster.conf, all I did was (literally) change the config version and run 'ccs_tool update'.

Comment 15 errata-xmlrpc 2009-09-02 11:03:46 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1339.html