Red Hat Bugzilla – Bug 505340
VM migration and subsequent cluster.conf update can cause the VM restart
Last modified: 2016-04-26 11:26:30 EDT
Description of problem:
If you perform the following steps:
* Migrate a virtual machine to another host in the cluster.
* Update the cluster configuration.
There is a nonzero chance that each previously migrated virtual machine will be restarted. This occurs because of the way rgmanager performs migration. When we migrate, we assign the virtual machine to the new owner in rgmanager's internal state and execute 'xm migrate' (or virsh migrate). The new owner in rgmanager's view then periodically checks for the completion of the migration using a status check.
This is fine, however, internally it is using the RG_STATUS operation. This operation, if status is 'bad' (i.e. the VM has not completed migration yet), the VM is flagged with RF_NEEDSTOP.
In rgmanager 2.0.46-1.el5.3_3, we added the RG_STATUS_INQUIRY operation internally -- which is a 'quick' status check which never records anything in the resource tree. This operation should be used for the VM during migration completion checks, but isn't.
Version-Release number of selected component (if applicable): 2.0.46-1.el5.3_3, 2.0.50-1.el5
How reproducible: Sometimes. If the VM migration completes prior to the status check being performed, the status check will not fail and the RF_NEEDSTOP flag will not be set.
For example, if you have a migration that takes 10 seconds, there's approximately a 1 in 3 chance that the VM will be restarted on a subsequent reconfiguration (the VM check interval is 30 seconds by default).
Created attachment 347426 [details]
* Ensures we use RG_STATUS_INQUIRY when determining if migration has completed,
* Ensures that subsequent RG_STATUS checks clear any NEEDSTOP flags to prevent this from happening in the future, and
* Fixes erroneous migration_mapping noise
I have reproduced this problem.
The patch above worked for me. I tested by:
* Creating a faux VM service,
* Starting migrating it and ensuring that the actual migration for my fake VM service took >5 minutes.
* After migration completed, I updated cluster.conf
As expected, without the patch, the VM was restarted.
With the patch, the update to cluster.conf did not cause a VM restart. Additionally, I tested standard service failure (by creating a service with an IP address and removing the IP address forcefully) to ensure the patch did not break normal service recovery.
I also tested:
* Begin migrating faux VM service
* Update cluster.conf with new config version -before- migration completes
Note that users should not update configurations of services of virtual machines while they are in transition.
~~ Attention - RHEL 5.4 Beta Released! ~~
RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!
If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.
Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.
Questions can be posted to this bug or your customer or partner representative.
For updating cluster.conf, all I did was (literally) change the config version and run 'ccs_tool update'.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.