Red Hat Bugzilla – Bug 468691
Virtual Services guest can start on 2 nodes at same time
Last modified: 2010-10-23 01:25:46 EDT
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:18.104.22.168) Gecko/2008092318 Fedora/3.0.2-1.fc9 Firefox/3.0.2
When managing a Xen virtual machine using rgmanager it is possible for the guest to be started on two different cluster nodes at once if xend dies, likely leading to corruption of the guest filesystems. For example, if the vm is running on node 1 and xend dies for whatever reason on that node, at the next status check the vm resource agent will think the guest has died. However xend dying does not mean the guest itself has died because the qemu-dm process can still be running. When rgmanager goes to start the vm on node 2, it thinks it has successfully stopped first on node1 when it really hasn't.
The reason the vm agent thinks the guest has died is that the xm tools do not work when xend is not running. So when it runs a status check it does:
xm list $OCF_RESKEY_name &> /dev/null
which will fail whether the guest is running or not. So now rgmanager goes to stop the guest which includes
xm stop <guest>
status || return 0
In this case we want status to return non-zero indicating the guest has successfully stopped. However 'xm list' is going to fail in every case even if the guest is still running, so we incorrectly return success to rgmanager. Now it follows the recovery policy which will eventually start it on the second node for both relocate and restart policies. At this point the guest is running on two different nodes.
The attached patch corrects this problem by changing the above status check to
status || /usr/sbin/xend status && return 0
Now if the first status check fails we make sure xend is actually running. If it is then the xm list info we got back is legitimate and the guest is actually stopped. Otherwise it will proceed with destroying the guest and repeating the loop until we hit the 60 second timeout or the guest dies.
Steps to Reproduce:
1) Configure a clustered vm like so
<failoverdomain name="it230349" nofailback="0" ordered="0" restricted="0">
<failoverdomainnode name="cluster2-2.gsslab.rdu.redhat.com" priority="1"/>
<failoverdomainnode name="cluster2-3.gsslab.rdu.redhat.com" priority="1"/>
<vm autostart="1" exclusive="0" domain="it230349" name="machine4" path="/var/lib/xen/images/" recovery="restart"/>
2) Start the vm on any node
3) On that node, stop xend
# service xend stop
Rgmanager thinks the guest has died, tries to run a stop on it and gets a success return code. It now follows the recovery policy which eventually results in the guest running on another node
Rgmanager detects a problem and tries to stop the guest in order to follow the recovery policy. Since there is no way to stop a guest once xend is dead the service should go into a failed state requiring administrator intervention, similar to a filesystem that cannot be unmounted.
The vm resource does have an attribute called 'hardrecovery' which causes the cluster node to reboot if the guest fails to stop. However this option is not useful in this case because we are not actually getting a failure on stopping. Just wanted to point that out since it was my initial idea for how to solve this problem.
Created attachment 321613 [details]
I propose to change the "status" function of vm.sh instead of the "stop" function.
rgmanager detects the problem when running the "status" check. If xend is down only for a short period of time (like service xend restart) and rgmanager runs a status check during that time it will probably suceed again when running the "stop" sequence afterwards.
As a result the VM will shut down without real need to do so, thereby causing a service interruption of the VM's services.
Of course this approach leaves a blind spot on rgmanager -- it can't detect the status issue anymore. The ideal solution would be for rgmanager to monitor xend as well and to know that a VM's service state can't be determined if xend is down. Therefore rgmanager should try to reactivate xend, and if rgmanager fails to do so the node should fence itself.
--- vm.sh.rhel53 2009-02-03 15:09:32.000000000 +0100
+++ vm.sh 2009-02-03 15:20:52.000000000 +0100
@@ -455,6 +455,10 @@
+ # RSI added - gs 20090203
+ # start
+ xm info &> /dev/null || return 0
+ # end
xm list $OCF_RESKEY_name &> /dev/null
if [ $? -eq 0 ]; then
Can anybody please state the current status of this bug.
It would be really great to have this one fixed as it leads - if one hits it - to data corruption in the DomU!!
Isn't that vm.sh patch, fixing the stop behaviour appropriate?
Patching 'status' to return success when it shouldn't will break migration detection. Setting the VM state to 'failed' when xend is dead is appropriate.
I do not agree.
xend is only required to manage (create/destroy/shutdown/migrate) VMs, but not to run them.
xend may be down (or dead) while the existing VMs keep on running like a charm. And starting xend in that situation just brings back the management capabilities.
You're correct, the VM will keep running. The VM will keep running even when rgmanager marks it as 'failed' because xend is dead, too.
This bug was filed because a VM end up actually running on two hosts in the cluster - potential data corruption. The VM continued to run on the machine where xend was dead, but rgmanager restarted the VM somewhere else because the 'stop' phase was falsely succeeding despite xend being dead.
Had the stop phase failed as it should have, recovery (and therefore, the VM running in two places) would have been prevented. This is the purpose of the 'failed' state.
Assuming the VM is alive with xend down is incorrect - in the best case, the VM is up but xend is dead (a partially broken case), and in the worst case, both xend and the VM are dead (very broken).
Rgmanager relies on xend (and/or libvirtd ; see bug 412911) for status information. So, a couple of suggestions:
(1) Don't kill xend. If it crashes, it's a bug in xend. File a bugzilla. Include logs from /var/log/xen.
(2) If you do have a reason to kill xend (maybe upgrading to the latest version?), freeze (clusvcadm -Z) the VM(s) first, perform your upgrade, then unfreeze (clusvcadm -U) the VM(s). Otherwise be prepared to manually restore the VM states as noted below.
In the case of a crashed xend you can fix rgmanager's view of the the VM state by simply:
- disable (clusvcadm -d) the VMs which are in the failed state. This will clear the failed state irrespective of the return value of the vm.sh script. Since xend is not running, we know nothing will actually happen to the VM
- restart xend
- enable the VM(s). Since the VM(s) is/are already running, rgmanager will just change its internal state but not actually do anything.
Thanks Lon, I am convinced on the "failed" state now.
Some background information:
We do not just kill xend. "xenconsoled" tends to die once in a while (and there is a bugzilla on that, will be probably solved in 5.7 or so...), so we restart the xend service to bring xenconsoled up again.
This did interfere with the cluster status checks, we experienced multiple startups, and I would not use the term "possible data corruption" on that, as it is quite likely. So I am quite concerned not to experience that mess again.
~~ Attention - RHEL 5.4 Beta Released! ~~
RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!
If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.
Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.
Questions can be posted to this bug or your customer or partner representative.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.