Bug 468691
Summary: | Virtual Services guest can start on 2 nodes at same time | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Shane Bradley <sbradley> | ||||
Component: | rgmanager | Assignee: | Lon Hohberger <lhh> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 5.4 | CC: | clasohm, cluster-maint, cmarthal, cward, edamato, grimme, jrooth, schlegel, tao | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-09-02 11:03:51 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Shane Bradley
2008-10-27 14:00:19 UTC
Created attachment 321613 [details]
rgmanager patch
I propose to change the "status" function of vm.sh instead of the "stop" function. rgmanager detects the problem when running the "status" check. If xend is down only for a short period of time (like service xend restart) and rgmanager runs a status check during that time it will probably suceed again when running the "stop" sequence afterwards. As a result the VM will shut down without real need to do so, thereby causing a service interruption of the VM's services. Of course this approach leaves a blind spot on rgmanager -- it can't detect the status issue anymore. The ideal solution would be for rgmanager to monitor xend as well and to know that a VM's service state can't be determined if xend is down. Therefore rgmanager should try to reactivate xend, and if rgmanager fails to do so the node should fence itself. patch example: --- vm.sh.rhel53 2009-02-03 15:09:32.000000000 +0100 +++ vm.sh 2009-02-03 15:20:52.000000000 +0100 @@ -455,6 +455,10 @@ # status() { + # RSI added - gs 20090203 + # start + xm info &> /dev/null || return 0 + # end xm list $OCF_RESKEY_name &> /dev/null if [ $? -eq 0 ]; then return 0 Can anybody please state the current status of this bug. It would be really great to have this one fixed as it leads - if one hits it - to data corruption in the DomU!! Isn't that vm.sh patch, fixing the stop behaviour appropriate? Patching 'status' to return success when it shouldn't will break migration detection. Setting the VM state to 'failed' when xend is dead is appropriate. I do not agree. xend is only required to manage (create/destroy/shutdown/migrate) VMs, but not to run them. xend may be down (or dead) while the existing VMs keep on running like a charm. And starting xend in that situation just brings back the management capabilities. You're correct, the VM will keep running. The VM will keep running even when rgmanager marks it as 'failed' because xend is dead, too. This bug was filed because a VM end up actually running on two hosts in the cluster - potential data corruption. The VM continued to run on the machine where xend was dead, but rgmanager restarted the VM somewhere else because the 'stop' phase was falsely succeeding despite xend being dead. Had the stop phase failed as it should have, recovery (and therefore, the VM running in two places) would have been prevented. This is the purpose of the 'failed' state. Assuming the VM is alive with xend down is incorrect - in the best case, the VM is up but xend is dead (a partially broken case), and in the worst case, both xend and the VM are dead (very broken). Rgmanager relies on xend (and/or libvirtd ; see bug 412911) for status information. So, a couple of suggestions: (1) Don't kill xend. If it crashes, it's a bug in xend. File a bugzilla. Include logs from /var/log/xen. (2) If you do have a reason to kill xend (maybe upgrading to the latest version?), freeze (clusvcadm -Z) the VM(s) first, perform your upgrade, then unfreeze (clusvcadm -U) the VM(s). Otherwise be prepared to manually restore the VM states as noted below. In the case of a crashed xend you can fix rgmanager's view of the the VM state by simply: - disable (clusvcadm -d) the VMs which are in the failed state. This will clear the failed state irrespective of the return value of the vm.sh script. Since xend is not running, we know nothing will actually happen to the VM - restart xend - enable the VM(s). Since the VM(s) is/are already running, rgmanager will just change its internal state but not actually do anything. Thanks Lon, I am convinced on the "failed" state now. Some background information: We do not just kill xend. "xenconsoled" tends to die once in a while (and there is a bugzilla on that, will be probably solved in 5.7 or so...), so we restart the xend service to bring xenconsoled up again. This did interfere with the cluster status checks, we experienced multiple startups, and I would not use the term "possible data corruption" on that, as it is quite likely. So I am quite concerned not to experience that mess again. ~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1339.html |