Bug 521817
Summary: | groupd dlm:clvmd stuck in LEAVE_STOP_WAIT | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Eduardo Damato <edamato> | ||||
Component: | cman | Assignee: | David Teigland <teigland> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 5.3 | CC: | burghardt, ccaulfie, cluster-maint, cward, edamato, jkortus, jwest, rom, schlegel, tao, tdunnon, tscherf | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | cman-2.0.115-23.el5.src.rpm | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2010-03-30 08:39:07 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 567316, 567317 | ||||||
Attachments: |
|
Description
Eduardo Damato
2009-09-08 12:39:22 UTC
The cluster data from comment 4 and comment 5 show that this is a failure case that is not handled correctly by groupd. It is unrelated to zero group ids. - node 1: clvmd is killed - node 1: clvmd calls dlm_release_lockspace to leave the lockspace - node 1: dlm_controld calls into groupd to leave the group - node 1: groupd begins processing the leave event for node 1's clvmd lockspace - node 2: groupd begins processing the leave event for node 1's clvmd lockspace - node 2: groupd sends and receives stopped message for node 1's leave event - node 2: groupd waiting for node 1's stopped msg to complete the leave event - node 1: fails before it sends stopped message for the clvmd/dlm leave - node 2: groupd sees node 1 failure, but cannot "adjust" the leave event that is in progress from node 1, so the groupd group remains stuck trying to process the leave event to completion (waiting for the stopped msg from node 1 that won't arrive because node 1 failed) The fix for this will require adding code to groupd to detect this situation of a node failing while leave events from that node were in progress, and then "fixing up" those leave events so they can be completed. There is code already to detect and handle this, but it is for a slightly different situation in which node 2 receives a cpg nodedown event for node 1. In this bug node 1 has already left the cpg before failing, so there is no cpg event. Created attachment 377314 [details]
patch to work around
This patch seems to work in my tests that approximate the situation being seen.
pushed to RHEL55 branch http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=d360c0537aa734205e49939de92c763696ef477b Will there be a hotfix for RHEL 5.4.z ? This bug is nasty, just has hit us for the 2nd time, and it looks like there is not way to fix it without doing a cleanstart of the cluster? ~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative. groupd is no longer stuck in LEAVE_STOP_WAIT or any other state. Tested with reproducer in http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=d360c0537aa734205e49939de92c763696ef477b An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0266.html <dct_> jkortus, in your test, all the groups (fence, dlm, gfs) had entered the LEAVE state. the patch in question just automatically completes the leave for them when the node fails <dct_> from the info I collected, it appears that the failed node has shut down "far enough" before it fails that it shouldn't need to be fenced or recovery... but it would be good to repeat and verify that with some more info |