Description of problem: clvmd lockspace obtains groupd id ZERO, which is refused by dlm_controld. When this happens all lvm commands hang forever. It is not possible to kill clvmd (even with SIGKILL, clvmd was in S state) and a full cluster rebooted was needed. This is related to https://bugzilla.redhat.com/show_bug.cgi?id=493207 which should prevent id ZERO from being issued. Version-Release number of selected component (if applicable): RHEL5.4: cman-2.0.115-1.el5-x86_64 cmirror-1.1.39-2.el5-x86_64 kernel-2.6.18-164.el5-x86_64 kmod-cmirror-0.1.22-1.el5-x86_64 lvm2-2.02.46-8.el5-x86_64 lvm2-cluster-2.02.46-8.el5-x86_64 openais-0.80.6-8.el5-x86_64 How reproducible: Not reproducible Steps to Reproduce: Not reproducible Actual results: clvmd lockspace gets assigned id ZERO and dlm_controld refuses it. node1: + group_tool -v type level name id state node id local_done fence 0 default 00010001 none [1 2] dlm 1 clvmd 00000000 JOIN_STOP_WAIT 1 100020001 1 [1 2] node2: + group_tool -v type level name id state node id local_done fence 0 default 00010001 none [1 2] dlm 1 clvmd 00000000 LEAVE_STOP_WAIT 1 100010002 1 [1 2] This is related to the bug below which is corrected in the version of cman being used: https://bugzilla.redhat.com/show_bug.cgi?id=493207 When the situation above happens clvmd is never able to fully join the lockspace and it stays in the state above forever. All lvm commands on the system hang. And the following messages from dlm_controld are seen: Sep 7 18:41:36 node1 dlm_controld[4293]: replace zero id for clvmd with 2746226290 Sep 7 18:41:37 node1 clvmd: Cluster LVM daemon started - connected to CMAN Sep 7 18:46:12 node2 dlm_controld[4324]: replace zero id for clvmd with 2746226290 Sep 7 18:46:13 node2 clvmd: Cluster LVM daemon started - connected to CMAN Expected results: clvmd not being assigned id ZERO, clvmd being able to join the correct id and fully contacting dlm_controld without being refused. Additional info: Not sure here if this would be a groupd bug, or a clvmd bug, setting it to cman following BZ#493207
The cluster data from comment 4 and comment 5 show that this is a failure case that is not handled correctly by groupd. It is unrelated to zero group ids. - node 1: clvmd is killed - node 1: clvmd calls dlm_release_lockspace to leave the lockspace - node 1: dlm_controld calls into groupd to leave the group - node 1: groupd begins processing the leave event for node 1's clvmd lockspace - node 2: groupd begins processing the leave event for node 1's clvmd lockspace - node 2: groupd sends and receives stopped message for node 1's leave event - node 2: groupd waiting for node 1's stopped msg to complete the leave event - node 1: fails before it sends stopped message for the clvmd/dlm leave - node 2: groupd sees node 1 failure, but cannot "adjust" the leave event that is in progress from node 1, so the groupd group remains stuck trying to process the leave event to completion (waiting for the stopped msg from node 1 that won't arrive because node 1 failed) The fix for this will require adding code to groupd to detect this situation of a node failing while leave events from that node were in progress, and then "fixing up" those leave events so they can be completed. There is code already to detect and handle this, but it is for a slightly different situation in which node 2 receives a cpg nodedown event for node 1. In this bug node 1 has already left the cpg before failing, so there is no cpg event.
Created attachment 377314 [details] patch to work around This patch seems to work in my tests that approximate the situation being seen.
pushed to RHEL55 branch http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=d360c0537aa734205e49939de92c763696ef477b
Will there be a hotfix for RHEL 5.4.z ? This bug is nasty, just has hit us for the 2nd time, and it looks like there is not way to fix it without doing a cleanstart of the cluster?
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
groupd is no longer stuck in LEAVE_STOP_WAIT or any other state. Tested with reproducer in http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=d360c0537aa734205e49939de92c763696ef477b
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0266.html
<dct_> jkortus, in your test, all the groups (fence, dlm, gfs) had entered the LEAVE state. the patch in question just automatically completes the leave for them when the node fails <dct_> from the info I collected, it appears that the failed node has shut down "far enough" before it fails that it shouldn't need to be fenced or recovery... but it would be good to repeat and verify that with some more info