Bug 521817

Summary: groupd dlm:clvmd stuck in LEAVE_STOP_WAIT
Product: Red Hat Enterprise Linux 5 Reporter: Eduardo Damato <edamato>
Component: cmanAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.3CC: burghardt, ccaulfie, cluster-maint, cward, edamato, jkortus, jwest, rom, schlegel, tao, tdunnon, tscherf
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: cman-2.0.115-23.el5.src.rpm Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-30 08:39:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 567316, 567317    
Attachments:
Description Flags
patch to work around none

Description Eduardo Damato 2009-09-08 12:39:22 UTC
Description of problem:

clvmd lockspace obtains groupd id ZERO, which is refused by dlm_controld. When this happens all lvm commands hang forever. It is not possible to kill clvmd (even with SIGKILL, clvmd was in S state) and a full cluster rebooted was needed. This is related to 

https://bugzilla.redhat.com/show_bug.cgi?id=493207

which should prevent id ZERO from being issued.

Version-Release number of selected component (if applicable):

RHEL5.4:

cman-2.0.115-1.el5-x86_64
cmirror-1.1.39-2.el5-x86_64
kernel-2.6.18-164.el5-x86_64
kmod-cmirror-0.1.22-1.el5-x86_64
lvm2-2.02.46-8.el5-x86_64
lvm2-cluster-2.02.46-8.el5-x86_64
openais-0.80.6-8.el5-x86_64

How reproducible:

Not reproducible

Steps to Reproduce:

Not reproducible
  
Actual results:

clvmd lockspace gets assigned id ZERO and dlm_controld refuses it. 

node1:

+ group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010001 none
[1 2]
dlm              1     clvmd    00000000 JOIN_STOP_WAIT 1 100020001 1
[1 2]

node2:

+ group_tool -v
type             level name       id       state node id local_done
fence            0     default    00010001 none
[1 2]
dlm              1     clvmd      00000000 LEAVE_STOP_WAIT 1 100010002 1
[1 2]

This is related to the bug below which is corrected in the version of cman being used:

https://bugzilla.redhat.com/show_bug.cgi?id=493207

When the situation above happens clvmd is never able to fully join the lockspace and it stays in the state above forever.

All lvm commands on the system hang. And the following messages from dlm_controld are seen:

Sep  7 18:41:36 node1 dlm_controld[4293]: replace zero id for clvmd with 2746226290
Sep  7 18:41:37 node1 clvmd: Cluster LVM daemon started - connected to CMAN

Sep  7 18:46:12 node2 dlm_controld[4324]: replace zero id for clvmd with 2746226290
Sep  7 18:46:13 node2 clvmd: Cluster LVM daemon started - connected to CMAN

Expected results:

clvmd not being assigned id ZERO, clvmd being able to join the correct id and fully contacting dlm_controld without being refused.

Additional info:

Not sure here if this would be a groupd bug, or a clvmd bug, setting it to cman following BZ#493207

Comment 7 David Teigland 2009-09-08 15:17:00 UTC
The cluster data from comment 4 and comment 5 show that this is a failure case that is not handled correctly by groupd.  It is unrelated to zero group ids.

- node 1: clvmd is killed
- node 1: clvmd calls dlm_release_lockspace to leave the lockspace
- node 1: dlm_controld calls into groupd to leave the group
- node 1: groupd begins processing the leave event for node 1's clvmd lockspace
- node 2: groupd begins processing the leave event for node 1's clvmd lockspace
- node 2: groupd sends and receives stopped message for node 1's leave event
- node 2: groupd waiting for node 1's stopped msg to complete the leave event
- node 1: fails before it sends stopped message for the clvmd/dlm leave
- node 2: groupd sees node 1 failure, but cannot "adjust" the leave event that is in progress from node 1, so the groupd group remains stuck trying to process
the leave event to completion (waiting for the stopped msg from node 1 that won't
arrive because node 1 failed)

The fix for this will require adding code to groupd to detect this situation of a node failing while leave events from that node were in progress, and then "fixing up" those leave events so they can be completed.  There is code already to detect and handle this, but it is for a slightly different situation in which node 2 receives a cpg nodedown event for node 1.  In this bug node 1 has already left the cpg before failing, so there is no cpg event.

Comment 13 David Teigland 2009-12-09 22:10:04 UTC
Created attachment 377314 [details]
patch to work around

This patch seems to work in my tests that approximate the situation being seen.

Comment 17 Gunther Schlegel 2010-01-25 13:25:59 UTC
Will there be a hotfix for RHEL 5.4.z ? 

This bug is nasty, just has hit us for the 2nd time, and it looks like there is not way to fix it without doing a cleanstart of the cluster?

Comment 22 Chris Ward 2010-02-11 10:11:03 UTC
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 25 Jaroslav Kortus 2010-03-08 17:31:54 UTC
groupd is no longer stuck in LEAVE_STOP_WAIT or any other state. Tested with reproducer in http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=d360c0537aa734205e49939de92c763696ef477b

Comment 27 errata-xmlrpc 2010-03-30 08:39:07 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0266.html

Comment 33 David Teigland 2010-06-02 20:48:10 UTC
<dct_> jkortus, in your test, all the groups (fence, dlm, gfs) had entered the LEAVE state.  the patch in question just automatically completes the leave for them when the node fails
<dct_> from the info I collected, it appears that the failed node has shut down "far enough" before it fails that it shouldn't need to be fenced or recovery... but it would be good to repeat and verify that with some more info