521817 – groupd dlm:clvmd stuck in LEAVE_STOP_WAIT

Bug 521817 - groupd dlm:clvmd stuck in LEAVE_STOP_WAIT

Summary: groupd dlm:clvmd stuck in LEAVE_STOP_WAIT

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	567316 567317
TreeView+	depends on / blocked

Reported:	2009-09-08 12:39 UTC by Eduardo Damato
Modified:	2018-10-27 16:03 UTC (History)
CC List:	12 users (show)
Fixed In Version:	cman-2.0.115-23.el5.src.rpm
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 08:39:07 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
patch to work around (3.02 KB, text/plain) 2009-12-09 22:10 UTC, David Teigland	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2010:0266	0	normal	SHIPPED_LIVE	cman bug fix and enhancement update	2010-03-29 12:54:44 UTC

Description Eduardo Damato 2009-09-08 12:39:22 UTC

Description of problem:

clvmd lockspace obtains groupd id ZERO, which is refused by dlm_controld. When this happens all lvm commands hang forever. It is not possible to kill clvmd (even with SIGKILL, clvmd was in S state) and a full cluster rebooted was needed. This is related to 

https://bugzilla.redhat.com/show_bug.cgi?id=493207

which should prevent id ZERO from being issued.

Version-Release number of selected component (if applicable):

RHEL5.4:

cman-2.0.115-1.el5-x86_64
cmirror-1.1.39-2.el5-x86_64
kernel-2.6.18-164.el5-x86_64
kmod-cmirror-0.1.22-1.el5-x86_64
lvm2-2.02.46-8.el5-x86_64
lvm2-cluster-2.02.46-8.el5-x86_64
openais-0.80.6-8.el5-x86_64

How reproducible:

Not reproducible

Steps to Reproduce:

Not reproducible
  
Actual results:

clvmd lockspace gets assigned id ZERO and dlm_controld refuses it. 

node1:

+ group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010001 none
[1 2]
dlm              1     clvmd    00000000 JOIN_STOP_WAIT 1 100020001 1
[1 2]

node2:

+ group_tool -v
type             level name       id       state node id local_done
fence            0     default    00010001 none
[1 2]
dlm              1     clvmd      00000000 LEAVE_STOP_WAIT 1 100010002 1
[1 2]

This is related to the bug below which is corrected in the version of cman being used:

https://bugzilla.redhat.com/show_bug.cgi?id=493207

When the situation above happens clvmd is never able to fully join the lockspace and it stays in the state above forever.

All lvm commands on the system hang. And the following messages from dlm_controld are seen:

Sep  7 18:41:36 node1 dlm_controld[4293]: replace zero id for clvmd with 2746226290
Sep  7 18:41:37 node1 clvmd: Cluster LVM daemon started - connected to CMAN

Sep  7 18:46:12 node2 dlm_controld[4324]: replace zero id for clvmd with 2746226290
Sep  7 18:46:13 node2 clvmd: Cluster LVM daemon started - connected to CMAN

Expected results:

clvmd not being assigned id ZERO, clvmd being able to join the correct id and fully contacting dlm_controld without being refused.

Additional info:

Not sure here if this would be a groupd bug, or a clvmd bug, setting it to cman following BZ#493207

Comment 7 David Teigland 2009-09-08 15:17:00 UTC

The cluster data from comment 4 and comment 5 show that this is a failure case that is not handled correctly by groupd.  It is unrelated to zero group ids.

- node 1: clvmd is killed
- node 1: clvmd calls dlm_release_lockspace to leave the lockspace
- node 1: dlm_controld calls into groupd to leave the group
- node 1: groupd begins processing the leave event for node 1's clvmd lockspace
- node 2: groupd begins processing the leave event for node 1's clvmd lockspace
- node 2: groupd sends and receives stopped message for node 1's leave event
- node 2: groupd waiting for node 1's stopped msg to complete the leave event
- node 1: fails before it sends stopped message for the clvmd/dlm leave
- node 2: groupd sees node 1 failure, but cannot "adjust" the leave event that is in progress from node 1, so the groupd group remains stuck trying to process
the leave event to completion (waiting for the stopped msg from node 1 that won't
arrive because node 1 failed)

The fix for this will require adding code to groupd to detect this situation of a node failing while leave events from that node were in progress, and then "fixing up" those leave events so they can be completed.  There is code already to detect and handle this, but it is for a slightly different situation in which node 2 receives a cpg nodedown event for node 1.  In this bug node 1 has already left the cpg before failing, so there is no cpg event.

Comment 13 David Teigland 2009-12-09 22:10:04 UTC

Created attachment 377314 [details]
patch to work around

This patch seems to work in my tests that approximate the situation being seen.

Comment 14 David Teigland 2009-12-09 22:28:49 UTC

pushed to RHEL55 branch

http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=d360c0537aa734205e49939de92c763696ef477b

Comment 17 Gunther Schlegel 2010-01-25 13:25:59 UTC

Will there be a hotfix for RHEL 5.4.z ? 

This bug is nasty, just has hit us for the 2nd time, and it looks like there is not way to fix it without doing a cleanstart of the cluster?

Comment 22 Chris Ward 2010-02-11 10:11:03 UTC

~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 25 Jaroslav Kortus 2010-03-08 17:31:54 UTC

groupd is no longer stuck in LEAVE_STOP_WAIT or any other state. Tested with reproducer in http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=d360c0537aa734205e49939de92c763696ef477b

Comment 27 errata-xmlrpc 2010-03-30 08:39:07 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0266.html

Comment 33 David Teigland 2010-06-02 20:48:10 UTC

<dct_> jkortus, in your test, all the groups (fence, dlm, gfs) had entered the LEAVE state.  the patch in question just automatically completes the leave for them when the node fails
<dct_> from the info I collected, it appears that the failed node has shut down "far enough" before it fails that it shouldn't need to be fenced or recovery... but it would be good to repeat and verify that with some more info

Note You need to log in before you can comment on or make changes to this bug.