Bug 219385 - cman_tool leave returns ebusy during cluster shutdown
cman_tool leave returns ebusy during cluster shutdown
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Chris Feist
Cluster QE
: Reopened
: 217449 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-12-12 16:17 EST by Kiersten (Kerri) Anderson
Modified: 2009-09-17 22:40 EDT (History)
7 users (show)

See Also:
Fixed In Version: RC
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-09-28 17:14:26 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Patch to fix (747 bytes, patch)
2006-12-13 08:34 EST, Christine Caulfield
no flags Details | Diff

  None (edit)
Description Kiersten (Kerri) Anderson 2006-12-12 16:17:09 EST
Description of problem:
When doing service cman stop on all nodes in the cluster, some of the nodes
report the following error message:
   
Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
[FAILED]


Version-Release number of selected component (if applicable):


How reproducible:
Happens on at least one node of the 23 node xen cluster during running the tests.

Steps to Reproduce:
1. run service cman stop in parallel on a running cluster
2.
3.
  
Actual results:


Expected results:


Additional info:

cman_tool -d leave does not provide any further information.  If you wait long
enough, then the command will succeed on the node.
Comment 1 Nate Straz 2006-12-12 16:47:12 EST
I am also seeing this on various clusters.  I was able to reproduce this with
whiplash (after adding a sleep after service cman stop). It appears to me that
openais is in the middle of a configuration change when the "cman_tool leave"
fails. 
Comment 2 Christine Caulfield 2006-12-13 04:04:44 EST
Sorry, maybe I wasn't clear. You need to add the -d to 'cman_tool join' so it
will debug log the whole daemon.
Comment 3 Christine Caulfield 2006-12-13 08:34:40 EST
Created attachment 143511 [details]
Patch to fix

ahem.

Try this trivial patch
Comment 4 Christine Caulfield 2006-12-13 08:42:07 EST
Stupid typo would cause cman to make the wrong shutdown decision if a daemon
disconnected whilst shutdown was in progress.

I've not committed this to RHEL50 as it's not marked as a blocker (is that
right?). But you can see it's a trivial and obvious fix.

HEAD:
Checking in cman/daemon/commands.c;
/cvs/cluster/cluster/cman/daemon/commands.c,v  <--  commands.c
new revision: 1.56; previous revision: 1.55
done

RHEL5:
Checking in commands.c;
/cvs/cluster/cluster/cman/daemon/commands.c,v  <--  commands.c
new revision: 1.55.2.1; previous revision: 1.55
done
Comment 5 Christine Caulfield 2006-12-13 09:21:11 EST
and now on RHEL50:

Checking in commands.c;
/cvs/cluster/cluster/cman/daemon/commands.c,v  <--  commands.c
new revision: 1.55.4.1; previous revision: 1.55
done
Comment 6 Kiersten (Kerri) Anderson 2006-12-13 10:44:22 EST
Still getting the problem after applying the patch.  Discussion from irc channel:
<pjc> it probably thinks there are stil fence domains active
* deepthot_ has quit (Quit: Leaving)
<pjc> so maybe it's fence_tool stop taking some time ?
<kanderso> 1166023930 start default 19 members 6 16 22 17 23 8 19 15 14 9 21 20
5 11 13 7 18 2 10 12 3 4 
<kanderso> 1166023930 do_recovery stop 18 start 19 finish 18
<kanderso> 1166023930 add node 1 to list 3
<kanderso> 1166023942 finish default 19
<kanderso> 1166023942 stop default
<kanderso> 1166023947 no to cman shutdown
<kanderso> 1166023947 terminate default
<dct> ah, yep, as I suspected -- terminate is happening after the shutdown callback
<dct> apparently the wait (-w) on fence_tool leave isn't waiting long enough
<dct> we'd probably be fine letting fenced reply with yes if all it needs is the
terminate
Comment 7 David Teigland 2006-12-13 14:07:24 EST
groupd's function that returns info for group status queries was
mistakenly setting the "member" status to 0 when a node was leaving.
This led fence_tool to believe that the local node was no longer
a member (i.e. had finished leaving) when in fact the leave wasn't
complete yet.

HEAD
Checking in main.c;
/cvs/cluster/cluster/group/daemon/main.c,v  <--  main.c
new revision: 1.56; previous revision: 1.55

RHEL5
Checking in main.c;
/cvs/cluster/cluster/group/daemon/main.c,v  <--  main.c
new revision: 1.51.2.5; previous revision: 1.51.2.4

RHEL50
Checking in main.c;
/cvs/cluster/cluster/group/daemon/main.c,v  <--  main.c
new revision: 1.51.4.5; previous revision: 1.51.4.4
Comment 8 David Teigland 2006-12-14 19:03:17 EST
*** Bug 217449 has been marked as a duplicate of this bug. ***
Comment 9 David Teigland 2006-12-18 12:30:58 EST
"Switch from CMAN_DISPATCH_ONE loop to CMAN_DISPATCH_ALL to resolve
 delayed cman shutdown callbacks."

This change was applied to: groupd, fenced, dlm_controld, ccsd on
HEAD, RHEL5 and RHEL50.  It was the latest change we made to address
the the busy from cman_tool leave.  The problem was that sometimes
the delivery/dispatch of cman's shutdown callback was being delayed
for up to 30+ sec to one of the daemons.  This would exceed cman's
5 sec timeout for getting a shutdown reply, so cman would assume a nak.
Comment 10 Steven Dake 2006-12-18 17:28:49 EST
The actual problem is that cman dispatch one doesn't work properly and
dispatches all pending callbacks.  The previous code in these daemons was
designed to handle only one callback at a time so multiple callbacks in one
dispatch would be lost.

On a side note dispatch one should probably be changed to dispatch one callback.
Comment 11 RHEL Product and Program Management 2007-02-07 20:41:52 EST
A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.
Comment 12 Paul Kennedy 2007-09-27 18:03:29 EDT
Encountered problem when trying to restart cluster with Conga. Tried stopping
and restarting cluster manually via 'service cman stop' and 'service cman start'
and noticed the following message:

[root@tng3-1 ~]# service cman stop
Stopping cluster:
   Stopping fencing... done
   Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
                                                           [FAILED]

Tried 'service cman stop' again and it worked. 

After that, started cluster with 'service cman start' and it worked.

Can stop and start cluster with Conga; however restarting with Conga does not
appears to work, possibly because of this problem.


Additional information:

[root@tng3-5 ~]# uname -a
Linux tng3-5 2.6.18-48.el5 #1 SMP Mon Sep 17 17:26:31 EDT 2007 i686 i686 i386
GNU/Linux

cman-2.0.70-1.el5
openais-0.80.3-5.el5
Comment 13 Steven Dake 2007-09-27 18:26:01 EDT
bug in comment #12 is a feature request.

Please open a new bz for that request.

Currently the only way to restart the cluster after service cman stop is to
reboot the node.  the higher level dlm and gfs cannot be stopped during runtime
and restarted without a reboot.  Prob assign to dave.
Comment 14 Christine Caulfield 2007-09-28 03:44:05 EDT
A clean "service cman stop/start" should be fine.

If cman won't shut down immediately that's because there are services still
running. Maybe they are in the process of shutting down and need more time, that
seems to be the case if 'service cman stop' works almost straight away afterwards.

In which case this is an init script problem, it just needs to wait until the
services have really finished.
Comment 15 David Teigland 2007-09-28 10:04:55 EDT
comment #13 is utter nonsense, please ignore it
Comment 16 Paul Kennedy 2007-09-28 17:14:26 EDT
Followed up on comment #14.
Determined that not all services had been stopped: rgmanager was still running. 

Note You need to log in before you can comment on or make changes to this bug.