Bug 219385

Summary: cman_tool leave returns ebusy during cluster shutdown
Product: Red Hat Enterprise Linux 5 Reporter: Kiersten (Kerri) Anderson <kanderso>
Component: cmanAssignee: Chris Feist <cfeist>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: cluster-maint, cmarthal, nstraz, rmccabe, teigland, yangcongjian, yyk
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RC Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-09-28 21:14:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch to fix none

Description Kiersten (Kerri) Anderson 2006-12-12 21:17:09 UTC
Description of problem:
When doing service cman stop on all nodes in the cluster, some of the nodes
report the following error message:
   
Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
[FAILED]


Version-Release number of selected component (if applicable):


How reproducible:
Happens on at least one node of the 23 node xen cluster during running the tests.

Steps to Reproduce:
1. run service cman stop in parallel on a running cluster
2.
3.
  
Actual results:


Expected results:


Additional info:

cman_tool -d leave does not provide any further information.  If you wait long
enough, then the command will succeed on the node.

Comment 1 Nate Straz 2006-12-12 21:47:12 UTC
I am also seeing this on various clusters.  I was able to reproduce this with
whiplash (after adding a sleep after service cman stop). It appears to me that
openais is in the middle of a configuration change when the "cman_tool leave"
fails. 

Comment 2 Christine Caulfield 2006-12-13 09:04:44 UTC
Sorry, maybe I wasn't clear. You need to add the -d to 'cman_tool join' so it
will debug log the whole daemon.

Comment 3 Christine Caulfield 2006-12-13 13:34:40 UTC
Created attachment 143511 [details]
Patch to fix

ahem.

Try this trivial patch

Comment 4 Christine Caulfield 2006-12-13 13:42:07 UTC
Stupid typo would cause cman to make the wrong shutdown decision if a daemon
disconnected whilst shutdown was in progress.

I've not committed this to RHEL50 as it's not marked as a blocker (is that
right?). But you can see it's a trivial and obvious fix.

HEAD:
Checking in cman/daemon/commands.c;
/cvs/cluster/cluster/cman/daemon/commands.c,v  <--  commands.c
new revision: 1.56; previous revision: 1.55
done

RHEL5:
Checking in commands.c;
/cvs/cluster/cluster/cman/daemon/commands.c,v  <--  commands.c
new revision: 1.55.2.1; previous revision: 1.55
done


Comment 5 Christine Caulfield 2006-12-13 14:21:11 UTC
and now on RHEL50:

Checking in commands.c;
/cvs/cluster/cluster/cman/daemon/commands.c,v  <--  commands.c
new revision: 1.55.4.1; previous revision: 1.55
done


Comment 6 Kiersten (Kerri) Anderson 2006-12-13 15:44:22 UTC
Still getting the problem after applying the patch.  Discussion from irc channel:
<pjc> it probably thinks there are stil fence domains active
* deepthot_ has quit (Quit: Leaving)
<pjc> so maybe it's fence_tool stop taking some time ?
<kanderso> 1166023930 start default 19 members 6 16 22 17 23 8 19 15 14 9 21 20
5 11 13 7 18 2 10 12 3 4 
<kanderso> 1166023930 do_recovery stop 18 start 19 finish 18
<kanderso> 1166023930 add node 1 to list 3
<kanderso> 1166023942 finish default 19
<kanderso> 1166023942 stop default
<kanderso> 1166023947 no to cman shutdown
<kanderso> 1166023947 terminate default
<dct> ah, yep, as I suspected -- terminate is happening after the shutdown callback
<dct> apparently the wait (-w) on fence_tool leave isn't waiting long enough
<dct> we'd probably be fine letting fenced reply with yes if all it needs is the
terminate

Comment 7 David Teigland 2006-12-13 19:07:24 UTC
groupd's function that returns info for group status queries was
mistakenly setting the "member" status to 0 when a node was leaving.
This led fence_tool to believe that the local node was no longer
a member (i.e. had finished leaving) when in fact the leave wasn't
complete yet.

HEAD
Checking in main.c;
/cvs/cluster/cluster/group/daemon/main.c,v  <--  main.c
new revision: 1.56; previous revision: 1.55

RHEL5
Checking in main.c;
/cvs/cluster/cluster/group/daemon/main.c,v  <--  main.c
new revision: 1.51.2.5; previous revision: 1.51.2.4

RHEL50
Checking in main.c;
/cvs/cluster/cluster/group/daemon/main.c,v  <--  main.c
new revision: 1.51.4.5; previous revision: 1.51.4.4


Comment 8 David Teigland 2006-12-15 00:03:17 UTC
*** Bug 217449 has been marked as a duplicate of this bug. ***

Comment 9 David Teigland 2006-12-18 17:30:58 UTC
"Switch from CMAN_DISPATCH_ONE loop to CMAN_DISPATCH_ALL to resolve
 delayed cman shutdown callbacks."

This change was applied to: groupd, fenced, dlm_controld, ccsd on
HEAD, RHEL5 and RHEL50.  It was the latest change we made to address
the the busy from cman_tool leave.  The problem was that sometimes
the delivery/dispatch of cman's shutdown callback was being delayed
for up to 30+ sec to one of the daemons.  This would exceed cman's
5 sec timeout for getting a shutdown reply, so cman would assume a nak.


Comment 10 Steven Dake 2006-12-18 22:28:49 UTC
The actual problem is that cman dispatch one doesn't work properly and
dispatches all pending callbacks.  The previous code in these daemons was
designed to handle only one callback at a time so multiple callbacks in one
dispatch would be lost.

On a side note dispatch one should probably be changed to dispatch one callback.

Comment 11 RHEL Program Management 2007-02-08 01:41:52 UTC
A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.


Comment 12 Paul Kennedy 2007-09-27 22:03:29 UTC
Encountered problem when trying to restart cluster with Conga. Tried stopping
and restarting cluster manually via 'service cman stop' and 'service cman start'
and noticed the following message:

[root@tng3-1 ~]# service cman stop
Stopping cluster:
   Stopping fencing... done
   Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
                                                           [FAILED]

Tried 'service cman stop' again and it worked. 

After that, started cluster with 'service cman start' and it worked.

Can stop and start cluster with Conga; however restarting with Conga does not
appears to work, possibly because of this problem.


Additional information:

[root@tng3-5 ~]# uname -a
Linux tng3-5 2.6.18-48.el5 #1 SMP Mon Sep 17 17:26:31 EDT 2007 i686 i686 i386
GNU/Linux

cman-2.0.70-1.el5
openais-0.80.3-5.el5

Comment 13 Steven Dake 2007-09-27 22:26:01 UTC
bug in comment #12 is a feature request.

Please open a new bz for that request.

Currently the only way to restart the cluster after service cman stop is to
reboot the node.  the higher level dlm and gfs cannot be stopped during runtime
and restarted without a reboot.  Prob assign to dave.

Comment 14 Christine Caulfield 2007-09-28 07:44:05 UTC
A clean "service cman stop/start" should be fine.

If cman won't shut down immediately that's because there are services still
running. Maybe they are in the process of shutting down and need more time, that
seems to be the case if 'service cman stop' works almost straight away afterwards.

In which case this is an init script problem, it just needs to wait until the
services have really finished.

Comment 15 David Teigland 2007-09-28 14:04:55 UTC
comment #13 is utter nonsense, please ignore it


Comment 16 Paul Kennedy 2007-09-28 21:14:26 UTC
Followed up on comment #14.
Determined that not all services had been stopped: rgmanager was still running.