Bug 219385
Summary: | cman_tool leave returns ebusy during cluster shutdown | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Kiersten (Kerri) Anderson <kanderso> | ||||
Component: | cman | Assignee: | Chris Feist <cfeist> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 5.0 | CC: | cluster-maint, cmarthal, nstraz, rmccabe, teigland, yangcongjian, yyk | ||||
Target Milestone: | --- | Keywords: | Reopened | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | RC | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2007-09-28 21:14:26 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Kiersten (Kerri) Anderson
2006-12-12 21:17:09 UTC
I am also seeing this on various clusters. I was able to reproduce this with whiplash (after adding a sleep after service cman stop). It appears to me that openais is in the middle of a configuration change when the "cman_tool leave" fails. Sorry, maybe I wasn't clear. You need to add the -d to 'cman_tool join' so it will debug log the whole daemon. Created attachment 143511 [details]
Patch to fix
ahem.
Try this trivial patch
Stupid typo would cause cman to make the wrong shutdown decision if a daemon disconnected whilst shutdown was in progress. I've not committed this to RHEL50 as it's not marked as a blocker (is that right?). But you can see it's a trivial and obvious fix. HEAD: Checking in cman/daemon/commands.c; /cvs/cluster/cluster/cman/daemon/commands.c,v <-- commands.c new revision: 1.56; previous revision: 1.55 done RHEL5: Checking in commands.c; /cvs/cluster/cluster/cman/daemon/commands.c,v <-- commands.c new revision: 1.55.2.1; previous revision: 1.55 done and now on RHEL50: Checking in commands.c; /cvs/cluster/cluster/cman/daemon/commands.c,v <-- commands.c new revision: 1.55.4.1; previous revision: 1.55 done Still getting the problem after applying the patch. Discussion from irc channel: <pjc> it probably thinks there are stil fence domains active * deepthot_ has quit (Quit: Leaving) <pjc> so maybe it's fence_tool stop taking some time ? <kanderso> 1166023930 start default 19 members 6 16 22 17 23 8 19 15 14 9 21 20 5 11 13 7 18 2 10 12 3 4 <kanderso> 1166023930 do_recovery stop 18 start 19 finish 18 <kanderso> 1166023930 add node 1 to list 3 <kanderso> 1166023942 finish default 19 <kanderso> 1166023942 stop default <kanderso> 1166023947 no to cman shutdown <kanderso> 1166023947 terminate default <dct> ah, yep, as I suspected -- terminate is happening after the shutdown callback <dct> apparently the wait (-w) on fence_tool leave isn't waiting long enough <dct> we'd probably be fine letting fenced reply with yes if all it needs is the terminate groupd's function that returns info for group status queries was mistakenly setting the "member" status to 0 when a node was leaving. This led fence_tool to believe that the local node was no longer a member (i.e. had finished leaving) when in fact the leave wasn't complete yet. HEAD Checking in main.c; /cvs/cluster/cluster/group/daemon/main.c,v <-- main.c new revision: 1.56; previous revision: 1.55 RHEL5 Checking in main.c; /cvs/cluster/cluster/group/daemon/main.c,v <-- main.c new revision: 1.51.2.5; previous revision: 1.51.2.4 RHEL50 Checking in main.c; /cvs/cluster/cluster/group/daemon/main.c,v <-- main.c new revision: 1.51.4.5; previous revision: 1.51.4.4 *** Bug 217449 has been marked as a duplicate of this bug. *** "Switch from CMAN_DISPATCH_ONE loop to CMAN_DISPATCH_ALL to resolve delayed cman shutdown callbacks." This change was applied to: groupd, fenced, dlm_controld, ccsd on HEAD, RHEL5 and RHEL50. It was the latest change we made to address the the busy from cman_tool leave. The problem was that sometimes the delivery/dispatch of cman's shutdown callback was being delayed for up to 30+ sec to one of the daemons. This would exceed cman's 5 sec timeout for getting a shutdown reply, so cman would assume a nak. The actual problem is that cman dispatch one doesn't work properly and dispatches all pending callbacks. The previous code in these daemons was designed to handle only one callback at a time so multiple callbacks in one dispatch would be lost. On a side note dispatch one should probably be changed to dispatch one callback. A package has been built which should help the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you. Encountered problem when trying to restart cluster with Conga. Tried stopping and restarting cluster manually via 'service cman stop' and 'service cman start' and noticed the following message: [root@tng3-1 ~]# service cman stop Stopping cluster: Stopping fencing... done Stopping cman... failed /usr/sbin/cman_tool: Error leaving cluster: Device or resource busy [FAILED] Tried 'service cman stop' again and it worked. After that, started cluster with 'service cman start' and it worked. Can stop and start cluster with Conga; however restarting with Conga does not appears to work, possibly because of this problem. Additional information: [root@tng3-5 ~]# uname -a Linux tng3-5 2.6.18-48.el5 #1 SMP Mon Sep 17 17:26:31 EDT 2007 i686 i686 i386 GNU/Linux cman-2.0.70-1.el5 openais-0.80.3-5.el5 bug in comment #12 is a feature request. Please open a new bz for that request. Currently the only way to restart the cluster after service cman stop is to reboot the node. the higher level dlm and gfs cannot be stopped during runtime and restarted without a reboot. Prob assign to dave. A clean "service cman stop/start" should be fine. If cman won't shut down immediately that's because there are services still running. Maybe they are in the process of shutting down and need more time, that seems to be the case if 'service cman stop' works almost straight away afterwards. In which case this is an init script problem, it just needs to wait until the services have really finished. comment #13 is utter nonsense, please ignore it Followed up on comment #14. Determined that not all services had been stopped: rgmanager was still running. |