Hide Forgot
Description of problem: rgmanager uncleanly shuts down on cman exit, leaving the rgmanager dlm group around, making it impossible to finish shutting down rgmanager and cman: [root@node2 systemtap]# cman_tool services type level name id state fence 0 default 00010002 none [2] dlm 1 rgmanager 00070001 none [1 2] [root@node2 systemtap]# service cman stop Stopping cluster: Stopping fencing... done Stopping cman... failed /usr/sbin/cman_tool: Error leaving cluster: Device or resource busy [FAILED] [root@node2 systemtap]# service rgmanager status clurgmgrd dead but pid file exists [root@node2 systemtap]# cman_tool services type level name id state dlm 1 rgmanager 00070001 none [1 2] Version-Release number of selected component (if applicable): rgmanager-2.0.52-21.el5
Workaround for this issue is to check cman_tool services before the cman stop and first stop everything in that list. Then cman stop will work fine.
Created attachment 549095 [details] Patch to resolve this issue Attached Lon's patch that should resolve this.
I tested this patch and I don't think it does exactly what we want: [root@node2 ~]# clustat Cluster Status for adrew-test @ Tue Jan 3 16:11:51 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ node1.adrew.net 1 Online node2.adrew.net 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:script-test node2.adrew.net started [root@node2 ~]# cman_tool services type level name id state fence 0 default 00010002 none [2] dlm 1 rgmanager 00070001 none [1 2] [root@node2 ~]# service cman stop Stopping cluster: Stopping fencing... done Stopping cman... failed /usr/sbin/cman_tool: Error leaving cluster: Device or resource busy [FAILED] [root@node2 ~]# clustat msg_receive: Broken pipe msg_receive_simple: Broken pipe Cluster Status for adrew-test @ Tue Jan 3 16:12:13 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ node1.adrew.net 1 Online node2.adrew.net 2 Online, Local [root@node2 ~]# service cman stop Stopping cluster: Stopping fencing... done Stopping cman... failed /usr/sbin/cman_tool: Error leaving cluster: Device or resource busy [FAILED] [root@node2 ~]# cman_tool services type level name id state dlm 1 rgmanager 00070001 none [1 2] So the stop order when we stop cman is: fence cman everything else We succesfully stop fence, but then cman stop fails. That leaves us in a similar (but new) bad position. I think we may need a patch in cman instead of rgmanager.
The risk to making rgmanager not shut down when cman asks (currently, it halts and exits uncleanly) is the stop ordering. Perhaps this is what you meant. Today: - fenced exits - we ask cman to leave - rgmanager halts, allowing cman to leave - dlm stops cman from leaving (active lockspaces) At this point, rgmanager is dead - it has halted services, so if the cluster node hangs, there is minimal risk. With patch: - fenced exits - we ask cman to leave - rgmanager refuses At this point, rgmanager is alive and services are running. This is problematic -- because 'fenced' has exited, which means the node will not be fenced if it hangs.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release.
It might be nice to additionally fix fence_tool leave to fail (as it does on RHEL6) when there are active lockspaces.
Patch applied in RHEL59 cluster.git commit 55710722d15be8f2eafdae472086182f88b2a0d5
(In reply to comment #9) > It might be nice to additionally fix fence_tool leave to fail (as it does on > RHEL6) when there are active lockspaces. Right, but we still need a fix in stable32 branch and rhel6 because nobody says a user can't issue cman_tool leave. All deamons are expected to behave and respond correctly to the request beside how you got to that point/request.
*** Bug 697656 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0026.html