Bug 769730

Summary: rgmanager uncleanly exits on cman shutdown
Product: Red Hat Enterprise Linux 5 Reporter: Adam Drew <adrew>
Component: rgmanagerAssignee: Ryan McCabe <rmccabe>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: medium    
Version: 5.7CC: ahecox, cluster-maint, djansa, edamato, fdinitto, jwest, mjuricek, mkelly
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: rgmanager-2.0.52-34.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-01-08 07:05:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 807971    
Attachments:
Description Flags
Patch to resolve this issue none

Description Adam Drew 2011-12-21 22:56:48 UTC
Description of problem:
rgmanager uncleanly shuts down on cman exit, leaving the rgmanager dlm group around, making it impossible to finish shutting down rgmanager and cman:

[root@node2 systemtap]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 none        
[2]
dlm              1     rgmanager  00070001 none        
[1 2]
[root@node2 systemtap]# service cman stop
Stopping cluster: 
   Stopping fencing... done
   Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
                                                           [FAILED]
[root@node2 systemtap]# service rgmanager status
clurgmgrd dead but pid file exists
[root@node2 systemtap]# cman_tool services
type             level name       id       state       
dlm              1     rgmanager  00070001 none        
[1 2]

Version-Release number of selected component (if applicable):
rgmanager-2.0.52-21.el5

Comment 1 Adam Drew 2011-12-21 22:59:43 UTC
Workaround for this issue is to check cman_tool services before the cman stop and first stop everything in that list. Then cman stop will work fine.

Comment 2 Adam Drew 2011-12-21 23:01:44 UTC
Created attachment 549095 [details]
Patch to resolve this issue

Attached Lon's patch that should resolve this.

Comment 4 Adam Drew 2012-01-03 21:14:38 UTC
I tested this patch and I don't think it does exactly what we want:

[root@node2 ~]# clustat
Cluster Status for adrew-test @ Tue Jan  3 16:11:51 2012
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 node1.adrew.net                                                     1 Online
 node2.adrew.net                                                     2 Online, Local, rgmanager

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 service:script-test                                              node2.adrew.net                                                  started       
[root@node2 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 none        
[2]
dlm              1     rgmanager  00070001 none        
[1 2]
[root@node2 ~]# service cman stop
Stopping cluster: 
   Stopping fencing... done
   Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
                                                           [FAILED]
[root@node2 ~]# clustat
msg_receive: Broken pipe
msg_receive_simple: Broken pipe
Cluster Status for adrew-test @ Tue Jan  3 16:12:13 2012
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 node1.adrew.net                                                     1 Online
 node2.adrew.net                                                     2 Online, Local

[root@node2 ~]# service cman stop
Stopping cluster: 
   Stopping fencing... done
   Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
                                                           [FAILED]
[root@node2 ~]# cman_tool services
type             level name       id       state       
dlm              1     rgmanager  00070001 none        
[1 2]

So the stop order when we stop cman is:
fence
cman
everything else

We succesfully stop fence, but then cman stop fails. That leaves us in a similar (but new) bad position.

I think we may need a patch in cman instead of rgmanager.

Comment 6 Lon Hohberger 2012-01-17 16:18:15 UTC
The risk to making rgmanager not shut down when cman asks (currently, it halts and exits uncleanly) is the stop ordering.  Perhaps this is what you meant.

Today:

- fenced exits
- we ask cman to leave
  - rgmanager halts, allowing cman to leave
  - dlm stops cman from leaving (active lockspaces)

At this point, rgmanager is dead - it has halted services, so if the cluster node hangs, there is minimal risk.

With patch:

- fenced exits
  - we ask cman to leave
  - rgmanager refuses

At this point, rgmanager is alive and services are running.  This is problematic -- because 'fenced' has exited, which means the node will not be fenced if it hangs.

Comment 7 RHEL Program Management 2012-04-02 10:41:28 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 9 Ryan McCabe 2012-06-15 14:49:01 UTC
It might be nice to additionally fix fence_tool leave to fail (as it does on RHEL6) when there are active lockspaces.

Comment 10 Ryan McCabe 2012-07-11 16:42:25 UTC
Patch applied in RHEL59 cluster.git commit 55710722d15be8f2eafdae472086182f88b2a0d5

Comment 17 Fabio Massimo Di Nitto 2012-07-27 19:32:47 UTC
(In reply to comment #9)
> It might be nice to additionally fix fence_tool leave to fail (as it does on
> RHEL6) when there are active lockspaces.

Right, but we still need a fix in stable32 branch and rhel6 because nobody says a user can't issue cman_tool leave. All deamons are expected to behave and respond correctly to the request beside how you got to that point/request.

Comment 22 Ryan McCabe 2012-10-09 16:27:20 UTC
*** Bug 697656 has been marked as a duplicate of this bug. ***

Comment 24 errata-xmlrpc 2013-01-08 07:05:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0026.html