Bug 692260

Summary: rgmanager dies and is not restartable
Product: Red Hat Enterprise Linux 6 Reporter: Richard Allen <ra>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 6.3CC: cluster-maint, djansa
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-01 21:57:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
example /var/log/messages file from the system in question
none
rgmanager.log
none
corosync.log
none
dlm_controld.log none

Description Richard Allen 2011-03-30 20:26:56 UTC
Description of problem:
rgmanager dies

[root@syseng1-vm ~]# service rgmanager status
rgmanager dead but pid file exists 

Version-Release number of selected component (if applicable):
[root@syseng3-vm ~]# rpm -q rgmanager
rgmanager-3.0.12-10.el6.x86_64


How reproducible:
Start a RHEL6 HA cluster.  wait a while, enjoy a dead rgmanager

Steps to Reproduce:
1. Install RHEL6
2. Install HA addon
3. Configure Basic Cluster
4. See rgmanager die and have problems restarting
  
Actual results:
Dead rgmanager

Expected results:
Stable cluster

Additional info:

[root@syseng1-vm ~]# clustat
Cluster Status for RHEL6Test @ Thu Mar 17 16:10:38 2011
Member Status: Quorate

 Member Name                                               ID   Status
 ------ ----                                               ---- ------
 syseng1-vm                               1 Online, Local
 syseng2-vm                               2 Online
 syseng3-vm                               3 Online

There is a service running on node2 but clustat has no info on that.

[root@syseng1-vm ~]# cman_tool status
Version: 6.2.0
Config Version: 9
Cluster Name: RHEL6Test
Cluster Id: 36258
Cluster Member: Yes
Cluster Generation: 88
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: syseng1-[CENSORED]
Node ID: 1
Multicast addresses: 239.192.141.48
Node addresses: 10.10.16.11

The syslog has some info:

Mar 17 15:47:55 syseng1-vm rgmanager[2463]: Quorum formed
Mar 17 15:47:55 syseng1-vm kernel: dlm: no local IP address has been set
Mar 17 15:47:55 syseng1-vm kernel: dlm: cannot start dlm lowcomms -107


The fix is always the same: 
[root@syseng1-vm ~]# service cman restart
Stopping cluster:
   Leaving fence domain...                                 [  OK  ]
   Stopping gfs_controld...                                [  OK  ]
   Stopping dlm_controld...                                [  OK  ]
   Stopping fenced...                                      [  OK  ]
   Stopping cman...                                        [  OK  ]
   Waiting for corosync to shutdown:                       [  OK  ]
   Unloading kernel modules...                             [  OK  ]
   Unmounting configfs...                                  [  OK  ]
Starting cluster:
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
   Waiting for quorum...                                   [  OK  ]
   Starting fenced...                                      [  OK  ]
   Starting dlm_controld...                                [  OK  ]
   Starting gfs_controld...                                [  OK  ]
   Unfencing self...                                       [  OK  ]
   Joining fence domain...                                 [  OK  ]

[root@syseng1-vm ~]# service rgmanager restart
Stopping Cluster Service Manager:                          [  OK  ]
Starting Cluster Service Manager:                          [  OK  ] 

[root@syseng1-vm ~]# clustat
Cluster Status for RHEL6Test @ Thu Mar 17 16:22:01 2011
Member Status: Quorate

 Member Name                                               ID   Status
 ------ ----                                               ---- ------
 syseng1-vm                               1 Online, Local, rgmanager
 syseng2-vm                               2 Online, rgmanager
 syseng3-vm                               3 Online

 Service Name                                     Owner (Last)                                     State
 ------- ----                                     ----- ------                                     -----
 service:TestDB                                   syseng2-vm                  started


Sometimes restarting rgmanager hangs and the node needs to be rebooted.

my cluster.conf:


<?xml version="1.0"?>
<cluster config_version="9" name="RHEL6Test">
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="syseng1-vm" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="syseng1-vm"/>
</method>
</fence>
</clusternode>
<clusternode name="syseng2-vm" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="syseng2-vm"/>
</method>
</fence>
</clusternode>
<clusternode name="syseng3-vm" nodeid="3" votes="1">
<fence>
<method name="1">
<device name="syseng3-vm"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman/>
<fencedevices>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng1-vm" passwd="[CENSORED]" port="syseng1-vm"/>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng2-vm" passwd="[CENSORED]" port="syseng2-vm"/>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng3-vm" passwd="[CENSORED]" port="syseng3-vm"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="AllNodes" nofailback="0" ordered="0" restricted="0">
<failoverdomainnode name="syseng1-vm" priority="1"/>
<failoverdomainnode name="syseng2-vm" priority="1"/>
<failoverdomainnode name="syseng3-vm" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.10.16.234" monitor_link="on" sleeptime="10"/>
<fs device="/dev/vgpg/pgsql" fsid="62946" mountpoint="/opt/rg" name="SharedDisk"/>
<script file="/etc/rc.d/init.d/postgresql" name="postgresql"/>
</resources>
<service autostart="1" domain="AllNodes" exclusive="0" name="TestDB" recovery="relocate">
<ip ref="10.10.16.234"/>
<fs ref="SharedDisk"/>
<script ref="postgresql"/>
</service>
</rm>
</cluster>

Comment 2 Lon Hohberger 2011-03-30 20:53:03 UTC
Ah -- I think you've tripped on this, which probably leads to a dlm interaction issue:

https://bugzilla.redhat.com/show_bug.cgi?id=688201

Full logs (even if you edit out hostnames or whatever) would be necessary to verify this.

Comment 3 Richard Allen 2011-03-30 21:03:47 UTC
Created attachment 488898 [details]
example /var/log/messages file from the system in question

This is a messages log from the one of the nodes.  I was starting and stopping services and rebooting wildly to try to get the stuff up :)

Comment 4 Richard Allen 2011-03-30 21:10:43 UTC
Created attachment 488899 [details]
rgmanager.log

rgmanager.log

Comment 5 Richard Allen 2011-03-30 21:11:14 UTC
Created attachment 488900 [details]
corosync.log

corosync.log

Comment 6 Richard Allen 2011-03-30 21:11:58 UTC
Created attachment 488901 [details]
dlm_controld.log

dlm_controld.log

Comment 7 Richard Allen 2011-03-30 21:12:32 UTC
All logs updated as is, only domain names have been stripped.

Comment 8 Lon Hohberger 2011-03-31 14:31:52 UTC
Whatever you're hitting, it's far below rgmanager - rgmanager is merely a victim of the underlying problem you're having.  Here are a couple of suggestions:

* It looks like you need to add the following to /etc/sysconfig/cman:

  CMAN_QUORUM_TIMEOUT=45

(This will be the default in RHEL 6.1)

* This worries me:

Mar 16 21:24:34 test3-vm corosync[1633]:   [TOTEM ] Incrementing problem counter for seqid 1 iface 172.29.123.243 to [1 of 10]

You need to disable redundant ring; it is not well tested and entirely unsupported.  The nodelist fed to from dlm_controld to the kernel may or may not match up with the active ring.  That is, corosync might think a given node is "up" but the DLM may be unable to talk to it because.  When this happens, anything using the DLM (rgmanager included) will break.

* You need to chkconfig --del corosync if it's not already disabled.

* I have seen this before; it appears to be a bug in corosync triggered by network issues:

Mar 17 15:01:44 syseng1-vm corosync[2619]:   [TOTEM ] Retransmit List: 37 
Mar 17 15:01:44 syseng1-vm corosync[2619]:   [TOTEM ] Retransmit List: 39 37 
... (repeat several times)
Mar 17 15:01:44 syseng1-vm corosync[2619]:   [TOTEM ] Retransmit List: 39 37 
Mar 17 15:01:54 syseng1-vm kernel: dlm: closing connection to node 3
Mar 17 15:01:54 syseng1-vm kernel: dlm: closing connection to node 2
Mar 17 15:01:54 syseng1-vm kernel: dlm: closing connection to node 1
Mar 17 15:01:54 syseng1-vm corosync[2619]:   [TOTEM ] A processor failed, forming new configuration.
Mar 17 15:02:00 syseng1-vm corosync[2619]:   [TOTEM ] Retransmit List: 2 
... (50 or 60 more)
Mar 17 15:02:19 syseng1-vm corosync[2619]:   [TOTEM ] Retransmit List: 2

Comment 9 Lon Hohberger 2011-03-31 14:34:14 UTC
(In reply to comment #8)

> You need to disable redundant ring; it is not well tested and entirely
> unsupported.  The nodelist fed to from dlm_controld to the kernel may or may
> not match up with the active ring.  That is, corosync might think a given node
> is "up" but the DLM may be unable to talk to it because.  When this happens,
> anything using the DLM (rgmanager included) will break.


Oops...

The nodelist given from dlm_controld to the kernel may or may not match up with the active ring.  That is, corosync might think a given node is "up" but the DLM may be unable to talk to it.  When this happens, anything using the DLM (rgmanager included) will break.

Comment 10 Lon Hohberger 2011-03-31 14:58:32 UTC
If rgmanager crashed (which is possible) there is a bug -- but the underlying issue(s) causing it is far worse.  If abrtd caught a core of rgmanager, go ahead and attach it here and I'll figure out how to fix it.

If there was no core file, you may need to add "DAEMON_COREFILE_LIMIT=unlimited" to /etc/sysconfig/rgmanager .

Comment 13 Lon Hohberger 2011-08-01 21:57:54 UTC

*** This bug has been marked as a duplicate of bug 725058 ***