Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 692260

Summary: rgmanager dies and is not restartable
Product: Red Hat Enterprise Linux 6 Reporter: Richard Allen <ra>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 6.3CC: cluster-maint, djansa
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-01 21:57:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
example /var/log/messages file from the system in question
none
rgmanager.log
none
corosync.log
none
dlm_controld.log none

Description Richard Allen 2011-03-30 20:26:56 UTC
Description of problem:
rgmanager dies

[root@syseng1-vm ~]# service rgmanager status
rgmanager dead but pid file exists 

Version-Release number of selected component (if applicable):
[root@syseng3-vm ~]# rpm -q rgmanager
rgmanager-3.0.12-10.el6.x86_64


How reproducible:
Start a RHEL6 HA cluster.  wait a while, enjoy a dead rgmanager

Steps to Reproduce:
1. Install RHEL6
2. Install HA addon
3. Configure Basic Cluster
4. See rgmanager die and have problems restarting
  
Actual results:
Dead rgmanager

Expected results:
Stable cluster

Additional info:

[root@syseng1-vm ~]# clustat
Cluster Status for RHEL6Test @ Thu Mar 17 16:10:38 2011
Member Status: Quorate

 Member Name                                               ID   Status
 ------ ----                                               ---- ------
 syseng1-vm                               1 Online, Local
 syseng2-vm                               2 Online
 syseng3-vm                               3 Online

There is a service running on node2 but clustat has no info on that.

[root@syseng1-vm ~]# cman_tool status
Version: 6.2.0
Config Version: 9
Cluster Name: RHEL6Test
Cluster Id: 36258
Cluster Member: Yes
Cluster Generation: 88
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: syseng1-[CENSORED]
Node ID: 1
Multicast addresses: 239.192.141.48
Node addresses: 10.10.16.11

The syslog has some info:

Mar 17 15:47:55 syseng1-vm rgmanager[2463]: Quorum formed
Mar 17 15:47:55 syseng1-vm kernel: dlm: no local IP address has been set
Mar 17 15:47:55 syseng1-vm kernel: dlm: cannot start dlm lowcomms -107


The fix is always the same: 
[root@syseng1-vm ~]# service cman restart
Stopping cluster:
   Leaving fence domain...                                 [  OK  ]
   Stopping gfs_controld...                                [  OK  ]
   Stopping dlm_controld...                                [  OK  ]
   Stopping fenced...                                      [  OK  ]
   Stopping cman...                                        [  OK  ]
   Waiting for corosync to shutdown:                       [  OK  ]
   Unloading kernel modules...                             [  OK  ]
   Unmounting configfs...                                  [  OK  ]
Starting cluster:
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
   Waiting for quorum...                                   [  OK  ]
   Starting fenced...                                      [  OK  ]
   Starting dlm_controld...                                [  OK  ]
   Starting gfs_controld...                                [  OK  ]
   Unfencing self...                                       [  OK  ]
   Joining fence domain...                                 [  OK  ]

[root@syseng1-vm ~]# service rgmanager restart
Stopping Cluster Service Manager:                          [  OK  ]
Starting Cluster Service Manager:                          [  OK  ] 

[root@syseng1-vm ~]# clustat
Cluster Status for RHEL6Test @ Thu Mar 17 16:22:01 2011
Member Status: Quorate

 Member Name                                               ID   Status
 ------ ----                                               ---- ------
 syseng1-vm                               1 Online, Local, rgmanager
 syseng2-vm                               2 Online, rgmanager
 syseng3-vm                               3 Online

 Service Name                                     Owner (Last)                                     State
 ------- ----                                     ----- ------                                     -----
 service:TestDB                                   syseng2-vm                  started


Sometimes restarting rgmanager hangs and the node needs to be rebooted.

my cluster.conf:


<?xml version="1.0"?>
<cluster config_version="9" name="RHEL6Test">
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="syseng1-vm" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="syseng1-vm"/>
</method>
</fence>
</clusternode>
<clusternode name="syseng2-vm" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="syseng2-vm"/>
</method>
</fence>
</clusternode>
<clusternode name="syseng3-vm" nodeid="3" votes="1">
<fence>
<method name="1">
<device name="syseng3-vm"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman/>
<fencedevices>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng1-vm" passwd="[CENSORED]" port="syseng1-vm"/>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng2-vm" passwd="[CENSORED]" port="syseng2-vm"/>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng3-vm" passwd="[CENSORED]" port="syseng3-vm"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="AllNodes" nofailback="0" ordered="0" restricted="0">
<failoverdomainnode name="syseng1-vm" priority="1"/>
<failoverdomainnode name="syseng2-vm" priority="1"/>
<failoverdomainnode name="syseng3-vm" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.10.16.234" monitor_link="on" sleeptime="10"/>
<fs device="/dev/vgpg/pgsql" fsid="62946" mountpoint="/opt/rg" name="SharedDisk"/>
<script file="/etc/rc.d/init.d/postgresql" name="postgresql"/>
</resources>
<service autostart="1" domain="AllNodes" exclusive="0" name="TestDB" recovery="relocate">
<ip ref="10.10.16.234"/>
<fs ref="SharedDisk"/>
<script ref="postgresql"/>
</service>
</rm>
</cluster>

Comment 2 Lon Hohberger 2011-03-30 20:53:03 UTC
Ah -- I think you've tripped on this, which probably leads to a dlm interaction issue:

https://bugzilla.redhat.com/show_bug.cgi?id=688201

Full logs (even if you edit out hostnames or whatever) would be necessary to verify this.

Comment 3 Richard Allen 2011-03-30 21:03:47 UTC
Created attachment 488898 [details]
example /var/log/messages file from the system in question

This is a messages log from the one of the nodes.  I was starting and stopping services and rebooting wildly to try to get the stuff up :)

Comment 4 Richard Allen 2011-03-30 21:10:43 UTC
Created attachment 488899 [details]
rgmanager.log

rgmanager.log

Comment 5 Richard Allen 2011-03-30 21:11:14 UTC
Created attachment 488900 [details]
corosync.log

corosync.log

Comment 6 Richard Allen 2011-03-30 21:11:58 UTC
Created attachment 488901 [details]
dlm_controld.log

dlm_controld.log

Comment 7 Richard Allen 2011-03-30 21:12:32 UTC
All logs updated as is, only domain names have been stripped.

Comment 8 Lon Hohberger 2011-03-31 14:31:52 UTC
Whatever you're hitting, it's far below rgmanager - rgmanager is merely a victim of the underlying problem you're having.  Here are a couple of suggestions:

* It looks like you need to add the following to /etc/sysconfig/cman:

  CMAN_QUORUM_TIMEOUT=45

(This will be the default in RHEL 6.1)

* This worries me:

Mar 16 21:24:34 test3-vm corosync[1633]:   [TOTEM ] Incrementing problem counter for seqid 1 iface 172.29.123.243 to [1 of 10]

You need to disable redundant ring; it is not well tested and entirely unsupported.  The nodelist fed to from dlm_controld to the kernel may or may not match up with the active ring.  That is, corosync might think a given node is "up" but the DLM may be unable to talk to it because.  When this happens, anything using the DLM (rgmanager included) will break.

* You need to chkconfig --del corosync if it's not already disabled.

* I have seen this before; it appears to be a bug in corosync triggered by network issues:

Mar 17 15:01:44 syseng1-vm corosync[2619]:   [TOTEM ] Retransmit List: 37 
Mar 17 15:01:44 syseng1-vm corosync[2619]:   [TOTEM ] Retransmit List: 39 37 
... (repeat several times)
Mar 17 15:01:44 syseng1-vm corosync[2619]:   [TOTEM ] Retransmit List: 39 37 
Mar 17 15:01:54 syseng1-vm kernel: dlm: closing connection to node 3
Mar 17 15:01:54 syseng1-vm kernel: dlm: closing connection to node 2
Mar 17 15:01:54 syseng1-vm kernel: dlm: closing connection to node 1
Mar 17 15:01:54 syseng1-vm corosync[2619]:   [TOTEM ] A processor failed, forming new configuration.
Mar 17 15:02:00 syseng1-vm corosync[2619]:   [TOTEM ] Retransmit List: 2 
... (50 or 60 more)
Mar 17 15:02:19 syseng1-vm corosync[2619]:   [TOTEM ] Retransmit List: 2

Comment 9 Lon Hohberger 2011-03-31 14:34:14 UTC
(In reply to comment #8)

> You need to disable redundant ring; it is not well tested and entirely
> unsupported.  The nodelist fed to from dlm_controld to the kernel may or may
> not match up with the active ring.  That is, corosync might think a given node
> is "up" but the DLM may be unable to talk to it because.  When this happens,
> anything using the DLM (rgmanager included) will break.


Oops...

The nodelist given from dlm_controld to the kernel may or may not match up with the active ring.  That is, corosync might think a given node is "up" but the DLM may be unable to talk to it.  When this happens, anything using the DLM (rgmanager included) will break.

Comment 10 Lon Hohberger 2011-03-31 14:58:32 UTC
If rgmanager crashed (which is possible) there is a bug -- but the underlying issue(s) causing it is far worse.  If abrtd caught a core of rgmanager, go ahead and attach it here and I'll figure out how to fix it.

If there was no core file, you may need to add "DAEMON_COREFILE_LIMIT=unlimited" to /etc/sysconfig/rgmanager .

Comment 13 Lon Hohberger 2011-08-01 21:57:54 UTC

*** This bug has been marked as a duplicate of bug 725058 ***