Hide Forgot
Description of problem: rgmanager dies [root@syseng1-vm ~]# service rgmanager status rgmanager dead but pid file exists Version-Release number of selected component (if applicable): [root@syseng3-vm ~]# rpm -q rgmanager rgmanager-3.0.12-10.el6.x86_64 How reproducible: Start a RHEL6 HA cluster. wait a while, enjoy a dead rgmanager Steps to Reproduce: 1. Install RHEL6 2. Install HA addon 3. Configure Basic Cluster 4. See rgmanager die and have problems restarting Actual results: Dead rgmanager Expected results: Stable cluster Additional info: [root@syseng1-vm ~]# clustat Cluster Status for RHEL6Test @ Thu Mar 17 16:10:38 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ syseng1-vm 1 Online, Local syseng2-vm 2 Online syseng3-vm 3 Online There is a service running on node2 but clustat has no info on that. [root@syseng1-vm ~]# cman_tool status Version: 6.2.0 Config Version: 9 Cluster Name: RHEL6Test Cluster Id: 36258 Cluster Member: Yes Cluster Generation: 88 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Node votes: 1 Quorum: 2 Active subsystems: 1 Flags: Ports Bound: 0 Node name: syseng1-[CENSORED] Node ID: 1 Multicast addresses: 239.192.141.48 Node addresses: 10.10.16.11 The syslog has some info: Mar 17 15:47:55 syseng1-vm rgmanager[2463]: Quorum formed Mar 17 15:47:55 syseng1-vm kernel: dlm: no local IP address has been set Mar 17 15:47:55 syseng1-vm kernel: dlm: cannot start dlm lowcomms -107 The fix is always the same: [root@syseng1-vm ~]# service cman restart Stopping cluster: Leaving fence domain... [ OK ] Stopping gfs_controld... [ OK ] Stopping dlm_controld... [ OK ] Stopping fenced... [ OK ] Stopping cman... [ OK ] Waiting for corosync to shutdown: [ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Starting cluster: Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs... [ OK ] Starting cman... [ OK ] Waiting for quorum... [ OK ] Starting fenced... [ OK ] Starting dlm_controld... [ OK ] Starting gfs_controld... [ OK ] Unfencing self... [ OK ] Joining fence domain... [ OK ] [root@syseng1-vm ~]# service rgmanager restart Stopping Cluster Service Manager: [ OK ] Starting Cluster Service Manager: [ OK ] [root@syseng1-vm ~]# clustat Cluster Status for RHEL6Test @ Thu Mar 17 16:22:01 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ syseng1-vm 1 Online, Local, rgmanager syseng2-vm 2 Online, rgmanager syseng3-vm 3 Online Service Name Owner (Last) State ------- ---- ----- ------ ----- service:TestDB syseng2-vm started Sometimes restarting rgmanager hangs and the node needs to be rebooted. my cluster.conf: <?xml version="1.0"?> <cluster config_version="9" name="RHEL6Test"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="syseng1-vm" nodeid="1" votes="1"> <fence> <method name="1"> <device name="syseng1-vm"/> </method> </fence> </clusternode> <clusternode name="syseng2-vm" nodeid="2" votes="1"> <fence> <method name="1"> <device name="syseng2-vm"/> </method> </fence> </clusternode> <clusternode name="syseng3-vm" nodeid="3" votes="1"> <fence> <method name="1"> <device name="syseng3-vm"/> </method> </fence> </clusternode> </clusternodes> <cman/> <fencedevices> <fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng1-vm" passwd="[CENSORED]" port="syseng1-vm"/> <fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng2-vm" passwd="[CENSORED]" port="syseng2-vm"/> <fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng3-vm" passwd="[CENSORED]" port="syseng3-vm"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="AllNodes" nofailback="0" ordered="0" restricted="0"> <failoverdomainnode name="syseng1-vm" priority="1"/> <failoverdomainnode name="syseng2-vm" priority="1"/> <failoverdomainnode name="syseng3-vm" priority="1"/> </failoverdomain> </failoverdomains> <resources> <ip address="10.10.16.234" monitor_link="on" sleeptime="10"/> <fs device="/dev/vgpg/pgsql" fsid="62946" mountpoint="/opt/rg" name="SharedDisk"/> <script file="/etc/rc.d/init.d/postgresql" name="postgresql"/> </resources> <service autostart="1" domain="AllNodes" exclusive="0" name="TestDB" recovery="relocate"> <ip ref="10.10.16.234"/> <fs ref="SharedDisk"/> <script ref="postgresql"/> </service> </rm> </cluster>
Ah -- I think you've tripped on this, which probably leads to a dlm interaction issue: https://bugzilla.redhat.com/show_bug.cgi?id=688201 Full logs (even if you edit out hostnames or whatever) would be necessary to verify this.
Created attachment 488898 [details] example /var/log/messages file from the system in question This is a messages log from the one of the nodes. I was starting and stopping services and rebooting wildly to try to get the stuff up :)
Created attachment 488899 [details] rgmanager.log rgmanager.log
Created attachment 488900 [details] corosync.log corosync.log
Created attachment 488901 [details] dlm_controld.log dlm_controld.log
All logs updated as is, only domain names have been stripped.
Whatever you're hitting, it's far below rgmanager - rgmanager is merely a victim of the underlying problem you're having. Here are a couple of suggestions: * It looks like you need to add the following to /etc/sysconfig/cman: CMAN_QUORUM_TIMEOUT=45 (This will be the default in RHEL 6.1) * This worries me: Mar 16 21:24:34 test3-vm corosync[1633]: [TOTEM ] Incrementing problem counter for seqid 1 iface 172.29.123.243 to [1 of 10] You need to disable redundant ring; it is not well tested and entirely unsupported. The nodelist fed to from dlm_controld to the kernel may or may not match up with the active ring. That is, corosync might think a given node is "up" but the DLM may be unable to talk to it because. When this happens, anything using the DLM (rgmanager included) will break. * You need to chkconfig --del corosync if it's not already disabled. * I have seen this before; it appears to be a bug in corosync triggered by network issues: Mar 17 15:01:44 syseng1-vm corosync[2619]: [TOTEM ] Retransmit List: 37 Mar 17 15:01:44 syseng1-vm corosync[2619]: [TOTEM ] Retransmit List: 39 37 ... (repeat several times) Mar 17 15:01:44 syseng1-vm corosync[2619]: [TOTEM ] Retransmit List: 39 37 Mar 17 15:01:54 syseng1-vm kernel: dlm: closing connection to node 3 Mar 17 15:01:54 syseng1-vm kernel: dlm: closing connection to node 2 Mar 17 15:01:54 syseng1-vm kernel: dlm: closing connection to node 1 Mar 17 15:01:54 syseng1-vm corosync[2619]: [TOTEM ] A processor failed, forming new configuration. Mar 17 15:02:00 syseng1-vm corosync[2619]: [TOTEM ] Retransmit List: 2 ... (50 or 60 more) Mar 17 15:02:19 syseng1-vm corosync[2619]: [TOTEM ] Retransmit List: 2
(In reply to comment #8) > You need to disable redundant ring; it is not well tested and entirely > unsupported. The nodelist fed to from dlm_controld to the kernel may or may > not match up with the active ring. That is, corosync might think a given node > is "up" but the DLM may be unable to talk to it because. When this happens, > anything using the DLM (rgmanager included) will break. Oops... The nodelist given from dlm_controld to the kernel may or may not match up with the active ring. That is, corosync might think a given node is "up" but the DLM may be unable to talk to it. When this happens, anything using the DLM (rgmanager included) will break.
If rgmanager crashed (which is possible) there is a bug -- but the underlying issue(s) causing it is far worse. If abrtd caught a core of rgmanager, go ahead and attach it here and I'll figure out how to fix it. If there was no core file, you may need to add "DAEMON_COREFILE_LIMIT=unlimited" to /etc/sysconfig/rgmanager .
*** This bug has been marked as a duplicate of bug 725058 ***