Description of problem: If ccs is being used with the gulm magma pluggin, it is not able to discover that lock_gulmd regains quorum in the event that gulm loses quorum and then later becomes quorate. Version-Release number of selected component (if applicable): RHEL4 cluster branch, Wed Jan 12 15:43:47 CST 2005 How reproducible: always Steps to Reproduce: 1. start ccsd on node1 2. start ccsd on node2 3. start lock_gulmd on node1 4. start lock_gulmd on node2 5. stop lock_gulmd on node1 6. start lock_gulmd on node1 7. ccs_test conncet on node1 Actual results: # # servers are trin-01 trin-02 and trin-03. start with none running # ccsd or lock_gulmd # [root@trin-01 ~]# gulm_tool getstats trin-02 Failed to connect to trin-02 (::ffff:192.168.44.172 40040) Connection refused In src/gulm_tool.c:607 (DEVEL.1105376568) death by: Failed to connect to server [root@trin-01 ~]# gulm_tool getstats trin-03 Failed to connect to trin-03 (::ffff:192.168.44.173 40040) Connection refused In src/gulm_tool.c:607 (DEVEL.1105376568) death by: Failed to connect to server [root@trin-01 ~]# gulm_tool getstats trin-01 Failed to connect to trin-01 (::ffff:192.168.44.171 40040) Connection refused In src/gulm_tool.c:607 (DEVEL.1105376568) death by: Failed to connect to server # # start ccsd on trin-02 and trin-01 # [root@trin-02 ~]# service ccsd start Starting ccsd: [ OK ] [root@trin-01 ~]# service ccsd start Starting ccsd: [ OK ] # # start lock_gulmd on trin-02 and trin-01 # [root@trin-02 ~]# service lock_gulmd start Starting lock_gulmd: [ OK ] [root@trin-01 ~]# service lock_gulmd start Starting lock_gulmd: [ OK ] # # trin-02 is the master # [root@trin-01 ~]# gulm_tool getstats trin-02 I_am = Master quorum_has = 2 quorum_needs = 2 rank = 1 quorate = true GenerationID = 1105566470051645 run time = 34 pid = 15671 verbosity = Default failover = enabled # # trin-01 is a slave # [root@trin-01 ~]# gulm_tool getstats trin-01 I_am = Slave Master = trin-02.lab.msp.redhat.com rank = 0 quorate = true GenerationID = 1105566470051645 run time = 17 pid = 21785 verbosity = Default failover = enabled # # demonstrate that ccs is working # [root@trin-01 ~]# ccs_test connect Connect successful. Connection descriptor = 0 # # stop gulm server on trin-01 # [root@trin-01 ~]# service lock_gulmd stop Checking for Gulm Services... Stopping lock_gulmd: [ OK ] # # we are stopped... quorum is lost # [root@trin-01 ~]# ccs_test connect ccs_connect failed: Connection refused [root@trin-01 ~]# gulm_tool getstats trin-01 Failed to connect to trin-01 (::ffff:192.168.44.171 40040) Connection refused In src/gulm_tool.c:607 (DEVEL.1105376568) death by: Failed to connect to server # # make cluster quorate again # [root@trin-01 ~]# service lock_gulmd start Starting lock_gulmd: [ OK ] # # We are now quorate again # [root@trin-01 ~]# gulm_tool getstats trin-01 I_am = Slave Master = trin-02.lab.msp.redhat.com rank = 0 quorate = true GenerationID = 1105566470051645 run time = 5 pid = 21858 verbosity = Default failover = enabled [root@trin-01 ~]# ccs_test connect ccs_connect failed: Connection refused [root@trin-01 ~]# tail -n 100 /var/log/messages | grep ccsd Jan 12 15:47:29 trin-01 ccsd[21682]: Starting ccsd DEVEL.1105376568: Jan 12 15:47:29 trin-01 ccsd[21682]: Built: Jan 12 2005 15:00:10 Jan 12 15:47:29 trin-01 ccsd[21682]: Copyright (C) Red Hat, Inc. 2004 All rights reserved. Jan 12 15:47:30 trin-01 ccsd: succeeded Jan 12 15:47:38 trin-01 ccsd[21682]: Unable to connect to cluster infrastructure after 10 seconds. Jan 12 15:47:48 trin-01 ccsd[21682]: Unable to connect to cluster infrastructure after 20 seconds. Jan 12 15:47:58 trin-01 ccsd[21682]: Unable to connect to cluster infrastructure after 30 seconds. Jan 12 15:48:09 trin-01 ccsd[21682]: Unable to connect to cluster infrastructure after 40 seconds. Jan 12 15:48:10 trin-01 ccsd[21682]: cluster.conf (cluster name = mantis, version = 17) found. Jan 12 15:48:14 trin-01 ccsd[21682]: Connected to cluster infrastruture via: GuLM Plugin v1.0 Jan 12 15:48:14 trin-01 ccsd[21682]: Initial status:: Quorate Jan 12 15:48:48 trin-01 ccsd[21682]: Cluster is not quorate. Refusing connection. Jan 12 15:48:48 trin-01 ccsd[21682]: Error while processing connect: Connection refused Jan 12 15:49:08 trin-01 ccsd[21682]: Cluster is not quorate. Refusing connection. Jan 12 15:49:08 trin-01 ccsd[21682]: Error while processing connect: Connection refused Expected results: ccsd accepts connection once gulm is quorate again Additional info: <?xml version="1.0"?> <cluster name="mantis" config_version="17"> <gulm> <lockserver name="trin-01.lab.msp.redhat.com"/> <lockserver name="trin-02.lab.msp.redhat.com"/> <lockserver name="trin-03.lab.msp.redhat.com"/> </gulm> <clusternodes> <clusternode name="trin-01.lab.msp.redhat.com"> <fence> <method name="default"> <device name="mm" myname="trin-01"/> </method> </fence> </clusternode> <clusternode name="trin-02.lab.msp.redhat.com"> <fence> <method name="default"> <device name="mm" myname="trin-02"/> </method> </fence> </clusternode> <clusternode name="trin-03.lab.msp.redhat.com"> <fence> <method name="default"> <device name="mm" myname="trin-03"/> </method> </fence> </clusternode> <clusternode name="trin-04.lab.msp.redhat.com"> <fence> <method name="default"> <device name="mm" myname="trin-04"/> </method> </fence> </clusternode> <clusternode name="trin-05.lab.msp.redhat.com"> <fence> <method name="default"> <device name="mm" myname="trin-05"/> </method> </fence> </clusternode> <clusternode name="trin-06.lab.msp.redhat.com"> <fence> <method name="default"> <device name="mm" myname="trin-06"/> </method> </fence> </clusternode> <clusternode name="trin-07.lab.msp.redhat.com"> <fence> <method name="default"> <device name="mm" myname="trin-07"/> </method> </fence> </clusternode> <clusternode name="trin-08.lab.msp.redhat.com"> <fence> <method name="default"> <device name="mm" myname="trin-08"/> </method> </fence> </clusternode> <clusternode name="trin-09.lab.msp.redhat.com"> <fence> <method name="default"> <device name="mm" myname="trin-09"/> </method> </fence> </clusternode> </clusternodes> <fence_daemon clean_start="1"/> <fencedevices> <fencedevice name="mm" agent="/root/bin/mm_fence" ipaddr="void.msp.redhat.com" ipport="16661" mm_bin="/root/bin/mm_util"/> </fencedevices> </cluster> as a workaround, ccsd can be restarted to restablish its connection after the cluster is quorate again
ccs uses magma to get cluster events. If magma is not telling CCS that quorum has been reestablished, there is nothing ccs can do about it.
Could be gulm magma plugin not pushing the state change back up to the parent; will check this out.
I did a fudge where I had 2 of 3 masters online and listened with my magma event listener: On node "red": [root@red cluster]# lock_gulmd --servers "red green blue" --cluster_name foo [root@red cluster]# cpt listen ... On node "green": [root@green gulm]# lock_gulmd --servers "red green blue" --cluster_name foo [root@green gulm]# gulm_tool shutdown localhost:core [root@green gulm]# lock_gulmd --servers "red green blue" --cluster_name foo Output of 'cpt' on red: Connected via: GuLM Plugin v1.0 Listening for events (group cluster::usrm)... +++ Dump of 0x8c22020 (1 nodes) red.lab.boston.redhat.com (id 0xffff0000284fa8c0) state Up - red.lab.boston.redhat.com 192.168.79.40 --- Done === Waiting for events. *E* Quorum formed *E* Quorum dissolved *E* Quorum formed The form/dissolve/form correspond to lock_gulmd master being started on green.
It looks like libgulm doesn't detect when lg_lock_logout is called and lock_gulmd is no longer running. However, given that once we get CE_SHUTDOWN, the application must exit without calling locks, etc., it's quite easy to fix this in the gulm magma plugin.
Created attachment 109734 [details] patch which changes gulm.so to kill lock FD on shutdown
Fix in CVS. State -> MODIFIED http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/magma-plugins/gulm/gulm.c.diff?cvsroot=cluster&only_with_tag=RHEL4&r1=1.6.2.1&r2=1.6.2.2