Bug 144945 - ccsd not recognizing that gulm is quorate when quorum is lost and then restablished
Summary: ccsd not recognizing that gulm is quorate when quorum is lost and then restab...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: magma-plugins
Version: 4
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-01-12 22:02 UTC by Adam "mantis" Manthei
Modified: 2009-04-16 20:16 UTC (History)
1 user (show)

Fixed In Version: RHCS4U1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-11-22 18:24:13 UTC
Embargoed:


Attachments (Terms of Use)
patch which changes gulm.so to kill lock FD on shutdown (565 bytes, patch)
2005-01-13 19:33 UTC, Lon Hohberger
no flags Details | Diff

Description Adam "mantis" Manthei 2005-01-12 22:02:46 UTC
Description of problem:
If ccs is being used with the gulm magma pluggin, it is not able to
discover that lock_gulmd regains quorum in the event that gulm loses
quorum and then later becomes quorate.

Version-Release number of selected component (if applicable):
RHEL4 cluster branch, Wed Jan 12 15:43:47 CST 2005

How reproducible:
always

Steps to Reproduce:
1. start ccsd on node1
2. start ccsd on node2
3. start lock_gulmd on node1
4. start lock_gulmd on node2
5. stop lock_gulmd on node1
6. start lock_gulmd on node1
7. ccs_test conncet on node1
  
Actual results:

#
# servers are trin-01 trin-02 and trin-03.  start with none running
# ccsd or lock_gulmd
#
[root@trin-01 ~]# gulm_tool getstats trin-02
Failed to connect to trin-02 (::ffff:192.168.44.172 40040) Connection
refused In src/gulm_tool.c:607 (DEVEL.1105376568) death by:
Failed to connect to server
[root@trin-01 ~]# gulm_tool getstats trin-03
Failed to connect to trin-03 (::ffff:192.168.44.173 40040) Connection
refused In src/gulm_tool.c:607 (DEVEL.1105376568) death by:
Failed to connect to server
[root@trin-01 ~]# gulm_tool getstats trin-01
Failed to connect to trin-01 (::ffff:192.168.44.171 40040) Connection
refused In src/gulm_tool.c:607 (DEVEL.1105376568) death by:
Failed to connect to server

#
# start ccsd on trin-02 and trin-01
#
[root@trin-02 ~]# service ccsd start 
Starting ccsd:                              [  OK  ]

[root@trin-01 ~]# service ccsd start 
Starting ccsd:                              [  OK  ]

#
# start lock_gulmd on trin-02 and trin-01
# 
[root@trin-02 ~]# service lock_gulmd start 
Starting lock_gulmd:                        [  OK  ]

[root@trin-01 ~]# service lock_gulmd start
Starting lock_gulmd:                        [  OK  ]

#
# trin-02 is the master
#
[root@trin-01 ~]# gulm_tool getstats trin-02
I_am = Master
quorum_has = 2
quorum_needs = 2
rank = 1
quorate = true
GenerationID = 1105566470051645
run time = 34
pid = 15671
verbosity = Default
failover = enabled

#
# trin-01 is a slave
#
[root@trin-01 ~]# gulm_tool getstats trin-01
I_am = Slave
Master = trin-02.lab.msp.redhat.com
rank = 0
quorate = true
GenerationID = 1105566470051645
run time = 17
pid = 21785
verbosity = Default
failover = enabled

#
# demonstrate that ccs is working
#
[root@trin-01 ~]# ccs_test connect
Connect successful.
 Connection descriptor = 0

#
# stop gulm server on trin-01
#
[root@trin-01 ~]# service lock_gulmd stop
Checking for Gulm Services...
Stopping lock_gulmd:                                       [  OK  ]

#
# we are stopped... quorum is lost
#

[root@trin-01 ~]# ccs_test connect
ccs_connect failed: Connection refused

[root@trin-01 ~]# gulm_tool getstats trin-01
Failed to connect to trin-01 (::ffff:192.168.44.171 40040) Connection
refused In src/gulm_tool.c:607 (DEVEL.1105376568) death by:
Failed to connect to server

#
# make cluster quorate again
#
[root@trin-01 ~]# service lock_gulmd start
Starting lock_gulmd:                                    [  OK  ]

#
# We are now quorate again
#
[root@trin-01 ~]# gulm_tool getstats trin-01
I_am = Slave
Master = trin-02.lab.msp.redhat.com
rank = 0
quorate = true
GenerationID = 1105566470051645
run time = 5
pid = 21858
verbosity = Default
failover = enabled

[root@trin-01 ~]# ccs_test connect
ccs_connect failed: Connection refused

[root@trin-01 ~]# tail -n 100 /var/log/messages | grep ccsd
Jan 12 15:47:29 trin-01 ccsd[21682]: Starting ccsd DEVEL.1105376568: 
Jan 12 15:47:29 trin-01 ccsd[21682]:  Built: Jan 12 2005 15:00:10 
Jan 12 15:47:29 trin-01 ccsd[21682]:  Copyright (C) Red Hat, Inc. 
2004  All rights reserved. 
Jan 12 15:47:30 trin-01 ccsd:  succeeded
Jan 12 15:47:38 trin-01 ccsd[21682]: Unable to connect to cluster
infrastructure after 10 seconds. 
Jan 12 15:47:48 trin-01 ccsd[21682]: Unable to connect to cluster
infrastructure after 20 seconds. 
Jan 12 15:47:58 trin-01 ccsd[21682]: Unable to connect to cluster
infrastructure after 30 seconds. 
Jan 12 15:48:09 trin-01 ccsd[21682]: Unable to connect to cluster
infrastructure after 40 seconds. 
Jan 12 15:48:10 trin-01 ccsd[21682]: cluster.conf (cluster name =
mantis, version = 17) found. 
Jan 12 15:48:14 trin-01 ccsd[21682]: Connected to cluster
infrastruture via: GuLM Plugin v1.0 
Jan 12 15:48:14 trin-01 ccsd[21682]: Initial status:: Quorate 
Jan 12 15:48:48 trin-01 ccsd[21682]: Cluster is not quorate.  Refusing
connection. 
Jan 12 15:48:48 trin-01 ccsd[21682]: Error while processing connect:
Connection refused 
Jan 12 15:49:08 trin-01 ccsd[21682]: Cluster is not quorate.  Refusing
connection. 
Jan 12 15:49:08 trin-01 ccsd[21682]: Error while processing connect:
Connection refused 


Expected results:
ccsd accepts connection once gulm is quorate again

Additional info:

<?xml version="1.0"?>
<cluster name="mantis" config_version="17">
        <gulm>
                <lockserver name="trin-01.lab.msp.redhat.com"/>
                <lockserver name="trin-02.lab.msp.redhat.com"/>
                <lockserver name="trin-03.lab.msp.redhat.com"/>
        </gulm>

        <clusternodes>
                <clusternode name="trin-01.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-01"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-02.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-02"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-03.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-03"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-04.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-04"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-05.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-05"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-06.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-06"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-07.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-07"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-08.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-08"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-09.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-09"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>

        <fence_daemon clean_start="1"/>

        <fencedevices>
                <fencedevice name="mm" agent="/root/bin/mm_fence"
ipaddr="void.msp.redhat.com" ipport="16661" mm_bin="/root/bin/mm_util"/>
        </fencedevices>
</cluster>


as a workaround, ccsd can be restarted to restablish its connection
after the cluster is quorate again

Comment 1 Jonathan Earl Brassow 2005-01-13 15:30:26 UTC
ccs uses magma to get cluster events.  If magma is not telling CCS that 
quorum has been reestablished, there is nothing ccs can do about it.

Comment 2 Lon Hohberger 2005-01-13 16:54:48 UTC
Could be gulm magma plugin not pushing the state change back up to the
parent; will check this out.

Comment 3 Lon Hohberger 2005-01-13 18:17:08 UTC
I did a fudge where I had 2 of 3 masters online and listened with my
magma event listener:

On node "red":

[root@red cluster]# lock_gulmd --servers "red green blue"
--cluster_name foo
[root@red cluster]# cpt listen ...

On node "green":
[root@green gulm]# lock_gulmd --servers "red green blue"
--cluster_name foo
[root@green gulm]# gulm_tool shutdown localhost:core
[root@green gulm]# lock_gulmd --servers "red green blue"
--cluster_name foo

Output of 'cpt' on red:

Connected via: GuLM Plugin v1.0
Listening for events (group cluster::usrm)...
+++ Dump of 0x8c22020 (1 nodes)
    red.lab.boston.redhat.com (id 0xffff0000284fa8c0) state Up
     - red.lab.boston.redhat.com 192.168.79.40
--- Done
=== Waiting for events.
*E* Quorum formed
*E* Quorum dissolved
*E* Quorum formed

The form/dissolve/form correspond to lock_gulmd master being started
on green.


Comment 4 Lon Hohberger 2005-01-13 19:30:41 UTC
It looks like libgulm doesn't detect when lg_lock_logout is called and
lock_gulmd is no longer running.

However, given that once we get CE_SHUTDOWN, the application must exit
without calling locks, etc., it's quite easy to fix this in the gulm
magma plugin.


Comment 5 Lon Hohberger 2005-01-13 19:33:22 UTC
Created attachment 109734 [details]
patch which changes gulm.so to kill lock FD on shutdown


Note You need to log in before you can comment on or make changes to this bug.