Bug 144945 - ccsd not recognizing that gulm is quorate when quorum is lost and then restablished
ccsd not recognizing that gulm is quorate when quorum is lost and then restab...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: magma-plugins (Show other bugs)
4
All Linux
medium Severity high
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-01-12 17:02 EST by Adam "mantis" Manthei
Modified: 2009-04-16 16:16 EDT (History)
1 user (show)

See Also:
Fixed In Version: RHCS4U1
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-11-22 13:24:13 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch which changes gulm.so to kill lock FD on shutdown (565 bytes, patch)
2005-01-13 14:33 EST, Lon Hohberger
no flags Details | Diff

  None (edit)
Description Adam "mantis" Manthei 2005-01-12 17:02:46 EST
Description of problem:
If ccs is being used with the gulm magma pluggin, it is not able to
discover that lock_gulmd regains quorum in the event that gulm loses
quorum and then later becomes quorate.

Version-Release number of selected component (if applicable):
RHEL4 cluster branch, Wed Jan 12 15:43:47 CST 2005

How reproducible:
always

Steps to Reproduce:
1. start ccsd on node1
2. start ccsd on node2
3. start lock_gulmd on node1
4. start lock_gulmd on node2
5. stop lock_gulmd on node1
6. start lock_gulmd on node1
7. ccs_test conncet on node1
  
Actual results:

#
# servers are trin-01 trin-02 and trin-03.  start with none running
# ccsd or lock_gulmd
#
[root@trin-01 ~]# gulm_tool getstats trin-02
Failed to connect to trin-02 (::ffff:192.168.44.172 40040) Connection
refused In src/gulm_tool.c:607 (DEVEL.1105376568) death by:
Failed to connect to server
[root@trin-01 ~]# gulm_tool getstats trin-03
Failed to connect to trin-03 (::ffff:192.168.44.173 40040) Connection
refused In src/gulm_tool.c:607 (DEVEL.1105376568) death by:
Failed to connect to server
[root@trin-01 ~]# gulm_tool getstats trin-01
Failed to connect to trin-01 (::ffff:192.168.44.171 40040) Connection
refused In src/gulm_tool.c:607 (DEVEL.1105376568) death by:
Failed to connect to server

#
# start ccsd on trin-02 and trin-01
#
[root@trin-02 ~]# service ccsd start 
Starting ccsd:                              [  OK  ]

[root@trin-01 ~]# service ccsd start 
Starting ccsd:                              [  OK  ]

#
# start lock_gulmd on trin-02 and trin-01
# 
[root@trin-02 ~]# service lock_gulmd start 
Starting lock_gulmd:                        [  OK  ]

[root@trin-01 ~]# service lock_gulmd start
Starting lock_gulmd:                        [  OK  ]

#
# trin-02 is the master
#
[root@trin-01 ~]# gulm_tool getstats trin-02
I_am = Master
quorum_has = 2
quorum_needs = 2
rank = 1
quorate = true
GenerationID = 1105566470051645
run time = 34
pid = 15671
verbosity = Default
failover = enabled

#
# trin-01 is a slave
#
[root@trin-01 ~]# gulm_tool getstats trin-01
I_am = Slave
Master = trin-02.lab.msp.redhat.com
rank = 0
quorate = true
GenerationID = 1105566470051645
run time = 17
pid = 21785
verbosity = Default
failover = enabled

#
# demonstrate that ccs is working
#
[root@trin-01 ~]# ccs_test connect
Connect successful.
 Connection descriptor = 0

#
# stop gulm server on trin-01
#
[root@trin-01 ~]# service lock_gulmd stop
Checking for Gulm Services...
Stopping lock_gulmd:                                       [  OK  ]

#
# we are stopped... quorum is lost
#

[root@trin-01 ~]# ccs_test connect
ccs_connect failed: Connection refused

[root@trin-01 ~]# gulm_tool getstats trin-01
Failed to connect to trin-01 (::ffff:192.168.44.171 40040) Connection
refused In src/gulm_tool.c:607 (DEVEL.1105376568) death by:
Failed to connect to server

#
# make cluster quorate again
#
[root@trin-01 ~]# service lock_gulmd start
Starting lock_gulmd:                                    [  OK  ]

#
# We are now quorate again
#
[root@trin-01 ~]# gulm_tool getstats trin-01
I_am = Slave
Master = trin-02.lab.msp.redhat.com
rank = 0
quorate = true
GenerationID = 1105566470051645
run time = 5
pid = 21858
verbosity = Default
failover = enabled

[root@trin-01 ~]# ccs_test connect
ccs_connect failed: Connection refused

[root@trin-01 ~]# tail -n 100 /var/log/messages | grep ccsd
Jan 12 15:47:29 trin-01 ccsd[21682]: Starting ccsd DEVEL.1105376568: 
Jan 12 15:47:29 trin-01 ccsd[21682]:  Built: Jan 12 2005 15:00:10 
Jan 12 15:47:29 trin-01 ccsd[21682]:  Copyright (C) Red Hat, Inc. 
2004  All rights reserved. 
Jan 12 15:47:30 trin-01 ccsd:  succeeded
Jan 12 15:47:38 trin-01 ccsd[21682]: Unable to connect to cluster
infrastructure after 10 seconds. 
Jan 12 15:47:48 trin-01 ccsd[21682]: Unable to connect to cluster
infrastructure after 20 seconds. 
Jan 12 15:47:58 trin-01 ccsd[21682]: Unable to connect to cluster
infrastructure after 30 seconds. 
Jan 12 15:48:09 trin-01 ccsd[21682]: Unable to connect to cluster
infrastructure after 40 seconds. 
Jan 12 15:48:10 trin-01 ccsd[21682]: cluster.conf (cluster name =
mantis, version = 17) found. 
Jan 12 15:48:14 trin-01 ccsd[21682]: Connected to cluster
infrastruture via: GuLM Plugin v1.0 
Jan 12 15:48:14 trin-01 ccsd[21682]: Initial status:: Quorate 
Jan 12 15:48:48 trin-01 ccsd[21682]: Cluster is not quorate.  Refusing
connection. 
Jan 12 15:48:48 trin-01 ccsd[21682]: Error while processing connect:
Connection refused 
Jan 12 15:49:08 trin-01 ccsd[21682]: Cluster is not quorate.  Refusing
connection. 
Jan 12 15:49:08 trin-01 ccsd[21682]: Error while processing connect:
Connection refused 


Expected results:
ccsd accepts connection once gulm is quorate again

Additional info:

<?xml version="1.0"?>
<cluster name="mantis" config_version="17">
        <gulm>
                <lockserver name="trin-01.lab.msp.redhat.com"/>
                <lockserver name="trin-02.lab.msp.redhat.com"/>
                <lockserver name="trin-03.lab.msp.redhat.com"/>
        </gulm>

        <clusternodes>
                <clusternode name="trin-01.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-01"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-02.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-02"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-03.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-03"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-04.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-04"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-05.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-05"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-06.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-06"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-07.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-07"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-08.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-08"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="trin-09.lab.msp.redhat.com">
                        <fence>
                                <method name="default"> 
                                        <device name="mm"
myname="trin-09"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>

        <fence_daemon clean_start="1"/>

        <fencedevices>
                <fencedevice name="mm" agent="/root/bin/mm_fence"
ipaddr="void.msp.redhat.com" ipport="16661" mm_bin="/root/bin/mm_util"/>
        </fencedevices>
</cluster>


as a workaround, ccsd can be restarted to restablish its connection
after the cluster is quorate again
Comment 1 Jonathan Earl Brassow 2005-01-13 10:30:26 EST
ccs uses magma to get cluster events.  If magma is not telling CCS that 
quorum has been reestablished, there is nothing ccs can do about it.
Comment 2 Lon Hohberger 2005-01-13 11:54:48 EST
Could be gulm magma plugin not pushing the state change back up to the
parent; will check this out.
Comment 3 Lon Hohberger 2005-01-13 13:17:08 EST
I did a fudge where I had 2 of 3 masters online and listened with my
magma event listener:

On node "red":

[root@red cluster]# lock_gulmd --servers "red green blue"
--cluster_name foo
[root@red cluster]# cpt listen ...

On node "green":
[root@green gulm]# lock_gulmd --servers "red green blue"
--cluster_name foo
[root@green gulm]# gulm_tool shutdown localhost:core
[root@green gulm]# lock_gulmd --servers "red green blue"
--cluster_name foo

Output of 'cpt' on red:

Connected via: GuLM Plugin v1.0
Listening for events (group cluster::usrm)...
+++ Dump of 0x8c22020 (1 nodes)
    red.lab.boston.redhat.com (id 0xffff0000284fa8c0) state Up
     - red.lab.boston.redhat.com 192.168.79.40
--- Done
=== Waiting for events.
*E* Quorum formed
*E* Quorum dissolved
*E* Quorum formed

The form/dissolve/form correspond to lock_gulmd master being started
on green.
Comment 4 Lon Hohberger 2005-01-13 14:30:41 EST
It looks like libgulm doesn't detect when lg_lock_logout is called and
lock_gulmd is no longer running.

However, given that once we get CE_SHUTDOWN, the application must exit
without calling locks, etc., it's quite easy to fix this in the gulm
magma plugin.
Comment 5 Lon Hohberger 2005-01-13 14:33:22 EST
Created attachment 109734 [details]
patch which changes gulm.so to kill lock FD on shutdown

Note You need to log in before you can comment on or make changes to this bug.