Bug 171246

Summary: shot master gets marked expired and isn't ever let back into the cluster
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: gulmAssignee: Chris Feist <cfeist>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint, nstraz
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2006-0238 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-03-09 19:52:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 164915    

Description Corey Marthaler 2005-10-19 21:30:28 UTC
Description of problem:
Three node cluster with an IP service running:
        Gulm Status
        ===========
        link-10: Master
        link-11: Slave
        link-12: Slave
Those facing the revolver=link-10

After being shot link-11 becomes the new master and then never marks link-10
unexpired, therefore causing link-10 to be stuck at Pending whenn attempting to
rejoin.


[root@link-10 ~]# gulm_tool getstats $(hostname)
I_am = Pending
quorum_has = 1
quorum_needs = 2
rank = 0
quorate = false
GenerationID = 0
run time = 1551
pid = 3725
verbosity = Default
failover = enabled

[root@link-11 ~]# gulm_tool getstats $(hostname)
I_am = Master
quorum_has = 2
quorum_needs = 2
rank = 1
quorate = true
GenerationID = 1129656435624285
run time = 1789
pid = 3724
verbosity = Default
failover = enabled

[root@link-12 ~]# gulm_tool getstats $(hostname)
I_am = Slave
Master = link-11.lab.msp.redhat.com
rank = 2
quorate = true
GenerationID = 1129656435624285
run time = 3515
pid = 3327
verbosity = Default
failover = enabled


[root@link-10 ~]# gulm_tool nodelist $(hostname):core
 Name: link-10
  ip    = ::ffff:10.15.84.160
  state = Logged in
  last state = Logged out
  mode = Pending
  missed beats = 0
  last beat = 0
  delay avg = 0
  max delay = 0


[root@link-11 ~]# gulm_tool nodelist $(hostname):core
 Name: link-12
  ip    = ::ffff:10.15.84.162
  state = Logged in
  last state = Was Logged in
  mode = Slave
  missed beats = 0
  last beat = 1129739640837508
  delay avg = 10000840
  max delay = 18446744073692569048

 Name: link-10
  ip    = ::ffff:10.15.84.160
  state = Expired
  last state = Logged in
  mode = Master
  missed beats = 3
  last beat = 1129737954184427
  delay avg = 128555690
  max delay = 18446744073686482467

 Name: link-11
  ip    = ::ffff:10.15.84.161
  state = Logged in
  last state = Was Logged in
  mode = Master
  missed beats = 0
  last beat = 1129739639422556
  delay avg = 10001019
  max delay = 18446744073690534786


link-11 (new master):
Oct 19 11:05:10 link-11 ccsd[3717]: Cluster is quorate.  Allowing connections.
Oct 19 11:05:10 link-11 clurgmgrd[5758]: <notice> Quorum Achieved
Oct 19 11:05:10 link-11 clurgmgrd[5758]: <info> Magma Event: Membership Change
Oct 19 11:05:10 link-11 clurgmgrd[5758]: <info> State change: link-10 DOWN
Oct 19 11:05:11 link-11 clurgmgrd[5758]: <notice> Taking over service
10.15.84.156 from down member (null)
Oct 19 11:05:12 link-11 clurgmgrd[5758]: <notice> Service 10.15.84.156 started
Oct 19 11:05:24 link-11 lock_gulmd_core[3724]: link-10 missed a heartbeat
(time:1129737924177196 mb:1)
Oct 19 11:05:39 link-11 lock_gulmd_core[3724]: link-10 missed a heartbeat
(time:1129737939180812 mb:2)
Oct 19 11:05:54 link-11 lock_gulmd_core[3724]: link-10 missed a heartbeat
(time:1129737954184427 mb:3)
Oct 19 11:05:54 link-11 lock_gulmd_core[3724]: Client (link-10) expired
Oct 19 11:05:54 link-11 lock_gulmd_core[5957]: Gonna exec fence_node -O link-10
Oct 19 11:05:54 link-11 lock_gulmd_core[3724]: Forked [5957] fence_node -O
link-10 with a 0 pause.
Oct 19 11:06:03 link-11 fence_node[5957]: Fence of "link-10" was successful
Oct 19 11:08:28 link-11 lock_gulmd_core[3724]: "Magma::6089" is logged out. fd:11
Oct 19 11:08:31 link-11 lock_gulmd_core[3724]:  (link-10 ::ffff:10.15.84.160)
Cannot login if you are expired.
Oct 19 11:08:34 link-11 lock_gulmd_core[3724]: "Magma::6090" is logged out. fd:11
Oct 19 11:08:34 link-11 lock_gulmd_core[3724]:  (link-10 ::ffff:10.15.84.160)
Cannot login if you are expired.
Oct 19 11:08:40 link-11 last message repeated 2 times
Oct 19 11:08:41 link-11 lock_gulmd_core[3724]: "Magma::6101" is logged out. fd:11
Oct 19 11:08:43 link-11 lock_gulmd_core[3724]:  (link-10 ::ffff:10.15.84.160)
Cannot login if you are expired.




link-10 (old master):
Oct 19 11:05:10 link-10 ccsd[3718]: cluster.conf (cluster name = LINK_128,
version = 2) found.
Oct 19 11:05:10 link-10 ccsd[3718]: Remote copy of cluster.conf is from quorate
node.
Oct 19 11:05:10 link-10 ccsd[3718]:  Local version # : 2
Oct 19 11:05:10 link-10 ccsd[3718]:  Remote version #: 2
Oct 19 11:05:10 link-10 lock_gulmd_main[3723]: Forked lock_gulmd_core.
Oct 19 11:05:10 link-10 lock_gulmd_core[3725]: Starting lock_gulmd_core 1.0.4.
(built Aug  1 200
5 14:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
Oct 19 11:05:10 link-10 lock_gulmd_core[3725]: I am running in Fail-over mode.
Oct 19 11:05:10 link-10 lock_gulmd_core[3725]: I am (link-10) with ip
(::ffff:10.15.84.160)
Oct 19 11:05:10 link-10 lock_gulmd_core[3725]: This is cluster LINK_128
Oct 19 11:05:11 link-10 lock_gulmd_core[3725]: EOF on xdr (Magma::3719 ::1 idx:1
fd:6)
Oct 19 11:05:11 link-10 hald[2538]: Timed out waiting for hotplug event 267.
Rebasing to 523
Oct 19 11:05:11 link-10 lock_gulmd_main[3723]: Forked lock_gulmd_LT.
Oct 19 11:05:11 link-10 lock_gulmd_LT[3729]: Starting lock_gulmd_LT 1.0.4.
(built Aug  1 2005 14
:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
Oct 19 11:05:11 link-10 lock_gulmd_LT[3729]: I am running in Fail-over mode.
Oct 19 11:05:11 link-10 lock_gulmd_LT[3729]: I am (link-10) with ip
(::ffff:10.15.84.160)
Oct 19 11:05:11 link-10 lock_gulmd_LT[3729]: This is cluster LINK_128
Oct 19 11:05:11 link-10 lock_gulmd_core[3725]: ERROR [src/core_io.c:1058] Got
error from reply:
(link-11.lab.msp.redhat.com ::ffff:10.15.84.161) 1008:Bad State Change
Oct 19 11:05:12 link-10 lock_gulmd_core[3725]: EOF on xdr (Magma::3719 ::1 idx:2
fd:7)
Oct 19 11:05:12 link-10 lock_gulmd_main[3723]: Forked lock_gulmd_LTPX.
Oct 19 11:05:12 link-10 qarshd[3722]: That's enough
Oct 19 11:05:12 link-10 lock_gulmd_LTPX[3733]: Starting lock_gulmd_LTPX 1.0.4.
(built Aug  1 200
5 14:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
Oct 19 11:05:12 link-10 lock_gulmd_LTPX[3733]: I am running in Fail-over mode.
Oct 19 11:05:12 link-10 lock_gulmd_LTPX[3733]: I am (link-10) with ip
(::ffff:10.15.84.160)
Oct 19 11:05:12 link-10 lock_gulmd_LTPX[3733]: This is cluster LINK_128
Oct 19 11:05:13 link-10 ccsd[3718]: Connected to cluster infrastruture via: GuLM
Plugin v1.0.1
Oct 19 11:05:13 link-10 ccsd[3718]: Initial status:: Inquorate
Oct 19 11:05:14 link-10 lock_gulmd_core[3725]: ERROR [src/core_io.c:1058] Got
error from reply:
(link-11.lab.msp.redhat.com ::ffff:10.15.84.161) 1008:Bad State Change
Oct 19 11:05:16 link-10 qarshd[4706]: Talking to peer 10.15.80.3:34440
Oct 19 11:05:17 link-10 lock_gulmd_core[3725]: ERROR [src/core_io.c:1058] Got
error from reply:
(link-11.lab.msp.redhat.com ::ffff:10.15.84.161) 1008:Bad State Change


Version-Release number of selected component (if applicable):
Starting lock_gulmd_core 1.0.4. (built Aug  1 2005 14:54:33) Copyright (C) 2004
Red Hat, Inc.  All rights reserved.

Comment 1 Corey Marthaler 2005-10-25 16:27:48 UTC
just a note that this was seen while doing regression testing for the
2.6.9-22.0.1 kernel

Comment 3 Chris Feist 2005-12-14 21:03:52 UTC
It appears this bug is caused by the main lock_gulmd process not receiving the
SIGCHLD event, and therefore never running waitpid().  I've modified the code to
call waitpid whether or not a SIGCHLD signal has been sent.

Comment 4 Corey Marthaler 2005-12-14 21:12:45 UTC
This fix ran though 85 iterations of revolver without problems.

Comment 5 Chris Feist 2006-01-17 19:32:59 UTC
*** Bug 178081 has been marked as a duplicate of this bug. ***

Comment 8 Red Hat Bugzilla 2006-03-09 19:52:44 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0238.html