Bug 171246

Summary:	shot master gets marked expired and isn't ever let back into the cluster
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Corey Marthaler <cmarthal>
Component:	gulm	Assignee:	Chris Feist <cfeist>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	cluster-maint, nstraz
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHBA-2006-0238	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-03-09 19:52:44 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	164915

Description Corey Marthaler 2005-10-19 21:30:28 UTC

Description of problem:
Three node cluster with an IP service running:
        Gulm Status
        ===========
        link-10: Master
        link-11: Slave
        link-12: Slave
Those facing the revolver=link-10

After being shot link-11 becomes the new master and then never marks link-10
unexpired, therefore causing link-10 to be stuck at Pending whenn attempting to
rejoin.


[root@link-10 ~]# gulm_tool getstats $(hostname)
I_am = Pending
quorum_has = 1
quorum_needs = 2
rank = 0
quorate = false
GenerationID = 0
run time = 1551
pid = 3725
verbosity = Default
failover = enabled

[root@link-11 ~]# gulm_tool getstats $(hostname)
I_am = Master
quorum_has = 2
quorum_needs = 2
rank = 1
quorate = true
GenerationID = 1129656435624285
run time = 1789
pid = 3724
verbosity = Default
failover = enabled

[root@link-12 ~]# gulm_tool getstats $(hostname)
I_am = Slave
Master = link-11.lab.msp.redhat.com
rank = 2
quorate = true
GenerationID = 1129656435624285
run time = 3515
pid = 3327
verbosity = Default
failover = enabled


[root@link-10 ~]# gulm_tool nodelist $(hostname):core
 Name: link-10
  ip    = ::ffff:10.15.84.160
  state = Logged in
  last state = Logged out
  mode = Pending
  missed beats = 0
  last beat = 0
  delay avg = 0
  max delay = 0


[root@link-11 ~]# gulm_tool nodelist $(hostname):core
 Name: link-12
  ip    = ::ffff:10.15.84.162
  state = Logged in
  last state = Was Logged in
  mode = Slave
  missed beats = 0
  last beat = 1129739640837508
  delay avg = 10000840
  max delay = 18446744073692569048

 Name: link-10
  ip    = ::ffff:10.15.84.160
  state = Expired
  last state = Logged in
  mode = Master
  missed beats = 3
  last beat = 1129737954184427
  delay avg = 128555690
  max delay = 18446744073686482467

 Name: link-11
  ip    = ::ffff:10.15.84.161
  state = Logged in
  last state = Was Logged in
  mode = Master
  missed beats = 0
  last beat = 1129739639422556
  delay avg = 10001019
  max delay = 18446744073690534786


link-11 (new master):
Oct 19 11:05:10 link-11 ccsd[3717]: Cluster is quorate.  Allowing connections.
Oct 19 11:05:10 link-11 clurgmgrd[5758]: <notice> Quorum Achieved
Oct 19 11:05:10 link-11 clurgmgrd[5758]: <info> Magma Event: Membership Change
Oct 19 11:05:10 link-11 clurgmgrd[5758]: <info> State change: link-10 DOWN
Oct 19 11:05:11 link-11 clurgmgrd[5758]: <notice> Taking over service
10.15.84.156 from down member (null)
Oct 19 11:05:12 link-11 clurgmgrd[5758]: <notice> Service 10.15.84.156 started
Oct 19 11:05:24 link-11 lock_gulmd_core[3724]: link-10 missed a heartbeat
(time:1129737924177196 mb:1)
Oct 19 11:05:39 link-11 lock_gulmd_core[3724]: link-10 missed a heartbeat
(time:1129737939180812 mb:2)
Oct 19 11:05:54 link-11 lock_gulmd_core[3724]: link-10 missed a heartbeat
(time:1129737954184427 mb:3)
Oct 19 11:05:54 link-11 lock_gulmd_core[3724]: Client (link-10) expired
Oct 19 11:05:54 link-11 lock_gulmd_core[5957]: Gonna exec fence_node -O link-10
Oct 19 11:05:54 link-11 lock_gulmd_core[3724]: Forked [5957] fence_node -O
link-10 with a 0 pause.
Oct 19 11:06:03 link-11 fence_node[5957]: Fence of "link-10" was successful
Oct 19 11:08:28 link-11 lock_gulmd_core[3724]: "Magma::6089" is logged out. fd:11
Oct 19 11:08:31 link-11 lock_gulmd_core[3724]:  (link-10 ::ffff:10.15.84.160)
Cannot login if you are expired.
Oct 19 11:08:34 link-11 lock_gulmd_core[3724]: "Magma::6090" is logged out. fd:11
Oct 19 11:08:34 link-11 lock_gulmd_core[3724]:  (link-10 ::ffff:10.15.84.160)
Cannot login if you are expired.
Oct 19 11:08:40 link-11 last message repeated 2 times
Oct 19 11:08:41 link-11 lock_gulmd_core[3724]: "Magma::6101" is logged out. fd:11
Oct 19 11:08:43 link-11 lock_gulmd_core[3724]:  (link-10 ::ffff:10.15.84.160)
Cannot login if you are expired.




link-10 (old master):
Oct 19 11:05:10 link-10 ccsd[3718]: cluster.conf (cluster name = LINK_128,
version = 2) found.
Oct 19 11:05:10 link-10 ccsd[3718]: Remote copy of cluster.conf is from quorate
node.
Oct 19 11:05:10 link-10 ccsd[3718]:  Local version # : 2
Oct 19 11:05:10 link-10 ccsd[3718]:  Remote version #: 2
Oct 19 11:05:10 link-10 lock_gulmd_main[3723]: Forked lock_gulmd_core.
Oct 19 11:05:10 link-10 lock_gulmd_core[3725]: Starting lock_gulmd_core 1.0.4.
(built Aug  1 200
5 14:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
Oct 19 11:05:10 link-10 lock_gulmd_core[3725]: I am running in Fail-over mode.
Oct 19 11:05:10 link-10 lock_gulmd_core[3725]: I am (link-10) with ip
(::ffff:10.15.84.160)
Oct 19 11:05:10 link-10 lock_gulmd_core[3725]: This is cluster LINK_128
Oct 19 11:05:11 link-10 lock_gulmd_core[3725]: EOF on xdr (Magma::3719 ::1 idx:1
fd:6)
Oct 19 11:05:11 link-10 hald[2538]: Timed out waiting for hotplug event 267.
Rebasing to 523
Oct 19 11:05:11 link-10 lock_gulmd_main[3723]: Forked lock_gulmd_LT.
Oct 19 11:05:11 link-10 lock_gulmd_LT[3729]: Starting lock_gulmd_LT 1.0.4.
(built Aug  1 2005 14
:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
Oct 19 11:05:11 link-10 lock_gulmd_LT[3729]: I am running in Fail-over mode.
Oct 19 11:05:11 link-10 lock_gulmd_LT[3729]: I am (link-10) with ip
(::ffff:10.15.84.160)
Oct 19 11:05:11 link-10 lock_gulmd_LT[3729]: This is cluster LINK_128
Oct 19 11:05:11 link-10 lock_gulmd_core[3725]: ERROR [src/core_io.c:1058] Got
error from reply:
(link-11.lab.msp.redhat.com ::ffff:10.15.84.161) 1008:Bad State Change
Oct 19 11:05:12 link-10 lock_gulmd_core[3725]: EOF on xdr (Magma::3719 ::1 idx:2
fd:7)
Oct 19 11:05:12 link-10 lock_gulmd_main[3723]: Forked lock_gulmd_LTPX.
Oct 19 11:05:12 link-10 qarshd[3722]: That's enough
Oct 19 11:05:12 link-10 lock_gulmd_LTPX[3733]: Starting lock_gulmd_LTPX 1.0.4.
(built Aug  1 200
5 14:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
Oct 19 11:05:12 link-10 lock_gulmd_LTPX[3733]: I am running in Fail-over mode.
Oct 19 11:05:12 link-10 lock_gulmd_LTPX[3733]: I am (link-10) with ip
(::ffff:10.15.84.160)
Oct 19 11:05:12 link-10 lock_gulmd_LTPX[3733]: This is cluster LINK_128
Oct 19 11:05:13 link-10 ccsd[3718]: Connected to cluster infrastruture via: GuLM
Plugin v1.0.1
Oct 19 11:05:13 link-10 ccsd[3718]: Initial status:: Inquorate
Oct 19 11:05:14 link-10 lock_gulmd_core[3725]: ERROR [src/core_io.c:1058] Got
error from reply:
(link-11.lab.msp.redhat.com ::ffff:10.15.84.161) 1008:Bad State Change
Oct 19 11:05:16 link-10 qarshd[4706]: Talking to peer 10.15.80.3:34440
Oct 19 11:05:17 link-10 lock_gulmd_core[3725]: ERROR [src/core_io.c:1058] Got
error from reply:
(link-11.lab.msp.redhat.com ::ffff:10.15.84.161) 1008:Bad State Change


Version-Release number of selected component (if applicable):
Starting lock_gulmd_core 1.0.4. (built Aug  1 2005 14:54:33) Copyright (C) 2004
Red Hat, Inc.  All rights reserved.

Comment 1 Corey Marthaler 2005-10-25 16:27:48 UTC

just a note that this was seen while doing regression testing for the
2.6.9-22.0.1 kernel

Comment 3 Chris Feist 2005-12-14 21:03:52 UTC

It appears this bug is caused by the main lock_gulmd process not receiving the
SIGCHLD event, and therefore never running waitpid().  I've modified the code to
call waitpid whether or not a SIGCHLD signal has been sent.

Comment 4 Corey Marthaler 2005-12-14 21:12:45 UTC

This fix ran though 85 iterations of revolver without problems.

Comment 5 Chris Feist 2006-01-17 19:32:59 UTC

*** Bug 178081 has been marked as a duplicate of this bug. ***

Comment 8 Red Hat Bugzilla 2006-03-09 19:52:44 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0238.html