This service will be undergoing maintenance at 00:00 UTC, 2016-09-28. It is expected to last about 1 hours
Bug 171246 - shot master gets marked expired and isn't ever let back into the cluster
shot master gets marked expired and isn't ever let back into the cluster
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gulm (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Chris Feist
Cluster QE
:
: 178081 (view as bug list)
Depends On:
Blocks: 164915
  Show dependency treegraph
 
Reported: 2005-10-19 17:30 EDT by Corey Marthaler
Modified: 2009-04-16 16:02 EDT (History)
2 users (show)

See Also:
Fixed In Version: RHBA-2006-0238
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-03-09 14:52:44 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2005-10-19 17:30:28 EDT
Description of problem:
Three node cluster with an IP service running:
        Gulm Status
        ===========
        link-10: Master
        link-11: Slave
        link-12: Slave
Those facing the revolver=link-10

After being shot link-11 becomes the new master and then never marks link-10
unexpired, therefore causing link-10 to be stuck at Pending whenn attempting to
rejoin.


[root@link-10 ~]# gulm_tool getstats $(hostname)
I_am = Pending
quorum_has = 1
quorum_needs = 2
rank = 0
quorate = false
GenerationID = 0
run time = 1551
pid = 3725
verbosity = Default
failover = enabled

[root@link-11 ~]# gulm_tool getstats $(hostname)
I_am = Master
quorum_has = 2
quorum_needs = 2
rank = 1
quorate = true
GenerationID = 1129656435624285
run time = 1789
pid = 3724
verbosity = Default
failover = enabled

[root@link-12 ~]# gulm_tool getstats $(hostname)
I_am = Slave
Master = link-11.lab.msp.redhat.com
rank = 2
quorate = true
GenerationID = 1129656435624285
run time = 3515
pid = 3327
verbosity = Default
failover = enabled


[root@link-10 ~]# gulm_tool nodelist $(hostname):core
 Name: link-10
  ip    = ::ffff:10.15.84.160
  state = Logged in
  last state = Logged out
  mode = Pending
  missed beats = 0
  last beat = 0
  delay avg = 0
  max delay = 0


[root@link-11 ~]# gulm_tool nodelist $(hostname):core
 Name: link-12
  ip    = ::ffff:10.15.84.162
  state = Logged in
  last state = Was Logged in
  mode = Slave
  missed beats = 0
  last beat = 1129739640837508
  delay avg = 10000840
  max delay = 18446744073692569048

 Name: link-10
  ip    = ::ffff:10.15.84.160
  state = Expired
  last state = Logged in
  mode = Master
  missed beats = 3
  last beat = 1129737954184427
  delay avg = 128555690
  max delay = 18446744073686482467

 Name: link-11
  ip    = ::ffff:10.15.84.161
  state = Logged in
  last state = Was Logged in
  mode = Master
  missed beats = 0
  last beat = 1129739639422556
  delay avg = 10001019
  max delay = 18446744073690534786


link-11 (new master):
Oct 19 11:05:10 link-11 ccsd[3717]: Cluster is quorate.  Allowing connections.
Oct 19 11:05:10 link-11 clurgmgrd[5758]: <notice> Quorum Achieved
Oct 19 11:05:10 link-11 clurgmgrd[5758]: <info> Magma Event: Membership Change
Oct 19 11:05:10 link-11 clurgmgrd[5758]: <info> State change: link-10 DOWN
Oct 19 11:05:11 link-11 clurgmgrd[5758]: <notice> Taking over service
10.15.84.156 from down member (null)
Oct 19 11:05:12 link-11 clurgmgrd[5758]: <notice> Service 10.15.84.156 started
Oct 19 11:05:24 link-11 lock_gulmd_core[3724]: link-10 missed a heartbeat
(time:1129737924177196 mb:1)
Oct 19 11:05:39 link-11 lock_gulmd_core[3724]: link-10 missed a heartbeat
(time:1129737939180812 mb:2)
Oct 19 11:05:54 link-11 lock_gulmd_core[3724]: link-10 missed a heartbeat
(time:1129737954184427 mb:3)
Oct 19 11:05:54 link-11 lock_gulmd_core[3724]: Client (link-10) expired
Oct 19 11:05:54 link-11 lock_gulmd_core[5957]: Gonna exec fence_node -O link-10
Oct 19 11:05:54 link-11 lock_gulmd_core[3724]: Forked [5957] fence_node -O
link-10 with a 0 pause.
Oct 19 11:06:03 link-11 fence_node[5957]: Fence of "link-10" was successful
Oct 19 11:08:28 link-11 lock_gulmd_core[3724]: "Magma::6089" is logged out. fd:11
Oct 19 11:08:31 link-11 lock_gulmd_core[3724]:  (link-10 ::ffff:10.15.84.160)
Cannot login if you are expired.
Oct 19 11:08:34 link-11 lock_gulmd_core[3724]: "Magma::6090" is logged out. fd:11
Oct 19 11:08:34 link-11 lock_gulmd_core[3724]:  (link-10 ::ffff:10.15.84.160)
Cannot login if you are expired.
Oct 19 11:08:40 link-11 last message repeated 2 times
Oct 19 11:08:41 link-11 lock_gulmd_core[3724]: "Magma::6101" is logged out. fd:11
Oct 19 11:08:43 link-11 lock_gulmd_core[3724]:  (link-10 ::ffff:10.15.84.160)
Cannot login if you are expired.




link-10 (old master):
Oct 19 11:05:10 link-10 ccsd[3718]: cluster.conf (cluster name = LINK_128,
version = 2) found.
Oct 19 11:05:10 link-10 ccsd[3718]: Remote copy of cluster.conf is from quorate
node.
Oct 19 11:05:10 link-10 ccsd[3718]:  Local version # : 2
Oct 19 11:05:10 link-10 ccsd[3718]:  Remote version #: 2
Oct 19 11:05:10 link-10 lock_gulmd_main[3723]: Forked lock_gulmd_core.
Oct 19 11:05:10 link-10 lock_gulmd_core[3725]: Starting lock_gulmd_core 1.0.4.
(built Aug  1 200
5 14:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
Oct 19 11:05:10 link-10 lock_gulmd_core[3725]: I am running in Fail-over mode.
Oct 19 11:05:10 link-10 lock_gulmd_core[3725]: I am (link-10) with ip
(::ffff:10.15.84.160)
Oct 19 11:05:10 link-10 lock_gulmd_core[3725]: This is cluster LINK_128
Oct 19 11:05:11 link-10 lock_gulmd_core[3725]: EOF on xdr (Magma::3719 ::1 idx:1
fd:6)
Oct 19 11:05:11 link-10 hald[2538]: Timed out waiting for hotplug event 267.
Rebasing to 523
Oct 19 11:05:11 link-10 lock_gulmd_main[3723]: Forked lock_gulmd_LT.
Oct 19 11:05:11 link-10 lock_gulmd_LT[3729]: Starting lock_gulmd_LT 1.0.4.
(built Aug  1 2005 14
:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
Oct 19 11:05:11 link-10 lock_gulmd_LT[3729]: I am running in Fail-over mode.
Oct 19 11:05:11 link-10 lock_gulmd_LT[3729]: I am (link-10) with ip
(::ffff:10.15.84.160)
Oct 19 11:05:11 link-10 lock_gulmd_LT[3729]: This is cluster LINK_128
Oct 19 11:05:11 link-10 lock_gulmd_core[3725]: ERROR [src/core_io.c:1058] Got
error from reply:
(link-11.lab.msp.redhat.com ::ffff:10.15.84.161) 1008:Bad State Change
Oct 19 11:05:12 link-10 lock_gulmd_core[3725]: EOF on xdr (Magma::3719 ::1 idx:2
fd:7)
Oct 19 11:05:12 link-10 lock_gulmd_main[3723]: Forked lock_gulmd_LTPX.
Oct 19 11:05:12 link-10 qarshd[3722]: That's enough
Oct 19 11:05:12 link-10 lock_gulmd_LTPX[3733]: Starting lock_gulmd_LTPX 1.0.4.
(built Aug  1 200
5 14:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
Oct 19 11:05:12 link-10 lock_gulmd_LTPX[3733]: I am running in Fail-over mode.
Oct 19 11:05:12 link-10 lock_gulmd_LTPX[3733]: I am (link-10) with ip
(::ffff:10.15.84.160)
Oct 19 11:05:12 link-10 lock_gulmd_LTPX[3733]: This is cluster LINK_128
Oct 19 11:05:13 link-10 ccsd[3718]: Connected to cluster infrastruture via: GuLM
Plugin v1.0.1
Oct 19 11:05:13 link-10 ccsd[3718]: Initial status:: Inquorate
Oct 19 11:05:14 link-10 lock_gulmd_core[3725]: ERROR [src/core_io.c:1058] Got
error from reply:
(link-11.lab.msp.redhat.com ::ffff:10.15.84.161) 1008:Bad State Change
Oct 19 11:05:16 link-10 qarshd[4706]: Talking to peer 10.15.80.3:34440
Oct 19 11:05:17 link-10 lock_gulmd_core[3725]: ERROR [src/core_io.c:1058] Got
error from reply:
(link-11.lab.msp.redhat.com ::ffff:10.15.84.161) 1008:Bad State Change


Version-Release number of selected component (if applicable):
Starting lock_gulmd_core 1.0.4. (built Aug  1 2005 14:54:33) Copyright (C) 2004
Red Hat, Inc.  All rights reserved.
Comment 1 Corey Marthaler 2005-10-25 12:27:48 EDT
just a note that this was seen while doing regression testing for the
2.6.9-22.0.1 kernel
Comment 3 Chris Feist 2005-12-14 16:03:52 EST
It appears this bug is caused by the main lock_gulmd process not receiving the
SIGCHLD event, and therefore never running waitpid().  I've modified the code to
call waitpid whether or not a SIGCHLD signal has been sent.
Comment 4 Corey Marthaler 2005-12-14 16:12:45 EST
This fix ran though 85 iterations of revolver without problems.
Comment 5 Chris Feist 2006-01-17 14:32:59 EST
*** Bug 178081 has been marked as a duplicate of this bug. ***
Comment 8 Red Hat Bugzilla 2006-03-09 14:52:44 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0238.html

Note You need to log in before you can comment on or make changes to this bug.