Bug 149119

Summary: recovery causes stuck lock_gulm while mounting
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: gulmAssignee: michael conrad tadpol tilstra <mtilstra>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-02-21 16:31:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2005-02-18 22:14:47 UTC
Description of problem:
This bug has the exact same feel as bz 143269 did, only for gulm
instead of dlm. 

I had the following cluster up and then revolver shot morph-01 and
morph-03. There was NO i/o running on any of the filesystems
        Gulm Status
        ===========
        morph-01: Master
        morph-02: Slave
        morph-03: Slave
        morph-04: Slave
        morph-05: Slave

The two shot nodes came back up into the cluster and the first three
filesystems mounted just fine on morph-01 but the fourth mount attempt
hung.

All the lock numbers appear to be static each time I run getstats:
[root@morph-01 ~]# gulm_tool getstats morph-02:lt000
I_am = Master
run time = 3281
pid = 5979
verbosity = Default
id = 0
partitions = 1
out_queue = 0
drpb_queue = 0
exclusive = 18
shared = 212
deferred = 0
lvbs = 96
expired = 2
lock ops = 2671
conflicts = 41
incomming_queue = 0
conflict_queue = 3
reply_queue = 0
free_locks = 9
used_locks = 94
free_lkrqs = 9
used_lkrqs = 3
free_holders = 9
used_holders = 328
elss = 31352
highwater = 1048576

[root@morph-01 ~]# gulm_tool getstats morph-01
I_am = Slave
Master = morph-02.lab.msp.redhat.com
rank = 0
quorate = true
GenerationID = 1108757687513061
run time = 2279
pid = 5672
verbosity = Default
failover = enabled


[root@morph-01 ~]# gulm_tool nodelist morph-01
 Name: morph-03.lab.msp.redhat.com
  ip    = ::ffff:192.168.44.63
  state = Logged in
  last state = Logged out
  mode = Slave
  missed beats = 3
  last beat = 1108758754760570
  delay avg = 1020539471
  max delay = 1020539471

 Name: morph-05.lab.msp.redhat.com
  ip    = ::ffff:192.168.44.65
  state = Logged in
  last state = Was Logged in
  mode = Slave
  missed beats = 0
  last beat = 1108759010921930
  delay avg = 10000403
  max delay = 10001430

 Name: morph-04.lab.msp.redhat.com
  ip    = ::ffff:192.168.44.64
  state = Logged in
  last state = Was Logged in
  mode = Slave
  missed beats = 0
  last beat = 1108759005731193
  delay avg = 10001046
  max delay = 10948610

 Name: morph-01.lab.msp.redhat.com
  ip    = ::ffff:192.168.44.61
  state = Logged in
  last state = Logged out
  mode = Slave
  missed beats = 0
  last beat = 1108759014201613
  delay avg = 1326688380
  max delay = 1326688380

 Name: morph-02.lab.msp.redhat.com
  ip    = ::ffff:192.168.44.62
  state = Logged in
  last state = Was Logged in
  mode = Master
  missed beats = 0
  last beat = 1108759009776803
  delay avg = 10000834
  max delay = 1018471792



Version-Release number of selected component (if applicable):
Gulm 2.6.9-23.1 (built Feb 17 2005 11:50:58) installed

How reproducible:
Didn't try

Comment 1 Corey Marthaler 2005-02-18 22:17:58 UTC
Here's the syslog from morph-02 the node who became the new master:

Feb 18 15:30:59 morph-02 lock_gulmd_core[5975]: Failed to receive a
timely heartbeat reply from Master. (t:110
8758659731017 mb:1)
Feb 18 15:31:14 morph-02 lock_gulmd_core[5975]: Failed to receive a
timely heartbeat reply from Master. (t:110
8758674731737 mb:2)
Feb 18 15:31:29 morph-02 lock_gulmd_core[5975]: Failed to receive a
timely heartbeat reply from Master. (t:110
8758689732455 mb:3)
Feb 18 15:31:49 morph-02 lock_gulmd_core[5975]: I see no Masters, So I
am Arbitrating until enough Slaves talk to me.
Feb 18 15:31:49 morph-02 lock_gulmd_core[5975]: Could not send quorum
update to slave morph-03.lab.msp.redhat.com
Feb 18 15:31:49 morph-02 lock_gulmd_core[5975]: Could not send quorum
update to slave morph-05.lab.msp.redhat.com
Feb 18 15:31:49 morph-02 lock_gulmd_core[5975]: Could not send quorum
update to slave morph-04.lab.msp.redhat.com
Feb 18 15:31:49 morph-02 lock_gulmd_core[5975]: Could not send quorum
update to slave morph-02.lab.msp.redhat.com
Feb 18 15:31:49 morph-02 lock_gulmd_core[5975]: LastMaster
morph-01.lab.msp.redhat.com ::ffff:192.168.44.61, is being marked Expired.
Feb 18 15:31:49 morph-02 lock_gulmd_core[5975]: Could not send
membership update "Expired" about morph-01.lab.
msp.redhat.com to slave morph-03.lab.msp.redhat.com
Feb 18 15:31:49 morph-02 lock_gulmd_core[5975]: Could not send
membership update "Expired" about morph-01.lab.
msp.redhat.com to slave morph-05.lab.msp.redhat.com
Feb 18 15:31:49 morph-02 lock_gulmd_core[5975]: Could not send
membership update "Expired" about morph-01.lab.
msp.redhat.com to slave morph-04.lab.msp.redhat.com
Feb 18 15:31:49 morph-02 lock_gulmd_core[5975]: Could not send
membership update "Expired" about morph-01.lab.
msp.redhat.com to slave morph-02.lab.msp.redhat.com
Feb 18 15:31:49 morph-02 lock_gulmd_core[7372]: Gonna exec fence_node
-O morph-01.lab.msp.redhat.com
Feb 18 15:31:49 morph-02 lock_gulmd_core[5975]: Forked [7372]
fence_node -O morph-01.lab.msp.redhat.com with a  0 pause.
Feb 18 15:31:49 morph-02 lock_gulmd_LTPX[5983]: New Master at
morph-02.lab.msp.redhat.com ::ffff:192.168.44.62

Feb 18 15:32:03 morph-02 fence_node[7372]: Fence of
"morph-01.lab.msp.redhat.com" was successful
Feb 18 15:32:03 morph-02 lock_gulmd_core[5975]: found match on pid
7372, marking node morph-01.lab.msp.redhat.
com as logged out.
Feb 18 15:32:03 morph-02 lock_gulmd_core[5975]: Member update message
Fenced about morph-01.lab.msp.redhat.com
 to morph-03.lab.msp.redhat.com is lost because node is in OM
Feb 18 15:32:03 morph-02 lock_gulmd_core[5975]: Member update message
Fenced about morph-01.lab.msp.redhat.com
 to morph-05.lab.msp.redhat.com is lost because node is in OM
Feb 18 15:32:03 morph-02 lock_gulmd_core[5975]: Member update message
Fenced about morph-01.lab.msp.redhat.com
 to morph-04.lab.msp.redhat.com is lost because node is in OM
Feb 18 15:32:03 morph-02 kernel: lock_gulm: Checking for journals for
node "morph-01.lab.msp.redhat.com"
Feb 18 15:32:04 morph-02 lock_gulmd_core[5975]:
morph-03.lab.msp.redhat.com missed a heartbeat (time:110875872
4751132 mb:1)
Feb 18 15:32:04 morph-02 lock_gulmd_core[5975]:
morph-05.lab.msp.redhat.com missed a heartbeat (time:110875872
4751132 mb:1)
Feb 18 15:32:04 morph-02 lock_gulmd_core[5975]:
morph-04.lab.msp.redhat.com missed a heartbeat (time:110875872
4751132 mb:1)
Feb 18 15:32:06 morph-02 lock_gulmd_core[5975]: Timeout (15000000) on
fd:9 (morph-01.lab.msp.redhat.com ::ffff
:192.168.44.61)
Feb 18 15:32:15 morph-02 lock_gulmd_core[5975]: Still in Arbitrating:
Have 2, need 3 for quorum.
Feb 18 15:32:15 morph-02 lock_gulmd_core[5975]: New Client: idx:7
fd:12 from morph-04.lab.msp.redhat.com ::ffff:192.168.44.64
Feb 18 15:32:15 morph-02 lock_gulmd_core[5975]: Member update message
Logged in about morph-04.lab.msp.redhat.com to
morph-03.lab.msp.redhat.com is lost because node is in OM
Feb 18 15:32:15 morph-02 lock_gulmd_core[5975]: Member update message
Logged in about morph-04.lab.msp.redhat.com to
morph-05.lab.msp.redhat.com is lost because node is in OM
Feb 18 15:32:15 morph-02 lock_gulmd_LT000[5979]: Attached slave
morph-04.lab.msp.redhat.com ::ffff:192.168.44.64 idx:2 fd:7 (soff:3
connected:0x8)
Feb 18 15:32:19 morph-02 lock_gulmd_core[5975]:
morph-03.lab.msp.redhat.com missed a heartbeat (time:1108758739755874
mb:2)
Feb 18 15:32:19 morph-02 lock_gulmd_core[5975]:
morph-05.lab.msp.redhat.com missed a heartbeat (time:1108758739755874
mb:2)
Feb 18 15:32:20 morph-02 lock_gulmd_core[5975]: Now have Slave quorum,
going full Master.
Feb 18 15:32:20 morph-02 lock_gulmd_core[5975]: New Client: idx:4 fd:9
from morph-05.lab.msp.redhat.com ::ffff:192.168.44.65
Feb 18 15:32:20 morph-02 lock_gulmd_core[5975]: Member update message
Logged in about morph-05.lab.msp.redhat.com to
morph-03.lab.msp.redhat.com is lost because node is in OM
Feb 18 15:32:20 morph-02 lock_gulmd_LT000[5979]: New Client: idx 3 fd
8 from morph-02.lab.msp.redhat.com ::ffff:192.168.44.62
Feb 18 15:32:20 morph-02 lock_gulmd_LTPX[5983]: Logged into LT000 at
morph-02.lab.msp.redhat.com ::ffff:192.168.44.62
Feb 18 15:32:20 morph-02 lock_gulmd_LTPX[5983]: Finished resending to
LT000
Feb 18 15:32:20 morph-02 lock_gulmd_LT000[5979]: New Client: idx 4 fd
9 from morph-05.lab.msp.redhat.com ::ffff:192.168.44.65
Feb 18 15:32:20 morph-02 lock_gulmd_LT000[5979]: Attached slave
morph-05.lab.msp.redhat.com ::ffff:192.168.44.65 idx:5 fd:10 (soff:2
connected:0xc)
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs3.1:
jid=0: Trying to acquire journal lock...
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs2.1:
jid=0: Trying to acquire journal lock...
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs1.1:
jid=0: Trying to acquire journal lock...
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs0.1:
jid=0: Trying to acquire journal lock...
Feb 18 15:32:21 morph-02 ccsd[5929]: Cluster is quorate.  Allowing
connections.
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs3.1:
jid=0: Busy
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs2.1:
jid=0: Busy
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs1.1:
jid=0: Busy
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs0.1:
jid=0: Looking at journal...
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs3.1:
jid=0: Trying to acquire journal lock...
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs2.1:
jid=0: Trying to acquire journal lock...
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs1.1:
jid=0: Trying to acquire journal lock...
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs3.1:
jid=0: Busy
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs2.1:
jid=0: Busy
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs1.1:
jid=0: Busy
Feb 18 15:32:21 morph-02 kernel: GFS: fsid=morph-cluster:gfs0.1:
jid=0: Acquiring the transaction lock...
Feb 18 15:32:21 morph-02 lock_gulmd_LT000[5979]: Attached slave
morph-04.lab.msp.redhat.com ::ffff:192.168.44.64 idx:2 fd:7 (soff:3
connected:0xc)
Feb 18 15:32:21 morph-02 lock_gulmd_LT000[5979]:
morph-04.lab.msp.redhat.com ::ffff:192.168.44.64 Gave us a
1733051136:gulm_lock_cb_state 1006:Not Allowed, closing the connection.
Feb 18 15:32:21 morph-02 lock_gulmd_LT000[5979]: Attached slave
morph-04.lab.msp.redhat.com ::ffff:192.168.44.64 idx:2 fd:7 (soff:3
connected:0xc)
Feb 18 15:32:21 morph-02 last message repeated 3 times
Feb 18 15:32:21 morph-02 lock_gulmd_LT000[5979]: New Client: idx 6 fd
11 from morph-04.lab.msp.redhat.com ::ffff:192.168.44.64
Feb 18 15:32:34 morph-02 lock_gulmd_core[5975]:
morph-03.lab.msp.redhat.com missed a heartbeat (time:1108758754760570
mb:3)
Feb 18 15:32:34 morph-02 lock_gulmd_core[5975]: Client
(morph-03.lab.msp.redhat.com) expired
Feb 18 15:32:34 morph-02 lock_gulmd_core[7374]: Gonna exec fence_node
-O morph-03.lab.msp.redhat.com
Feb 18 15:32:34 morph-02 lock_gulmd_core[5975]: Forked [7374]
fence_node -O morph-03.lab.msp.redhat.com with a 0 pause.
Feb 18 15:32:34 morph-02 lock_gulmd_LT000[5979]:
morph-04.lab.msp.redhat.com ::ffff:192.168.44.64 Gave us a
1733051136:gulm_lock_cb_state 1006:Not Allowed, closing the connection.
Feb 18 15:32:34 morph-02 lock_gulmd_LT000[5979]: Attached slave
morph-04.lab.msp.redhat.com ::ffff:192.168.44.64 idx:2 fd:7 (soff:3
connected:0xc)
Feb 18 15:32:56 morph-02 fence_node[7374]: Fence of
"morph-03.lab.msp.redhat.com" was successful
Feb 18 15:32:56 morph-02 lock_gulmd_core[5975]: found match on pid
7374, marking node morph-03.lab.msp.redhat.com as logged out.
Feb 18 15:32:56 morph-02 kernel: lock_gulm: Checking for journals for
node "morph-03.lab.msp.redhat.com"
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs3.1:
jid=2: Trying to acquire journal lock...
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs2.1:
jid=2: Trying to acquire journal lock...
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs3.1:
jid=2: Looking at journal...
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs2.1:
jid=2: Looking at journal...
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs3.1:
jid=2: Acquiring the transaction lock...
Feb 18 15:32:56 morph-02 lock_gulmd_LT000[5979]: Attached slave
morph-04.lab.msp.redhat.com ::ffff:192.168.44.64 idx:2 fd:7 (soff:3
connected:0xc)
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs1.1:
jid=2: Trying to acquire journal lock...
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs1.1:
jid=2: Looking at journal...
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs0.1:
jid=0: Replaying journal...
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs0.1:
jid=0: Replayed 0 of 1 blocks
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs0.1:
jid=0: replays = 0, skips = 1, sames = 0
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs0.1:
jid=0: Journal replayed in 36s
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs2.1:
jid=2: Done
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs0.1:
jid=0: Done
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs0.1:
jid=0: Trying to acquire journal lock...
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs0.1:
jid=0: Busy
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs0.1:
jid=2: Trying to acquire journal lock...
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs1.1:
jid=2: Done
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs0.1:
jid=2: Busy
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs0.1:
jid=2: Trying to acquire journal lock...
Feb 18 15:32:56 morph-02 kernel: GFS: fsid=morph-cluster:gfs0.1:
jid=2: Busy
Feb 18 15:33:41 morph-02 smartd[2040]: Device: /dev/hda, 1 Currently
unreadable (pending) sectors
Feb 18 15:33:41 morph-02 smartd[2040]: Device: /dev/hda, 1 Offline
uncorrectable sectors
Feb 18 15:36:54 morph-02 lock_gulmd_core[5975]: New Client: idx:8
fd:13 from morph-01.lab.msp.redhat.com ::ffff:192.168.44.61
Feb 18 15:36:54 morph-02 lock_gulmd_LT000[5979]: Attached slave
morph-01.lab.msp.redhat.com ::ffff:192.168.44.61 idx:7 fd:12 (soff:1
connected:0xe)
Feb 18 15:36:55 morph-02 lock_gulmd_LT000[5979]: New Client: idx 8 fd
13 from morph-01.lab.msp.redhat.com ::ffff:192.168.44.61
Feb 18 15:36:56 morph-02 lock_gulmd_core[5975]: New Client: idx:9
fd:14 from morph-03.lab.msp.redhat.com ::ffff:192.168.44.63
Feb 18 15:36:56 morph-02 lock_gulmd_LT000[5979]: Attached slave
morph-03.lab.msp.redhat.com ::ffff:192.168.44.63 idx:9 fd:14 (soff:0
connected:0xf)
Feb 18 15:36:57 morph-02 lock_gulmd_LT000[5979]: New Client: idx 10 fd
15 from morph-03.lab.msp.redhat.com ::ffff:192.168.44.63


Comment 2 michael conrad tadpol tilstra 2005-02-18 22:27:46 UTC
just to note, pjc saw something like this as well.

Comment 3 michael conrad tadpol tilstra 2005-02-21 16:12:26 UTC
.......
Feb 18 15:32:21 morph-02 lock_gulmd_LT000[5979]:
morph-04.lab.msp.redhat.com ::ffff:192.168.44.64 Gave us a
1733051136:gulm_lock_cb_state 1006:Not Allowed, closing the connection.
.......

mixing up slave and client sockets. <sarcasm>wonderful.</sarcasm>
Looks a lot like bug #148029
I checked stuff in on the 16 that should have fixed this.  I see the code you're
using was built on the 17.  Can you confirm if that commit was in it?

Comment 4 michael conrad tadpol tilstra 2005-02-21 16:24:22 UTC
*sigh*  it is there.  Its the same patch that added the "Gave us..." message.
cruptity.

Comment 5 michael conrad tadpol tilstra 2005-02-21 16:31:29 UTC
This is bug #148029, marking as such.

*** This bug has been marked as a duplicate of 148029 ***