148029 – running gulm lockserver on mounters with master failure can mixup sockets

Bug 148029 - running gulm lockserver on mounters with master failure can mixup sockets

Summary: running gulm lockserver on mounters with master failure can mixup sockets

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gulm
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	michael conrad tadpol tilstra
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	149119 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-02-14 22:37 UTC by Corey Marthaler
Modified:	2009-04-16 20:24 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-05-25 16:41:11 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2005:466	0	normal	SHIPPED_LIVE	GFS bug fix update	2005-05-25 04:00:00 UTC

Description Corey Marthaler 2005-02-14 22:37:41 UTC

Description of problem:
I hit this issue running revolver on a 5 node gulm cluster (all
servers) after the master (tank-01) was shot. Tank-02 became the new
master and fenced tank-01. In between the "Gonna exec" fence message
and the "fence was successful" message were these gulm memory errors:


Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]:
R0ZTIEoEZ2ZzMAAJTQIAAAAAAAAAAA== Sent but not recved on Slave
login. Marking.
 Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR
[src/lock_io.c:1790] Bad code!
 Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR
[src/lock_io.c:2357] Out Of Memory!
 Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR
[src/lock_io.c:1790] Bad code!
 Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR
[src/lock_io.c:2357] Out Of Memory!
 Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: Attached slave
tank-04.lab.msp.redhat.com ::ffff:192.168.44.94
 idx:2 fd:7 (soff:3 connected:0xe)
 Feb 14 12:59:25 tank-02 last message repeated 2 times
 Feb 14 12:59:25 tank-02 lock_gulmd_LT000[5688]:
R0ZTIEoEZ2ZzMAAJTQMAAAAAAAAAAA== Sent but not recved on Slave login.
Marking.

The fencing was successful and tank-01 was then brought back up and
became a slave, clvmd was started, the VG was activated, but the mount
attempt along with other I/O attempts on other nodes were then hung.

[root@tank-02 tmp]# gulm_tool nodeinfo tank-02.lab.msp.redhat.com
tank-01.lab.msp.redhat.com
 Name: tank-01.lab.msp.redhat.com
   ip    = ::ffff:192.168.44.91
   state = Logged in
   last state = Logged out
   mode = Slave
   missed beats = 0
   last beat = 1108410169870761
   delay avg = 10000498
   max delay = 1944937822


[root@tank-02 tmp]# gulm_tool getstats tank-01
I_am = Slave
Master = tank-02.lab.msp.redhat.com
rank = 0
quorate = true
GenerationID = 1108402164594869
run time = 6113
pid = 3505
verbosity = Default
failover = enabled


The lock numbers were still dynamic so some I/O activity was still
continuing.
[root@tank-02 tmp]#  gulm_tool getstats tank-02:lt000
I_am = Master
run time = 8054
pid = 5688
verbosity = Default
id = 0
partitions = 1
out_queue = 4
drpb_queue = 0
exclusive = 2561
shared = 1288
deferred = 0
lvbs = 20902
expired = 1796
lock ops = 7117444
conflicts = 23
incomming_queue = 0
conflict_queue = 7
reply_queue = 3
free_locks = 1
used_locks = 10730
free_lkrqs = 4
used_lkrqs = 16
free_holders = 1
used_holders = 26547
elss = 2712900
highwater = 1048576
[root@tank-02 tmp]#  gulm_tool getstats tank-02:lt000
I_am = Master
run time = 8055
pid = 5688
verbosity = Default
id = 0
partitions = 1
out_queue = 0
drpb_queue = 0
exclusive = 2610
shared = 1285
deferred = 0
lvbs = 20902
expired = 1796
lock ops = 7118024
conflicts = 23
incomming_queue = 0
conflict_queue = 7
reply_queue = 4
free_locks = 0
used_locks = 10776
free_lkrqs = 7
used_lkrqs = 13
free_holders = 0
used_holders = 26593
elss = 2720820
highwater = 1048576



Version-Release number of selected component (if applicable):
[root@tank-05 ~]# gulm_tool -V
gulm_tool 1.0-pre18 (built Feb 14 2005 10:36:01)
Copyright (C) 2004 Red Hat, Inc.  All rights reserved.

How reproducible:
Didn't try

Comment 1 Corey Marthaler 2005-02-14 22:42:58 UTC

Here is the full syslog from tank-02 who became the new master:

eb 14 12:58:28 tank-02 lock_gulmd_core[5684]: Failed to receive a
timely heartbeat reply from Master. (t:1108
403908700228 mb:1)
Feb 14 12:58:43 tank-02 lock_gulmd_core[5684]: Failed to receive a
timely heartbeat reply from Master. (t:1108
403923700947 mb:2)
Feb 14 12:58:58 tank-02 lock_gulmd_core[5684]: Failed to receive a
timely heartbeat reply from Master. (t:1108
403938701667 mb:3)
Feb 14 12:59:13 tank-02 lock_gulmd_core[5684]: Timeout (15000000) on
fd:9 (tank-01.lab.msp.redhat.com ::ffff:1
92.168.44.91)
Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: I see no Masters, So I
am Arbitrating until enough Slaves talk
to me.
Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send quorum
update to slave tank-04.lab.msp.redhat.com
Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send quorum
update to slave tank-02.lab.msp.redhat.com
Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send quorum
update to slave tank-05.lab.msp.redhat.com
Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send quorum
update to slave tank-03.lab.msp.redhat.com
Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: LastMaster
tank-01.lab.msp.redhat.com ::ffff:192.168.44.91, is
being marked Expired.
Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send
membership update "Expired" about tank-01.lab.ms
p.redhat.com to slave tank-04.lab.msp.redhat.com
Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send
membership update "Expired" about tank-01.lab.ms
p.redhat.com to slave tank-02.lab.msp.redhat.com
Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send
membership update "Expired" about tank-01.lab.ms
p.redhat.com to slave tank-05.lab.msp.redhat.com
Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send
membership update "Expired" about tank-01.lab.ms
p.redhat.com to slave tank-03.lab.msp.redhat.com
Feb 14 12:59:16 tank-02 lock_gulmd_core[6365]: Gonna exec fence_node
tank-01.lab.msp.redhat.com
Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Forked [6365]
fence_node tank-01.lab.msp.redhat.com with a 0 pause.
Feb 14 12:59:16 tank-02 lock_gulmd_LTPX[5692]: New Master at
tank-02.lab.msp.redhat.com ::ffff:192.168.44.92
Feb 14 12:59:16 tank-02 ccsd[5638]: Cluster is not quorate.  Refusing
connection.
Feb 14 12:59:16 tank-02 ccsd[5638]: Error while processing connect:
Connection refused
Feb 14 12:59:16 tank-02 fence_node[6365]: cannot connect to ccs -111
Feb 14 12:59:16 tank-02 fence_node[6365]: Fence of
"tank-01.lab.msp.redhat.com" was unsuccessful
Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Fence failed. [6365]
Exit code:1 Running it again.
Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Forked [6366]
fence_node tank-01.lab.msp.redhat.com with a 5 pause.
Feb 14 12:59:20 tank-02 lock_gulmd_core[5684]: Still in Arbitrating:
Have 2, need 3 for quorum.
Feb 14 12:59:20 tank-02 lock_gulmd_core[5684]: New Client: idx:7 fd:12
from tank-04.lab.msp.redhat.com ::ffff:
192.168.44.94
Feb 14 12:59:20 tank-02 lock_gulmd_core[5684]: Member update message
Logged in about tank-04.lab.msp.redhat.co
m to tank-05.lab.msp.redhat.com is lost because node is in OM
Feb 14 12:59:20 tank-02 lock_gulmd_core[5684]: Member update message
Logged in about tank-04.lab.msp.redhat.co
m to tank-03.lab.msp.redhat.com is lost because node is in OM
Feb 14 12:59:20 tank-02 lock_gulmd_core[5684]: Now have Slave quorum,
going full Master.
Feb 14 12:59:20 tank-02 lock_gulmd_core[5684]: New Client: idx:8 fd:13
from tank-05.lab.msp.redhat.com ::ffff:
192.168.44.95
Feb 14 12:59:20 tank-02 lock_gulmd_core[5684]: Member update message
Logged in about tank-05.lab.msp.redhat.co
m to tank-03.lab.msp.redhat.com is lost because node is in OM
Feb 14 12:59:21 tank-02 lock_gulmd_LT000[5688]: Attached slave
tank-04.lab.msp.redhat.com ::ffff:192.168.44.94
 idx:2 fd:7 (soff:3 connected:0x8)
 Feb 14 12:59:21 tank-02 lock_gulmd_LT000[5688]: New Client: idx 3 fd
8 from tank-02.lab.msp.redhat.com ::ffff:
 192.168.44.92
 Feb 14 12:59:21 tank-02 lock_gulmd_LT000[5688]: New Client: idx 4 fd
9 from tank-05.lab.msp.redhat.com ::ffff:
192.168.44.95
 Feb 14 12:59:21 tank-02 lock_gulmd_LTPX[5692]: Logged into LT000 at
tank-02.lab.msp.redhat.com ::ffff:192.168.
44.92
 Feb 14 12:59:21 tank-02 lock_gulmd_LTPX[5692]: Finished resending to
LT000
 Feb 14 12:59:21 tank-02 ccsd[5638]: Cluster is quorate.  Allowing
connections.
 Feb 14 12:59:21 tank-02 lock_gulmd_core[5684]: New Client: idx:9
fd:14 from tank-03.lab.msp.redhat.com ::ffff:
192.168.44.93
 Feb 14 12:59:21 tank-02 lock_gulmd_core[6366]: Gonna exec fence_node
tank-01.lab.msp.redhat.com
 Feb 14 12:59:21 tank-02 lock_gulmd_LT000[5688]: Attached slave
tank-05.lab.msp.redhat.com ::ffff:192.168.44.95
 idx:5 fd:10 (soff:2 connected:0xc)
 Feb 14 12:59:21 tank-02 lock_gulmd_LT000[5688]: New Client: idx 6 fd
11 from tank-04.lab.msp.redhat.com ::ffff
:192.168.44.94
 Feb 14 12:59:21 tank-02 lock_gulmd_LT000[5688]: New Client: idx 7 fd
12 from tank-03.lab.msp.redhat.com ::ffff
:192.168.44.93
 Feb 14 12:59:22 tank-02 lock_gulmd_LT000[5688]: Attached slave
tank-03.lab.msp.redhat.com ::ffff:192.168.44.93
 idx:8 fd:13 (soff:1 connected:0xe)
 Feb 14 12:59:22 tank-02 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=0:
Trying to acquire journal lock...
 Feb 14 12:59:22 tank-02 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=0:
Looking at journal...
 Feb 14 12:59:22 tank-02 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=0:
Acquiring the transaction lock...
 Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: Attached slave
tank-04.lab.msp.redhat.com ::ffff:192.168.44.94
 idx:2 fd:7 (soff:3 connected:0xe)
 Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]:
R0ZTIEoEZ2ZzMAAJTQIAAAAAAAAAAA== Sent but not recved on Slave
login. Marking.
 Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR
[src/lock_io.c:1790] Bad code!
 Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR
[src/lock_io.c:2357] Out Of Memory!
 Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR
[src/lock_io.c:1790] Bad code!
 Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR
[src/lock_io.c:2357] Out Of Memory!
 Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: Attached slave
tank-04.lab.msp.redhat.com ::ffff:192.168.44.94
 idx:2 fd:7 (soff:3 connected:0xe)
 Feb 14 12:59:25 tank-02 last message repeated 2 times
Feb 14 12:59:25 tank-02 lock_gulmd_LT000[5688]:
R0ZTIEoEZ2ZzMAAJTQMAAAAAAAAAAA== Sent but not recved on Slave
login. Marking.
 Feb 14 12:59:25 tank-02 lock_gulmd_LT000[5688]: Attached slave
tank-04.lab.msp.redhat.com ::ffff:192.168.44.94
 idx:2 fd:7 (soff:3 connected:0xe)
 Feb 14 12:59:33 tank-02 lock_gulmd_core[5684]: Timeout (15000000) on
fd:9 (tank-01.lab.msp.redhat.com ::ffff:192.168.44.91)
 Feb 14 12:59:34 tank-02 fence_node[6366]: Fence of
"tank-01.lab.msp.redhat.com" was successful
 Feb 14 12:59:34 tank-02 lock_gulmd_core[5684]: found match on pid
6366, marking node tank-01.lab.msp.redhat.com as logged out.
 Feb 14 12:59:34 tank-02 kernel: lock_gulm: Checking for journals for
node "tank-01.lab.msp.redhat.com"
 Feb 14 13:01:01 tank-02 crond(pam_unix)[6368]: session opened for
user root by (uid=0)
 Feb 14 13:01:01 tank-02 crond(pam_unix)[6368]: session closed for
user root
 Feb 14 13:01:49 tank-02 lock_gulmd_core[5684]: New Client: idx:4 fd:9
from tank-01.lab.msp.redhat.com ::ffff:192.168.44.91
 Feb 14 13:01:50 tank-02 lock_gulmd_LT000[5688]: Attached slave
tank-01.lab.msp.redhat.com ::ffff:192.168.44.91 idx:9 fd:14 (soff:0
connected:0xf)
 Feb 14 13:01:50 tank-02 lock_gulmd_LT000[5688]: New Client: idx 10 fd
15 from tank-01.lab.msp.redhat.com ::ffff:192.168.44.91

Comment 2 michael conrad tadpol tilstra 2005-02-15 14:42:05 UTC

I cleaned up the "Bad code!" message to actually print out the code that it
pulled from the socket.

Thinking about this, if the slaves lt broke its socket when the master lt was
trying to read a packet, this error could be produced.  So the logs from tank-04
would be nice.

Also note that we see the slave lt on tank-04 attach itself five times.  Once
before the ERRORs, and four times after. (last message repeated 2 times)

This could possibly be a messed up slave.

Comment 3 Corey Marthaler 2005-02-15 18:29:15 UTC

Here are the logs from tank-04 (note: the clock on tank-04 is 11
minutes slow compared to tank-02).

Feb 14 12:47:47 tank-04 lock_gulmd_core[3475]: Failed to receive a
timely heartbeat reply from Master. (t:1108403267347502 mb:1)
Feb 14 12:48:02 tank-04 lock_gulmd_core[3475]: Failed to receive a
timely heartbeat reply from Master. (t:1108403282348222 mb:2)
Feb 14 12:48:17 tank-04 lock_gulmd_core[3475]: Failed to receive a
timely heartbeat reply from Master. (t:1108403297348942 mb:3)
Feb 14 12:48:32 tank-04 lock_gulmd_core[3475]: Timeout (15000000) on
fd:6 (tank-01.lab.msp.redhat.com ::ffff:192.168.44.91)
Feb 14 12:48:37 tank-04 lock_gulmd_core[3475]: Found Master at
tank-02.lab.msp.redhat.com, so I'm a Slave.
Feb 14 12:48:37 tank-04 lock_gulmd_core[3475]: Failed to receive a
timely heartbeat reply from Master. (t:1108403317369795 mb:1)
Feb 14 12:48:37 tank-04 lock_gulmd_LTPX[3483]: New Master at
tank-02.lab.msp.redhat.com ::ffff:192.168.44.92
Feb 14 12:48:37 tank-04 lock_gulmd_LTPX[3483]: Errors trying to login
to LT000: (tank-02.lab.msp.redhat.com ::ffff:192.168.44.92) 1006:Not
Allowed
Feb 14 12:48:37 tank-04 lock_gulmd_LT000[3479]: New Master at
tank-02.lab.msp.redhat.com ::ffff:192.168.44.92
Feb 14 12:48:37 tank-04 kernel: lock_gulm: Checking for journals for
node "tank-01.lab.msp.redhat.com"
Feb 14 12:48:38 tank-04 lock_gulmd_LT000[3479]: Logged into Master at
tank-02.lab.msp.redhat.com ::ffff:192.168.44.92
Feb 14 12:48:38 tank-04 ccsd[3451]: Cluster is quorate.  Allowing
connections.
Feb 14 12:48:38 tank-04 lock_gulmd_LTPX[3483]: Logged into LT000 at
tank-02.lab.msp.redhat.com ::ffff:192.168.44.92
Feb 14 12:48:38 tank-04 lock_gulmd_LTPX[3483]: Finished resending to LT000
Feb 14 12:48:38 tank-04 ccsd[3451]: Cluster is quorate.  Allowing
connections.
Feb 14 12:48:39 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0:
Trying to acquire journal lock...
Feb 14 12:48:39 tank-04 lock_gulmd_LT000[3479]: Error on xdr
(tank-02.lab.msp.redhat.com ::ffff:192.168.44.92): -42:0:Success
Feb 14 12:48:40 tank-04 lock_gulmd_LT000[3479]: Logged into Master at
tank-02.lab.msp.redhat.com ::ffff:192.168.44.92
Feb 14 12:48:40 tank-04 lock_gulmd_LT000[3479]: Error on xdr
(tank-02.lab.msp.redhat.com ::ffff:192.168.44.92): -42:0:Success
Feb 14 12:48:40 tank-04 lock_gulmd_LT000[3479]: Logged into Master at
tank-02.lab.msp.redhat.com ::ffff:192.168.44.92
Feb 14 12:48:40 tank-04 lock_gulmd_LT000[3479]: Error on xdr
(tank-02.lab.msp.redhat.com ::ffff:192.168.44.92): -42:0:Success
Feb 14 12:48:41 tank-04 lock_gulmd_LT000[3479]: Logged into Master at
tank-02.lab.msp.redhat.com ::ffff:192.168.44.92
Feb 14 12:48:41 tank-04 lock_gulmd_LT000[3479]: Error on xdr
(tank-02.lab.msp.redhat.com ::ffff:192.168.44.92): -42:0:Success
Feb 14 12:48:42 tank-04 lock_gulmd_LT000[3479]: Logged into Master at
tank-02.lab.msp.redhat.com ::ffff:192.168.44.92
Feb 14 12:48:42 tank-04 lock_gulmd_LT000[3479]: Error on xdr
(tank-02.lab.msp.redhat.com ::ffff:192.168.44.92): -42:0:Success
Feb 14 12:48:42 tank-04 lock_gulmd_LT000[3479]: Logged into Master at
tank-02.lab.msp.redhat.com ::ffff:192.168.44.92
Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Busy
Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0:
Trying to acquire journal lock...
Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Busy
Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0:
Trying to acquire journal lock...
Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Busy
Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0:
Trying to acquire journal lock...
Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Busy
Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0:
Trying to acquire journal lock...
Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Busy
Feb 14 12:48:51 tank-04 kernel: lock_gulm: Checking for journals for
node "tank-01.lab.msp.redhat.com"
Feb 14 12:48:51 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0:
Trying to acquire journal lock...
Feb 14 12:48:51 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Busy
Feb 14 12:51:06 tank-04 ccsd[3451]: Cluster is quorate.  Allowing
connections.
Feb 14 12:51:07 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0:
Trying to acquire journal lock...
Feb 14 12:51:07 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Busy

Comment 4 Corey Marthaler 2005-02-15 18:33:37 UTC

This was reproduced, again after a Master ended up shot by revolver.

Comment 5 michael conrad tadpol tilstra 2005-02-15 19:20:58 UTC

damn weird.  both sides are claiming that the other is sending garbage.
Looking into things deeper.  will be adding something that will maybe bring some
light.

Comment 6 michael conrad tadpol tilstra 2005-02-16 17:12:39 UTC

Think i found something that might be the cause.  It is because the lock servers
are also mounters.  What i think is happening, is that somehow the socket that
is going out to the slave LTs and the LTPXs is getting mixed up.  So I added
some extra code to check for this.  If this is the problem, it should fix it. so
modified.

Comment 7 michael conrad tadpol tilstra 2005-02-21 16:31:35 UTC

*** Bug 149119 has been marked as a duplicate of this bug. ***

Comment 8 michael conrad tadpol tilstra 2005-02-21 16:32:28 UTC

It is the problem, but that fix didn't. so back to assigned.

Comment 9 michael conrad tadpol tilstra 2005-02-21 16:44:25 UTC

Added another check to avoid mixing up clients and slaves.

Comment 10 Jay Turner 2005-05-25 16:41:11 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-466.html

Note You need to log in before you can comment on or make changes to this bug.