Description of problem: I hit this issue running revolver on a 5 node gulm cluster (all servers) after the master (tank-01) was shot. Tank-02 became the new master and fenced tank-01. In between the "Gonna exec" fence message and the "fence was successful" message were these gulm memory errors: Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: R0ZTIEoEZ2ZzMAAJTQIAAAAAAAAAAA== Sent but not recved on Slave login. Marking. Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR [src/lock_io.c:1790] Bad code! Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR [src/lock_io.c:2357] Out Of Memory! Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR [src/lock_io.c:1790] Bad code! Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR [src/lock_io.c:2357] Out Of Memory! Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: Attached slave tank-04.lab.msp.redhat.com ::ffff:192.168.44.94 idx:2 fd:7 (soff:3 connected:0xe) Feb 14 12:59:25 tank-02 last message repeated 2 times Feb 14 12:59:25 tank-02 lock_gulmd_LT000[5688]: R0ZTIEoEZ2ZzMAAJTQMAAAAAAAAAAA== Sent but not recved on Slave login. Marking. The fencing was successful and tank-01 was then brought back up and became a slave, clvmd was started, the VG was activated, but the mount attempt along with other I/O attempts on other nodes were then hung. [root@tank-02 tmp]# gulm_tool nodeinfo tank-02.lab.msp.redhat.com tank-01.lab.msp.redhat.com Name: tank-01.lab.msp.redhat.com ip = ::ffff:192.168.44.91 state = Logged in last state = Logged out mode = Slave missed beats = 0 last beat = 1108410169870761 delay avg = 10000498 max delay = 1944937822 [root@tank-02 tmp]# gulm_tool getstats tank-01 I_am = Slave Master = tank-02.lab.msp.redhat.com rank = 0 quorate = true GenerationID = 1108402164594869 run time = 6113 pid = 3505 verbosity = Default failover = enabled The lock numbers were still dynamic so some I/O activity was still continuing. [root@tank-02 tmp]# gulm_tool getstats tank-02:lt000 I_am = Master run time = 8054 pid = 5688 verbosity = Default id = 0 partitions = 1 out_queue = 4 drpb_queue = 0 exclusive = 2561 shared = 1288 deferred = 0 lvbs = 20902 expired = 1796 lock ops = 7117444 conflicts = 23 incomming_queue = 0 conflict_queue = 7 reply_queue = 3 free_locks = 1 used_locks = 10730 free_lkrqs = 4 used_lkrqs = 16 free_holders = 1 used_holders = 26547 elss = 2712900 highwater = 1048576 [root@tank-02 tmp]# gulm_tool getstats tank-02:lt000 I_am = Master run time = 8055 pid = 5688 verbosity = Default id = 0 partitions = 1 out_queue = 0 drpb_queue = 0 exclusive = 2610 shared = 1285 deferred = 0 lvbs = 20902 expired = 1796 lock ops = 7118024 conflicts = 23 incomming_queue = 0 conflict_queue = 7 reply_queue = 4 free_locks = 0 used_locks = 10776 free_lkrqs = 7 used_lkrqs = 13 free_holders = 0 used_holders = 26593 elss = 2720820 highwater = 1048576 Version-Release number of selected component (if applicable): [root@tank-05 ~]# gulm_tool -V gulm_tool 1.0-pre18 (built Feb 14 2005 10:36:01) Copyright (C) 2004 Red Hat, Inc. All rights reserved. How reproducible: Didn't try
Here is the full syslog from tank-02 who became the new master: eb 14 12:58:28 tank-02 lock_gulmd_core[5684]: Failed to receive a timely heartbeat reply from Master. (t:1108 403908700228 mb:1) Feb 14 12:58:43 tank-02 lock_gulmd_core[5684]: Failed to receive a timely heartbeat reply from Master. (t:1108 403923700947 mb:2) Feb 14 12:58:58 tank-02 lock_gulmd_core[5684]: Failed to receive a timely heartbeat reply from Master. (t:1108 403938701667 mb:3) Feb 14 12:59:13 tank-02 lock_gulmd_core[5684]: Timeout (15000000) on fd:9 (tank-01.lab.msp.redhat.com ::ffff:1 92.168.44.91) Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: I see no Masters, So I am Arbitrating until enough Slaves talk to me. Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send quorum update to slave tank-04.lab.msp.redhat.com Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send quorum update to slave tank-02.lab.msp.redhat.com Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send quorum update to slave tank-05.lab.msp.redhat.com Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send quorum update to slave tank-03.lab.msp.redhat.com Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: LastMaster tank-01.lab.msp.redhat.com ::ffff:192.168.44.91, is being marked Expired. Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send membership update "Expired" about tank-01.lab.ms p.redhat.com to slave tank-04.lab.msp.redhat.com Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send membership update "Expired" about tank-01.lab.ms p.redhat.com to slave tank-02.lab.msp.redhat.com Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send membership update "Expired" about tank-01.lab.ms p.redhat.com to slave tank-05.lab.msp.redhat.com Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Could not send membership update "Expired" about tank-01.lab.ms p.redhat.com to slave tank-03.lab.msp.redhat.com Feb 14 12:59:16 tank-02 lock_gulmd_core[6365]: Gonna exec fence_node tank-01.lab.msp.redhat.com Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Forked [6365] fence_node tank-01.lab.msp.redhat.com with a 0 pause. Feb 14 12:59:16 tank-02 lock_gulmd_LTPX[5692]: New Master at tank-02.lab.msp.redhat.com ::ffff:192.168.44.92 Feb 14 12:59:16 tank-02 ccsd[5638]: Cluster is not quorate. Refusing connection. Feb 14 12:59:16 tank-02 ccsd[5638]: Error while processing connect: Connection refused Feb 14 12:59:16 tank-02 fence_node[6365]: cannot connect to ccs -111 Feb 14 12:59:16 tank-02 fence_node[6365]: Fence of "tank-01.lab.msp.redhat.com" was unsuccessful Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Fence failed. [6365] Exit code:1 Running it again. Feb 14 12:59:16 tank-02 lock_gulmd_core[5684]: Forked [6366] fence_node tank-01.lab.msp.redhat.com with a 5 pause. Feb 14 12:59:20 tank-02 lock_gulmd_core[5684]: Still in Arbitrating: Have 2, need 3 for quorum. Feb 14 12:59:20 tank-02 lock_gulmd_core[5684]: New Client: idx:7 fd:12 from tank-04.lab.msp.redhat.com ::ffff: 192.168.44.94 Feb 14 12:59:20 tank-02 lock_gulmd_core[5684]: Member update message Logged in about tank-04.lab.msp.redhat.co m to tank-05.lab.msp.redhat.com is lost because node is in OM Feb 14 12:59:20 tank-02 lock_gulmd_core[5684]: Member update message Logged in about tank-04.lab.msp.redhat.co m to tank-03.lab.msp.redhat.com is lost because node is in OM Feb 14 12:59:20 tank-02 lock_gulmd_core[5684]: Now have Slave quorum, going full Master. Feb 14 12:59:20 tank-02 lock_gulmd_core[5684]: New Client: idx:8 fd:13 from tank-05.lab.msp.redhat.com ::ffff: 192.168.44.95 Feb 14 12:59:20 tank-02 lock_gulmd_core[5684]: Member update message Logged in about tank-05.lab.msp.redhat.co m to tank-03.lab.msp.redhat.com is lost because node is in OM Feb 14 12:59:21 tank-02 lock_gulmd_LT000[5688]: Attached slave tank-04.lab.msp.redhat.com ::ffff:192.168.44.94 idx:2 fd:7 (soff:3 connected:0x8) Feb 14 12:59:21 tank-02 lock_gulmd_LT000[5688]: New Client: idx 3 fd 8 from tank-02.lab.msp.redhat.com ::ffff: 192.168.44.92 Feb 14 12:59:21 tank-02 lock_gulmd_LT000[5688]: New Client: idx 4 fd 9 from tank-05.lab.msp.redhat.com ::ffff: 192.168.44.95 Feb 14 12:59:21 tank-02 lock_gulmd_LTPX[5692]: Logged into LT000 at tank-02.lab.msp.redhat.com ::ffff:192.168. 44.92 Feb 14 12:59:21 tank-02 lock_gulmd_LTPX[5692]: Finished resending to LT000 Feb 14 12:59:21 tank-02 ccsd[5638]: Cluster is quorate. Allowing connections. Feb 14 12:59:21 tank-02 lock_gulmd_core[5684]: New Client: idx:9 fd:14 from tank-03.lab.msp.redhat.com ::ffff: 192.168.44.93 Feb 14 12:59:21 tank-02 lock_gulmd_core[6366]: Gonna exec fence_node tank-01.lab.msp.redhat.com Feb 14 12:59:21 tank-02 lock_gulmd_LT000[5688]: Attached slave tank-05.lab.msp.redhat.com ::ffff:192.168.44.95 idx:5 fd:10 (soff:2 connected:0xc) Feb 14 12:59:21 tank-02 lock_gulmd_LT000[5688]: New Client: idx 6 fd 11 from tank-04.lab.msp.redhat.com ::ffff :192.168.44.94 Feb 14 12:59:21 tank-02 lock_gulmd_LT000[5688]: New Client: idx 7 fd 12 from tank-03.lab.msp.redhat.com ::ffff :192.168.44.93 Feb 14 12:59:22 tank-02 lock_gulmd_LT000[5688]: Attached slave tank-03.lab.msp.redhat.com ::ffff:192.168.44.93 idx:8 fd:13 (soff:1 connected:0xe) Feb 14 12:59:22 tank-02 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=0: Trying to acquire journal lock... Feb 14 12:59:22 tank-02 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=0: Looking at journal... Feb 14 12:59:22 tank-02 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=0: Acquiring the transaction lock... Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: Attached slave tank-04.lab.msp.redhat.com ::ffff:192.168.44.94 idx:2 fd:7 (soff:3 connected:0xe) Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: R0ZTIEoEZ2ZzMAAJTQIAAAAAAAAAAA== Sent but not recved on Slave login. Marking. Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR [src/lock_io.c:1790] Bad code! Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR [src/lock_io.c:2357] Out Of Memory! Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR [src/lock_io.c:1790] Bad code! Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: ERROR [src/lock_io.c:2357] Out Of Memory! Feb 14 12:59:23 tank-02 lock_gulmd_LT000[5688]: Attached slave tank-04.lab.msp.redhat.com ::ffff:192.168.44.94 idx:2 fd:7 (soff:3 connected:0xe) Feb 14 12:59:25 tank-02 last message repeated 2 times Feb 14 12:59:25 tank-02 lock_gulmd_LT000[5688]: R0ZTIEoEZ2ZzMAAJTQMAAAAAAAAAAA== Sent but not recved on Slave login. Marking. Feb 14 12:59:25 tank-02 lock_gulmd_LT000[5688]: Attached slave tank-04.lab.msp.redhat.com ::ffff:192.168.44.94 idx:2 fd:7 (soff:3 connected:0xe) Feb 14 12:59:33 tank-02 lock_gulmd_core[5684]: Timeout (15000000) on fd:9 (tank-01.lab.msp.redhat.com ::ffff:192.168.44.91) Feb 14 12:59:34 tank-02 fence_node[6366]: Fence of "tank-01.lab.msp.redhat.com" was successful Feb 14 12:59:34 tank-02 lock_gulmd_core[5684]: found match on pid 6366, marking node tank-01.lab.msp.redhat.com as logged out. Feb 14 12:59:34 tank-02 kernel: lock_gulm: Checking for journals for node "tank-01.lab.msp.redhat.com" Feb 14 13:01:01 tank-02 crond(pam_unix)[6368]: session opened for user root by (uid=0) Feb 14 13:01:01 tank-02 crond(pam_unix)[6368]: session closed for user root Feb 14 13:01:49 tank-02 lock_gulmd_core[5684]: New Client: idx:4 fd:9 from tank-01.lab.msp.redhat.com ::ffff:192.168.44.91 Feb 14 13:01:50 tank-02 lock_gulmd_LT000[5688]: Attached slave tank-01.lab.msp.redhat.com ::ffff:192.168.44.91 idx:9 fd:14 (soff:0 connected:0xf) Feb 14 13:01:50 tank-02 lock_gulmd_LT000[5688]: New Client: idx 10 fd 15 from tank-01.lab.msp.redhat.com ::ffff:192.168.44.91
I cleaned up the "Bad code!" message to actually print out the code that it pulled from the socket. Thinking about this, if the slaves lt broke its socket when the master lt was trying to read a packet, this error could be produced. So the logs from tank-04 would be nice. Also note that we see the slave lt on tank-04 attach itself five times. Once before the ERRORs, and four times after. (last message repeated 2 times) This could possibly be a messed up slave.
Here are the logs from tank-04 (note: the clock on tank-04 is 11 minutes slow compared to tank-02). Feb 14 12:47:47 tank-04 lock_gulmd_core[3475]: Failed to receive a timely heartbeat reply from Master. (t:1108403267347502 mb:1) Feb 14 12:48:02 tank-04 lock_gulmd_core[3475]: Failed to receive a timely heartbeat reply from Master. (t:1108403282348222 mb:2) Feb 14 12:48:17 tank-04 lock_gulmd_core[3475]: Failed to receive a timely heartbeat reply from Master. (t:1108403297348942 mb:3) Feb 14 12:48:32 tank-04 lock_gulmd_core[3475]: Timeout (15000000) on fd:6 (tank-01.lab.msp.redhat.com ::ffff:192.168.44.91) Feb 14 12:48:37 tank-04 lock_gulmd_core[3475]: Found Master at tank-02.lab.msp.redhat.com, so I'm a Slave. Feb 14 12:48:37 tank-04 lock_gulmd_core[3475]: Failed to receive a timely heartbeat reply from Master. (t:1108403317369795 mb:1) Feb 14 12:48:37 tank-04 lock_gulmd_LTPX[3483]: New Master at tank-02.lab.msp.redhat.com ::ffff:192.168.44.92 Feb 14 12:48:37 tank-04 lock_gulmd_LTPX[3483]: Errors trying to login to LT000: (tank-02.lab.msp.redhat.com ::ffff:192.168.44.92) 1006:Not Allowed Feb 14 12:48:37 tank-04 lock_gulmd_LT000[3479]: New Master at tank-02.lab.msp.redhat.com ::ffff:192.168.44.92 Feb 14 12:48:37 tank-04 kernel: lock_gulm: Checking for journals for node "tank-01.lab.msp.redhat.com" Feb 14 12:48:38 tank-04 lock_gulmd_LT000[3479]: Logged into Master at tank-02.lab.msp.redhat.com ::ffff:192.168.44.92 Feb 14 12:48:38 tank-04 ccsd[3451]: Cluster is quorate. Allowing connections. Feb 14 12:48:38 tank-04 lock_gulmd_LTPX[3483]: Logged into LT000 at tank-02.lab.msp.redhat.com ::ffff:192.168.44.92 Feb 14 12:48:38 tank-04 lock_gulmd_LTPX[3483]: Finished resending to LT000 Feb 14 12:48:38 tank-04 ccsd[3451]: Cluster is quorate. Allowing connections. Feb 14 12:48:39 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Trying to acquire journal lock... Feb 14 12:48:39 tank-04 lock_gulmd_LT000[3479]: Error on xdr (tank-02.lab.msp.redhat.com ::ffff:192.168.44.92): -42:0:Success Feb 14 12:48:40 tank-04 lock_gulmd_LT000[3479]: Logged into Master at tank-02.lab.msp.redhat.com ::ffff:192.168.44.92 Feb 14 12:48:40 tank-04 lock_gulmd_LT000[3479]: Error on xdr (tank-02.lab.msp.redhat.com ::ffff:192.168.44.92): -42:0:Success Feb 14 12:48:40 tank-04 lock_gulmd_LT000[3479]: Logged into Master at tank-02.lab.msp.redhat.com ::ffff:192.168.44.92 Feb 14 12:48:40 tank-04 lock_gulmd_LT000[3479]: Error on xdr (tank-02.lab.msp.redhat.com ::ffff:192.168.44.92): -42:0:Success Feb 14 12:48:41 tank-04 lock_gulmd_LT000[3479]: Logged into Master at tank-02.lab.msp.redhat.com ::ffff:192.168.44.92 Feb 14 12:48:41 tank-04 lock_gulmd_LT000[3479]: Error on xdr (tank-02.lab.msp.redhat.com ::ffff:192.168.44.92): -42:0:Success Feb 14 12:48:42 tank-04 lock_gulmd_LT000[3479]: Logged into Master at tank-02.lab.msp.redhat.com ::ffff:192.168.44.92 Feb 14 12:48:42 tank-04 lock_gulmd_LT000[3479]: Error on xdr (tank-02.lab.msp.redhat.com ::ffff:192.168.44.92): -42:0:Success Feb 14 12:48:42 tank-04 lock_gulmd_LT000[3479]: Logged into Master at tank-02.lab.msp.redhat.com ::ffff:192.168.44.92 Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Busy Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Trying to acquire journal lock... Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Busy Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Trying to acquire journal lock... Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Busy Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Trying to acquire journal lock... Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Busy Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Trying to acquire journal lock... Feb 14 12:48:42 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Busy Feb 14 12:48:51 tank-04 kernel: lock_gulm: Checking for journals for node "tank-01.lab.msp.redhat.com" Feb 14 12:48:51 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Trying to acquire journal lock... Feb 14 12:48:51 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Busy Feb 14 12:51:06 tank-04 ccsd[3451]: Cluster is quorate. Allowing connections. Feb 14 12:51:07 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Trying to acquire journal lock... Feb 14 12:51:07 tank-04 kernel: GFS: fsid=tank-cluster:gfs0.3: jid=0: Busy
This was reproduced, again after a Master ended up shot by revolver.
damn weird. both sides are claiming that the other is sending garbage. Looking into things deeper. will be adding something that will maybe bring some light.
Think i found something that might be the cause. It is because the lock servers are also mounters. What i think is happening, is that somehow the socket that is going out to the slave LTs and the LTPXs is getting mixed up. So I added some extra code to check for this. If this is the problem, it should fix it. so modified.
*** Bug 149119 has been marked as a duplicate of this bug. ***
It is the problem, but that fix didn't. so back to assigned.
Added another check to avoid mixing up clients and slaves.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-466.html