Bug 387081 - node told to leave cluster due to inconsistent view ends up panicing
node told to leave cluster due to inconsistent view ends up panicing
Status: CLOSED NOTABUG
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman-kernel (Show other bugs)
4
All Linux
urgent Severity high
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-11-16 11:00 EST by Corey Marthaler
Modified: 2009-04-16 15:46 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-21 10:24:04 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
here's the tcp dump during the failures (228.75 KB, application/octet-stream)
2007-11-16 12:44 EST, Corey Marthaler
no flags Details

  None (edit)
Description Corey Marthaler 2007-11-16 11:00:53 EST
Description of problem:
I was running revovler on the latest 4.6 builds on my 6 node cluster and after
one node was shot and brought back up, another node was told to leave the
cluster due to an inconsistent view and paniced.

One of the nodes telling grant-01 to leave:
Nov 16 09:22:42 link-02 kernel: CMAN: Started transition, generation 13
Nov 16 09:22:42 link-02 kernel: CMAN: node grant-03 rejoining
Nov 16 09:22:43 link-02 kernel: CMAN: Finished transition, generation 13
Nov 16 09:22:58 link-02 kernel: CMAN: Started transition, generation 15
Nov 16 09:23:04 link-02 kernel: CMAN: node grant-01 has been removed from the
cluster : Inconsistent cluster view
Nov 16 09:23:13 link-02 kernel: CMAN: Initiating transition, generation 16
Nov 16 09:23:13 link-02 kernel: CMAN: Initiating transition, generation 17
Nov 16 09:23:15 link-02 kernel: CMAN: Completed transition, generation 17
Nov 16 09:23:45 link-02 fenced[7690]: grant-01 not a cluster member after 30 sec
post_fail_delay
Nov 16 09:23:45 link-02 fenced[7690]: fencing node "grant-01"
Nov 16 09:23:45 link-02 fenced[7690]: fence "grant-01" success



[root@link-02 sbin]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    6   M   link-02
   2    1    6   M   grant-02
   3    1    6   M   grant-03
   4    1    6   M   link-07
   5    1    6   X   grant-01
   6    1    6   M   link-08
[root@link-02 sbin]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[1 4 6 2 3]

DLM Lock Space:  "clvmd"                             3   3 run       -
[1 4 6 2 3]

DLM Lock Space:  "LINK_1280"                         9   5 run       -
[1 4 6 2]

DLM Lock Space:  "LINK_1281"                        11   7 run       -
[1 4 6 2]

DLM Lock Space:  "LINK_1282"                        13   9 run       -
[1 4 6 2]

GFS Mount Group: "LINK_1280"                        10   6 run       -
[1 4 6 2]

GFS Mount Group: "LINK_1281"                        12   8 run       -
[1 4 6 2]

GFS Mount Group: "LINK_1282"                        14  10 run       -
[1 4 6 2]




Grant-01:
Nov 16 09:22:42 grant-01 kernel: CMAN: node grant-03 rejoining
Nov 16 09:22:42 grant-01 kernel: CMAN: Initiating transition, generation 13
Nov 16 09:22:43 grant-01 kernel: CMAN: Completed transition, generation 13
Nov 16 09:22:58 grant-01 kernel: CMAN: Initiating transition, generation 15
WARNING: dlm_emergency_shutdown
dlm: process_cluster_request invalid lockspace 100000b from 2 req 1
WARNING: dlm_emergency_shutdown
SM: 00000002 sm_stop: SG still joined
SM: 01000003 sm_stop: SG still joined
SM: 0200000a sm_stop: SG still joined
dlm: dlm_lock: no lockspace
2 resend 34004c lq 1 flg 200008 node -1/-1 "       8
LINK_1282 resend 3c0261 lq 1 flg 200000 node -1/-1 "       5
LINK_1282 resent 3 requests
LINK_1282 recover event 36 finished
LINK_1280 move flags 0,0,1 ids 22,36,36
LINK_1280 process held requests
LINK_1280 processed 0 requests
LINK_1280 resend marked requests
LINK_1280 resend e0317 lq 1 flg 200000 node -1/-1 "       7
LINK_1280 resent 1 requests
LINK_1280 recover event 36 finished
LINK_1281 mark waiting requests
LINK_1281 mark 18004b lq 1 nodeid -1
LINK_1281 marked 1 requests
LINK_1281 purge locks of departed nodes
LINK_1281 purged 1 locks
LINK_1281 update remastered resources
LINK_1281 updated 1 resources
LINK_1281 rebuild locks
LINK_1281 rebuilt 1 locks
LINK_1281 recover event 36 done
LINK_1281 move flags 0,0,1 ids 28,36,36
LINK_1281 process held requests
LINK_1281 processed 0 requests
LINK_1281 resend marked requests
LINK_1281 resend 18004b lq 1 flg 200000 node -1/-1 "       2
LINK_1281 resent 1 requests
LINK_1281 recover event 36 finished
,2797006
8924 remove 7,2797006
8924 ex punlock 0
8924 en plock 7,279700c
8924 req 7,279700c ex 0-7fffffffffffffff lkf 2000 wait 1
8924 ex plock 0
8920 en punlock 7,279700b
8920 remove 7,279700b
8920 ex punlock 0
8920 en plock 7,2797006
8920 req 7,2797006 ex 0-7fffffffffffffff lkf 2000 wait 1
8924 en punlock 7,279700c
8924 remove 7,279700c
8924 ex punlock 0
8924 en plock 7,279700b
8924 req 7,279700b ex 0-7fffffffffffffff lkf 2000 wait 1
8924 ex plock 0
8924 en punlock 7,279700b
8924 remove 7,279700b
8924 ex punlock 0
8924 en plock 7,2797006
8913 ex plock 0
8907 ex plock 0
8920 ex plock 0
8908 ex plock 0
8913 en punlock 7,279701b
8920 en punlock 7,2797006
8907 en punlock 7,27970d8
8920 remove 7,2797006
8908 en punlock 7,27a7190
8908 remove 7,27a7190
8920 ex punlock 0
8920 en plock 7,279700c
8920 req 7,279700c ex 0-7fffffffffffffff lkf 2000 wait 1
8908 ex punlock 0
8908 en plock 7,27970ce
8924 req 7,2797006 ex 0-7fffffffffffffff lkf 2000 wait 1
8913 remove 7,279701b
8913 ex punlock 0
8913 en plock 7,2797017
8913 req 7,2797017 ex 0-7fffffffffffffff lkf 2000 wait 1
8912 req 7,279701b ex 0-7fffffffffffffff lkf 2000 wait 1
8920 ex plock 0
8907 remove 7,27970d8
8907 ex punlock 0
8907 en plock 7,4223afe
8908 req 7,27970ce ex 0-7fffffffffffffff lkf 2000 wait 1
8913 ex plock 0
8907 req 7,4223afe ex 0-7fffffffffffffff lkf 2000 wait 1
8908 ex plock 0
8907 ex plock 0
8924 ex plock 0
8912 ex plock 0
8913 en punlock 7,2797017
8908 en punlock 7,27970ce
8920 en punlock 7,279700c
8907 en punlock 7,4223afe
8908 remove 7,27970ce
8920 remove 7,279700c
8908 ex punlock 0
8908 en plock 7,4223ada
8920 ex punlock 0
8920 en plock 7,2797006
8913 remove 7,2797017
8913 ex punlock 0
8913 en plock 7,279701b
8907 remove 7,4223afe
8907 ex punlock 0
8907 en plock 7,27a7190
8908 req 7,4223ada ex 0-7fffffffffffffff lkf 2000 wait 1
8907 req 7,27a7190 ex 0-7fffffffffffffff lkf 2000 wait 1
8908 ex plock 0
8907 ex plock 0
8942 en plock 7,2797006
8942 req 7,2797006 ex 9580a8-12756d54 lkf 2000 wait 1
8942 ex plock 0
8912 en punlock 7,279701b
8908 en punlock 7,4223ada
8907 en punlock 7,27a7190
8908 remove 7,4223ada
8924 en punlock 7,2797006
8907 remove 7,27a7190
8924 remove 7,2797006
8908 ex punlock 0
8908 en plock 7,27a71a8
8907 ex punlock 0
8907 en plock 7,27970ce
8924 ex punlock 0
8924 en plock 7,279700c
8924 req 7,279700c ex 0-7fffffffffffffff lkf 2000 wait 1
8920 req 7,2797006 ex 0-7fffffffffffffff lkf 2000 wait 1
8912 remove 7,279701b
8912 ex punlock 0
8912 en plock 7,279701d
8912 req 7,279701d ex 0-7fffffffffffffff lkf 2000 wait 1
8913 req 7,279701b ex 0-7fffffffffffffff lkf 2000 wait 1
8912 ex plock 0
8924 ex plock 0
8920 ex plock 0
8908 req 7,27a71a8 ex 0-7fffffffffffffff lkf 2000 wait 1
8907 req 7,27970ce ex 0-7fffffffffffffff lkf 2000 wait 1
8908 ex plock 0
8907 ex plock 0
8913 ex plock 0
8912 en punlock 7,279701d
8924 en punlock 7,279700c
8908 en punlock 7,27a71a8
8924 remove 7,279700c
8907 en punlock 7,27970ce
8924 ex punlock 0
8924 en plock 7,2797006
8912 remove 7,279701d
8912 ex punlock 0
8912 en plock 7,2797019
8912 req 7,2797019 ex 0-7fffffffffffffff lkf 2000 wait 1
8912 ex plock 0
8908 remove 7,27a71a8
8908 ex punlock 0
8907 remove 7,27970ce
8907 ex punlock 0
8907 en plock 7,4223ada
8907 req 7,4223ada ex 0-7fffffffffffffff lkf 2000 wait 1
8907 ex plock 0
8913 en punlock 7,279701b
8920 en punlock 7,2797006
8920 remove 7,2797006
8907 en punlock 7,4223ada
8920 ex punlock 0
8920 en plock 7,2797008
8920 req 7,2797008 ex 0-7fffffffffffffff lkf 2000 wait 1
8924 req 7,2797006 ex 0-7fffffffffffffff lkf 2000 wait 1
8913 remove 7,279701b
8920 ex plock 0
8913 ex punlock 0
8913 en plock 7,279701d
8924 ex plock 0
8913 req 7,279701d ex 0-7fffffffffffffff lkf 2000 wait 1
8913 ex plock 0
8907 remove 7,4223ada
8907 ex punlock 0
8942 en punlock 7,2797006
8942 remove 7,2797006
8942 ex punlock 0
8912 en punlock 7,2797019
8912 remove 7,2797019
8912 ex punlock 0
8912 en plock 7,279701f
8912 req 7,279701f ex 0-7fffffffffffffff lkf 2000 wait 1
8912 ex plock 0
8920 en punlock 7,2797008
8920 remove 7,2797008
8920 ex punlock 0
8920 en plock 7,2797006
8942 en plock 7,2797006

lock_dlm:  Assertion failed on line 432 of file
/builddir/build/BUILD/gfs-kernel-2.6.9-75/up/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 4297038051
LINK_1281: num=11,2797006 err=-22 cur=0 req=5 lkf=4

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:432
invalid operand: 0000 [1]
CPU 0
Modules linked in: lock_dlm(U) dm_cmirror(U) gnbd(U) lock_nolock(U) gfs(U)
lock_harness(U) dlm(U) cman(U) qlad
Pid: 8942, comm: doio Not tainted 2.6.9-67.EL
RIP: 0010:[<ffffffffa0349327>] <ffffffffa0349327>{:lock_dlm:do_dlm_lock+363}
RSP: 0018:00000101e2133c28  EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 000000000003f733
RDX: 00000000ffffff01 RSI: 000000000003f733 RDI: ffffffff8043d300
RBP: 00000101ed1196c0 R08: 00000000000927bf R09: 00000000000927c0
R10: 0000000000000246 R11: 0000ffff8045c520 R12: 0000010210106c00
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000012756d54
FS:  0000002a95562b00(0000) GS:ffffffff80554580(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000007fbfffb8e0 CR3: 0000000000101000 CR4: 00000000000006e0
Process doio (pid: 8942, threadinfo 00000101e2132000, task 00000101e28dcce0)
Stack: 0000000000000005 0000000000000004 3131202020202020 2020202020202020
       3630303739373220 0000000000000018 00000101ed1196c0 0000010169a3fc00
       00000101ed119718 0000010210106c00
Call Trace:<ffffffffa03493a3>{:lock_dlm:do_dlm_lock_sync+85}
<ffffffffa034babf>{:lock_dlm:lock_resource+127}
       <ffffffffa034d13c>{:lock_dlm:lm_dlm_plock+601}
<ffffffff801355dd>{default_wake_function+0}
       <ffffffffa03493ab>{:lock_dlm:do_dlm_lock_sync+93}
<ffffffffa02ddb2f>{:gfs:gfs_lm_plock+45}
       <ffffffffa02ea053>{:gfs:gfs_lock+196} <ffffffff801aa772>{fcntl_setlk+311}
       <ffffffff8035fb2c>{thread_return+0} <ffffffff801a5eb5>{sys_fcntl+1163}
       <ffffffff80110a92>{system_call+126}

Code: 0f 0b b6 ed 34 a0 ff ff ff ff b0 01 48 c7 c7 bb ed 34 a0 31
RIP <ffffffffa0349327>{:lock_dlm:do_dlm_lock+363} RSP <00000101e2133c28>
 <0>Kernel panic - not syncing: Oops




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Christine Caulfield 2007-11-16 11:36:44 EST
This is one of those unfortunate situations where it's almost impossible to
debug it retrospectively.

The inconsistent view error is really there to catch bugs, the nodes should
never get out of step with their cluster view, but odd things can happen (I
suspect) if network packets get horribly delayed during transition.

As we can assume that the nodes had a consistent view going into transition then
it must be something in that process that has screwed up. So it's either that or
a corrupted packet somewhere (or somehow) that has probably caused this. Without
a tcpdump of the transition it's impossible to say more I'm afraid :(

Comment 2 Corey Marthaler 2007-11-16 11:43:51 EST
This is reproducable with the following:

2.6.9-67.EL
cman-kernel-2.6.9-54.1
dlm-kernel-2.6.9-52.2
Comment 3 Corey Marthaler 2007-11-16 12:44:18 EST
Created attachment 261521 [details]
here's the tcp dump during the failures
Comment 4 Christine Caulfield 2007-11-19 04:35:29 EST
Did you verify that this is caused by the new packages with the "missing
messages" patch ?
Comment 5 Christine Caulfield 2007-11-19 10:34:53 EST
OK, I've found the problem. Rip that patch out for 4.6 and I'll redo it for 4.7.
Comment 6 Christine Caulfield 2007-11-20 06:03:11 EST
A fix for the fix. on the RHEL4 branch:

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.31; previous revision: 1.42.2.30
done
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v  <--  membership.c
new revision: 1.44.2.29; previous revision: 1.44.2.28
done
Comment 7 Kiersten (Kerri) Anderson 2007-11-20 11:18:15 EST
Original change backed out of the release, so moving this to be a 4.7 request.
Comment 8 Christine Caulfield 2007-11-21 04:22:05 EST
This BZ might as well be closed then. The bug was caused by an inadequate patch
for bz#373671. If that patch doesn't exist anywhere then neither does this bug.

I'll copy the checkin record above into that BZ and set this to MODIFIED. Corey
can close it if he's happy with that.
Comment 9 Corey Marthaler 2007-11-21 10:24:04 EST
Closing...

Note You need to log in before you can comment on or make changes to this bug.