Red Hat Bugzilla – Bug 148016
"if a node is kicked out of the cluster things go ape-shit"
Last modified: 2009-04-16 16:30:15 EDT
Description of problem:
Filing this as a new bug as per Patrick's request, with requested Summary ^_^.
I had a 9 node cluster (9 nodes in cluster.conf, only 6 were actually running).
One node oopsed and panicked due to bug #148014, leaving me with 5 of 9
running. A while later, another node was removed from the cluster, bring the
running count to 4 of 9 (loss of quorum!). I do not know why that happened,
perhaps it has something to do with bug #139738?
Eventually the removed node panicked.
lock_dlm: Assertion failed on line 410 of file
lock_dlm: assertion: "!error"
lock_dlm: time = 205020903
trin1.gfs: num=2,17 err=-22 cur=-1 req=3 lkf=0
------------[ cut here ]------------
kernel BUG at
invalid operand: 0000 [#1]
Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core lock_gulm(U)
lock_dlm(U) dlm(U) gfs(U) lock_harness(U) cman(U) sunrpc md5 ipv6 button battery
ac uhci_hcd e
hci_hcd e1000 floppy ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod
EIP: 0060:[<e01f2eaf>] Not tainted VLI
EFLAGS: 00010246 (2.6.9-5.EL)
EIP is at do_dlm_lock+0x149/0x163 [lock_dlm]
eax: 00000001 ebx: ffffffea ecx: e01f8556 edx: dded7cf4
esi: e01f2ece edi: c1466e00 ebp: df7aee80 esp: dded7cf0
ds: 007b es: 007b ss: 0068
Process df (pid: 3435, threadinfo=dded7000 task=ca0bf970)
Stack: e01f8556 20202020 32202020 20202020 20202020 20202020 37312020 00000018
c1725000 df7aee80 00000003 00000000 df7aee80 e01f2f78 00000003 e01fbc40
e01a9000 e03e547e 00000000 dcba3624 e01a9000 00000000 00000003 e03d7f67
[<e01f2f78>] lm_dlm_lock+0x42/0x4b [lock_dlm]
[<e03e547e>] gfs_lm_lock+0x34/0x4b [gfs]
[<e03d7f67>] gfs_glock_xmote_th+0x1ab/0x1e9 [gfs]
[<e03d6cbd>] rq_promote+0x1af/0x28f [gfs]
[<e03d70ed>] run_queue+0x91/0xc1 [gfs]
[<e03d8e6e>] gfs_glock_nq+0x11e/0x1b4 [gfs]
[<e03d98dc>] gfs_glock_nq_init+0x13/0x26 [gfs]
[<e03fc91c>] gfs_rindex_hold+0x2a/0xc5 [gfs]
[<e03ff9c7>] gfs_stat_gfs+0x16/0x4e [gfs]
[<e03f56b1>] gfs_statfs+0x25/0xc6 [gfs]
Code: 26 50 0f bf 45 24 50 53 ff 75 08 ff 75 04 ff 75 0c ff 77 18 68 b1 86 1f e0
e8 93 d1 f2 df 83 c4 38 68 56 85 1f e0 e8 86 d1 f2 df <0f> 0b 9a 01 e8 83 1f e0
68 58 85 1f
e0 e8 d2 c5 f2 df 83 c4 20
<0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception
A while later another node in the cluster panicked with the same BUG, bringing
the running node count to 3 of 9:
Version-Release number of selected component (if applicable):
Haven't tried yet
Steps to Reproduce:
1. Start cluster Friday. (Don't bother with any load!)
2. Go home for the weekend
3. Come back to the office on Monday.
Created attachment 111066 [details]
logs and console from crashed cluster
Assigned to Dave in the first instance, because the first death is in
gfs-kernel/dlm/lock.c - though I suspect this bug may bounce around a
bit before being closed.
trin-05 is where the assertion failed and it's evident from its
log file that the cluster was shut down, which causes the lockspaces
to be shut down, which causes the assertion failure. Same problem
we've seen before.
(In reply to comment #3)
> trin-05 is where the assertion failed and it's evident from its
> log file that the cluster was shut down, which causes the lockspaces
> to be shut down, which causes the assertion failure. Same problem
> we've seen before.
I didn't shut down the cluster. Does the cluster for some reason shut itself
down? If so why and when? The leaving the cluster is probably bug #139738.
This bug addresses the seperate issue that Patrick referred to in bug #139738
comment #10 where "all hell breaks loose"
Patrick had this to say as well:
This bug is not the same as bug #139738 even though it is (in most
circumstances) caused by it. There are potentially other causes of this error.
So, even if we close bug #139738 this bug is not fixed.
The problem is that /if/ cman gets kicked out of the cluster then these errors
occur. bug #139738 is the fact that cman /does/ get kicked out of the cluster.
The reason I wanted a seperate bug opened for this problem is so that it
doesn't get lost when bug #139738 goes away.
It's likely that such (inadvisable) commands such as "cman_tool kill" or
"cman_tool leave force" would also cause these errors.
This belongs in a new bz.
*** This bug has been marked as a duplicate of 148788 ***
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.