148016 – "if a node is kicked out of the cluster things go ape-shit"

Bug 148016 - "if a node is kicked out of the cluster things go ape-shit"

Summary: "if a node is kicked out of the cluster things go ape-shit"

Keywords:
Status:	CLOSED DUPLICATE of bug 148788
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	dlm
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-02-14 20:14 UTC by Adam "mantis" Manthei
Modified:	2009-04-16 20:30 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-02-21 19:08:11 UTC
Embargoed:

Attachments	(Terms of Use)
logs and console from crashed cluster (1.69 MB, application/x-tar) 2005-02-14 20:17 UTC, Adam "mantis" Manthei	no flags	Details
View All

Description Adam "mantis" Manthei 2005-02-14 20:14:40 UTC

Description of problem:
Filing this as a new bug as per Patrick's request, with requested Summary ^_^. 
I had a 9 node cluster (9 nodes in cluster.conf, only 6 were actually running).
 One node oopsed and panicked due to bug #148014, leaving me with 5 of 9
running.  A while later, another node was removed from the cluster, bring the
running count to 4 of 9 (loss of quorum!).  I do not know why that happened,
perhaps it has something to do with bug #139738? 

Eventually the removed node panicked.

lock_dlm:  Assertion failed on line 410 of file
/usr/src/build/519783-i686/BUILD/gfs-kernel-2.6.9-22/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 205020903
trin1.gfs: num=2,17 err=-22 cur=-1 req=3 lkf=0

------------[ cut here ]------------
kernel BUG at
/usr/src/build/519783-i686/BUILD/gfs-kernel-2.6.9-22/src/dlm/lock.c:410!
invalid operand: 0000 [#1]
Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core lock_gulm(U)
lock_dlm(U) dlm(U) gfs(U) lock_harness(U) cman(U) sunrpc md5 ipv6 button battery
ac uhci_hcd e
hci_hcd e1000 floppy ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod
scsi_mod
CPU:    0
EIP:    0060:[<e01f2eaf>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-5.EL)
EIP is at do_dlm_lock+0x149/0x163 [lock_dlm]
eax: 00000001   ebx: ffffffea   ecx: e01f8556   edx: dded7cf4
esi: e01f2ece   edi: c1466e00   ebp: df7aee80   esp: dded7cf0
ds: 007b   es: 007b   ss: 0068
Process df (pid: 3435, threadinfo=dded7000 task=ca0bf970)
Stack: e01f8556 20202020 32202020 20202020 20202020 20202020 37312020 00000018
       c1725000 df7aee80 00000003 00000000 df7aee80 e01f2f78 00000003 e01fbc40
       e01a9000 e03e547e 00000000 dcba3624 e01a9000 00000000 00000003 e03d7f67
Call Trace:
 [<e01f2f78>] lm_dlm_lock+0x42/0x4b [lock_dlm]
 [<e03e547e>] gfs_lm_lock+0x34/0x4b [gfs]
 [<e03d7f67>] gfs_glock_xmote_th+0x1ab/0x1e9 [gfs]
 [<e03d6cbd>] rq_promote+0x1af/0x28f [gfs]
 [<e03d70ed>] run_queue+0x91/0xc1 [gfs]
 [<e03d8e6e>] gfs_glock_nq+0x11e/0x1b4 [gfs]
 [<e03d98dc>] gfs_glock_nq_init+0x13/0x26 [gfs]
 [<e03fc91c>] gfs_rindex_hold+0x2a/0xc5 [gfs]
 [<e03ff9c7>] gfs_stat_gfs+0x16/0x4e [gfs]
 [<c0145cc9>] buffered_rmqueue+0x1c4/0x1e7
 [<e03f56b1>] gfs_statfs+0x25/0xc6 [gfs]
 [<c017d02d>] __d_lookup+0x12d/0x1e6
 [<c017ae9f>] dput+0x33/0x417
 [<c0160171>] vfs_statfs+0x41/0x59
 [<c0160267>] vfs_statfs64+0xe/0x28
 [<c01725fe>] __user_walk+0x4a/0x51
 [<c0160372>] sys_statfs64+0x52/0xb2
 [<c0154199>] do_mmap_pgoff+0x55b/0x653
 [<c010d0f8>] sys_mmap2+0x7f/0xb2
 [<c0119234>] do_page_fault+0x0/0x4dc
 [<c0301bfb>] syscall_call+0x7/0xb
Code: 26 50 0f bf 45 24 50 53 ff 75 08 ff 75 04 ff 75 0c ff 77 18 68 b1 86 1f e0
e8 93 d1 f2 df 83 c4 38 68 56 85 1f e0 e8 86 d1 f2 df <0f> 0b 9a 01 e8 83 1f e0
68 58 85 1f 
e0 e8 d2 c5 f2 df 83 c4 20
 <0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception


A while later another node in the cluster panicked with the same BUG, bringing
the running node count to 3 of 9:


Version-Release number of selected component (if applicable):
http://people.redhat.com/cfeist/cluster/RHEL4/alpha/cluster-2005-02-11-1100/cluster-i686-2005-02-11-1100.tar


How reproducible:
Haven't tried yet

Steps to Reproduce:
1. Start cluster Friday.  (Don't bother with any load!)  
2. Go home for the weekend
3. Come back to the office on Monday.
  
Actual results:


Expected results:


Additional info:

Comment 1 Adam "mantis" Manthei 2005-02-14 20:17:06 UTC

Created attachment 111066 [details]
logs and console from crashed cluster

Comment 2 Christine Caulfield 2005-02-15 08:30:26 UTC

Assigned to Dave in the first instance, because the first death is in
gfs-kernel/dlm/lock.c - though I suspect this bug may bounce around a
bit before being closed.

Comment 3 David Teigland 2005-02-15 14:27:53 UTC

trin-05 is where the assertion failed and it's evident from its
log file that the cluster was shut down, which causes the lockspaces
to be shut down, which causes the assertion failure.  Same problem
we've seen before.

Comment 4 Adam "mantis" Manthei 2005-02-15 16:47:59 UTC

(In reply to comment #3)
> trin-05 is where the assertion failed and it's evident from its
> log file that the cluster was shut down, which causes the lockspaces
> to be shut down, which causes the assertion failure.  Same problem
> we've seen before.

I didn't shut down the cluster.  Does the cluster for some reason shut itself
down?  If so why and when?  The leaving the cluster is probably bug #139738.

This bug addresses the seperate issue that Patrick referred to in bug #139738
comment #10 where "all hell breaks loose"

Comment 5 Adam "mantis" Manthei 2005-02-15 16:52:08 UTC

Patrick had this to say as well:

This bug is not the same as bug #139738 even though it is (in most
circumstances) caused by it. There are potentially other causes of this error.
So, even if we close bug #139738 this bug is not fixed.

The problem is that /if/ cman gets kicked out of the cluster then these errors
occur. bug #139738 is the fact that cman /does/ get kicked out of the cluster.

The reason I wanted a seperate bug opened for this problem is so that it
doesn't get lost when bug #139738 goes away.

It's likely that such (inadvisable) commands such as "cman_tool kill" or
"cman_tool leave force" would also cause these errors.

Comment 6 David Teigland 2005-02-15 17:12:40 UTC

This belongs in a new bz.

*** This bug has been marked as a duplicate of 148788 ***

Comment 7 Red Hat Bugzilla 2006-02-21 19:08:11 UTC

Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.

Note You need to log in before you can comment on or make changes to this bug.