Description of problem: Filing this as a new bug as per Patrick's request, with requested Summary ^_^. I had a 9 node cluster (9 nodes in cluster.conf, only 6 were actually running). One node oopsed and panicked due to bug #148014, leaving me with 5 of 9 running. A while later, another node was removed from the cluster, bring the running count to 4 of 9 (loss of quorum!). I do not know why that happened, perhaps it has something to do with bug #139738? Eventually the removed node panicked. lock_dlm: Assertion failed on line 410 of file /usr/src/build/519783-i686/BUILD/gfs-kernel-2.6.9-22/src/dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 205020903 trin1.gfs: num=2,17 err=-22 cur=-1 req=3 lkf=0 ------------[ cut here ]------------ kernel BUG at /usr/src/build/519783-i686/BUILD/gfs-kernel-2.6.9-22/src/dlm/lock.c:410! invalid operand: 0000 [#1] Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core lock_gulm(U) lock_dlm(U) dlm(U) gfs(U) lock_harness(U) cman(U) sunrpc md5 ipv6 button battery ac uhci_hcd e hci_hcd e1000 floppy ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<e01f2eaf>] Not tainted VLI EFLAGS: 00010246 (2.6.9-5.EL) EIP is at do_dlm_lock+0x149/0x163 [lock_dlm] eax: 00000001 ebx: ffffffea ecx: e01f8556 edx: dded7cf4 esi: e01f2ece edi: c1466e00 ebp: df7aee80 esp: dded7cf0 ds: 007b es: 007b ss: 0068 Process df (pid: 3435, threadinfo=dded7000 task=ca0bf970) Stack: e01f8556 20202020 32202020 20202020 20202020 20202020 37312020 00000018 c1725000 df7aee80 00000003 00000000 df7aee80 e01f2f78 00000003 e01fbc40 e01a9000 e03e547e 00000000 dcba3624 e01a9000 00000000 00000003 e03d7f67 Call Trace: [<e01f2f78>] lm_dlm_lock+0x42/0x4b [lock_dlm] [<e03e547e>] gfs_lm_lock+0x34/0x4b [gfs] [<e03d7f67>] gfs_glock_xmote_th+0x1ab/0x1e9 [gfs] [<e03d6cbd>] rq_promote+0x1af/0x28f [gfs] [<e03d70ed>] run_queue+0x91/0xc1 [gfs] [<e03d8e6e>] gfs_glock_nq+0x11e/0x1b4 [gfs] [<e03d98dc>] gfs_glock_nq_init+0x13/0x26 [gfs] [<e03fc91c>] gfs_rindex_hold+0x2a/0xc5 [gfs] [<e03ff9c7>] gfs_stat_gfs+0x16/0x4e [gfs] [<c0145cc9>] buffered_rmqueue+0x1c4/0x1e7 [<e03f56b1>] gfs_statfs+0x25/0xc6 [gfs] [<c017d02d>] __d_lookup+0x12d/0x1e6 [<c017ae9f>] dput+0x33/0x417 [<c0160171>] vfs_statfs+0x41/0x59 [<c0160267>] vfs_statfs64+0xe/0x28 [<c01725fe>] __user_walk+0x4a/0x51 [<c0160372>] sys_statfs64+0x52/0xb2 [<c0154199>] do_mmap_pgoff+0x55b/0x653 [<c010d0f8>] sys_mmap2+0x7f/0xb2 [<c0119234>] do_page_fault+0x0/0x4dc [<c0301bfb>] syscall_call+0x7/0xb Code: 26 50 0f bf 45 24 50 53 ff 75 08 ff 75 04 ff 75 0c ff 77 18 68 b1 86 1f e0 e8 93 d1 f2 df 83 c4 38 68 56 85 1f e0 e8 86 d1 f2 df <0f> 0b 9a 01 e8 83 1f e0 68 58 85 1f e0 e8 d2 c5 f2 df 83 c4 20 <0>Fatal exception: panic in 5 seconds Kernel panic - not syncing: Fatal exception A while later another node in the cluster panicked with the same BUG, bringing the running node count to 3 of 9: Version-Release number of selected component (if applicable): http://people.redhat.com/cfeist/cluster/RHEL4/alpha/cluster-2005-02-11-1100/cluster-i686-2005-02-11-1100.tar How reproducible: Haven't tried yet Steps to Reproduce: 1. Start cluster Friday. (Don't bother with any load!) 2. Go home for the weekend 3. Come back to the office on Monday. Actual results: Expected results: Additional info:
Created attachment 111066 [details] logs and console from crashed cluster
Assigned to Dave in the first instance, because the first death is in gfs-kernel/dlm/lock.c - though I suspect this bug may bounce around a bit before being closed.
trin-05 is where the assertion failed and it's evident from its log file that the cluster was shut down, which causes the lockspaces to be shut down, which causes the assertion failure. Same problem we've seen before.
(In reply to comment #3) > trin-05 is where the assertion failed and it's evident from its > log file that the cluster was shut down, which causes the lockspaces > to be shut down, which causes the assertion failure. Same problem > we've seen before. I didn't shut down the cluster. Does the cluster for some reason shut itself down? If so why and when? The leaving the cluster is probably bug #139738. This bug addresses the seperate issue that Patrick referred to in bug #139738 comment #10 where "all hell breaks loose"
Patrick had this to say as well: This bug is not the same as bug #139738 even though it is (in most circumstances) caused by it. There are potentially other causes of this error. So, even if we close bug #139738 this bug is not fixed. The problem is that /if/ cman gets kicked out of the cluster then these errors occur. bug #139738 is the fact that cman /does/ get kicked out of the cluster. The reason I wanted a seperate bug opened for this problem is so that it doesn't get lost when bug #139738 goes away. It's likely that such (inadvisable) commands such as "cman_tool kill" or "cman_tool leave force" would also cause these errors.
This belongs in a new bz. *** This bug has been marked as a duplicate of 148788 ***
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.