From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041111 Firefox/1.0 Description of problem: This really comes down to being able to cleanly shut down live gfs file systems on demand when something gfs depends on goes away. Ken has already done this for the case where the fs storage goes away (the new "withdraw" feature.) We need something similar for the case where the cluster manager goes away (or the network goes away, which would look about the same as the cluster going away.) Once gfs supports that, then instead of lock_dlm panicking on an error, gfs/lock_dlm initiate the new "network/cluster is down so force a shutdown" feature. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. shut down cman while gfs is running 2. 3. Actual Results: you get an assertion failure in lock_dlm when the dlm starts returning errors to lock requests Additional info:
*** Bug 148016 has been marked as a duplicate of this bug. ***
I realized shortly after writing this that I missed a fairly obvious simplification of the problem. When the network/cluster goes down and the node looses contact with other nodes, it's going to be fenced. Lock_dlm sees this when it starts getting errors from the DLM (we could make it a specific error the dlm returns when the cluster is gone). Lock_dlm would then, instead of panicking, just stop doing anything and stand by waiting to be fenced. If the fencing method is power-based, it will be reset and it doesn't really matter in the end that it didn't panic. But, if it's SAN-based fencing, then gfs will likely see the fs storage disappear. This will cause gfs to do a withdraw to shut down the fs. Lock_dlm would skip the usual withdraw steps in this situation and just return. This means that there's no additional feature necessarily needed by gfs. Lock_dlm is the place where there would be a little work to add a "standby" mode that could be called when it gets errors instead of asserting.
FWIW, I'm seeing this issue as well during cman/dlm recovery testing.
reproduced this on morph-03 last night after 56 iterations of revolver: From the SYSLOG: Mar 16 14:15:08 morph-03 kernel: CMAN: node morph-04.lab.msp.redhat.com rejoining Mar 16 14:15:15 morph-03 kernel: CMAN: node morph-03.lab.msp.redhat.com has been removed from the cluster : No response to messages Mar 16 14:15:15 morph-03 kernel: CMAN: killed by NODEDOWN message Mar 16 14:15:15 morph-03 kernel: CMAN: we are leaving the cluster. No response to messages Mar 16 14:15:15 morph-03 kernel: 00 node -1/-1 " 5 Mar 16 14:15:15 morph-03 kernel: gfs0 resend 59027f lq 1 flg 200000 node -1/-1 " 2 Mar 16 14:15:15 morph-03 kernel: gfs0 resend 540210 lq 1 flg 200000 node -1/-1 " 7 Mar 16 14:15:15 morph-03 kernel: gfs0 resend 5c009e lq 1 flg 200000 node -1/-1 " 7 Mar 16 14:15:15 morph-03 kernel: gfs0 resend 530258 lq 1 flg 200000 node -1/-1 " 11 Mar 16 14:15:15 morph-03 kernel: gfs0 resend 6301a9 lq 1 flg 200000 node -1/-1 " 7 Mar 16 14:15:15 morph-03 kernel: gfs0 resent 6 requests Mar 16 14:15:15 morph-03 kernel: gfs0 recover event 39 finished Mar 16 14:15:15 morph-03 kernel: gfs1 move flags 0,0,1 ids 37,39,39 Mar 16 14:15:15 morph-03 kernel: gfs1 process held requests Mar 16 14:15:17 morph-03 kernel: gfs1 processed 0 requests Mar 16 14:15:17 morph-03 kernel: gfs1 resend marked requests Mar 16 14:15:17 morph-03 kernel: gfs1 resend 5d0171 lq 1 flg 200000 node -1/-1 " 5 Mar 16 14:15:17 morph-03 kernel: gfs1 resend 5b0193 lq 1 flg 200000 node -1/-1 " 5 Mar 16 14:15:18 morph-03 kernel: gfs1 resend 4b0378 lq 1 flg 200000 node -1/-1 " 11 Mar 16 14:15:20 morph-03 kernel: gfs1 resend 4d0050 lq 1 flg 200000 node -1/-1 " 5 Mar 16 14:15:21 morph-03 kernel: gfs1 resend 4800a3 lq 1 flg 200000 node -1/-1 " 7 From the console: 6114 lk 11,24ff11 id 54026a 0,5 4 6102 en punlock 7,67773c3 6102 lk 11,67773c3 id 401c1 0,5 4 6157 en punlock 7,67773b2 6157 lk 11,67773b2 id 70255 0,5 4 lock_dlm: Assertion failed on line 411 of file /usr/src/build/540056-i686/BUILD/hugemem/src/dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 920011 gfs1: num=11,67773b2 err=-22 cur=0 req=5 lkf=4 Recursive die() failure, output suppressed <0>Fatal exception: panic in 5 seconds 00 node -1/-1 " 5 gfs0 resend 59027f lq 1 flg 200000 node -1/-1 " 2 gfs0 resend 540210 lq 1 flg 200000 node -1/-1 " 7 gfs0 resend 5c009e lq 1 flg 200000 node -1/-1 " 7 gfs0 resend 530258 lq 1 flg 200000 node -1/-1 " 11 gfs0 resend 6301a9 lq 1 flg 200000 node -1/-1 " 7 gfs0 resent 6 requests gfs0 recover event 39 finished gfs1 move flags 0,0,1 ids 37,39,39 gfs1 process held requests gfs1 processed 0 requests gfs1 resend marked requests gfs1 resend 5d0171 lq 1 flg 200000 node -1/-1 " 5 gfs1 resend 5b0193 lq 1 flg 200000 node -1/-1 " 5 gfs1 resend 4b0378 lq 1 flg 200000 node -1/-1 " 11 gfs1 resend 4d0050 lq 1 flg 200000 node -1/-1 " 5 gfs1 resend 4800a3 lq 1 flg 200000 node -1/-1 " 7 gfs1 resend 510092 lq 1 flg 200000 node -1/-1 " 11 gfs1 resend 5f01d1 lq 1 flg 200000 node -1/-1 " 7 gfs1 resent 7 requests gfs1 recover event 39 finished ,240035 id 0 -1,5 0 6164 req 7,1859c8 ex 0-7fffffffffffffff lkf 2000 wait 1 6164 lk 7,1859c8 id 0 -1,5 2000 6164 lk 11,1859c8 id 5e02e1 5,0 4 5858 qc 7,1859c8 -1,5 id 50030c sts 0 0 5858 qc 11,1859c8 5,0 id 5e02e1 sts 0 0 6164 ex plock 0 5858 qc 2,240035 -1,5 id 460268 sts 0 0 6167 lk 5,240035 id 0 -1,3 0 5858 qc 5,240035 -1,3 id 5900f4 sts 0 0 6165 en punlock 7,1b4348 6165 lk 11,1b4348 id 5202f4 0,5 4 5858 qc 11,1b4348 0,5 id 5202f4 sts 0 0 6165 remove 7,1b4348 6165 un 7,1b4348 62018f 5 0 5858 qc 7,1b4348 5,5 id 62018f sts -65538 0 6165 lk 11,1b4348 id 5202f4 5,0 4 5858 qc 11,1b4348 5,0 id 5202f4 sts 0 0 6165 ex punlock 0 6165 en plock 7,21ffd5 6165 lk 11,21ffd5 id 0 -1,5 0 6112 en punlock 7,67773bd 6112 lk 11,67773bd id 803b3 0,5 4 5858 qc 11,67773bd 0,5 id 803b3 sts 0 0 6112 remove 7,67773bd 6112 un 7,67773bd 580153 5 0 5858 qc 7,67773bd 5,5 id 580153 sts -65538 0 6112 lk 11,67773bd id 803b3 5,0 4 5858 qc 11,67773bd 5,0 id 803b3 sts 0 0 6112 ex punlock 0 6112 en plock 7,6797387 5858 qc 11,21ffd5 -1,5 id 5f0165 sts 0 0 6116 en punlock 7,1495ab 6116 lk 11,1495ab id 6f0378 0,5 4 5858 qc 11,1495ab 0,5 id 6f0378 sts 0 0 6116 remove 7,1495ab 6116 un 7,1495ab 730382 5 0 5858 qc 7,1495ab 5,5 id 730382 sts -65538 0 6116 lk 11,1495ab id 6f0378 5,0 4 5858 qc 11,1495ab 5,0 id 6f0378 sts 0 0 6116 ex punlock 0 6116 en plock 7,25fee1 6116 lk 11,25fee1 id 6502d4 0,5 4 5858 qc 11,25fee1 0,5 id 6502d4 sts 0 0 6116 req 7,25fee1 ex 0-7fffffffffffffff lkf 2000 wait 1 6116 lk 7,25fee1 id 0 -1,5 2000 6116 lk 11,25fee1 id 6502d4 5,0 4 5858 qc 7,25fee1 -1,5 id 5d017f sts 0 0 5858 qc 11,25fee1 5,0 id 6502d4 sts 0 0 6116 ex plock 0 6165 req 7,21ffd5 ex 0-7fffffffffffffff lkf 2000 wait 1 6165 lk 7,21ffd5 id 0 -1,5 2000 6165 lk 11,21ffd5 id 5f0165 5,0 4 5858 qc 7,21ffd5 -1,5 id 6202ce sts 0 0 5858 qc 11,21ffd5 5,0 id 5f0165 sts 0 0 6165 ex plock 0 6168 en punlock 7,67a7370 6168 lk 11,67a7370 id 3034f 0,5 4 5858 qc 11,67a7370 0,5 id 3034f sts 0 0 6168 remove 7,67a7370 6168 un 7,67a7370 550058 5 0 5858 qc 7,67a7370 5,5 id 550058 sts -65538 0 6168 lk 11,67a7370 id 3034f 5,0 4 5858 qc 11,67a7370 5,0 id 3034f sts 0 0 6168 ex punlock 0 6168 en plock 7,67773b5 6168 lk 11,67773b5 id 201f9 0,5 4 5858 qc 11,67773b5 0,5 id 201f9 sts 0 0 6168 req 7,67773b5 ex 0-7fffffffffffffff lkf 2000 wait 1 6168 lk 7,67773b5 id 0 -1,5 2000 6168 lk 11,67773b5 id 201f9 5,0 4 5858 qc 11,67773b5 5,0 id 201f9 sts 0 0 5858 qc 7,67773b5 -1,5 id 560148 sts 0 0 6168 ex plock 0 6113 en punlock 7,67773b8 6113 lk 11,67773b8 id 9017c 0,5 4 5858 qc 11,67773b8 0,5 id 9017c sts 0 0 6113 remove 7,67773b8 6113 un 7,67773b8 6403dd 5 0 5858 qc 7,67773b8 5,5 id 6403dd sts -65538 0 6113 lk 11,67773b8 id 9017c 5,0 4 5858 qc 11,67773b8 5,0 id 9017c sts 0 0 6113 ex punlock 0 6113 en plock 7,67773b8 6113 lk 11,67773b8 id 9017c 0,5 4 5858 qc 11,67773b8 0,5 id 9017c sts 0 0 6113 req 7,67773b8 ex 0-7fffffffffffffff lkf 2000 wait 1 6113 lk 7,67773b8 id 0 -1,5 2000 6113 lk 11,67773b8 id 9017c 5,0 4 5858 qc 11,67773b8 5,0 id 9017c sts 0 0 6109 en punlock 7,67773c1 6109 lk 11,67773c1 id 40224 0,5 4 5858 qc 11,67773c1 0,5 id 40224 sts 0 0 6109 remove 7,67773c1 6109 un 7,67773c1 7b0163 5 0 5858 qc 7,67773c1 5,5 id 7b0163 sts -65538 0 6109 lk 11,67773c1 id 40224 5,0 4 5858 qc 11,67773c1 5,0 id 40224 sts 0 0 6109 ex punlock 0 6109 en plock 7,67773c1 6109 lk 11,67773c1 id 40224 0,5 4 5858 qc 11,67773c1 0,5 id 40224 sts 0 0 6109 req 7,67773c1 ex 2d6f00-2da4dd lkf 2000 wait 1 6109 lk 7,67773c1 id 0 -1,5 2000 5858 qc 7,67773b8 -1,5 id 640103 sts 0 0 6113 ex plock 0 6018 un 5,21ffbc 4b02d8 3 0 5858 qc 5,21ffbc 3,3 id 4b02d8 sts -65538 0 6170 en punlock 7,67a7374 6170 lk 11,67a7374 id 80043 0,5 4 6171 en punlock 7,67a736f 6171 lk 11,67a736f id 4029c 0,5 4 6109 lk 11,67773c1 id 40224 5,0 4 5858 qc 11,67773c1 5,0 id 40224 sts 0 0 6166 en plock 7,16c610 6166 lk 11,16c610 id 0 -1,5 0 6114 en punlock 7,24ff11 6114 lk 11,24ff11 id 54026a 0,5 4 6102 en punlock 7,67773c3 6102 lk 11,67773c3 id 401c1 0,5 4 6157 en punlock 7,67773b2 6157 lk 11,67773b2 id 70255 0,5 4 6115 en punlock 7,38fd31 6115 lk 11,38fd31 id 6c01d0 0,5 4 lock_dlm: Assertion failed on line 411 of file /usr/src/build/540056-i686/BUILD/hugemem/src/dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 920591 gfs0: num=11,38fd31 err=-22 cur=0 req=5 lkf=4 ------------[ cut here ]------------ kernel BUG at /usr/src/build/540056-i686/BUILD/hugemem/src/dlm/lock.c:411! invalid operand: 0000 [#6] SMP Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_dlm(U) dlm(U) cman(U) lock_harness(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc button battery ac uhci_hcd hw_random e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<82a7b7a6>] Not tainted VLI EFLAGS: 00010246 (2.6.9-6.16.ELhugemem) EIP is at do_dlm_lock+0x147/0x161 [lock_dlm] eax: 00000001 ebx: ffffffea ecx: 7bcf7e54 edx: 82a80052 esi: 82a7b7c5 edi: 81c60c00 ebp: 7750cd80 esp: 7bcf7e50 ds: 007b es: 007b ss: 0068 Process genesis (pid: 6115, threadinfo=7bcf7000 task=7bef53b0) Stack: 82a80052 20202020 31312020 20202020 20202020 38332020 31336466 00000018 00000000 dead4ead 7750cd80 7750cdc0 7750cdbc 82a7b7fd 7bcf7ea0 56922600 7bcf7f10 8074336c 82a7d171 7750cd80 0038fd31 00000000 00000011 56922600 Call Trace: [<82a7b7fd>] do_dlm_lock_sync+0x33/0x42 [lock_dlm] [<82a7d171>] lock_resource+0x70/0x93 [lock_dlm] [<82a7ea3c>] lm_dlm_punlock+0xcc/0x14f [lock_dlm] [<82b57e73>] gfs_lm_punlock+0x30/0x3f [gfs] [<82b63007>] gfs_lock+0xd4/0xf2 [gfs] [<82b62f33>] gfs_lock+0x0/0xf2 [gfs] [<02167c35>] fcntl_setlk+0x102/0x236 [<02153955>] do_truncate+0x87/0xbc [<02164276>] do_fcntl+0x10c/0x155 [<02164385>] sys_fcntl64+0x6c/0x7d Code: <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 in_atomic():0[expected: 0], irqs_disabled():1 [<0211f371>] __might_sleep+0x7d/0x88 [<0215035f>] rw_vm+0xdb/0x282 [<82a7b77b>] do_dlm_lock+0x11c/0x161 [lock_dlm] [<82a7b77b>] do_dlm_lock+0x11c/0x161 [lock_dlm] [<021507b9>] get_user_size+0x30/0x57 [<82a7b77b>] do_dlm_lock+0x11c/0x161 [lock_dlm] [<0210615b>] show_registers+0x115/0x16c [<021062f2>] die+0xdb/0x16b [<02106664>] do_invalid_op+0x0/0xd5 [<02106664>] do_invalid_op+0x0/0xd5 [<02106730>] do_invalid_op+0xcc/0xd5 [<82a7b7a6>] do_dlm_lock+0x147/0x161 [lock_dlm] [<0220007b>] agpioc_bind_wrap+0x18/0x3a [<021112a6>] delay_tsc+0xb/0x13 [<021b5b59>] __delay+0x9/0xa [<0220e431>] serial8250_console_write+0x16c/0x1b2 [<0220e2c5>] serial8250_console_write+0x0/0x1b2 [<0220e2c5>] serial8250_console_write+0x0/0x1b2 [<0212175a>] __call_console_drivers+0x36/0x40 [<82a7b7c5>] lock_bast+0x0/0x5 [lock_dlm] [<82a7b7a6>] do_dlm_lock+0x147/0x161 [lock_dlm] [<82a7b7fd>] do_dlm_lock_sync+0x33/0x42 [lock_dlm] [<82a7d171>] lock_resource+0x70/0x93 [lock_dlm] [<82a7ea3c>] lm_dlm_punlock+0xcc/0x14f [lock_dlm] [<82b57e73>] gfs_lm_punlock+0x30/0x3f [gfs] [<82b63007>] gfs_lock+0xd4/0xf2 [gfs] [<82b62f33>] gfs_lock+0x0/0xf2 [gfs] [<02167c35>] fcntl_setlk+0x102/0x236 [<02153955>] do_truncate+0x87/0xbc [<02164276>] do_fcntl+0x10c/0x155 [<02164385>] sys_fcntl64+0x6c/0x7d 00 node -1/-1 " 5 gfs0 resend 59027f lq 1 flg 200000 node -1/-1 " 2 gfs0 resend 540210 lq 1 flg 200000 node -1/-1 " 7 gfs0 resend 5c009e lq 1 flg 200000 node -1/-1 " 7 gfs0 resend 530258 lq 1 flg 200000 node -1/-1 " 11 gfs0 resend 6301a9 lq 1 flg 200000 node -1/-1 " 7
Comment #4 belongs to bz 139738 -- I've copied over the relevant part. I'm going to remove this bug because it's simply not a bug, it won't be resolved in the v4 version, and it provides endless confusion.