Bug 148788
Summary: | force shutdown when network/cluster goes down | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | David Teigland <teigland> |
Component: | dlm | Assignee: | David Teigland <teigland> |
Status: | CLOSED NOTABUG | QA Contact: | GFS Bugs <gfs-bugs> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | amanthei, cluster-maint, cmarthal |
Target Milestone: | --- | Keywords: | FutureFeature |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-03-17 15:54:53 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
David Teigland
2005-02-15 17:10:35 UTC
*** Bug 148016 has been marked as a duplicate of this bug. *** I realized shortly after writing this that I missed a fairly obvious simplification of the problem. When the network/cluster goes down and the node looses contact with other nodes, it's going to be fenced. Lock_dlm sees this when it starts getting errors from the DLM (we could make it a specific error the dlm returns when the cluster is gone). Lock_dlm would then, instead of panicking, just stop doing anything and stand by waiting to be fenced. If the fencing method is power-based, it will be reset and it doesn't really matter in the end that it didn't panic. But, if it's SAN-based fencing, then gfs will likely see the fs storage disappear. This will cause gfs to do a withdraw to shut down the fs. Lock_dlm would skip the usual withdraw steps in this situation and just return. This means that there's no additional feature necessarily needed by gfs. Lock_dlm is the place where there would be a little work to add a "standby" mode that could be called when it gets errors instead of asserting. FWIW, I'm seeing this issue as well during cman/dlm recovery testing. reproduced this on morph-03 last night after 56 iterations of revolver: From the SYSLOG: Mar 16 14:15:08 morph-03 kernel: CMAN: node morph-04.lab.msp.redhat.com rejoining Mar 16 14:15:15 morph-03 kernel: CMAN: node morph-03.lab.msp.redhat.com has been removed from the cluster : No response to messages Mar 16 14:15:15 morph-03 kernel: CMAN: killed by NODEDOWN message Mar 16 14:15:15 morph-03 kernel: CMAN: we are leaving the cluster. No response to messages Mar 16 14:15:15 morph-03 kernel: 00 node -1/-1 " 5 Mar 16 14:15:15 morph-03 kernel: gfs0 resend 59027f lq 1 flg 200000 node -1/-1 " 2 Mar 16 14:15:15 morph-03 kernel: gfs0 resend 540210 lq 1 flg 200000 node -1/-1 " 7 Mar 16 14:15:15 morph-03 kernel: gfs0 resend 5c009e lq 1 flg 200000 node -1/-1 " 7 Mar 16 14:15:15 morph-03 kernel: gfs0 resend 530258 lq 1 flg 200000 node -1/-1 " 11 Mar 16 14:15:15 morph-03 kernel: gfs0 resend 6301a9 lq 1 flg 200000 node -1/-1 " 7 Mar 16 14:15:15 morph-03 kernel: gfs0 resent 6 requests Mar 16 14:15:15 morph-03 kernel: gfs0 recover event 39 finished Mar 16 14:15:15 morph-03 kernel: gfs1 move flags 0,0,1 ids 37,39,39 Mar 16 14:15:15 morph-03 kernel: gfs1 process held requests Mar 16 14:15:17 morph-03 kernel: gfs1 processed 0 requests Mar 16 14:15:17 morph-03 kernel: gfs1 resend marked requests Mar 16 14:15:17 morph-03 kernel: gfs1 resend 5d0171 lq 1 flg 200000 node -1/-1 " 5 Mar 16 14:15:17 morph-03 kernel: gfs1 resend 5b0193 lq 1 flg 200000 node -1/-1 " 5 Mar 16 14:15:18 morph-03 kernel: gfs1 resend 4b0378 lq 1 flg 200000 node -1/-1 " 11 Mar 16 14:15:20 morph-03 kernel: gfs1 resend 4d0050 lq 1 flg 200000 node -1/-1 " 5 Mar 16 14:15:21 morph-03 kernel: gfs1 resend 4800a3 lq 1 flg 200000 node -1/-1 " 7 From the console: 6114 lk 11,24ff11 id 54026a 0,5 4 6102 en punlock 7,67773c3 6102 lk 11,67773c3 id 401c1 0,5 4 6157 en punlock 7,67773b2 6157 lk 11,67773b2 id 70255 0,5 4 lock_dlm: Assertion failed on line 411 of file /usr/src/build/540056-i686/BUILD/hugemem/src/dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 920011 gfs1: num=11,67773b2 err=-22 cur=0 req=5 lkf=4 Recursive die() failure, output suppressed <0>Fatal exception: panic in 5 seconds 00 node -1/-1 " 5 gfs0 resend 59027f lq 1 flg 200000 node -1/-1 " 2 gfs0 resend 540210 lq 1 flg 200000 node -1/-1 " 7 gfs0 resend 5c009e lq 1 flg 200000 node -1/-1 " 7 gfs0 resend 530258 lq 1 flg 200000 node -1/-1 " 11 gfs0 resend 6301a9 lq 1 flg 200000 node -1/-1 " 7 gfs0 resent 6 requests gfs0 recover event 39 finished gfs1 move flags 0,0,1 ids 37,39,39 gfs1 process held requests gfs1 processed 0 requests gfs1 resend marked requests gfs1 resend 5d0171 lq 1 flg 200000 node -1/-1 " 5 gfs1 resend 5b0193 lq 1 flg 200000 node -1/-1 " 5 gfs1 resend 4b0378 lq 1 flg 200000 node -1/-1 " 11 gfs1 resend 4d0050 lq 1 flg 200000 node -1/-1 " 5 gfs1 resend 4800a3 lq 1 flg 200000 node -1/-1 " 7 gfs1 resend 510092 lq 1 flg 200000 node -1/-1 " 11 gfs1 resend 5f01d1 lq 1 flg 200000 node -1/-1 " 7 gfs1 resent 7 requests gfs1 recover event 39 finished ,240035 id 0 -1,5 0 6164 req 7,1859c8 ex 0-7fffffffffffffff lkf 2000 wait 1 6164 lk 7,1859c8 id 0 -1,5 2000 6164 lk 11,1859c8 id 5e02e1 5,0 4 5858 qc 7,1859c8 -1,5 id 50030c sts 0 0 5858 qc 11,1859c8 5,0 id 5e02e1 sts 0 0 6164 ex plock 0 5858 qc 2,240035 -1,5 id 460268 sts 0 0 6167 lk 5,240035 id 0 -1,3 0 5858 qc 5,240035 -1,3 id 5900f4 sts 0 0 6165 en punlock 7,1b4348 6165 lk 11,1b4348 id 5202f4 0,5 4 5858 qc 11,1b4348 0,5 id 5202f4 sts 0 0 6165 remove 7,1b4348 6165 un 7,1b4348 62018f 5 0 5858 qc 7,1b4348 5,5 id 62018f sts -65538 0 6165 lk 11,1b4348 id 5202f4 5,0 4 5858 qc 11,1b4348 5,0 id 5202f4 sts 0 0 6165 ex punlock 0 6165 en plock 7,21ffd5 6165 lk 11,21ffd5 id 0 -1,5 0 6112 en punlock 7,67773bd 6112 lk 11,67773bd id 803b3 0,5 4 5858 qc 11,67773bd 0,5 id 803b3 sts 0 0 6112 remove 7,67773bd 6112 un 7,67773bd 580153 5 0 5858 qc 7,67773bd 5,5 id 580153 sts -65538 0 6112 lk 11,67773bd id 803b3 5,0 4 5858 qc 11,67773bd 5,0 id 803b3 sts 0 0 6112 ex punlock 0 6112 en plock 7,6797387 5858 qc 11,21ffd5 -1,5 id 5f0165 sts 0 0 6116 en punlock 7,1495ab 6116 lk 11,1495ab id 6f0378 0,5 4 5858 qc 11,1495ab 0,5 id 6f0378 sts 0 0 6116 remove 7,1495ab 6116 un 7,1495ab 730382 5 0 5858 qc 7,1495ab 5,5 id 730382 sts -65538 0 6116 lk 11,1495ab id 6f0378 5,0 4 5858 qc 11,1495ab 5,0 id 6f0378 sts 0 0 6116 ex punlock 0 6116 en plock 7,25fee1 6116 lk 11,25fee1 id 6502d4 0,5 4 5858 qc 11,25fee1 0,5 id 6502d4 sts 0 0 6116 req 7,25fee1 ex 0-7fffffffffffffff lkf 2000 wait 1 6116 lk 7,25fee1 id 0 -1,5 2000 6116 lk 11,25fee1 id 6502d4 5,0 4 5858 qc 7,25fee1 -1,5 id 5d017f sts 0 0 5858 qc 11,25fee1 5,0 id 6502d4 sts 0 0 6116 ex plock 0 6165 req 7,21ffd5 ex 0-7fffffffffffffff lkf 2000 wait 1 6165 lk 7,21ffd5 id 0 -1,5 2000 6165 lk 11,21ffd5 id 5f0165 5,0 4 5858 qc 7,21ffd5 -1,5 id 6202ce sts 0 0 5858 qc 11,21ffd5 5,0 id 5f0165 sts 0 0 6165 ex plock 0 6168 en punlock 7,67a7370 6168 lk 11,67a7370 id 3034f 0,5 4 5858 qc 11,67a7370 0,5 id 3034f sts 0 0 6168 remove 7,67a7370 6168 un 7,67a7370 550058 5 0 5858 qc 7,67a7370 5,5 id 550058 sts -65538 0 6168 lk 11,67a7370 id 3034f 5,0 4 5858 qc 11,67a7370 5,0 id 3034f sts 0 0 6168 ex punlock 0 6168 en plock 7,67773b5 6168 lk 11,67773b5 id 201f9 0,5 4 5858 qc 11,67773b5 0,5 id 201f9 sts 0 0 6168 req 7,67773b5 ex 0-7fffffffffffffff lkf 2000 wait 1 6168 lk 7,67773b5 id 0 -1,5 2000 6168 lk 11,67773b5 id 201f9 5,0 4 5858 qc 11,67773b5 5,0 id 201f9 sts 0 0 5858 qc 7,67773b5 -1,5 id 560148 sts 0 0 6168 ex plock 0 6113 en punlock 7,67773b8 6113 lk 11,67773b8 id 9017c 0,5 4 5858 qc 11,67773b8 0,5 id 9017c sts 0 0 6113 remove 7,67773b8 6113 un 7,67773b8 6403dd 5 0 5858 qc 7,67773b8 5,5 id 6403dd sts -65538 0 6113 lk 11,67773b8 id 9017c 5,0 4 5858 qc 11,67773b8 5,0 id 9017c sts 0 0 6113 ex punlock 0 6113 en plock 7,67773b8 6113 lk 11,67773b8 id 9017c 0,5 4 5858 qc 11,67773b8 0,5 id 9017c sts 0 0 6113 req 7,67773b8 ex 0-7fffffffffffffff lkf 2000 wait 1 6113 lk 7,67773b8 id 0 -1,5 2000 6113 lk 11,67773b8 id 9017c 5,0 4 5858 qc 11,67773b8 5,0 id 9017c sts 0 0 6109 en punlock 7,67773c1 6109 lk 11,67773c1 id 40224 0,5 4 5858 qc 11,67773c1 0,5 id 40224 sts 0 0 6109 remove 7,67773c1 6109 un 7,67773c1 7b0163 5 0 5858 qc 7,67773c1 5,5 id 7b0163 sts -65538 0 6109 lk 11,67773c1 id 40224 5,0 4 5858 qc 11,67773c1 5,0 id 40224 sts 0 0 6109 ex punlock 0 6109 en plock 7,67773c1 6109 lk 11,67773c1 id 40224 0,5 4 5858 qc 11,67773c1 0,5 id 40224 sts 0 0 6109 req 7,67773c1 ex 2d6f00-2da4dd lkf 2000 wait 1 6109 lk 7,67773c1 id 0 -1,5 2000 5858 qc 7,67773b8 -1,5 id 640103 sts 0 0 6113 ex plock 0 6018 un 5,21ffbc 4b02d8 3 0 5858 qc 5,21ffbc 3,3 id 4b02d8 sts -65538 0 6170 en punlock 7,67a7374 6170 lk 11,67a7374 id 80043 0,5 4 6171 en punlock 7,67a736f 6171 lk 11,67a736f id 4029c 0,5 4 6109 lk 11,67773c1 id 40224 5,0 4 5858 qc 11,67773c1 5,0 id 40224 sts 0 0 6166 en plock 7,16c610 6166 lk 11,16c610 id 0 -1,5 0 6114 en punlock 7,24ff11 6114 lk 11,24ff11 id 54026a 0,5 4 6102 en punlock 7,67773c3 6102 lk 11,67773c3 id 401c1 0,5 4 6157 en punlock 7,67773b2 6157 lk 11,67773b2 id 70255 0,5 4 6115 en punlock 7,38fd31 6115 lk 11,38fd31 id 6c01d0 0,5 4 lock_dlm: Assertion failed on line 411 of file /usr/src/build/540056-i686/BUILD/hugemem/src/dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 920591 gfs0: num=11,38fd31 err=-22 cur=0 req=5 lkf=4 ------------[ cut here ]------------ kernel BUG at /usr/src/build/540056-i686/BUILD/hugemem/src/dlm/lock.c:411! invalid operand: 0000 [#6] SMP Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_dlm(U) dlm(U) cman(U) lock_harness(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc button battery ac uhci_hcd hw_random e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<82a7b7a6>] Not tainted VLI EFLAGS: 00010246 (2.6.9-6.16.ELhugemem) EIP is at do_dlm_lock+0x147/0x161 [lock_dlm] eax: 00000001 ebx: ffffffea ecx: 7bcf7e54 edx: 82a80052 esi: 82a7b7c5 edi: 81c60c00 ebp: 7750cd80 esp: 7bcf7e50 ds: 007b es: 007b ss: 0068 Process genesis (pid: 6115, threadinfo=7bcf7000 task=7bef53b0) Stack: 82a80052 20202020 31312020 20202020 20202020 38332020 31336466 00000018 00000000 dead4ead 7750cd80 7750cdc0 7750cdbc 82a7b7fd 7bcf7ea0 56922600 7bcf7f10 8074336c 82a7d171 7750cd80 0038fd31 00000000 00000011 56922600 Call Trace: [<82a7b7fd>] do_dlm_lock_sync+0x33/0x42 [lock_dlm] [<82a7d171>] lock_resource+0x70/0x93 [lock_dlm] [<82a7ea3c>] lm_dlm_punlock+0xcc/0x14f [lock_dlm] [<82b57e73>] gfs_lm_punlock+0x30/0x3f [gfs] [<82b63007>] gfs_lock+0xd4/0xf2 [gfs] [<82b62f33>] gfs_lock+0x0/0xf2 [gfs] [<02167c35>] fcntl_setlk+0x102/0x236 [<02153955>] do_truncate+0x87/0xbc [<02164276>] do_fcntl+0x10c/0x155 [<02164385>] sys_fcntl64+0x6c/0x7d Code: <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 in_atomic():0[expected: 0], irqs_disabled():1 [<0211f371>] __might_sleep+0x7d/0x88 [<0215035f>] rw_vm+0xdb/0x282 [<82a7b77b>] do_dlm_lock+0x11c/0x161 [lock_dlm] [<82a7b77b>] do_dlm_lock+0x11c/0x161 [lock_dlm] [<021507b9>] get_user_size+0x30/0x57 [<82a7b77b>] do_dlm_lock+0x11c/0x161 [lock_dlm] [<0210615b>] show_registers+0x115/0x16c [<021062f2>] die+0xdb/0x16b [<02106664>] do_invalid_op+0x0/0xd5 [<02106664>] do_invalid_op+0x0/0xd5 [<02106730>] do_invalid_op+0xcc/0xd5 [<82a7b7a6>] do_dlm_lock+0x147/0x161 [lock_dlm] [<0220007b>] agpioc_bind_wrap+0x18/0x3a [<021112a6>] delay_tsc+0xb/0x13 [<021b5b59>] __delay+0x9/0xa [<0220e431>] serial8250_console_write+0x16c/0x1b2 [<0220e2c5>] serial8250_console_write+0x0/0x1b2 [<0220e2c5>] serial8250_console_write+0x0/0x1b2 [<0212175a>] __call_console_drivers+0x36/0x40 [<82a7b7c5>] lock_bast+0x0/0x5 [lock_dlm] [<82a7b7a6>] do_dlm_lock+0x147/0x161 [lock_dlm] [<82a7b7fd>] do_dlm_lock_sync+0x33/0x42 [lock_dlm] [<82a7d171>] lock_resource+0x70/0x93 [lock_dlm] [<82a7ea3c>] lm_dlm_punlock+0xcc/0x14f [lock_dlm] [<82b57e73>] gfs_lm_punlock+0x30/0x3f [gfs] [<82b63007>] gfs_lock+0xd4/0xf2 [gfs] [<82b62f33>] gfs_lock+0x0/0xf2 [gfs] [<02167c35>] fcntl_setlk+0x102/0x236 [<02153955>] do_truncate+0x87/0xbc [<02164276>] do_fcntl+0x10c/0x155 [<02164385>] sys_fcntl64+0x6c/0x7d 00 node -1/-1 " 5 gfs0 resend 59027f lq 1 flg 200000 node -1/-1 " 2 gfs0 resend 540210 lq 1 flg 200000 node -1/-1 " 7 gfs0 resend 5c009e lq 1 flg 200000 node -1/-1 " 7 gfs0 resend 530258 lq 1 flg 200000 node -1/-1 " 11 gfs0 resend 6301a9 lq 1 flg 200000 node -1/-1 " 7 Comment #4 belongs to bz 139738 -- I've copied over the relevant part. I'm going to remove this bug because it's simply not a bug, it won't be resolved in the v4 version, and it provides endless confusion. |