Bug 148788

Summary: force shutdown when network/cluster goes down
Product: [Retired] Red Hat Cluster Suite Reporter: David Teigland <teigland>
Component: dlmAssignee: David Teigland <teigland>
Status: CLOSED NOTABUG QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: amanthei, cluster-maint, cmarthal
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-03-17 15:54:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Teigland 2005-02-15 17:10:35 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5)
Gecko/20041111 Firefox/1.0

Description of problem:
This really comes down to being able to cleanly shut down live gfs
file systems on demand when something gfs depends on goes away.

Ken has already done this for the case where the fs storage goes away
(the new "withdraw" feature.)  We need something similar for the case
where the cluster manager goes away (or the network goes away, which
would look about the same as the cluster going away.)

Once gfs supports that, then instead of lock_dlm panicking on an
error, gfs/lock_dlm initiate the new "network/cluster is down so force
a shutdown" feature.


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. shut down cman while gfs is running
2.
3.
    

Actual Results:  you get an assertion failure in lock_dlm when the dlm
starts
returning errors to lock requests

Additional info:

Comment 1 David Teigland 2005-02-15 17:12:43 UTC
*** Bug 148016 has been marked as a duplicate of this bug. ***

Comment 2 David Teigland 2005-02-16 01:40:12 UTC
I realized shortly after writing this that I missed a fairly
obvious simplification of the problem.  When the network/cluster
goes down and the node looses contact with other nodes, it's
going to be fenced.

Lock_dlm sees this when it starts getting errors from the DLM
(we could make it a specific error the dlm returns when the cluster
is gone).  Lock_dlm would then, instead of panicking, just stop
doing anything and stand by waiting to be fenced.

If the fencing method is power-based, it will be reset and it
doesn't really matter in the end that it didn't panic.  But,
if it's SAN-based fencing, then gfs will likely see the fs
storage disappear.  This will cause gfs to do a withdraw to
shut down the fs.  Lock_dlm would skip the usual withdraw steps
in this situation and just return.

This means that there's no additional feature necessarily needed
by gfs.  Lock_dlm is the place where there would be a little
work to add a "standby" mode that could be called when it gets
errors instead of asserting.


Comment 3 Corey Marthaler 2005-02-18 16:31:31 UTC
FWIW, I'm seeing this issue as well during cman/dlm recovery testing.

Comment 4 Corey Marthaler 2005-03-17 15:39:11 UTC
reproduced this on morph-03 last night after 56 iterations of revolver:

From the SYSLOG:
Mar 16 14:15:08 morph-03 kernel: CMAN: node morph-04.lab.msp.redhat.com rejoining
Mar 16 14:15:15 morph-03 kernel: CMAN: node morph-03.lab.msp.redhat.com has been
removed from the cluster : No response to messages
Mar 16 14:15:15 morph-03 kernel: CMAN: killed by NODEDOWN message
Mar 16 14:15:15 morph-03 kernel: CMAN: we are leaving the cluster. No response
to messages
Mar 16 14:15:15 morph-03 kernel: 00 node -1/-1 "       5
Mar 16 14:15:15 morph-03 kernel: gfs0 resend 59027f lq 1 flg 200000 node -1/-1 "
      2
Mar 16 14:15:15 morph-03 kernel: gfs0 resend 540210 lq 1 flg 200000 node -1/-1 "
      7
Mar 16 14:15:15 morph-03 kernel: gfs0 resend 5c009e lq 1 flg 200000 node -1/-1 "
      7
Mar 16 14:15:15 morph-03 kernel: gfs0 resend 530258 lq 1 flg 200000 node -1/-1 "
     11
Mar 16 14:15:15 morph-03 kernel: gfs0 resend 6301a9 lq 1 flg 200000 node -1/-1 "
      7
Mar 16 14:15:15 morph-03 kernel: gfs0 resent 6 requests
Mar 16 14:15:15 morph-03 kernel: gfs0 recover event 39 finished
Mar 16 14:15:15 morph-03 kernel: gfs1 move flags 0,0,1 ids 37,39,39
Mar 16 14:15:15 morph-03 kernel: gfs1 process held requests
Mar 16 14:15:17 morph-03 kernel: gfs1 processed 0 requests
Mar 16 14:15:17 morph-03 kernel: gfs1 resend marked requests
Mar 16 14:15:17 morph-03 kernel: gfs1 resend 5d0171 lq 1 flg 200000 node -1/-1 "
      5
Mar 16 14:15:17 morph-03 kernel: gfs1 resend 5b0193 lq 1 flg 200000 node -1/-1 "
      5
Mar 16 14:15:18 morph-03 kernel: gfs1 resend 4b0378 lq 1 flg 200000 node -1/-1 "
     11
Mar 16 14:15:20 morph-03 kernel: gfs1 resend 4d0050 lq 1 flg 200000 node -1/-1 "
      5
Mar 16 14:15:21 morph-03 kernel: gfs1 resend 4800a3 lq 1 flg 200000 node -1/-1 "
      7



From the console:
6114 lk 11,24ff11 id 54026a 0,5 4
6102 en punlock 7,67773c3
6102 lk 11,67773c3 id 401c1 0,5 4
6157 en punlock 7,67773b2
6157 lk 11,67773b2 id 70255 0,5 4

lock_dlm:  Assertion failed on line 411 of file
/usr/src/build/540056-i686/BUILD/hugemem/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 920011
gfs1: num=11,67773b2 err=-22 cur=0 req=5 lkf=4

Recursive die() failure, output suppressed
 <0>Fatal exception: panic in 5 seconds
00 node -1/-1 "       5
gfs0 resend 59027f lq 1 flg 200000 node -1/-1 "       2
gfs0 resend 540210 lq 1 flg 200000 node -1/-1 "       7
gfs0 resend 5c009e lq 1 flg 200000 node -1/-1 "       7
gfs0 resend 530258 lq 1 flg 200000 node -1/-1 "      11
gfs0 resend 6301a9 lq 1 flg 200000 node -1/-1 "       7
gfs0 resent 6 requests
gfs0 recover event 39 finished
gfs1 move flags 0,0,1 ids 37,39,39
gfs1 process held requests
gfs1 processed 0 requests
gfs1 resend marked requests
gfs1 resend 5d0171 lq 1 flg 200000 node -1/-1 "       5
gfs1 resend 5b0193 lq 1 flg 200000 node -1/-1 "       5
gfs1 resend 4b0378 lq 1 flg 200000 node -1/-1 "      11
gfs1 resend 4d0050 lq 1 flg 200000 node -1/-1 "       5
gfs1 resend 4800a3 lq 1 flg 200000 node -1/-1 "       7
gfs1 resend 510092 lq 1 flg 200000 node -1/-1 "      11
gfs1 resend 5f01d1 lq 1 flg 200000 node -1/-1 "       7
gfs1 resent 7 requests
gfs1 recover event 39 finished
,240035 id 0 -1,5 0
6164 req 7,1859c8 ex 0-7fffffffffffffff lkf 2000 wait 1
6164 lk 7,1859c8 id 0 -1,5 2000
6164 lk 11,1859c8 id 5e02e1 5,0 4
5858 qc 7,1859c8 -1,5 id 50030c sts 0 0
5858 qc 11,1859c8 5,0 id 5e02e1 sts 0 0
6164 ex plock 0
5858 qc 2,240035 -1,5 id 460268 sts 0 0
6167 lk 5,240035 id 0 -1,3 0
5858 qc 5,240035 -1,3 id 5900f4 sts 0 0
6165 en punlock 7,1b4348
6165 lk 11,1b4348 id 5202f4 0,5 4
5858 qc 11,1b4348 0,5 id 5202f4 sts 0 0
6165 remove 7,1b4348
6165 un 7,1b4348 62018f 5 0
5858 qc 7,1b4348 5,5 id 62018f sts -65538 0
6165 lk 11,1b4348 id 5202f4 5,0 4
5858 qc 11,1b4348 5,0 id 5202f4 sts 0 0
6165 ex punlock 0
6165 en plock 7,21ffd5
6165 lk 11,21ffd5 id 0 -1,5 0
6112 en punlock 7,67773bd
6112 lk 11,67773bd id 803b3 0,5 4
5858 qc 11,67773bd 0,5 id 803b3 sts 0 0
6112 remove 7,67773bd
6112 un 7,67773bd 580153 5 0
5858 qc 7,67773bd 5,5 id 580153 sts -65538 0
6112 lk 11,67773bd id 803b3 5,0 4
5858 qc 11,67773bd 5,0 id 803b3 sts 0 0
6112 ex punlock 0
6112 en plock 7,6797387
5858 qc 11,21ffd5 -1,5 id 5f0165 sts 0 0
6116 en punlock 7,1495ab
6116 lk 11,1495ab id 6f0378 0,5 4
5858 qc 11,1495ab 0,5 id 6f0378 sts 0 0
6116 remove 7,1495ab
6116 un 7,1495ab 730382 5 0
5858 qc 7,1495ab 5,5 id 730382 sts -65538 0
6116 lk 11,1495ab id 6f0378 5,0 4
5858 qc 11,1495ab 5,0 id 6f0378 sts 0 0
6116 ex punlock 0
6116 en plock 7,25fee1
6116 lk 11,25fee1 id 6502d4 0,5 4
5858 qc 11,25fee1 0,5 id 6502d4 sts 0 0
6116 req 7,25fee1 ex 0-7fffffffffffffff lkf 2000 wait 1
6116 lk 7,25fee1 id 0 -1,5 2000
6116 lk 11,25fee1 id 6502d4 5,0 4
5858 qc 7,25fee1 -1,5 id 5d017f sts 0 0
5858 qc 11,25fee1 5,0 id 6502d4 sts 0 0
6116 ex plock 0
6165 req 7,21ffd5 ex 0-7fffffffffffffff lkf 2000 wait 1
6165 lk 7,21ffd5 id 0 -1,5 2000
6165 lk 11,21ffd5 id 5f0165 5,0 4
5858 qc 7,21ffd5 -1,5 id 6202ce sts 0 0
5858 qc 11,21ffd5 5,0 id 5f0165 sts 0 0
6165 ex plock 0
6168 en punlock 7,67a7370
6168 lk 11,67a7370 id 3034f 0,5 4
5858 qc 11,67a7370 0,5 id 3034f sts 0 0
6168 remove 7,67a7370
6168 un 7,67a7370 550058 5 0
5858 qc 7,67a7370 5,5 id 550058 sts -65538 0
6168 lk 11,67a7370 id 3034f 5,0 4
5858 qc 11,67a7370 5,0 id 3034f sts 0 0
6168 ex punlock 0
6168 en plock 7,67773b5
6168 lk 11,67773b5 id 201f9 0,5 4
5858 qc 11,67773b5 0,5 id 201f9 sts 0 0
6168 req 7,67773b5 ex 0-7fffffffffffffff lkf 2000 wait 1
6168 lk 7,67773b5 id 0 -1,5 2000
6168 lk 11,67773b5 id 201f9 5,0 4
5858 qc 11,67773b5 5,0 id 201f9 sts 0 0
5858 qc 7,67773b5 -1,5 id 560148 sts 0 0
6168 ex plock 0
6113 en punlock 7,67773b8
6113 lk 11,67773b8 id 9017c 0,5 4
5858 qc 11,67773b8 0,5 id 9017c sts 0 0
6113 remove 7,67773b8
6113 un 7,67773b8 6403dd 5 0
5858 qc 7,67773b8 5,5 id 6403dd sts -65538 0
6113 lk 11,67773b8 id 9017c 5,0 4
5858 qc 11,67773b8 5,0 id 9017c sts 0 0
6113 ex punlock 0
6113 en plock 7,67773b8
6113 lk 11,67773b8 id 9017c 0,5 4
5858 qc 11,67773b8 0,5 id 9017c sts 0 0
6113 req 7,67773b8 ex 0-7fffffffffffffff lkf 2000 wait 1
6113 lk 7,67773b8 id 0 -1,5 2000
6113 lk 11,67773b8 id 9017c 5,0 4
5858 qc 11,67773b8 5,0 id 9017c sts 0 0
6109 en punlock 7,67773c1
6109 lk 11,67773c1 id 40224 0,5 4
5858 qc 11,67773c1 0,5 id 40224 sts 0 0
6109 remove 7,67773c1
6109 un 7,67773c1 7b0163 5 0
5858 qc 7,67773c1 5,5 id 7b0163 sts -65538 0
6109 lk 11,67773c1 id 40224 5,0 4
5858 qc 11,67773c1 5,0 id 40224 sts 0 0
6109 ex punlock 0
6109 en plock 7,67773c1
6109 lk 11,67773c1 id 40224 0,5 4
5858 qc 11,67773c1 0,5 id 40224 sts 0 0
6109 req 7,67773c1 ex 2d6f00-2da4dd lkf 2000 wait 1
6109 lk 7,67773c1 id 0 -1,5 2000
5858 qc 7,67773b8 -1,5 id 640103 sts 0 0
6113 ex plock 0
6018 un 5,21ffbc 4b02d8 3 0
5858 qc 5,21ffbc 3,3 id 4b02d8 sts -65538 0
6170 en punlock 7,67a7374
6170 lk 11,67a7374 id 80043 0,5 4
6171 en punlock 7,67a736f
6171 lk 11,67a736f id 4029c 0,5 4
6109 lk 11,67773c1 id 40224 5,0 4
5858 qc 11,67773c1 5,0 id 40224 sts 0 0
6166 en plock 7,16c610
6166 lk 11,16c610 id 0 -1,5 0
6114 en punlock 7,24ff11
6114 lk 11,24ff11 id 54026a 0,5 4
6102 en punlock 7,67773c3
6102 lk 11,67773c3 id 401c1 0,5 4
6157 en punlock 7,67773b2
6157 lk 11,67773b2 id 70255 0,5 4
6115 en punlock 7,38fd31
6115 lk 11,38fd31 id 6c01d0 0,5 4

lock_dlm:  Assertion failed on line 411 of file
/usr/src/build/540056-i686/BUILD/hugemem/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 920591
gfs0: num=11,38fd31 err=-22 cur=0 req=5 lkf=4

------------[ cut here ]------------
kernel BUG at /usr/src/build/540056-i686/BUILD/hugemem/src/dlm/lock.c:411!
invalid operand: 0000 [#6]
SMP
Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_dlm(U) dlm(U) cman(U)
lock_harness(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc button battery ac
uhci_hcd hw_random e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<82a7b7a6>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-6.16.ELhugemem)
EIP is at do_dlm_lock+0x147/0x161 [lock_dlm]
eax: 00000001   ebx: ffffffea   ecx: 7bcf7e54   edx: 82a80052
esi: 82a7b7c5   edi: 81c60c00   ebp: 7750cd80   esp: 7bcf7e50
ds: 007b   es: 007b   ss: 0068
Process genesis (pid: 6115, threadinfo=7bcf7000 task=7bef53b0)
Stack: 82a80052 20202020 31312020 20202020 20202020 38332020 31336466 00000018
       00000000 dead4ead 7750cd80 7750cdc0 7750cdbc 82a7b7fd 7bcf7ea0 56922600
       7bcf7f10 8074336c 82a7d171 7750cd80 0038fd31 00000000 00000011 56922600
Call Trace:
 [<82a7b7fd>] do_dlm_lock_sync+0x33/0x42 [lock_dlm]
 [<82a7d171>] lock_resource+0x70/0x93 [lock_dlm]
 [<82a7ea3c>] lm_dlm_punlock+0xcc/0x14f [lock_dlm]
 [<82b57e73>] gfs_lm_punlock+0x30/0x3f [gfs]
 [<82b63007>] gfs_lock+0xd4/0xf2 [gfs]
 [<82b62f33>] gfs_lock+0x0/0xf2 [gfs]
 [<02167c35>] fcntl_setlk+0x102/0x236
 [<02153955>] do_truncate+0x87/0xbc
 [<02164276>] do_fcntl+0x10c/0x155
 [<02164385>] sys_fcntl64+0x6c/0x7d
Code: <3>Debug: sleeping function called from invalid context at
include/linux/rwsem.h:43
in_atomic():0[expected: 0], irqs_disabled():1
 [<0211f371>] __might_sleep+0x7d/0x88
 [<0215035f>] rw_vm+0xdb/0x282
 [<82a7b77b>] do_dlm_lock+0x11c/0x161 [lock_dlm]
 [<82a7b77b>] do_dlm_lock+0x11c/0x161 [lock_dlm]
 [<021507b9>] get_user_size+0x30/0x57
 [<82a7b77b>] do_dlm_lock+0x11c/0x161 [lock_dlm]
 [<0210615b>] show_registers+0x115/0x16c
 [<021062f2>] die+0xdb/0x16b
 [<02106664>] do_invalid_op+0x0/0xd5
 [<02106664>] do_invalid_op+0x0/0xd5
 [<02106730>] do_invalid_op+0xcc/0xd5
 [<82a7b7a6>] do_dlm_lock+0x147/0x161 [lock_dlm]
 [<0220007b>] agpioc_bind_wrap+0x18/0x3a
 [<021112a6>] delay_tsc+0xb/0x13
 [<021b5b59>] __delay+0x9/0xa
 [<0220e431>] serial8250_console_write+0x16c/0x1b2
 [<0220e2c5>] serial8250_console_write+0x0/0x1b2
 [<0220e2c5>] serial8250_console_write+0x0/0x1b2
 [<0212175a>] __call_console_drivers+0x36/0x40
 [<82a7b7c5>] lock_bast+0x0/0x5 [lock_dlm]
 [<82a7b7a6>] do_dlm_lock+0x147/0x161 [lock_dlm]
 [<82a7b7fd>] do_dlm_lock_sync+0x33/0x42 [lock_dlm]
 [<82a7d171>] lock_resource+0x70/0x93 [lock_dlm]
 [<82a7ea3c>] lm_dlm_punlock+0xcc/0x14f [lock_dlm]
 [<82b57e73>] gfs_lm_punlock+0x30/0x3f [gfs]
 [<82b63007>] gfs_lock+0xd4/0xf2 [gfs]
 [<82b62f33>] gfs_lock+0x0/0xf2 [gfs]
 [<02167c35>] fcntl_setlk+0x102/0x236
 [<02153955>] do_truncate+0x87/0xbc
 [<02164276>] do_fcntl+0x10c/0x155
 [<02164385>] sys_fcntl64+0x6c/0x7d
00 node -1/-1 "       5
gfs0 resend 59027f lq 1 flg 200000 node -1/-1 "       2
gfs0 resend 540210 lq 1 flg 200000 node -1/-1 "       7
gfs0 resend 5c009e lq 1 flg 200000 node -1/-1 "       7
gfs0 resend 530258 lq 1 flg 200000 node -1/-1 "      11
gfs0 resend 6301a9 lq 1 flg 200000 node -1/-1 "       7


Comment 5 David Teigland 2005-03-17 15:54:53 UTC
Comment #4 belongs to bz 139738 -- I've copied over the relevant
part.

I'm going to remove this bug because it's simply not a bug, it won't
be resolved in the v4 version, and it provides endless confusion.