Bug 179672 - withdraw causes kernel panic
withdraw causes kernel panic
Status: CLOSED WONTFIX
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm (Show other bugs)
4
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Ryan O'Hara
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-02-01 18:23 EST by Ryan O'Hara
Modified: 2009-04-16 16:27 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-08-22 14:30:10 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log file showing withdraw and panic. (5.91 KB, text/plain)
2006-02-01 18:23 EST, Ryan O'Hara
no flags Details

  None (edit)
Description Ryan O'Hara 2006-02-01 18:23:42 EST
Description of problem:

I caused a node to withdraw from a cluster by killing its I/O path. With four
nodes in the cluster, three of which were doing a large amount of I/O via fibre
channel I/O paths to shared GFS filesystem. I killed the I/O path of one of the
nodes that was actively doing I/O by closing the port on the fibre channel
switch. The decided to withdraw and then the kernel panics.

Feb  1 17:05:58 trin-05 kernel: ------------[ cut here ]------------
Feb  1 17:05:58 trin-05 kernel: kernel BUG at fs/locks.c:1799!
Feb  1 17:05:58 trin-05 kernel: invalid operand: 0000 [#1]
Feb  1 17:05:58 trin-05 kernel: Modules linked in: lock_dlm(U) gfs(U) lock_harne
ss(U) qla2300 qla2xxx scsi_transport_fc parport_pc lp parport autofs4 i2c_dev i2
c_core dlm(U) cman(U) md5 ipv6 sunrpc button battery ac uhci_hcd ehci_hcd e1000
floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sd_mod scsi_mod
Feb  1 17:05:58 trin-05 kernel: CPU:    0
Feb  1 17:05:58 trin-05 kernel: EIP:    0060:[<c018112d>]    Not tainted VLI
Feb  1 17:05:58 trin-05 kernel: EFLAGS: 00010246   (2.6.9-28.EL)
Feb  1 17:05:58 trin-05 kernel: EIP is at locks_remove_flock+0x119/0x1b4
Feb  1 17:05:58 trin-05 kernel: eax: dc793a00   ebx: d85b0ea4   ecx: d85a3d04
edx: 00000081
Feb  1 17:05:58 trin-05 kernel: esi: e0494e60   edi: d85b0dcc   ebp: da0fad80
esp: dc9a0ee4
Feb  1 17:05:58 trin-05 kernel: ds: 007b   es: 007b   ss: 0068
Feb  1 17:05:58 trin-05 kernel: Process accordion (pid: 3649, threadinfo=dc9a000
0 task=da2e6cd0)
Feb  1 17:05:58 trin-05 kernel: Stack: 00000000 00000000 00000000 00000000 00000
000 00000000 00000000 00000000
Feb  1 17:05:58 trin-05 kernel:        00000000 00000000 00000000 00000000 00000
000 00000000 00000000 00000000
Feb  1 17:05:58 trin-05 kernel:        00000202 00000000 00000000 00000000 00000
000 00000000 00000000 00000000
Feb  1 17:05:59 trin-05 kernel: Call Trace:
Feb  1 17:05:59 trin-05 kernel:  [<c01699aa>] __fput+0x41/0xee
Feb  1 17:05:59 trin-05 kernel:  [<c01682c6>] filp_close+0x59/0x5f
Feb  1 17:05:59 trin-05 kernel:  [<c0123390>] put_files_struct+0x56/0xbf
Feb  1 17:05:59 trin-05 kernel:  [<c0124383>] do_exit+0x2df/0x59c
Feb  1 17:05:59 trin-05 kernel:  [<c01247d8>] sys_exit_group+0x0/0xd
Feb  1 17:05:59 trin-05 kernel:  [<c03114ab>] syscall_call+0x7/0xb
Feb  1 17:05:59 trin-05 kernel:  [<c031007b>] rwsem_down_read_failed+0x137/0x204
Feb  1 17:05:59 trin-05 kernel: Code: 38 39 68 3c 75 2d 0f b6 50 40 f6 c2 02 74
09 89 d8 e8 52 d8 ff ff eb 1d f6 c2 20 74 0e ba 02 00 00 00 89 d8 e8 19 e9 ff ff
 eb 0a <0f> 0b 07 07 14 39 32 c0 89 c3 8b 03 eb c4 b8 00 f0 ff ff 21 e0
Feb  1 17:05:59 trin-05 kernel:  <0>Fatal exception: panic in 5 seconds

Version-Release number of selected component (if applicable):

dlm-1.0.0-5
kernel-2.6.9-28.EL
GFS-6.1.4-0

How reproducible:
Always.

Steps to Reproduce:
1. Configure cluster with shared GFS storage.
2. Run large amount of I/O over I/O paths.
3. Kill one of the I/O paths by either closing its port or pulling the cable.
  
Actual results:
Kernal panic.

Expected results:
Successful withdraw.

Additional info:
The I/O paths in my cluster were fibre channel, but I'm not suggesting that this
has anything to do with that. Just a detail to explain how I killed the I/O
path. A more complete log is attached.
Comment 1 Ryan O'Hara 2006-02-01 18:23:43 EST
Created attachment 124017 [details]
log file showing withdraw and panic.
Comment 2 Akshat Aranya 2006-04-14 17:45:36 EDT
This bug happened for me over NFS with 2.6.9-34.ELsmp kernel.  I put it in the
Linux kernel bugzilla (http://bugzilla.kernel.org/show_bug.cgi?id=3986) but the
kernel maintainers are unwilling to help with a 2.6.9 kernel.  The scenario
where it occurred for me is when I tried to interrupt a parallel make that
writes over NFS.

Hardware Environment:  2-Dual core AMD Opteron , 4GB RAM
Software Environment:  RHEL 4 SMP kernel.  Process g++ while trying to write
over NFS.

Problem Description:

Apr 14 15:10:21 hfs12 kernel: kernel BUG at fs/locks.c:1799!
Apr 14 15:10:21 hfs12 kernel: invalid operand: 0000 [#1]
Apr 14 15:10:21 hfs12 kernel: SMP
Apr 14 15:10:21 hfs12 kernel: Modules linked in: nfsd exportfs parport_pc lp
parport autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc dm_mirror dm_mod
button battery ac md5 ipv6 ohci_hcd hw_random e100 mii tg3 floppy ext3 jbd raid0
sata_sil libata sd_mod scsi_mod
Apr 14 15:10:21 hfs12 kernel: CPU:    1
Apr 14 15:10:21 hfs12 kernel: EIP:    0060:[<c016dd4c>]    Not tainted VLI
Apr 14 15:10:21 hfs12 kernel: EFLAGS: 00010246   (2.6.9-34.ELsmp)
Apr 14 15:10:21 hfs12 kernel: EIP is at locks_remove_flock+0xa1/0xe1
Apr 14 15:10:21 hfs12 kernel: eax: f64efa8c   ebx: f5be620c   ecx: 00000000  
edx: 00000081
Apr 14 15:10:21 hfs12 kernel: esi: 00000000   edi: f5be6164   ebp: f58c06c0  
esp: f40b3f2c
Apr 14 15:10:21 hfs12 kernel: ds: 007b   es: 007b   ss: 0068
Apr 14 15:10:21 hfs12 kernel: Process g++-4.0 (pid: 14863, threadinfo=f40b3000
task=f36ef830)
Apr 14 15:10:21 hfs12 kernel: Stack: f58c06c0 f896643a f40b3f44 f8966e2a
f8c3abd7 c016dca4 f40b3f6c 00000001
Apr 14 15:10:21 hfs12 kernel:        00000000 00000001 f5be60f8 f378c3c0
00003a0f f8c426ac 00000000 ffffffff
Apr 14 15:10:21 hfs12 kernel:        f6020f40 f58c06c0 00000201 00000000
00000000 00000246 00000000 f58c06c0
Apr 14 15:10:21 hfs12 kernel: Call Trace:
Apr 14 15:10:21 hfs12 kernel:  [<f896643a>] nlm_put_lockowner+0x11/0x49 [lockd]
Apr 14 15:10:21 hfs12 kernel:  [<f8966e2a>]
nlmclnt_locks_release_private+0xb/0x14 [lockd]
Apr 14 15:10:21 hfs12 kernel:  [<f8c3abd7>] nfs_lock+0x0/0xc7 [nfs]
Apr 14 15:10:21 hfs12 kernel:  [<c016dca4>] locks_remove_posix+0x130/0x137
Apr 14 15:10:21 hfs12 kernel:  [<f8c426ac>] nfs_wait_on_requests+0x7e/0xba [nfs]
Apr 14 15:10:21 hfs12 kernel:  [<c015b0c6>] __fput+0x41/0x100
Apr 14 15:10:21 hfs12 kernel:  [<c0159d21>] filp_close+0x59/0x5f
Apr 14 15:10:21 hfs12 kernel:  [<c02d2657>] syscall_call+0x7/0xb
Apr 14 15:10:21 hfs12 kernel: Code: 38 39 68 2c 75 2d 0f b6 50 30 f6 c2 02 74 09
89 d8 e8 b3 df ff ff eb 1d f6 c2 20 74 0e ba 02 00 00 00 89 d8 e8 ce ec ff ff eb
0a <0f> 0b 07 07 6c 74 2e c0 89 c3 8b 03 eb c4 b8 00 f0 ff ff 21 e0
Apr 14 15:10:21 hfs12 kernel:  <0>Fatal exception: panic in 5 seconds
Comment 5 Ryan O'Hara 2006-09-22 15:33:23 EDT
I don't think that the original bug reported here is related to comment #2. The
original post was seen while using GFS in a clustered environment and a node
withdrew from the filesystem. The problem reported in comment #2 is occuring
without GFS being involved. While they could be related, I think it is more
likely that they are two different problems that happen to panic in the same
place. Seems like flock's are not being cleaned-up properly.
Comment 6 David Teigland 2006-10-17 12:45:28 EDT
I'm letting Ryan decide what the status of this is.
Comment 7 Kiersten (Kerri) Anderson 2007-01-04 12:21:21 EST
Moving this to NEEDINFO, haven't been able to recreate this one recently.
Comment 8 Ryan O'Hara 2007-08-22 14:30:10 EDT
Have not seen this problem in quite some time. Closing.

Note You need to log in before you can comment on or make changes to this bug.