Bug 179672

Summary: withdraw causes kernel panic
Product: [Retired] Red Hat Cluster Suite Reporter: Ryan O'Hara <rohara>
Component: dlmAssignee: Ryan O'Hara <rohara>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: aaranya, ccaulfie, cluster-maint, juanino, k.georgiou, rohara
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-08-22 18:30:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log file showing withdraw and panic. none

Description Ryan O'Hara 2006-02-01 23:23:42 UTC
Description of problem:

I caused a node to withdraw from a cluster by killing its I/O path. With four
nodes in the cluster, three of which were doing a large amount of I/O via fibre
channel I/O paths to shared GFS filesystem. I killed the I/O path of one of the
nodes that was actively doing I/O by closing the port on the fibre channel
switch. The decided to withdraw and then the kernel panics.

Feb  1 17:05:58 trin-05 kernel: ------------[ cut here ]------------
Feb  1 17:05:58 trin-05 kernel: kernel BUG at fs/locks.c:1799!
Feb  1 17:05:58 trin-05 kernel: invalid operand: 0000 [#1]
Feb  1 17:05:58 trin-05 kernel: Modules linked in: lock_dlm(U) gfs(U) lock_harne
ss(U) qla2300 qla2xxx scsi_transport_fc parport_pc lp parport autofs4 i2c_dev i2
c_core dlm(U) cman(U) md5 ipv6 sunrpc button battery ac uhci_hcd ehci_hcd e1000
floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sd_mod scsi_mod
Feb  1 17:05:58 trin-05 kernel: CPU:    0
Feb  1 17:05:58 trin-05 kernel: EIP:    0060:[<c018112d>]    Not tainted VLI
Feb  1 17:05:58 trin-05 kernel: EFLAGS: 00010246   (2.6.9-28.EL)
Feb  1 17:05:58 trin-05 kernel: EIP is at locks_remove_flock+0x119/0x1b4
Feb  1 17:05:58 trin-05 kernel: eax: dc793a00   ebx: d85b0ea4   ecx: d85a3d04
edx: 00000081
Feb  1 17:05:58 trin-05 kernel: esi: e0494e60   edi: d85b0dcc   ebp: da0fad80
esp: dc9a0ee4
Feb  1 17:05:58 trin-05 kernel: ds: 007b   es: 007b   ss: 0068
Feb  1 17:05:58 trin-05 kernel: Process accordion (pid: 3649, threadinfo=dc9a000
0 task=da2e6cd0)
Feb  1 17:05:58 trin-05 kernel: Stack: 00000000 00000000 00000000 00000000 00000
000 00000000 00000000 00000000
Feb  1 17:05:58 trin-05 kernel:        00000000 00000000 00000000 00000000 00000
000 00000000 00000000 00000000
Feb  1 17:05:58 trin-05 kernel:        00000202 00000000 00000000 00000000 00000
000 00000000 00000000 00000000
Feb  1 17:05:59 trin-05 kernel: Call Trace:
Feb  1 17:05:59 trin-05 kernel:  [<c01699aa>] __fput+0x41/0xee
Feb  1 17:05:59 trin-05 kernel:  [<c01682c6>] filp_close+0x59/0x5f
Feb  1 17:05:59 trin-05 kernel:  [<c0123390>] put_files_struct+0x56/0xbf
Feb  1 17:05:59 trin-05 kernel:  [<c0124383>] do_exit+0x2df/0x59c
Feb  1 17:05:59 trin-05 kernel:  [<c01247d8>] sys_exit_group+0x0/0xd
Feb  1 17:05:59 trin-05 kernel:  [<c03114ab>] syscall_call+0x7/0xb
Feb  1 17:05:59 trin-05 kernel:  [<c031007b>] rwsem_down_read_failed+0x137/0x204
Feb  1 17:05:59 trin-05 kernel: Code: 38 39 68 3c 75 2d 0f b6 50 40 f6 c2 02 74
09 89 d8 e8 52 d8 ff ff eb 1d f6 c2 20 74 0e ba 02 00 00 00 89 d8 e8 19 e9 ff ff
 eb 0a <0f> 0b 07 07 14 39 32 c0 89 c3 8b 03 eb c4 b8 00 f0 ff ff 21 e0
Feb  1 17:05:59 trin-05 kernel:  <0>Fatal exception: panic in 5 seconds

Version-Release number of selected component (if applicable):

dlm-1.0.0-5
kernel-2.6.9-28.EL
GFS-6.1.4-0

How reproducible:
Always.

Steps to Reproduce:
1. Configure cluster with shared GFS storage.
2. Run large amount of I/O over I/O paths.
3. Kill one of the I/O paths by either closing its port or pulling the cable.
  
Actual results:
Kernal panic.

Expected results:
Successful withdraw.

Additional info:
The I/O paths in my cluster were fibre channel, but I'm not suggesting that this
has anything to do with that. Just a detail to explain how I killed the I/O
path. A more complete log is attached.

Comment 1 Ryan O'Hara 2006-02-01 23:23:43 UTC
Created attachment 124017 [details]
log file showing withdraw and panic.

Comment 2 Akshat Aranya 2006-04-14 21:45:36 UTC
This bug happened for me over NFS with 2.6.9-34.ELsmp kernel.  I put it in the
Linux kernel bugzilla (http://bugzilla.kernel.org/show_bug.cgi?id=3986) but the
kernel maintainers are unwilling to help with a 2.6.9 kernel.  The scenario
where it occurred for me is when I tried to interrupt a parallel make that
writes over NFS.

Hardware Environment:  2-Dual core AMD Opteron , 4GB RAM
Software Environment:  RHEL 4 SMP kernel.  Process g++ while trying to write
over NFS.

Problem Description:

Apr 14 15:10:21 hfs12 kernel: kernel BUG at fs/locks.c:1799!
Apr 14 15:10:21 hfs12 kernel: invalid operand: 0000 [#1]
Apr 14 15:10:21 hfs12 kernel: SMP
Apr 14 15:10:21 hfs12 kernel: Modules linked in: nfsd exportfs parport_pc lp
parport autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc dm_mirror dm_mod
button battery ac md5 ipv6 ohci_hcd hw_random e100 mii tg3 floppy ext3 jbd raid0
sata_sil libata sd_mod scsi_mod
Apr 14 15:10:21 hfs12 kernel: CPU:    1
Apr 14 15:10:21 hfs12 kernel: EIP:    0060:[<c016dd4c>]    Not tainted VLI
Apr 14 15:10:21 hfs12 kernel: EFLAGS: 00010246   (2.6.9-34.ELsmp)
Apr 14 15:10:21 hfs12 kernel: EIP is at locks_remove_flock+0xa1/0xe1
Apr 14 15:10:21 hfs12 kernel: eax: f64efa8c   ebx: f5be620c   ecx: 00000000  
edx: 00000081
Apr 14 15:10:21 hfs12 kernel: esi: 00000000   edi: f5be6164   ebp: f58c06c0  
esp: f40b3f2c
Apr 14 15:10:21 hfs12 kernel: ds: 007b   es: 007b   ss: 0068
Apr 14 15:10:21 hfs12 kernel: Process g++-4.0 (pid: 14863, threadinfo=f40b3000
task=f36ef830)
Apr 14 15:10:21 hfs12 kernel: Stack: f58c06c0 f896643a f40b3f44 f8966e2a
f8c3abd7 c016dca4 f40b3f6c 00000001
Apr 14 15:10:21 hfs12 kernel:        00000000 00000001 f5be60f8 f378c3c0
00003a0f f8c426ac 00000000 ffffffff
Apr 14 15:10:21 hfs12 kernel:        f6020f40 f58c06c0 00000201 00000000
00000000 00000246 00000000 f58c06c0
Apr 14 15:10:21 hfs12 kernel: Call Trace:
Apr 14 15:10:21 hfs12 kernel:  [<f896643a>] nlm_put_lockowner+0x11/0x49 [lockd]
Apr 14 15:10:21 hfs12 kernel:  [<f8966e2a>]
nlmclnt_locks_release_private+0xb/0x14 [lockd]
Apr 14 15:10:21 hfs12 kernel:  [<f8c3abd7>] nfs_lock+0x0/0xc7 [nfs]
Apr 14 15:10:21 hfs12 kernel:  [<c016dca4>] locks_remove_posix+0x130/0x137
Apr 14 15:10:21 hfs12 kernel:  [<f8c426ac>] nfs_wait_on_requests+0x7e/0xba [nfs]
Apr 14 15:10:21 hfs12 kernel:  [<c015b0c6>] __fput+0x41/0x100
Apr 14 15:10:21 hfs12 kernel:  [<c0159d21>] filp_close+0x59/0x5f
Apr 14 15:10:21 hfs12 kernel:  [<c02d2657>] syscall_call+0x7/0xb
Apr 14 15:10:21 hfs12 kernel: Code: 38 39 68 2c 75 2d 0f b6 50 30 f6 c2 02 74 09
89 d8 e8 b3 df ff ff eb 1d f6 c2 20 74 0e ba 02 00 00 00 89 d8 e8 ce ec ff ff eb
0a <0f> 0b 07 07 6c 74 2e c0 89 c3 8b 03 eb c4 b8 00 f0 ff ff 21 e0
Apr 14 15:10:21 hfs12 kernel:  <0>Fatal exception: panic in 5 seconds

Comment 5 Ryan O'Hara 2006-09-22 19:33:23 UTC
I don't think that the original bug reported here is related to comment #2. The
original post was seen while using GFS in a clustered environment and a node
withdrew from the filesystem. The problem reported in comment #2 is occuring
without GFS being involved. While they could be related, I think it is more
likely that they are two different problems that happen to panic in the same
place. Seems like flock's are not being cleaned-up properly.


Comment 6 David Teigland 2006-10-17 16:45:28 UTC
I'm letting Ryan decide what the status of this is.


Comment 7 Kiersten (Kerri) Anderson 2007-01-04 17:21:21 UTC
Moving this to NEEDINFO, haven't been able to recreate this one recently.

Comment 8 Ryan O'Hara 2007-08-22 18:30:10 UTC
Have not seen this problem in quite some time. Closing.