Bug 471278 - Samba I/O(?) causes kernel Oops/Panic/Fatal Exception in 2.6.9-67.ELsmp
Summary: Samba I/O(?) causes kernel Oops/Panic/Fatal Exception in 2.6.9-67.ELsmp
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.6
Hardware: i386
OS: Linux
medium
urgent
Target Milestone: rc
: ---
Assignee: Anton Arapov
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-11-12 19:54 UTC by shawn oconnor
Modified: 2014-06-18 08:02 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-12-18 09:32:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description shawn oconnor 2008-11-12 19:54:16 UTC
Description of problem: 

I have a RHEL 4.6 server (2.6.9-67.ELsmp) running as a NAS head. After some period of uptime, I get an Oops and panic. There is significant I/O load on this box, but it seems unlikely that load is the only factor, as the crashes are usually days apart. In the log extract I have, the smb process is implicated, so I upgraded to Samba 3.0.28, prior to the most recent crash.

Version-Release number of selected component (if applicable):

RHEL 4U6, kernel 2.6.9-67.ELsmp, Samba 3.0.28


How reproducible:

No clear trigger event is obvious, although heavy I/O through Samba could be implicated. This server has been operations for quite some time without issue, but the load has recently increased as we've consolidated our network storage.

Steps to Reproduce:
1.
2.
3.
  
Actual results:

Oops and panic.

Expected results:

Normal file services.

Additional info:

Nov  7 15:38:24 nas smbd[16997]: [2008/11/07 15:38:24, 0] smbd/nttrans.c:call_nt_transact_ioctl(2481)
Nov  7 15:38:24 nas smbd[16997]:   call_nt_transact_ioctl(0x9005c): Currently not implemented.
Nov  7 15:45:01 nas nss_wins[8082]: authenticated mount request from 192.168.129.100:862 for /shares/dev/DISTRIB (/shares/dev)
Nov  7 15:46:49 nas kernel: Unable to handle kernel paging request at virtual address 00100104
Nov  7 15:46:49 nas kernel:  printing eip:
Nov  7 15:46:49 nas kernel: c012adb0
Nov  7 15:46:49 nas kernel: *pde = 31639001
Nov  7 15:46:49 nas kernel: Oops: 0002 [#1]
Nov  7 15:46:49 nas kernel: SMP
Nov  7 15:46:49 nas kernel: Modules linked in: vfat fat iptable_filter ip_tables joydev nfsd exportfs lockd nfs_acl lp autofs4 i2c_dev i2c_co                           re vmnet(U) parport_pc parport vmmon(U) sunrpc xfs_quota(U) xfs(U) dm_mirror dm_mod button battery ac md5 ipv6 uhci_hcd ehci_hcd hw_random s2                           io(U) bnx2 ext3 jbd mppVhba(U) qla2400(U) qla2xxx(U) qla2xxx_conf(U) ata_piix libata aacraid(U) mppUpper(U) sg sd_mod scsi_mod
Nov  7 15:46:49 nas kernel: CPU:    3
Nov  7 15:46:49 nas kernel: EIP:    0060:[<c012adb0>]    Tainted: P      VLI
Nov  7 15:46:49 nas kernel: EFLAGS: 00010002   (2.6.9-67.ELsmp)
Nov  7 15:46:49 nas kernel: EIP is at free_uid+0x22/0x60
Nov  7 15:46:49 nas kernel: eax: 00100100   ebx: dc55f640   ecx: dc55f658   edx: 00200200
Nov  7 15:46:49 nas kernel: esi: 00000082   edi: d3f42f28   ebp: 00000000   esp: d3f42ea4
Nov  7 15:46:49 nas kernel: ds: 007b   es: 007b   ss: 0068
Nov  7 15:46:49 nas kernel: Process smbd (pid: 13508, threadinfo=d3f42000 task=df046770)
Nov  7 15:46:49 nas kernel: Stack: e3f6e4f0 db56eb88 c012b4d0 0000000a 00000000 d3f42f28 df046770 df046c64
Nov  7 15:46:49 nas kernel:        c012b557 d3f42000 df046c64 d3f42000 d3f42000 c012cbc2 c03518c0 df046c64
Nov  7 15:46:49 nas kernel:        d3f42fc4 d3f42f08 d3f42f28 d3f42fc4 df046c64 d3f42000 d3f42000 c0105bd4
Nov  7 15:46:49 nas kernel: Call Trace:
Nov  7 15:46:49 nas kernel:  [<c012b4d0>] __dequeue_signal+0xfb/0x155
Nov  7 15:46:49 nas kernel:  [<c012b557>] dequeue_signal+0x2d/0x54
Nov  7 15:46:49 nas kernel:  [<c012cbc2>] get_signal_to_deliver+0xcf/0x346
Nov  7 15:46:49 nas kernel:  [<c0105bd4>] do_signal+0x55/0xd9
Nov  7 15:46:49 nas kernel:  [<c011e7fb>] __wake_up_common+0x36/0x51
Nov  7 15:46:49 nas kernel:  [<c02d644d>] schedule+0x855/0x8f3
Nov  7 15:46:49 nas kernel:  [<c02d644d>] schedule+0x855/0x8f3
Nov  7 15:46:49 nas kernel:  [<c02d647d>] schedule+0x885/0x8f3
Nov  7 15:46:49 nas kernel:  [<c0105c80>] do_notify_resume+0x28/0x38
Nov  7 15:46:49 nas kernel:  [<c02d866a>] work_notifysig+0x13/0x15
Nov  7 15:46:50 nas kernel: Code: e8 01 c6 1a 00 89 d8 5b c3 56 85 c0 53 89 c3 74 55 9c 5e fa ba 40 cf 32 c0 e8 51 a4 09 00 85 c0 74 42 8d 4b                            18 8b 43 18 8b 51 04 <89> 50 04 89 02 c7 41 04 00 02 20 00 8b 43 24 c7 43 18 00 01 10
Nov  7 15:46:50 nas kernel:  <0>Fatal exception: panic in 5 seconds
Nov  7 15:46:56 nas kernel: Kernel panic - not syncing: Fatal exception
Nov  7 17:05:49 nas syslogd 1.4.1: restart.

Comment 6 Anton Arapov 2008-11-27 10:36:04 UTC
Patch addressed to this issue was committed to kernel-2.6.9-74 of RHEL 4.7.

Shawn, please, try this kernel:
http://people.redhat.com/aarapov/kernel/2.6.9-74/

And let us know if it works for you.
Thanks!

Comment 7 shawn oconnor 2008-11-27 14:55:52 UTC
The current qlogic 24xx driver(8.02.12) does not support this kernel, so this isn't an option for me. I've dumped Samba 3.0.28 because the 3.0.32 is supposed to correct a race condition that could cause this, but I'm still getting lockups, but without the "Process smb". I've also upgraded to 2.6.9-67.0.22smp, which is the last maintenance kernel in 4.6, and I still get lockups but without a log entry for the panic.

Comment 8 Anton Arapov 2008-11-27 15:46:10 UTC
Okay, I will prepare patched 2.6.9-67.

Comment 9 Anton Arapov 2008-11-27 17:23:47 UTC
Shawn, please, test this kernel:
http://people.redhat.com/aarapov/kernel/2.6.9-67.test
Thanks!

Comment 10 Anton Arapov 2008-12-02 13:45:47 UTC
Shawn, have you had a chance to test the kernel?

Comment 11 shawn oconnor 2008-12-02 14:07:53 UTC
The next maintenance window is Friday, December 12th. Also, we've had three configuration changes since the last crash, so it makes sense to stand pat and see if the lockup occurs, since the "Oops" message hasn't recurred in the last two failures.

Comment 12 Anton Arapov 2008-12-17 10:50:46 UTC
shawn, any luck with the patched kernel?

Comment 13 shawn oconnor 2008-12-17 17:04:01 UTC
We removed the XFS filesystem driver, and haven't had a crash in 20 days, running the 2.6.9-67.0.22 kernel.

Since the "oops" did not recur after the first crash, I'm not inclined to implement a fix for that issue at this time.

When Qlogic releases upgraded drivers, we will consider upgrading to a current production kernel. Of course, if it ain't broke ...

Comment 14 Anton Arapov 2008-12-18 09:32:54 UTC
Good. ... I'm closing this bug with WONTFIX. 
Please, reopon it in case of hitting it again.

Thanks, Shawn.


Note You need to log in before you can comment on or make changes to this bug.