Bug 191332 - kernel panic using sctp provided by red hat AS 4 Update 2
Summary: kernel panic using sctp provided by red hat AS 4 Update 2
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Neil Horman
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 176344
TreeView+ depends on / blocked
 
Reported: 2006-05-10 21:30 UTC by William Reich
Modified: 2018-11-26 18:47 UTC (History)
3 users (show)

Fixed In Version: RHBA-2007-0304
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-05-08 01:16:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
test programs - pitcher.c, catcher.c (4.70 KB, text/plain)
2006-05-10 21:30 UTC, William Reich
no flags Details
/var/log/messages from GRANT 5/12/06 noon-ish (37.82 KB, text/plain)
2006-05-12 16:00 UTC, William Reich
no flags Details
output of catcher program on GRANT - 5/12/06 noon-ish (20.14 KB, text/plain)
2006-05-12 16:01 UTC, William Reich
no flags Details
output of pitcher on GRANT 5/12/06 noon-ish (8.41 KB, text/plain)
2006-05-12 16:02 UTC, William Reich
no flags Details
sysreport of GRANT - 5/12/06 - RH AS4 64bit Update 3 (952.31 KB, application/octet-stream)
2006-05-12 19:11 UTC, William Reich
no flags Details
sysreport output on 32 machine 5/30/06 (618.05 KB, application/octet-stream)
2006-05-30 13:39 UTC, William Reich
no flags Details
another panic on 32bit platform (3.82 KB, text/plain)
2006-05-30 13:48 UTC, William Reich
no flags Details
fresh 64bit platform - sysreport of such (729.02 KB, application/octet-stream)
2006-05-31 19:54 UTC, William Reich
no flags Details
patch to fix sctp_inq_pop race (888 bytes, patch)
2006-06-23 20:24 UTC, Neil Horman
no flags Details | Diff
patch to fix peeloff race condition (6.39 KB, patch)
2006-11-10 18:07 UTC, Neil Horman
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0304 0 normal SHIPPED_LIVE Updated kernel packages available for Red Hat Enterprise Linux 4 Update 5 2007-04-28 18:58:50 UTC

Description William Reich 2006-05-10 21:30:10 UTC
Description of problem:
While using the sctp provided in the redhat AS 4 distribution,
kernel panics are observed.

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux AS release 4 (Nahant Update 2)
$ rpm -qa | grep sctp
lksctp-tools-doc-1.0.2-6.4E.1
lksctp-tools-1.0.2-6.4E.1
lksctp-tools-devel-1.0.2-6.4E.1


How reproducible:
Use provided tools 'pitcher' and 'catcher'.

You firsh run "./catcher", then on the same machine you run
"./pitcher -addr 172.25.1.213 -port 10000"
 
Once it starts running, CTRL+C catcher, do it many times, it will crash/panic
eventually.
 

Steps to Reproduce:
1.
2.
3.
  
Actual results:
kernel panic

Expected results:
no panic

Additional info:
^MKERNEL: assertion (!atomic_read(&sk->sk_rmem_alloc)) 
              failed at net/ipv4/af_inet.c (149)
^MUnable to handle kernel NULL pointer dereference at 00000000000000fc RIP:
^M<ffffffffa01d4753>{:sctp:sctp_inq_pop+145}
^MPML4 b9262067 PGD 9101f067 PMD 0
^MOops: 0002 [1] SMP
^MCPU 0
^MModules linked in: ulcm_kmem(U) swrmm(U) nfs(U) lockd(U) sctp(U) parport_pc(U)
lp(U) parport(U) autofs4(U)
i2c_dev(U) i2c_core(U) sunrpc(U) md5(U) ipv6(U) ds(U) yenta_socket(U)
pcmcia_core(U) dm_mirror(U) dm_multipat
h(U) dm_mod(U) button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) e1000(U)
ext3(U) jbd(U) ata_piix(U) libata(
U) aic79xx(U) sd_mod(U) scsi_mod(U)
^MPid: 11521, comm: catcher Tainted: PF     2.6.9-22.ELsmp-reich
^MRIP: 0010:[<ffffffffa01d4753>] <ffffffffa01d4753>{:sctp:sctp_inq_pop+145}
^MRSP: 0018:000001008791fd68  EFLAGS: 00010246
^MRAX: 0000000000000000 RBX: 00000100ca378028 RCX: 0000000000000000
^MRDX: 0000000000000206 RSI: 0000000000000246 RDI: 00000100ca37803c
^MRBP: 00000100ca378000 R08: 0000000100000000 R09: 0000000000000004
^MR10: 00000000fffffff4 R11: ffffffffa01d0100 R12: 00000100ca378028
^MR13: 0000010009fc4200 R14: 0000010009fc4cf0 R15: 000001007c515700
^MFS:  0000002a955857a0(0000) GS:ffffffff804d3200(0000) knlGS:0000000000000000
^MCS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
^MCR2: 00000000000000fc CR3: 0000000000101000 CR4: 00000000000006e0
^MProcess catcher (pid: 11521, threadinfo 000001008791e000, task 000001011cd667f0)
^MStack: 0000000000000000 ffffffffa01d01db 00000000fffffff4 0000000000000000
^M       000001007c515b80 00000100a6a41680 7fffffffffffffff ffffffffa01dc5aa
^M       00000100a6a41820 ffffffff802a37e0
^MCall Trace:<ffffffffa01d01db>{:sctp:sctp_assoc_bh_rcv+219}
<ffffffffa01dc5aa>{:sctp:sctp_backlog_rcv+17}
^M       <ffffffff802a37e0>{release_sock+88}
<ffffffffa01dbb58>{:sctp:sctp_accept+446}
^M       <ffffffff80134722>{autoremove_wake_function+0}
<ffffffff80222bad>{tty_ldisc_try+60}
^M       <ffffffff80134722>{autoremove_wake_function+0}
<ffffffff802e8bab>{inet_accept+40}
^M       <ffffffff802a1e64>{sys_accept+201}
<ffffffff801dca2d>{crypto_alloc_hmac_block+43}
^M       <ffffffffa01dc6a5>{:sctp:__sctp_hash_endpoint+49}
<ffffffff802a3798>{release_sock+16}
^M       <ffffffffa01d91bc>{:sctp:sctp_inet_listen+335}
<ffffffff80110052>{system_call+126}
^M
^M
^MCode: c6 80 fc 00 00 00 01 48 8b 40 40 48 8b b8 f8 00 00 00 48 89
^MRIP <ffffffffa01d4753>{:sctp:sctp_inq_pop+145} RSP <000001008791fd68>
^MCR2: 00000000000000fc
^M <0>Kernel panic - not syncing: Oops

Comment 1 William Reich 2006-05-10 21:30:10 UTC
Created attachment 128863 [details]
test programs - pitcher.c, catcher.c

Comment 2 William Reich 2006-05-11 12:02:58 UTC
Added notes:
- this was on a redhat AS 4 64 bit system.
- we have seen this on a 2 CPU system, & a 4 CPU system.
  ( hyperthreading makes the 4 CPU system appear as 8 CPUs. No hyperthreading
     activated on the 2CPU system. )

Comment 3 William Reich 2006-05-11 12:05:06 UTC
when running the pitcher program, use your own IP address ( not 
the one in the example... )

Comment 4 Jason Baron 2006-05-11 15:23:57 UTC
Can this be reproduced on the latest released kernel...-34?

Comment 5 Neil Horman 2006-05-11 17:13:25 UTC
agreed, this is not a RHEL kernel:
Pid: 11521, comm: catcher Tainted: PF     2.6.9-22.ELsmp-reich
Can the problem be recreated on a stock RHEL kernel?

Comment 6 William Reich 2006-05-11 17:55:04 UTC
on a machine with Update 3 of redhat AS 4 64 bit, and the "up to date" as of
5/10/06,
the issue is much more severe. The failure is 100%.

keaton:/u/cm:uname -a
Linux keaton 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64
x86_64 GNU/Linux

keaton:/u/cm:rpm -qa | grep sctp
lksctp-tools-doc-1.0.2-6.4E.1
lksctp-tools-1.0.2-6.4E.1
lksctp-tools-devel-1.0.2-6.4E.1
keaton:/u/cm:rpm -qi lksctp-tools-devel-1.0.2-6.4E.1
Name        : lksctp-tools-devel           Relocations: (not relocatable)
Version     : 1.0.2                             Vendor: Red Hat, Inc.
Release     : 6.4E.1                        Build Date: Thu 21 Jul 2005 09:13:23
AM EDT
Install Date: Tue 11 Apr 2006 02:56:34 PM EDT      Build Host:
crowe.devel.redhat.com
Group       : Development/Libraries         Source RPM:
lksctp-tools-1.0.2-6.4E.1.src.rpm
Size        : 144358                           License: LGPL
Signature   : DSA/SHA1, Tue 20 Sep 2005 02:19:32 PM EDT, Key ID 219180cddb42a60e
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL         : http://lksctp.sourceforge.net
Summary     : Development kit for lksctp-tools
Description :
Development kit for lksctp-tools

- Man pages
- Header files
- Static libraries
- Symlinks to dynamic libraries
- Tutorial source code
keaton:/u/cm:
keaton:/u/cm:rpm -qi lksctp-tools-1.0.2-6.4E.1
Name        : lksctp-tools                 Relocations: (not relocatable)
Version     : 1.0.2                             Vendor: Red Hat, Inc.
Release     : 6.4E.1                        Build Date: Thu 21 Jul 2005 09:13:23
AM EDT
Install Date: Tue 11 Apr 2006 02:46:45 PM EDT      Build Host:
crowe.devel.redhat.com
Group       : System Environment/Libraries   Source RPM:
lksctp-tools-1.0.2-6.4E.1.src.rpm
Size        : 160615                           License: LGPL
Signature   : DSA/SHA1, Tue 20 Sep 2005 02:19:32 PM EDT, Key ID 219180cddb42a60e
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL         : http://lksctp.sourceforge.net
Summary     : User-space access to Linux Kernel SCTP
Description :
This is the lksctp-tools package for Linux Kernel SCTP Reference
Implementation.


( from bootup in /var/log/messages :

May 11 13:44:13 keaton kernel: Linux version 2.6.9-34.ELsmp
(bhcompile.redhat.com) 
                    (gcc version 3.4.5 20051201 (Red Hat 3.4.5-2)) 
                    #1 SMP Fri Feb 24 16:56:28 EST 2006
)

I have not been able to capture the backtrace. Nothing appears in
/var/log/messages at the time of the lockup/panic. I have not
been able to setup a serial console on this Update 3 box due to lack of
resource in the location of this box.


Comment 7 Neil Horman 2006-05-11 19:35:22 UTC
I'm trying to reproduce this on an EM64T machine, using the RHEL4
2.4.21-34.ELsmp kernel for x86_64, and your reproducer causes no crash for me
after 15 repititions of the above instructions from the initial comment.  I'm
inclined to close this as a NOTABUG, but I'll leave it open for a bit in the
event that you can provide a backtrace from your machine.

Comment 8 William Reich 2006-05-12 09:48:34 UTC
Please confirm that you are using a 2.6 kernel
on a multi-CPU machine during your innvestigation.
This issue has only been seen by my people on 2.6 kernels...
( REDHAT AS 4 is based on a 2.6 kernel. REDHAT AS 3 is based on a 2.4 kernel ).

Comment 9 William Reich 2006-05-12 11:12:37 UTC
as requested, here is a back trace from the
REDHAT AS 4 Update 3 ( uptodate as of 5/10/06 ) machine...
This box has 4 3.16GHz CPUs with hyperthreading activated.

It took me 7 or 8 tries before the failure occurred.

( from /var/log/messages at bootup )
May 12 06:58:50 keaton kernel: Bootdata ok (command line is ro root=LABEL=/1
nmi_watchdog=1 rhgb quiet  console=ttyS0,57600n8  console=tty0)
May 12 06:58:50 keaton kernel: Linux version 2.6.9-34.ELsmp
(bhcompile.redhat.com) (gcc version 3.4.5 20051201 (Red Hat
3.4.5-2)) #1 SMP Fri Feb 24 16:56:28 EST 2006

( from serial console upon crash )

Unable to handle kernel NULL pointer dereference at 00000000000000fc RIP:
<ffffffffa06638d7>{:sctp:sctp_inq_pop+145}
PML4 1187de067 PGD 102c8a067 PMD 0
Oops: 0002 [1] SMP
CPU 0
Modules linked in: sctp mvfs(U) vnode(U) parport_pc lp parport autofs4 i2c_dev
i2c_core nfs lockd nfs_acl su
nrpc md5 ipv6 ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button
battery ac joydev uhci_hcd eh
ci_hcd hw_random e1000 tg3 floppy sg ext3 jbd mptscsih mptsas mptspi mptfc
mptscsi mptbase sd_mod scsi_mod
Pid: 11855, comm: pitcher Tainted: PF     2.6.9-34.ELsmp
RIP: 0010:[<ffffffffa06638d7>] <ffffffffa06638d7>{:sctp:sctp_inq_pop+145}
RSP: 0018:000001011e3eb398  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000010103b0c028 RCX: 0000000000000000
RDX: 000001010381e680 RSI: 0000000000000246 RDI: 0000010103b0c03c
RBP: 0000010103b0c000 R08: 000001010381e680 R09: 000001010381e680
R10: 000001010381e680 R11: 0000010103b0c000 R12: 0000010103b0c028
R13: 000001012f16b600 R14: 0000010103b0c000 R15: 0000000000000000
FS:  0000002a955857a0(0000) GS:ffffffff804d7b00(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000000fc CR3: 0000000000101000 CR4: 00000000000006e0
Process pitcher (pid: 11855, threadinfo 000001011e3ea000, task 000001012c936030)
Stack: 000001010381e680 ffffffffa065f268 000001010381e680 000001010381e680
       0000010104153700 000001012c700080 0000010103b0c000 ffffffffa066b7d1
       0000000000000000 ffffffffa066c54d
Call Trace:<ffffffffa065f268>{:sctp:sctp_assoc_bh_rcv+219}
<ffffffffa066b7d1>{:sctp:sctp_backlog_rcv+17}
       <ffffffffa066c54d>{:sctp:sctp_rcv+1274}
<ffffffff802c3c2c>{ip_local_deliver+298}
       <ffffffff802c4368>{ip_rcv+1002} <ffffffff802ab5b1>{netif_receive_skb+590}
       <ffffffff802ab66d>{process_backlog+136} <ffffffff802ab777>{net_rx_action+129}
       <ffffffff8013be04>{__do_softirq+88} <ffffffff8013bead>{do_softirq+49}
       <ffffffff802ab017>{dev_queue_xmit+525}
<ffffffff802c6b2d>{ip_finish_output+356}
       <ffffffff802c6f69>{ip_queue_xmit+951} <ffffffff802528f8>{loopback_xmit+960}
       <ffffffffa066b6c3>{:sctp:sctp_packet_transmit+1101}
       <ffffffffa06643f9>{:sctp:sctp_outq_flush+1457}
<ffffffff802c26db>{__ip_route_output_key+1972}
       <ffffffffa0664456>{:sctp:sctp_outq_uncork+27}
<ffffffffa065d1bc>{:sctp:sctp_cmd_interpreter+2981}
       <ffffffffa065d1fd>{:sctp:sctp_side_effects+37}
<ffffffffa065d31b>{:sctp:sctp_do_sm+119}
       <ffffffffa06601aa>{:sctp:sctp_datamsg_from_user+648}
       <ffffffffa066ae98>{:sctp:sctp_primitive_SEND+53}
<ffffffffa066925e>{:sctp:sctp_sendmsg+2224}
       <ffffffff80131d1d>{try_to_wake_up+863} <ffffffff802a220b>{sock_sendmsg+271}
       <ffffffff80228fd0>{n_tty_receive_buf+3922}
<ffffffff80134df2>{autoremove_wake_function+0}
       <ffffffff802a3b7f>{sys_sendmsg+463} <ffffffff80223eb1>{tty_ldisc_try+60}
       <ffffffff8013346f>{__wake_up+54} <ffffffff80223fe4>{tty_ldisc_deref+103}
       <ffffffff8022470b>{tty_write+702} <ffffffff80191570>{dnotify_parent+34}
       <ffffffff80177c89>{vfs_write+248} <ffffffff80177d48>{sys_write+69}
       <ffffffff801101c6>{system_call+126}

Code: c6 80 fc 00 00 00 01 48 8b 40 40 48 8b b8 f8 00 00 00 48 89
RIP <ffffffffa06638d7>{:sctp:sctp_inq_pop+145} RSP <000001011e3eb398>
CR2: 00000000000000fc
 <0>Kernel panic - not syncing: Oops





Comment 10 Neil Horman 2006-05-12 12:18:32 UTC
I'm aware of what RHEL4 kernels are based on, and yes, I'm using the same kernel
as you on a dual core machine.  I've only got an EM64T machine available, rather
than AMD64, but this shouldn't be silicon family specific.  Can you please
reproduce this without your tainting modules loaded (mvfs, vnode, and I think
nrpc)?  I wonder if clearcase isn't causing a memory corruption somewhere.


Comment 11 Anatoly Khusid 2006-05-12 12:47:30 UTC
>Comment #7 From Neil Horman (nhorman) 	
>I'm trying to reproduce this on an EM64T machine, using the RHEL4
>2.4.21-34.ELsmp kernel for x86_64, and your reproducer causes no crash for me
>after 15 repititions of the above instructions from the initial comment.  

As per your comment, we were under impression that you only tried to replicate
the problem on 2.4.x Kernel.  Are you able to reproduce it on  2.6.9-34.ELsmp?
Thank you.

PS. It is important to CTRL+C catcher application to trigger the OS panic. 
Usually it takes only few tries, but sometimes it takes longer.  After you
CTRL+C the catcher, try to repeat the test as quickly as possible


Comment 12 William Reich 2006-05-12 12:56:50 UTC
ok, I found a box that I can use to change from
Update 2 to Update 3.
This will probably take a while.


Comment 13 Neil Horman 2006-05-12 13:36:15 UTC
sorry, fat fingers.  Updating too many issues at once.  I meant to say
2.6.9-34.ELsmp on an EM64T dual core box.  No reproduction after 15 successive
attempts.

Please reproduce without the tainting modules loaded.

Comment 14 William Reich 2006-05-12 15:57:52 UTC
Update 3 loaded onto box.
Panic occurs...

( This backtrace is different from the others )

ÿàààààààModule sctp cannot be unloaded due to unsafe usage in
net/sctp/protocol.c:1171

/* above reported when sctp loaded into kernel */

/* several pitcher/catcher runs later, we panic... */

Unable to handle kernel paging request at 0000000021f4a040 RIP: 
<ffffffff801897c7>{poll_freewait+20}
PML4 9d8b067 PGD 37e18067 PMD 0 
Oops: 0000 [1] SMP 
CPU 2 
Modules linked in: sctp parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc
md5 ipv6 ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button
battery ac uhci_hcd ehci_hcd e1000 ext3 jbd ata_piix libata aic79xx sd_mod scsi_mod
Pid: 1, comm: init Not tainted 2.6.9-34.ELsmp
RIP: 0010:[<ffffffff801897c7>] <ffffffff801897c7>{poll_freewait+20}
RSP: 0018:00000100d7e11dc8  EFLAGS: 00010202
RAX: 0000010037e8c7f0 RBX: 0000000021f4a010 RCX: 0000000000000104
RDX: 0000000000000000 RSI: 0000000000000e4c RDI: 00000100d7e11e78
RBP: 0000010121f4a000 R08: 000000000000000f R09: 0000000000000008
R10: 000000000000000a R11: 000000000000000a R12: 0000000000000100
R13: 0000000000000800 R14: 00000100d7e11f18 R15: 000000000000000b
FS:  0000002a958a06e0(0000) GS:ffffffff804d7c00(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000021f4a040 CR3: 00000000d7e4e000 CR4: 00000000000006e0
Process init (pid: 1, threadinfo 00000100d7e10000, task 0000010037e8c7f0)
Stack: 0000000000000000 0000010123617580 0000000000000000 ffffffff80189c86 
       0000000000000000 0000000000000000 0000000000000400 0000000000000000 
       0000000000000000 0000000000000000 
Call Trace:<ffffffff80189c86>{do_select+978} <ffffffff801897f9>{__pollwait+0} 
       <ffffffff80189fde>{sys_select+820} <ffffffff801101c6>{system_call+126} 
       

Code: 48 8b 7b 30 48 8d 73 08 e8 c2 b4 fa ff 48 8b 3b e8 94 f0 fe 
RIP <ffffffff801897c7>{poll_freewait+20} RSP <00000100d7e11dc8>
CR2: 0000000021f4a040
 <0>Kernel panic - not syncing: Oops


Comment 15 William Reich 2006-05-12 16:00:01 UTC
Created attachment 128944 [details]
/var/log/messages from GRANT 5/12/06 noon-ish

Comment 16 William Reich 2006-05-12 16:01:12 UTC
Created attachment 128945 [details]
output of catcher program on GRANT - 5/12/06 noon-ish

Comment 17 William Reich 2006-05-12 16:02:27 UTC
Created attachment 128946 [details]
output of pitcher on GRANT 5/12/06 noon-ish

Comment 18 William Reich 2006-05-12 16:18:06 UTC
after lunch, I will attempt to recreate the panic
to determine if the backtrace is repeatable.


Comment 19 William Reich 2006-05-12 16:56:31 UTC
this panic occurred on the first try.
This backtrace is more familar...

Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c:1171
Unable to handle kernel NULL pointer dereference at 0000000000000004 RIP: 
<ffffffffa01d65e9>{:sctp:sctp_ulpevent_make_rcvmsg+167}
PML4 11af41067 PGD 11a33e067 PMD 0 
Oops: 0000 [1] SMP 
CPU 1 
Modules linked in: sctp parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc
md5 ipv6 ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button
battery ac uhci_hcd ehci_hcd e1000 ext3 jbd ata_piix libata aic79xx sd_mod scsi_mod
Pid: 4270, comm: catcher Not tainted 2.6.9-34.ELsmp
RIP: 0010:[<ffffffffa01d65e9>]
<ffffffffa01d65e9>{:sctp:sctp_ulpevent_make_rcvmsg+167}
RSP: 0018:000001011aabbab8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000010121269080 RCX: 0000000000000000
RDX: 000001011b088080 RSI: 00000000000000f6 RDI: 0000010119ed2000
RBP: 0000010119f7fc80 R08: 000000006ba53ae3 R09: 0000010119f7fc80
R10: 000001011aabbc38 R11: 0000000000000220 R12: 00000101212690f0
R13: 0000010119ed2000 R14: 000001011aabbc38 R15: 0000000000000000
FS:  0000002a959b86e0(0000) GS:ffffffff804d7b80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000004 CR3: 00000000d7e18000 CR4: 00000000000006e0
Process catcher (pid: 4270, threadinfo 000001011aaba000, task 000001011fed57f0)
Stack: 000000006ba53ae3 0000010119ed3698 0000000000000001 0000010119ed2000 
       0000000000000001 ffffffffa01d864d 000001011aabbba0 0000000000000212 
       000000000000001f ffffffff80221efa 
Call Trace:<ffffffffa01d864d>{:sctp:sctp_ulpq_tail_data+21}
<ffffffff80221efa>{add_entropy_words+88} 
       <ffffffffa01cfa21>{:sctp:sctp_cmd_interpreter+1034} 
       <ffffffffa01d01fd>{:sctp:sctp_side_effects+37}
<ffffffffa01d031b>{:sctp:sctp_do_sm+119} 
       <ffffffffa01d223f>{:sctp:sctp_assoc_bh_rcv+178}
<ffffffffa01de7d1>{:sctp:sctp_backlog_rcv+17} 
       <ffffffff802a4d40>{release_sock+88}
<ffffffffa01ddd97>{:sctp:sctp_accept+446} 
       <ffffffff80134df2>{autoremove_wake_function+0}
<ffffffff80223eb1>{tty_ldisc_try+60} 
       <ffffffff80134df2>{autoremove_wake_function+0}
<ffffffff802ea22b>{inet_accept+40} 
       <ffffffff802a33c7>{sys_accept+201}
<ffffffff801dd6fd>{crypto_alloc_hmac_block+43} 
       <ffffffffa01de8cc>{:sctp:__sctp_hash_endpoint+49}
<ffffffff802a4cf8>{release_sock+16} 
       <ffffffffa01db340>{:sctp:sctp_inet_listen+335}
<ffffffff801101c6>{system_call+126} 
       

Code: 0f b7 40 04 89 c2 c1 e8 08 c1 e2 08 09 c2 66 41 89 54 24 08 
RIP <ffffffffa01d65e9>{:sctp:sctp_ulpevent_make_rcvmsg+167} RSP <000001011aabbab8>
CR2: 0000000000000004
 <0>Kernel panic - not syncing: Oops


Comment 20 William Reich 2006-05-12 17:22:14 UTC
This panic occurred on the 9th time.
In this case, both pitcher & catcher were running.
The box panic'd before I could do the cntl-c the catcher.

Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c:1171
general protection fault: 0000 [1] SMP
CPU 2
Modules linked in: sctp parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc
md5 ipv6 ds yenta_socket pcmci
a_core dm_mirror dm_multipath dm_mod button battery ac uhci_hcd ehci_hcd e1000
ext3 jbd ata_piix libata aic7
9xx sd_mod scsi_mod
Pid: 12, comm: events/2 Not tainted 2.6.9-34.ELsmp
RIP: 0010:[<ffffffff80160e8d>] <ffffffff80160e8d>{free_block+168}
RSP: 0018:0000010037c29de8  EFLAGS: 00010006
RAX: 038000fc81000000 RBX: 000001012351bc80 RCX: 000001000000e000
RDX: 0000000000000000 RSI: 0000010122388010 RDI: 0000000000000000
RBP: 0000010122388010 R08: 0000010037c28000 R09: 0000000000000080
R10: 0000000000000080 R11: 000000000000000a R12: 0000000000666667
R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff80480f00
FS:  0000000000000000(0000) GS:ffffffff804d7c00(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000002a957de0f0 CR3: 00000000d7e4e000 CR4: 00000000000006e0
Process events/2 (pid: 12, threadinfo 0000010037c28000, task 00000100d7df27f0)
Stack: 0000010122388000 0000010122388010 0000000000666667 0000010122388000
       000001012351bc80 ffffffff80160fae 0000010005034dc0 0000010005034dc8
       000001012351bd70 ffffffff80161e74
Call Trace:<ffffffff80160fae>{drain_array_locked+99}
<ffffffff80161e74>{cache_reap+162}
       <ffffffff80161dd2>{cache_reap+0} <ffffffff80146e1e>{worker_thread+419}
       <ffffffff801333c8>{default_wake_function+0}
<ffffffff80133419>{__wake_up_common+67}
       <ffffffff801333c8>{default_wake_function+0}
<ffffffff80146c7b>{worker_thread+0}
       <ffffffff8014aa93>{kthread+200} <ffffffff80110e17>{child_rip+8}
       <ffffffff8014a9cb>{kthread+0} <ffffffff80110e0f>{child_rip+0}
 

Code: 48 8b 70 30 48 8b 56 08 48 8b 06 48 89 50 08 48 89 02 48 2b
RIP <ffffffff80160e8d>{free_block+168} RSP <0000010037c29de8>
 <0>Kernel panic - not syncing: Oops


Comment 21 William Reich 2006-05-12 17:28:36 UTC
This one occurred on the 13th run...

Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c:11
71
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ulpevent:749
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: sctp parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc
md5 ipv6 ds yenta_socket pcmci
a_core dm_mirror dm_multipath dm_mod button battery ac uhci_hcd ehci_hcd e1000
ext3 jbd ata_piix libata aic7
9xx sd_mod scsi_mod
Pid: 4002, comm: pitcher Not tainted 2.6.9-34.ELsmp
RIP: 0010:[<ffffffffa01d5f93>]
<ffffffffa01d5f93>{:sctp:sctp_ulpevent_make_remote_error+162}
RSP: 0018:000001011afed6e8  EFLAGS: 00010293
RAX: 0000000000000000 RBX: 000001012359b080 RCX: 0000000000000000
RDX: 0000010120ed8080 RSI: 0000010120ed8118 RDI: 000001012359b118
RBP: 000001011ae02680 R08: 000001012359b080 R09: 0000010120ed8080
R10: 000001012359b080 R11: 00000000000000e4 R12: 000001011add6000
R13: 000001011add6000 R14: 0000000000006f74 R15: 0000000000000600
FS:  0000002a959b86e0(0000) GS:ffffffff804d7b00(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a9585cbd0 CR3: 0000000000101000 CR4: 00000000000006e0
Process pitcher (pid: 4002, threadinfo 000001011afec000, task 000001012122f030)
Stack: 000001011afed768 000001011afed768 000001011ae02680 000001011afed768
       000001011add6000 000001011ae02680 000001011add6000 ffffffffa01cb759
       0000010123cd8600 000001011afed768
Call Trace:<ffffffffa01cb759>{:sctp:sctp_sf_operr_notify+51}
<ffffffffa01d02f7>{:sctp:sctp_do_sm+83}
       <ffffffffa01d223f>{:sctp:sctp_assoc_bh_rcv+178}
<ffffffffa01de7d1>{:sctp:sctp_backlog_rcv+17}
       <ffffffffa01df54d>{:sctp:sctp_rcv+1274} <ffffffff802c5289>{ip_defrag+2800}
       <ffffffff802c3c2c>{ip_local_deliver+298} <ffffffff802c4368>{ip_rcv+1002}
       <ffffffff802ab5b1>{netif_receive_skb+590}
<ffffffff802ab66d>{process_backlog+136}
       <ffffffff802ab777>{net_rx_action+129} <ffffffff8013be04>{__do_softirq+88}
       <ffffffff8013bead>{do_softirq+49} <ffffffffa01dc2a3>{:sctp:sctp_sendmsg+2293}
       <ffffffff80131d1d>{try_to_wake_up+863} <ffffffff802a220b>{sock_sendmsg+271}
       <ffffffff80228fd0>{n_tty_receive_buf+3922}
<ffffffff80134df2>{autoremove_wake_function+0}
       <ffffffff802a3b7f>{sys_sendmsg+463} <ffffffff80223eb1>{tty_ldisc_try+60}
       <ffffffff8013346f>{__wake_up+54} <ffffffff80223fe4>{tty_ldisc_deref+103}
       <ffffffff8022470b>{tty_write+702} <ffffffff80191570>{dnotify_parent+34}
       <ffffffff80177c89>{vfs_write+248} <ffffffff80177d48>{sys_write+69}
       <ffffffff801101c6>{system_call+126}

Code: 0f 0b 53 35 1e a0 ff ff ff ff ed 02 44 89 f0 48 01 82 f8 00
RIP <ffffffffa01d5f93>{:sctp:sctp_ulpevent_make_remote_error+162} RSP
<000001011afed6e8>
 <0>Kernel panic - not syncing: Oops


Comment 22 William Reich 2006-05-12 17:30:52 UTC
This data seems to imply "randomness".
My gut translates that into "memory corruption".
And that translates into a "difficult issue to solve"...

Comment 23 William Reich 2006-05-12 17:36:10 UTC
Please note that this message:
"Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c"
is created when I do a 
"/sbin/modprobe sctp" ( as root )
to get the sctp module loaded after a reboot.
.
This occurs well before the catcher & pitcher are even started.

Comment 24 Neil Horman 2006-05-12 17:55:39 UTC
The "cannot be unloaded" warning is expected in all cases.  There is a flag set
in the module metadata that explicitly causes this.  You shouldn't need to worry
about that.

I definately agree, looking over the most recent backtraces you have uploaded,
that this smells a good deal like a memory problem, given that none of the
backtraces indicate a crash in the same place (thank you for all the hard work
by the way).  It may well be that high sctp usage just triggers the error (which
is why the backtraces frequently show sctp laden stacks.  

It would be interesting if you could run a stressor utility (such as
http://freshmeat.net/projects/stress/) to load the system and see if the hang
can be reproduced without the use of sctp.  It may also be worthwhile to run
memtest86 on the system (http://www.memtest86.com/) to see if you can detect any
memory errors.  Also, if you could send me a sysreport of your system, I can try
to further bring your setup and mine closer together, in the hopes of
reproducing here (if its not a memory error).

Also, FYI, I'm about to head on vacation for a week, so I won't be too
responsive on this during that time.  If you need rapid responses, I suggest you
open up a support call through Red Hat's support channel to get someone looking
at this in my absence.

Comment 25 William Reich 2006-05-12 18:23:23 UTC
please recall that this panic/crash has been 
reproduced on multiple machines. Some 4 CPUs, some 2 CPUs.
This has been reproduced on 2 different 2.6 kernels.
It looks like the only constants have been the
sctp version, and the test tools.
I will investigate the stress test tool.

Comment 26 William Reich 2006-05-12 19:11:18 UTC
Created attachment 128955 [details]
sysreport of GRANT - 5/12/06 - RH AS4 64bit Update 3

Comment 27 William Reich 2006-05-12 19:13:51 UTC
"stress" is running. I'll let it run over the weekend.

Comment 28 William Reich 2006-05-15 11:11:01 UTC
"stress" ran all weekend with no panics observed.
The machine is very very slow, but no panics...
( REDHAT 4 AS 64 bit Update 3 )
I did this...
nice -n 18 stress --cpu 2 --io 2 --vm 23 --vm-hang 7 --hdd 3 &

Comment 29 Neil Horman 2006-05-22 20:16:48 UTC
why did you nice the utility that was intended to hog up your system.  That
seems counterproductive.

What about the output of memtest86?

I'm starting to look over your sysreport.  The rpm database integrity check is
showing something concerning:
prelink: /usr/bin/dbus-cleanup-sockets: at least one of file's dependencies has
changed since prelinking
S.?....T    /usr/bin/dbus-cleanup-sockets
prelink: /usr/bin/dbus-daemon-1: at least one of file's dependencies has changed
since prelinking
S.?....T    /usr/bin/dbus-daemon-1
prelink: /usr/bin/dbus-send: at least one of file's dependencies has changed
since prelinking
S.?....T    /usr/bin/dbus-send


There are hundreds of such discrepancies in the rpm-Va file from your sysreport.
 I can't tell whats causing it, but its likely that a shared library has been
changed to a version that may be partially incompatible with many of your user
space binaries.  And that would be a major discrepancy between your setup and
mine.  If possible it may be worth a fresh re-install of one of your failing
machines to a point where it only has stock Red Hat installed packages on it. 
Then try to run your reproducer, and see if the problem recurs.


Comment 30 William Reich 2006-05-22 20:47:48 UTC
- 'nice' was used so that 'top' and 'ps -ef' would produce
output to prove that 'stress' was actually doing something.
Without 'nice', no user input would be processed in a timely manner
while 'stress' was executing.
- the boxes are in use in support of the deliveries of our product.
Therefore, I can't strip them to re-install the OS at this time.
However, I expect to receive new removable disks within two weeks.
Your request for a fresh re-install can be processed once the new
disks arrive.


Comment 31 William Reich 2006-05-23 17:27:09 UTC
memtest86 version 3.2 has been started on machine 'grant'.
I'll let the test run overnight & I will post the results tommorrow.

Comment 32 William Reich 2006-05-24 11:33:22 UTC
memtest86 ran for 18 hours & 11 minutes. No errors reported.
The tool was booted from CD rom.
Screen "4" showed no errors. The default screen
( that contains the duration of the test ) also showed no errors.

Comment 33 Neil Horman 2006-05-24 11:55:38 UTC
I'm sorry, I don't know what to tell you.  I can't reproduce the problem here,
and I've tried now on two separate systems (a EM64T box and a AMD64 box).  Given
the above anomalies in your sysreport, and the test results that you've
provided, I think a software install discrepancy is the most likely culprit at
this point.  I would try a fresh reinstall (with a rpm -Va verification to
ensure no problematic packages have been installed) and a retest at your
earliest convienience.

Comment 34 William Reich 2006-05-24 12:09:28 UTC
based on your requests, I have a two part plan -
1) get a fresh install of REDHAT 4 AS 64 bit Update 3
2) get a fresh install of REDHAT 4 AS 32 bit Update x
At this point, the requests to create the disks have been
submitted. Now I just have to wait for the requests to be processed.
Once I get these platforms, I will rerun the tests
that produced the errors.
This could take a week.

Comment 35 William Reich 2006-05-24 12:10:54 UTC
just as an FYI, here is a ldd output of one of the test programs

$ ldd catcher
        libnsl.so.1 => /lib64/libnsl.so.1 (0x00000036a7700000)
        libsctp.so.1 => /usr/lib64/libsctp.so.1 (0x00000036a1e00000)
        libc.so.6 => /lib64/tls/libc.so.6 (0x00000036a2000000)
        /lib64/ld-linux-x86-64.so.2 (0x00000036a1c00000)
$ 
$ 
$ uname -a 
Linux grant 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64
x86_64 GNU/Linux
$ 


Comment 36 William Reich 2006-05-26 15:27:42 UTC
I had REDHAT 4  32 bit installed onto a clean/brand-new disk.
"uptodate" was executed. Nothing on the disk except
REDHAT-provided packages.

The pitcher / catcher test was executed.
On the 16th attempt, the machine panic'd.
The backtrace was in sctp...

This was using the 2.6.9-34.0.1.ELsmp kernel (32 bit).

When I return from the holiday weekend, I will setup a 
serial console so the backtrace can be captured, and I will upload
'sysreport' info.

( My 64bit clean box should be ready next week. )

Have a good holiday weekend.

Comment 37 Neil Horman 2006-05-26 15:40:18 UTC
Thanks, you do the same. I'm still unable to produce any oops here, so send me
that backtrace and sysreport when you get a chance, and I'll look it over asap.

Comment 38 William Reich 2006-05-30 13:37:54 UTC
Here is a panic using the 32bit version of redhat 4.
This is a 2 CPU  ( 2.0 ghz ) box.
( sysreport output attached )
+++++++++++++++++++++++++
Script started on Tue 30 May 2006 09:32:58 AM EDT

omnitest@chaplin 39% uname -a
Linux chaplin 2.6.9-34.0.1.ELsmp #1 SMP Wed May 17 17:05:24 EDT 2006 i686 i686
i386 GNU/Linux
omnitest@chaplin 40% 
omnitest@chaplin 40% 
omnitest@chaplin 40% cd kh*
omnitest@chaplin 41% ls -l
total 60
-rwxr-xr-x  1 omnitest users  135 May 26 11:00 build*
-rwxr-xr-x  1 omnitest users 7116 May 26 11:01 catcher*
-rw-r--r--  1 omnitest users 2374 May 26 11:01 catcher.c
-rwxr-xr-x  1 omnitest users 7195 May 26 11:01 pitcher*
-rw-r--r--  1 omnitest users 2200 May 26 11:00 pitcher.c
-rw-r--r--  1 omnitest users 4811 May 26 11:00 tester.txt
omnitest@chaplin 42% 
omnitest@chaplin 42% 
omnitest@chaplin 42% ldd catcher
        libnsl.so.1 => /lib/libnsl.so.1 (0x00d17000)
        libsctp.so.1 => /usr/lib/libsctp.so.1 (0x001f6000)
        libc.so.6 => /lib/tls/libc.so.6 (0x001fa000)
        /lib/ld-linux.so.2 (0x001dd000)
omnitest@chaplin 43% 
omnitest@chaplin 43% 
omnitest@chaplin 43% ldd pitcher
        libnsl.so.1 => /lib/libnsl.so.1 (0x00d17000)
        libsctp.so.1 => /usr/lib/libsctp.so.1 (0x001f6000)
        libc.so.6 => /lib/tls/libc.so.6 (0x001fa000)
        /lib/ld-linux.so.2 (0x001dd000)
omnitest@chaplin 44% 
omnitest@chaplin 44% 

++++++++++++++++++++++++++++
This panic occurred on the third attempt...

Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c:1171

Unable to handle kernel NULL pointer dereference at virtual address 00000004
 printing eip:
e0bb5977
*pde = 18c95001
Oops: 0002 [#1]
SMP 
Modules linked in: sctp nfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4
i2c_dev i2c_core sunrpc dm_mirror dm_multipath dm_mod button battery ac ohci_hcd
eepro100 e1000 e100 mii floppy sg ext3 jbd mptscsih mptsas mptspi mptfc mptscsi
mptbase sd_mod scsi_mod
CPU:    1
EIP:    0060:[<e0bb5977>]    Not tainted VLI
EFLAGS: 00010282   (2.6.9-34.0.1.ELsmp) 
EIP is at sctp_packet_transmit+0x29b/0x49a [sctp]
eax: 00000000   ebx: c179d230   ecx: c16eeeec   edx: c16eeef4
esi: d9113480   edi: dda18a80   ebp: dda18780   esp: dd8edc7c
ds: 007b   es: 007b   ss: 0068
Process catcher (pid: 3673, threadinfo=dd8ed000 task=deddc7b0)
Stack: 00000003 001b9000 00000000 d9323480 6e2de76c c179d224 d82b0000 c16eee00 
       c16eeeec 00000000 d82b1418 00000000 c16eee00 e0bae8b8 00000000 d82b1438 
       1dc16684 00008002 00002710 d82b0000 c16eeeec 00000000 00000000 dd8edcd8 
Call Trace:
 [<e0bae8b8>] sctp_outq_flush+0x3d4/0x402 [sctp]
 [<e0bab3e3>] sctp_make_sack+0xde/0x124 [sctp]
 [<e0bae12c>] sctp_outq_tail+0x10c/0x114 [sctp]
 [<e0ba7579>] sctp_gen_sack+0x73/0x98 [sctp]
 [<e0ba7f00>] sctp_cmd_interpreter+0x12f/0x65f [sctp]
 [<e0ba7d55>] sctp_side_effects+0x29/0xa5 [sctp]
 [<e0ba7d21>] sctp_do_sm+0x6c/0x77 [sctp]
 [<c027b2bc>] skb_dequeue+0x40/0x46
 [<e0ba9e64>] sctp_assoc_bh_rcv+0x95/0xd2 [sctp]
 [<e0badcbc>] sctp_inq_push+0xe/0x10 [sctp]
 [<e0bb60cf>] sctp_backlog_rcv+0xb/0xe [sctp]
 [<c02793ec>] __release_sock+0x39/0x55
 [<c027996b>] release_sock+0x1f/0x4f
 [<e0bb2a47>] sctp_accept+0x9a/0xaa [sctp]
 [<c02b897e>] inet_accept+0x1d/0x92
 [<c02775f9>] sys_accept+0xa8/0x13f
 [<c02009e6>] write_chan+0x1bd/0x1d0
 [<c02d0ca2>] __cond_resched+0x14/0x39
 [<c01b6ccb>] crypto_alloc_hmac_block+0x22/0x34
 [<e0bb64af>] __sctp_hash_endpoint+0x26/0x47 [sctp]
 [<c027995b>] release_sock+0xf/0x4f
 [<e0bb46ec>] sctp_inet_listen+0x89/0x92 [sctp]
 [<c027809c>] sys_socketcall+0x10c/0x1fb
 [<c015a692>] sys_write+0x3c/0x62
 [<c02d268f>] syscall_call+0x7/0xb
Code: 8b 46 30 5a 80 38 00 74 07 89 f0 e8 62 60 ff ff 8b 4c 24 20 8b 54 24 20 8b
41 08 83 c2 08 39 d0 74 29 89 c6 8b 00 ff 4a 08 85 f6 <89> 50 04 89 41 08 c7 46
04 00 00 00 00 c7 06 00 00 00 00 c7 46 
 <0>Kernel panic - not syncing: Fatal exception in interrupt


Comment 39 William Reich 2006-05-30 13:39:12 UTC
Created attachment 130229 [details]
sysreport output on 32 machine 5/30/06

Comment 40 William Reich 2006-05-30 13:48:50 UTC
Created attachment 130230 [details]
another panic on 32bit platform

Comment 41 William Reich 2006-05-30 13:50:50 UTC
The attachment from comment 40 is a panic backtrace that
occurred on the 1st attempt, just as the pitcher was launched.
This was also on the redhat 4 32 bit platform.

Comment 42 Neil Horman 2006-05-30 17:41:05 UTC
I'm sorry, I don't know what to tell you at this point.  I'm still unable to
reproduce this problem here, on either the 64 or the 32 bit kernels.

Two things I notice:

1) Your glibc setup still appears to be broken.  From rpm-Va:
Unsatisfied dependencies for glibc-2.3.4-2.13.i686: glibc-common = 2.3.4-2.13
.......T  c /etc/rpc
S.5....T    /lib/i686/libc-2.3.4.so
..5....T    /lib/i686/libm-2.3.4.so
..5....T    /lib/i686/libpthread-0.10.so
..5....T    /lib/i686/librt-2.3.4.so
..5....T    /lib/ld-2.3.4.so
..5....T    /lib/libBrokenLocale-2.3.4.so
..5....T    /lib/libNoVersion-2.3.4.so
..5....T    /lib/libSegFault.so
..5....T    /lib/libanl-2.3.4.so
...

And so forth.  It looks like all of the shared libraries in the glibc package
have been modified after the installation.  From the installed-rpms file  I note
this:

glibc-kernheaders-2.4-9.1.98.EL-i386
glibc-utils-2.3.4-2.19-i386
glibc-common-2.3.4-2.19-i386
glibc-2.3.4-2.19-i686
glibc-2.3.4-2.13-i686
compat-glibc-headers-2.3.2-95.30-i386
glibc-devel-2.3.4-2.19-i386
glibc-profile-2.3.4-2.19-i386
glibc-headers-2.3.4-2.19-i386
compat-glibc-2.3.2-95.30-i386

You have two different versions of glibc installed, which I don't think is safe
to do.  I don't think it necessecarily has much to do with this problem (given
that its on the receive path), but at this point, since I can't reproduce, I'm
left with looking for discrepancies in our setups.

2) I note that you have the eepro100 driver loaded, yet the sysreport doesn't
show any indication of having the eepro100 device configured to load.  Why is
that driver loaded.  eepro100 and e100 (which you also load, and have configured
in the sysreport), have several overlapping pci id ranges, and IIRC, eepro100 is
no longer actively maintained (i.e. you should use e100 instead).  If you
somehow managed to load eepro100 first, and that driver has a bug which modified
an skbuff while it is in the network stack, that may lead to this problem.  Can
you determine how this driver is getting loaded, and prevent it from doing so,
so you can be sure that e100 is the driver driving the network cards?

Comment 43 William Reich 2006-05-31 17:35:35 UTC
Regarding the NIC card(s),
there were 2 cards in the machine
- a built-in NIC
- and a dual port card.
The IT guy that did the install says that he simply
picked the built-in NIC, and the installation did the rest.

Comment 44 William Reich 2006-05-31 19:52:19 UTC
my freshly build 64 bit box has become available.
This is redhat 4 AS with "uptodate" to get it to Update 3.
Just redhat provided stuff on the box...

omnitest@grant 8% uname -a
Linux grant 2.6.9-34.0.1.ELsmp #1 SMP Wed May 17 16:59:36 EDT 2006 x86_64 x86_64
x86_64 GNU/Linux


The panic occurred on the 12th attempt.

Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP: 
<ffffffffa0232680>{:sctp:sctp_packet_transmit+750}
PML4 11b31e067 PGD 11ad2f067 PMD 0 
Oops: 0002 [1] SMP 
CPU 2 
Modules linked in: sctp nfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4
i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core dm_mirror dm_multipath
dm_mod button battery ac uhci_hcd ehci_hcd e1000 ext3 jbd ata_piix libata
aic79xx sd_mod scsi_mod
Pid: 3940, comm: catcher Not tainted 2.6.9-34.0.1.ELsmp
RIP: 0010:[<ffffffffa0232680>] <ffffffffa0232680>{:sctp:sctp_packet_transmit+750}
RSP: 0018:000001011b7b59a8  EFLAGS: 00010206
RAX: 000001011ed4fd60 RBX: 000001011b319800 RCX: 000001000000e000
RDX: 0000000000000206 RSI: 0000000000000010 RDI: 0000000000000000
RBP: 00000101210ea200 R08: 0000000000000010 R09: 0000000000000004
R10: 0000000000000004 R11: 00000000000000e4 R12: 000001011ed4de30
R13: 0000000006701d6e R14: 000001011ed4fd58 R15: 000001011ed4fc00
FS:  0000002a959b86e0(0000) GS:ffffffff804d7d00(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 00000000d7e4e000 CR4: 00000000000006e0
Process catcher (pid: 3940, threadinfo 000001011b7b4000, task 00000101235bf030)
Stack: 00000101212793a0 0000000000000020 000001011ad82080 000001011ed4de24 
       00000101232d4000 0000000000000000 0000000000000000 0000000000000000 
       00000101232d5610 000001011ed4fd58 
Call Trace:<ffffffffa022b47d>{:sctp:sctp_outq_flush+1457}
<ffffffffa0227a04>{:sctp:sctp_make_sack+243} 
       <ffffffffa022b62d>{:sctp:sctp_outq_tail+336}
<ffffffffa022360a>{:sctp:sctp_gen_sack+126} 
       <ffffffffa0223919>{:sctp:sctp_cmd_interpreter+638} 
       <ffffffffa0224281>{:sctp:sctp_side_effects+37}
<ffffffffa022439f>{:sctp:sctp_do_sm+119} 
       <ffffffffa02262c3>{:sctp:sctp_assoc_bh_rcv+178}
<ffffffffa02328ed>{:sctp:sctp_backlog_rcv+17} 
       <ffffffff802a4e5c>{release_sock+88}
<ffffffffa0231eb3>{:sctp:sctp_accept+446} 
       <ffffffff80134e96>{autoremove_wake_function+0}
<ffffffff80134e96>{autoremove_wake_function+0} 
       <ffffffff802ea34b>{inet_accept+40} <ffffffff802a34e3>{sys_accept+201} 
       <ffffffff801dd81d>{crypto_alloc_hmac_block+43}
<ffffffffa02329e8>{:sctp:__sctp_hash_endpoint+49} 
       <ffffffff802a4e14>{release_sock+16}
<ffffffffa022f45c>{:sctp:sctp_inet_listen+335} 
       <ffffffff80110236>{system_call+126} 

Code: 48 89 47 08 49 89 7e 08 48 c7 43 08 00 00 00 00 48 c7 03 00 
RIP <ffffffffa0232680>{:sctp:sctp_packet_transmit+750} RSP <000001011b7b59a8>
CR2: 0000000000000008
 <0>Kernel panic - not syncing: Oops
 ~


Comment 45 William Reich 2006-05-31 19:54:25 UTC
Created attachment 130307 [details]
fresh 64bit platform - sysreport of such

please note that at the time this sysreport was run,
SElinux was "enforcing".
However, I changed it to "disabled" before running the test
that created the panic.

Comment 46 Neil Horman 2006-06-14 11:30:07 UTC
This may have something to do with this crash:
http://sourceforge.net/mailarchive/message.php?msg_id=18516158

I'm investigating

Comment 47 Neil Horman 2006-06-23 20:24:06 UTC
Created attachment 131463 [details]
patch to fix sctp_inq_pop race

quick update:

1) The comment in comment #43 does not appear related to this bug (although it
also needs fixing in RHEL)

2) The problem as diagnosed so far, seems due to a race in sctp_inq_pop, in
which we check a queues empty status before we dequeue from it without holding
any lock in between.  This leads to a NULL return from skb_dequeue sometimes
since we can get in sctp_inq_pop from tasklet and process context.  

3) After fixing that, I ran into another (apparently unrelated crash) that
happens when parsing a remote error frame.  I'm currently debugging that now.

Comment 48 Neil Horman 2006-06-30 18:18:44 UTC
Update:

After fixing (or so I thought) the remote frame error, I hit yet another bug. 
Its becomming obvious that there is a subtle sharing of skbs here that is
corrupting their data which I am not yet seeing, especially as I go back and see
that the error chunk has a completely invalid cause code.  I seem to have some
confirmation of this at the moment as I'm running a test kernel in which I copy
every received skb and discard the actual received one in sctp_rcv, and I've
been running for over an hour without a failure.  I would suspect as well that
if this is run over an interface other than the loopback interface this panic
would never occur, which would in turn suggest in invalid sharing of skb's
between tx and rx paths.

Continuing to investigate

Comment 49 Neil Horman 2006-08-01 13:12:36 UTC
I've posted upstream about this problem, since it seems to also happen on the
latest upstream kernel and lksctp-tools.  Hoping to find other having this
problem  in an effort to get more data to correlate to this issue to help find a
root cause.

Comment 52 Neil Horman 2006-11-10 18:07:33 UTC
Created attachment 140916 [details]
patch to fix peeloff race condition

This is a variant of a patch that I had tried before (which at the time didn't
fix the problem).  But looking at my previous attempt I see that my queuing of
backlogged frames were done on skb's, not chunks as they should have been,
whcih in the receive path would have led to data corruption and an eventual
oops. With this variant of the patch, backlog queuing works properly, and I've
had 24 hours running with the pitcher/catcher test and no panics.  I'm putting
together a test kernel with this patch for all to try.	I'll report once its
up.

Comment 53 Neil Horman 2006-11-10 21:14:16 UTC
new test kernel available at
http://people.redhat.com/nhorman
please test it out and report results
thanks!

Comment 54 William Reich 2006-11-16 13:18:18 UTC
my ability to test this fix has been interrupted.
The box that I previously used for this has been reassigned.
I need to purchase another entitlement to get RH installed
onto another box.

Comment 56 Neil Horman 2006-11-16 16:02:50 UTC
Ok, I'm going to assume then that my testing was sufficient, as I've run for a
few days (allbeit intermittently) without another crash.  I'll post based on
that.  Please re-open another bug if the problem resurfaces after 4.5 comes out.

Comment 58 Jason Baron 2006-11-21 21:23:20 UTC
committed in stream U5 build 42.26. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 60 Jay Turner 2006-12-18 14:26:24 UTC
QE ack for RHEL4.5.

Comment 62 masanari iida 2007-02-20 11:49:38 UTC
I have a crash, which is very similar to comment #20.
Do you think 4.5 kernel will fix this panic??

<1>Unable to handle kernel NULL pointer dereference at virtual address 00000004
<1> printing eip:
<4>c0147a4c
<1>*pde = 34881001
<1>Oops: 0002 [#1]
<4>SMP
<1>Oops: 0002 [#1]
<4>SMP
<4>Modules linked in: parport_pc parport 8021q mptctl(U) sg cpqci(U) md5 ipv6 ne
tconsole netdump i2c_dev i2c_core sunrpc joydev dm_mirror dm_mod button battery
ac ehci_hcd uhci_hcd hw_random e1000(U) tg3(U) bonding(U) ext3 jbd mptspi(U) mpt
sas(U) ata_piix libata mptscsih(U) mptbase(U) sd_mod scsi_mod
<4>CPU:    0
<4>EIP:    0060:[<c0147a4c>]    Tainted: P      VLI
<4>EFLAGS: 00010046   (2.6.9-34.ELsmp)
<4>EIP is at cache_reap+0x110/0x1a1
<4>eax: 00000000   ebx: f7ffd240   ecx: 00000005   edx: 00000000
<4>esi: f7ffd180   edi: f3248d20   ebp: f7ffd274   esp: f7feff54
<4>ds: 007b   es: 007b   ss: 0068
<4>Process events/0 (pid: 6, threadinfo=f7fef000 task=c3765130)
<4>Stack: 00000005 c353c9a4 c353c9a0 c353c9a4 00000283 c3766000 c0130943 0000000
0
<4>       c014793c ffffffff ffffffff 00000001 00000000 c011e71b 00010000 0000000
0
<4>       c0411fa0 c353ade0 00000000 00000000 c3765130 c011e71b 00100100 0020020
0
<4>Call Trace:
<4> [<c0130943>] worker_thread+0x168/0x1d5
<4> [<c014793c>] cache_reap+0x0/0x1a1
<4> [<c011e71b>] default_wake_function+0x0/0xc
<4> [<c011e71b>] default_wake_function+0x0/0xc
<4> [<c01307db>] worker_thread+0x0/0x1d5
<4> [<c0133ecd>] kthread+0x73/0x9b
<4> [<c0133e5a>] kthread+0x0/0x9b
<4> [<c01041f5>] kernel_thread_helper+0x5/0xb
<4>Code: 04 24 8b be 98 00 00 00 8d 86 98 00 00 00 39 c7 74 5d 83 7f 10 00 74 08
 0f 0b f1 0a 6a 64 2e c0 8b 57 04 8d 9e c0 00 00 00 8b 07 <89> 50 04 89 02 c7 47
 04 00 02 20 00 c7 07 00 01 10 00 8b 86 a0

Comment 63 Mike Gahagan 2007-04-03 19:17:48 UTC
Patch is in -52 and the fix looks to be already verified by at least 1 customer. 

Comment 65 Red Hat Bugzilla 2007-05-08 01:16:15 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html


Note You need to log in before you can comment on or make changes to this bug.