A customer is getting a kernel panic 2-3 times daily, under the following circumstances: about 300 TCP connections (100 megabit) using a custom proxy server. panics always occur in TCP stack, sometimes in tcp_ack(), sometimes in tcp_retransmit_collapse_try(). no dynamic allocation (apart from the stack itself) in the program. enough free memory that swap isn't used. Currently running 2.2.12-20, more unstable with more recent kernels... both logs of the panic (some copied by hand) and an upgrade to 2.2.16-3 are forthcoming. Hardware: compaq 1850 server - RAID 1, NCR RAID controller - 512 ECC ram - TLAN NIC - SMP (2 proc, not crash as much, once a week at most, closer to two weeks) - an (apart from the number of CPUs) UP machine (vanilla UP kernel) AP 206 - IDE drive - 256 ram - Intel NIC, pro 100 - same version lsmod output (from the Compaq 1850): tlan 19892 1 (autoclean) ncr53c8xx 52264 0 (unused) cpqarray 15200 6 I also have the System.map and ksyms, not included here because of their length, I'll email them to whomever asks. Let me know if any other output is required. Traces from customer follow: --------------- We've got the 2.2.12-20 kernel. These dumps are representative of the ones we've been getting. One always crashes on tcp_retransmit_collapse_try and the other in tcp_ack. The crash in tcp ack is always on the same instruction. The crash in tcp_retransmit_collapse_try is always in within an instruction or two (it will crash on the 83 38 01 sometimes). Just for variety, I guess. NOTE: I beleive these dumps are reliable because I can use gdb to disassemble vmlinuz and I can find the code byte-for-byte where it should be. WARNING: This version of ksymoops is obsolete. WARNING: The current version can be obtained from +ftp://ftp.ocs.com.au/pub/ksymoops Options used: -V (default) -o /lib/modules/2.2.12-20/ (default) -k /proc/ksyms (default) -l /proc/modules (default) -m System.map (specified) -c 1 (default) EIP: 0010:[<c01639b1>] eax: 0031eef1 ebx: d2c29470 ecx: 00000000 edx: 0000000 esi: 00000000 edi: 00000006 ebp: d2c293c0 esp: c0225e24 ds: 0018 es: 0018 ss: 0018 Call Trace: [<c0164cf5>] [<c0169d43>] [<c0169eb2>] [<c016a14b>] [<c015cc33>] +[<c015cf31>] [<c014f2bd>] [<c014f2bd>] [<c01184f5>] [<c010ae6b>] [<c101ab38>] [<c010b5fd>] [<c0106000>] +[<c0108620>] [<c0109d08>] [<c0106000>] [<c010607b>] [<c0106000>] [<c0100176>] Code: 2b 42 44 8b 4b 50 29 c1 89 c8 85 c0 7d 05 b8 01 00 00 00 50 >>EIP: c01639b1 <tcp_ack+2a1/370> Trace: c0164cf5 <tcp_rcv_established+449/5e8> Trace: c0169d43 <tcp_v4_do_rcv+6f/178> Trace: c0169eb2 <tcp_v4_rcv+66/384> Trace: c016a14b <tcp_v4_rcv+2ff/384> Trace: c015cc33 <ip_local_deliver+223/27c> Trace: c015cf31 <ip_rcv+2a5/2d4> Trace: c014f2bd <net_bh+179/1d4> Trace: c014f2bd <net_bh+179/1d4> Trace: c0109d08 <system_call+34/38> Code: c01639b1 <tcp_ack+2a1/370> 00000000 <_EIP>: <=== Code: c01639b1 <tcp_ack+2a1/370> 0: 2b 42 44 +subl 0x44(%edx),%eax <=== Code: c01639b4 <tcp_ack+2a4/370> 3: 8b 4b 50 +movl 0x50(%ebx),%ecx Code: c01639b7 <tcp_ack+2a7/370> 6: 29 c1 +subl %eax,%ecx Code: c01639b9 <tcp_ack+2a9/370> 8: 89 c8 +movl %ecx,%eax Code: c01639bb <tcp_ack+2ab/370> a: 85 c0 +testl %eax,%eax Code: c01639bd <tcp_ack+2ad/370> c: 7d 05 +jnl c01639c4 <tcp_ack+2b4/370> Code: c01639bf <tcp_ack+2af/370> e: b8 01 00 00 00 +movl $0x1,%eax Code: c01639c4 <tcp_ack+2b4/370> 13: 50 +pushl %eax 5 warnings issued. Results may not be reliable. ----- WARNING: This version of ksymoops is obsolete. WARNING: The current version can be obtained from +ftp://ftp.ocs.com.au/pub/ksymoops Options used: -V (default) -o /lib/modules/2.2.12-20/ (default) -k /proc/ksyms (default) -l /proc/modules (default) -m System.map (specified) -c 1 (default) EIP: 0010:[<c016612e>] EFLAGS: 00010246 eax: 00000000 ebx: c8742c20 ecx: 00000000 edx: 00000000 esi: c895d7f0 edi c8742c20 ebp: c0225dd8 esp: c0225dc4 Call Trace: [<c014d4c5>] [<c0166483>] [<c015cc33>] [<c015cf31>] [<c014f2bd>] [<c01184f5>] [<c010ae6b>] [<c010ab38>] [<c01085fd>] [<c0106000>] [<c0108620>] [<c0109d08>] [<c0106000>] [<c010607b>] [<c0106000>] [<c0100176>] Code: 80 79 66 00 74 11 8b 81 88 00 00 00 83 38 01 0f 95 c0 25 ff >>EIP: c016612e <tcp_retrans_try_collapse+3a/208> Trace: c014d4c5 <__kfree_skb+a1/a8> Trace: c0166483 <tcp_retransmit_skb+a3/164> Trace: c015cc33 <ip_local_deliver+223/27c> Trace: c015cf31 <ip_rcv+2a5/2d4> Trace: c014f2bd <net_bh+179/1d4> Trace: c01184f5 <do_bottom_half+45/64> Trace: c0109d08 <system_call+34/38> Code: c016612e <tcp_retrans_try_collapse+3a/208> 00000000 <_EIP>: <=== Code: c016612e <tcp_retrans_try_collapse+3a/208> 0: 80 79 66 00 +cmpb $0x0,0x66(%ecx) <=== Code: c0166132 <tcp_retrans_try_collapse+3e/208> 4: 74 11 +je c0166145 <tcp_retrans_try_collapse+51/208> Code: c0166134 <tcp_retrans_try_collapse+40/208> 6: 8b 81 88 00 00 00 +movl 0x88(%ecx),%eax Code: c016613a <tcp_retrans_try_collapse+46/208> c: 83 38 01 +cmpl $0x1,(%eax) Code: c016613d <tcp_retrans_try_collapse+49/208> f: 0f 95 c0 +setne %al Code: c0166140 <tcp_retrans_try_collapse+4c/208> 12: 25 ff 00 00 00 +andl $0xff,%eax 4 warnings issued. Results may not be reliable.
They *really* need to try a 2.2.16-based kernel; many TCP bugs were fixed in those.
Initially, stability goes up with 2.2.16-3. Waiting for a few days of production use to see if the problem is truly gone, or if it's just occuring less often now.
After running with 2.2.16 for one day there was no panic. However, this kernel runs too slow for our production environment (the 2.2.16 box was several packets behind our 2.2.12 box all day). I've backed off to 2.2.14 which appears to be just as fast as the 2.2.12 kernel.
Problem appears resolved with no change in many months. Changing status to "closed".