Bug 15178

Summary: Panic in tcp stack
Product: [Retired] Red Hat Linux Reporter: Brian Brock <bbrock>
Component: kernelAssignee: Michael K. Johnson <johnsonm>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 6.2CC: mnovi
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2000-08-22 16:59:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Brian Brock 2000-08-02 20:43:41 UTC
A customer is getting a kernel panic 2-3 times daily, under the following
circumstances:

about 300 TCP connections (100 megabit) using a custom proxy server.

panics always occur in TCP stack, sometimes in tcp_ack(), sometimes in
tcp_retransmit_collapse_try().

no dynamic allocation (apart from the stack itself) in the program.

enough free memory that swap isn't used.

Currently running 2.2.12-20, more unstable with more recent kernels... both
logs of the panic (some copied by hand) and an upgrade to 2.2.16-3 are
forthcoming.


Hardware:

compaq 1850 server
- RAID 1, NCR RAID controller
- 512 ECC ram
- TLAN NIC
- SMP (2 proc, not crash as much, once a week at most, closer to two weeks)
- an (apart from the number of CPUs) UP machine (vanilla UP kernel)

AP 206
- IDE drive
- 256 ram
- Intel NIC, pro 100
- same version


lsmod output (from the Compaq 1850):

tlan                   19892   1 (autoclean)
ncr53c8xx              52264   0 (unused)
cpqarray               15200   6

I also have the System.map and ksyms, not included here because of their
length, I'll email them to whomever asks.  Let me know if any other output
is required.

Traces from customer follow:

---------------
        We've got the 2.2.12-20 kernel.  These dumps are representative of
the ones we've been getting.  One always crashes on
tcp_retransmit_collapse_try and the other in tcp_ack.  The crash in tcp ack
is always on the same instruction.  The crash in
tcp_retransmit_collapse_try
is always in within an instruction or two (it will crash on the 83 38 01
sometimes).  Just for variety, I guess.

NOTE: I beleive these dumps are reliable because I can use gdb to
disassemble vmlinuz and I can find the code byte-for-byte where it should
be.



WARNING: This version of ksymoops is obsolete.
WARNING: The current version can be obtained from
+ftp://ftp.ocs.com.au/pub/ksymoops                                             
Options used: -V (default)
              -o /lib/modules/2.2.12-20/ (default)
              -k /proc/ksyms (default)
              -l /proc/modules (default)
              -m System.map (specified)
              -c 1 (default)

EIP: 0010:[<c01639b1>]
eax: 0031eef1 ebx: d2c29470 ecx: 00000000 edx: 0000000
esi: 00000000 edi: 00000006 ebp: d2c293c0 esp: c0225e24
ds: 0018 es: 0018 ss: 0018
Call Trace: [<c0164cf5>] [<c0169d43>] [<c0169eb2>] [<c016a14b>]
[<c015cc33>]
+[<c015cf31>] [<c014f2bd>]
[<c014f2bd>] [<c01184f5>] [<c010ae6b>] [<c101ab38>] [<c010b5fd>]
[<c0106000>]
+[<c0108620>]
[<c0109d08>] [<c0106000>] [<c010607b>] [<c0106000>] [<c0100176>]
Code: 2b 42 44 8b 4b 50 29 c1 89 c8 85 c0 7d 05 b8 01 00 00 00 50

>>EIP: c01639b1
<tcp_ack+2a1/370>                                               Trace:
c0164cf5 <tcp_rcv_established+449/5e8>
Trace: c0169d43 <tcp_v4_do_rcv+6f/178>
Trace: c0169eb2 <tcp_v4_rcv+66/384>
Trace: c016a14b <tcp_v4_rcv+2ff/384>
Trace: c015cc33 <ip_local_deliver+223/27c>
Trace: c015cf31 <ip_rcv+2a5/2d4>
Trace: c014f2bd <net_bh+179/1d4>
Trace: c014f2bd <net_bh+179/1d4>
Trace: c0109d08 <system_call+34/38>
Code:  c01639b1 <tcp_ack+2a1/370>              00000000 <_EIP>: <===
Code:  c01639b1 <tcp_ack+2a1/370>                 0:    2b 42 44
+subl   0x44(%edx),%eax <===
Code:  c01639b4 <tcp_ack+2a4/370>                 3:    8b 4b 50
+movl   0x50(%ebx),%ecx
Code:  c01639b7 <tcp_ack+2a7/370>                 6:    29 c1
+subl   %eax,%ecx
Code:  c01639b9 <tcp_ack+2a9/370>                 8:    89 c8
+movl   %ecx,%eax
Code:  c01639bb <tcp_ack+2ab/370>                 a:    85 c0
+testl  %eax,%eax
Code:  c01639bd <tcp_ack+2ad/370>                 c:    7d 05
+jnl     c01639c4 <tcp_ack+2b4/370>
Code:  c01639bf <tcp_ack+2af/370>                 e:    b8 01 00 00 00
+movl   $0x1,%eax
Code:  c01639c4 <tcp_ack+2b4/370>                13:    50
+pushl  %eax


5 warnings issued.  Results may not be reliable.


-----



WARNING: This version of ksymoops is obsolete.
WARNING: The current version can be obtained from
+ftp://ftp.ocs.com.au/pub/ksymoops                                             
Options used: -V (default)
              -o /lib/modules/2.2.12-20/ (default)
              -k /proc/ksyms (default)
              -l /proc/modules (default)
              -m System.map (specified)
              -c 1 (default)

EIP: 0010:[<c016612e>]
EFLAGS: 00010246
eax: 00000000 ebx: c8742c20 ecx: 00000000 edx: 00000000
esi: c895d7f0 edi c8742c20 ebp: c0225dd8 esp: c0225dc4
Call Trace: [<c014d4c5>] [<c0166483>] [<c015cc33>] [<c015cf31>]
[<c014f2bd>]
[<c01184f5>] [<c010ae6b>] [<c010ab38>] [<c01085fd>] [<c0106000>]
[<c0108620>]
[<c0109d08>] [<c0106000>] [<c010607b>] [<c0106000>] [<c0100176>]
Code: 80 79 66 00 74 11 8b 81 88 00 00 00 83 38 01 0f 95 c0 25 ff

>>EIP: c016612e
<tcp_retrans_try_collapse+3a/208>                               Trace:
c014d4c5 <__kfree_skb+a1/a8>
Trace: c0166483 <tcp_retransmit_skb+a3/164>
Trace: c015cc33 <ip_local_deliver+223/27c>
Trace: c015cf31 <ip_rcv+2a5/2d4>
Trace: c014f2bd <net_bh+179/1d4>
Trace: c01184f5 <do_bottom_half+45/64>
Trace: c0109d08 <system_call+34/38>
Code:  c016612e <tcp_retrans_try_collapse+3a/208> 00000000 <_EIP>: <===
Code:  c016612e <tcp_retrans_try_collapse+3a/208>    0: 80 79 66 00
+cmpb   $0x0,0x66(%ecx) <===
Code:  c0166132 <tcp_retrans_try_collapse+3e/208>    4: 74 11
+je      c0166145 <tcp_retrans_try_collapse+51/208>
Code:  c0166134 <tcp_retrans_try_collapse+40/208>    6: 8b 81 88 00 00 00
+movl   0x88(%ecx),%eax
Code:  c016613a <tcp_retrans_try_collapse+46/208>    c: 83 38 01
+cmpl   $0x1,(%eax)
Code:  c016613d <tcp_retrans_try_collapse+49/208>    f: 0f 95 c0
+setne  %al
Code:  c0166140 <tcp_retrans_try_collapse+4c/208>   12: 25 ff 00 00 00
+andl   $0xff,%eax


4 warnings issued.  Results may not be reliable.

Comment 1 Bill Nottingham 2000-08-02 21:12:42 UTC
They *really* need to try a 2.2.16-based kernel; many
TCP bugs were fixed in those.

Comment 2 Brian Brock 2000-08-04 18:12:42 UTC
Initially, stability goes up with 2.2.16-3.

Waiting for a few days of production use to see if the problem is truly gone, or
if it's just occuring less often now.

Comment 3 Matt Novi 2000-08-04 18:52:12 UTC
After running with 2.2.16 for one day there was no panic. However, this kernel 
runs too slow for our production environment (the 2.2.16 box was several 
packets behind our 2.2.12 box all day).  I've backed off to 2.2.14 which 
appears to be just as fast as the 2.2.12 kernel.

Comment 4 Brian Brock 2001-06-25 14:26:02 UTC
Problem appears resolved with no change in many months.  Changing status to
"closed".