Bug 437674 - Kernel Panic in tcp_retransmit_skb
Summary: Kernel Panic in tcp_retransmit_skb
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.8
Hardware: i386
OS: Linux
high
urgent
Target Milestone: rc
: ---
Assignee: Neil Horman
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-03-16 03:14 UTC by Nox
Modified: 2018-10-20 01:43 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-05-18 19:22:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Kerne panic screenshots (134.01 KB, image/png)
2008-03-16 03:14 UTC, Nox
no flags Details
New panic (87.69 KB, image/png)
2008-08-18 18:10 UTC, Joseph W. Breu
no flags Details
full kernel panic output (117.89 KB, text/plain)
2008-08-18 21:18 UTC, Joseph W. Breu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1024 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update 2009-05-18 14:57:26 UTC

Description Nox 2008-03-16 03:14:33 UTC
Description of problem:
Servers are getting down because of kernel panic.
Both servers are in cluster and they use same nfs share from local machine.

Version-Release number of selected component (if applicable):
Kernel version: 2.6.9-67.0.4.ELsmp
Server: DELL PE1855

How reproducible:
Can not reproduce this because we do not know when it appears, in what cases.

Additional info:
In attachment screenshots of kernel panics on both machines.

Comment 1 Nox 2008-03-16 03:14:33 UTC
Created attachment 298178 [details]
Kerne panic screenshots

Comment 3 Neil Horman 2008-03-19 17:51:43 UTC
Sure.  I thought of bz 319181 as well when I saw the subject header here. My
thought (as in the other bz), is that there is a race between a fast retransmit
from a duplicate ack a and the popping of the write timer, which leads to the
corruption of the write queue.  I've not been able to find it yet, but If I do
I'll post a patch here for you to try.  Have you only seen it once, or can you
reproduce it on occasion? (I know what you said above, but random reproduction
would be better than a single failure).

Comment 4 Joseph W. Breu 2008-08-18 18:10:47 UTC
Created attachment 314488 [details]
New panic

Comment 5 Joseph W. Breu 2008-08-18 18:17:35 UTC
Hello,

We have 6 servers that are seeing this bug.  We cannot duplicate it, but we have seen it happen quite often over the last 3 days on all 6 machines.

The 6 machines are:

HP DL385        2.6.9-55.0.2.ELsmp
HP DL385        2.6.9-67.0.15.ELsmp
HP DL385        2.6.9-55.0.2.ELsmp
HP DL385        2.6.9-67.ELsmp
HP DL385        2.6.9-67.ELsmp
Dell 2950 MK3   2.6.9-78.ELsmp

All machines have the Broadcom Corporation NetXtreme II BCM5708 NIC

I've attached a PNG of the panic

Comment 6 Neil Horman 2008-08-18 20:06:26 UTC
Are any of these servers available to try test kernels on, or have you noticed any pattern that leads to this panic?  If you can run test kernels I can try to put something together to confirm or disprove the race I hypothesized on in comment #3.  Also, if you have tcpdumps of the traffic into and out of these servers at the time of the panic, that might help corrolate whats going on on these systems.

Thanks!

Comment 7 Joseph W. Breu 2008-08-18 21:16:23 UTC
Hi Neil,

Unfortunately all 6 of these boxes are in production.  We've installed netdump on all of them so we can get a good copy of the oops and a dump of the memory when it panics.

We don't have tcpdumps as these servers have a high amount of NFS I/O on them.

I've attached the latest oops.

Comment 8 Joseph W. Breu 2008-08-18 21:18:59 UTC
Created attachment 314497 [details]
full kernel panic output

Comment 9 Joseph W. Breu 2008-08-18 21:23:33 UTC
Forgot to mention - this panic is from 2.6.9-67.0.15.ELsmp

Comment 10 Joseph W. Breu 2008-08-19 13:47:38 UTC
Hi Neil,

I've contacted the customer and we can run a devel kernel on their box.  The box is running 2.6.9-67.ELsmp 32bit.

In the meantime I've disabled tcp_retrans_collapse through sysctl on 4 of their 6 boxes and changed the NFS mount from TCP to UDP on 2 of those 4 boxes.

Let me know where I can grab the kernel.

-breu

Comment 11 Neil Horman 2008-08-19 15:17:44 UTC
thank you, I was just going to tell you to disable tcp_restrans_collapse.  That should skip the ccode thats oopsing.  I'll let you know when I have something put together

Comment 12 Joseph W. Breu 2008-08-19 15:47:33 UTC
Hi Neil,

I found this patch while googling:

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2005/01/18 12:24:11-08:00 kuznet@xxxxxxxxxxxxx 
#   [TCP]: Do not try to collapse multi-packet SKBs.
#   
#   Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
# 
# net/ipv4/tcp_output.c
#   2005/01/18 12:23:36-08:00 kuznet@xxxxxxxxxxxxx +1 -0
#   [TCP]: Do not try to collapse multi-packet SKBs.
# 
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c     2005-01-20 14:30:25 -08:00
+++ b/net/ipv4/tcp_output.c     2005-01-20 14:30:25 -08:00
@@ -1069,6 +1069,7 @@
           (skb->next != sk->sk_send_head) &&
           (skb->next != (struct sk_buff *)&sk->sk_write_queue) &&
           (skb_shinfo(skb)->nr_frags == 0 && skb_shinfo(skb->next)->nr_frags 
== 0) &&
+          (tcp_skb_pcount(skb) == 1 && tcp_skb_pcount(skb->next) == 1) &&
           (sysctl_tcp_retrans_collapse != 0))
                tcp_retrans_try_collapse(sk, skb, cur_mss);


It appears that this would resolve the issue that we are seeing.

In the interim we have turned off tcp_retrans_collapse

Comment 13 Neil Horman 2008-08-19 16:44:59 UTC
apparently we're debugging in parallel.  I was backporting that patch when you posted this :).  What arches do you need test kernels for?  I'll post them to my people page

Comment 14 Joseph W. Breu 2008-08-19 17:17:23 UTC
Can you post it for 2.6.9-78.0.1.ELsmp?  I'll get the versions on all of the other machines up to 2.6.9-78.0.1 if the fix works.

Comment 15 Neil Horman 2008-08-20 16:38:25 UTC
x86_64 test kernel available here:
http://people.redhat.com/nhorman/rpms/kernel-smp-2.6.9-78.4.EL.bz437674.x86_64.rpm

I'll build i686 shortly

Comment 16 Neil Horman 2008-08-20 19:43:47 UTC
i686 kernel package is in the same place now too:
http://people.redhat.com/nhorman
please test and let me know if the problem is fixed.  Thanks!

Comment 17 Joseph W. Breu 2008-08-20 20:45:01 UTC
I'll schedule with the customer to get this kernel up and test.  The workaround has fixed the issue in the interim and we haven't had a panic in 24 hours.

Comment 18 Joseph W. Breu 2008-08-21 14:09:45 UTC
Can you post the SMP i686 kernel as well?

Comment 19 Neil Horman 2008-08-21 17:37:42 UTC
I've replaced the link on my people page with the smp kernel i686 version

Comment 20 Joseph W. Breu 2008-08-21 18:03:18 UTC
I am running the SMP kernel on one of the affected servers with tcp_retrans_collapse turned on.

Comment 21 Joseph W. Breu 2008-08-22 12:39:13 UTC
This issue has been resolved with the test kernels.  I think we can go ahead and close this ticket now.  Thanks for all the help!

Comment 22 RHEL Program Management 2008-11-26 18:29:32 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 26 Vivek Goyal 2008-12-10 22:11:38 UTC
Committed in 78.21.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 28 Chris Ward 2009-03-13 14:02:33 UTC
~~ Attention Partners!  ~~
RHEL 4.8Beta has been released on partners.redhat.com. There should
be a fix present, which addresses this bug. Please test and report back results on this OtherQA Partner bug at your earliest convenience.

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe any issues you encountered. If you have found a NEW bug, clone this bug and describe the issues you've encountered. Further questions can be directed to your Red Hat Partner Manager.

If you have VERIFIED the bug fix. Please select your PartnerID from the Verified field above. Please leave a comment with your test results details. Include which arches tested, package version and any applicable logs.

 - Red Hat QE Partner Management

Comment 30 Chris Ward 2009-03-25 12:27:36 UTC
Setting to verified based on Customer Verification results in comment #10. 

If there are any additional issues that need to be addressed, please clone this
bug and make a new request. If it is found that this bug has not really been
resolved, please reset to ASSIGNED state and describe the issues you are
encountering.

Comment 32 errata-xmlrpc 2009-05-18 19:22:11 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html


Note You need to log in before you can comment on or make changes to this bug.