Bug 437674 - Kernel Panic in tcp_retransmit_skb
Kernel Panic in tcp_retransmit_skb
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.8
i386 Linux
high Severity urgent
: rc
: ---
Assigned To: Neil Horman
Martin Jenner
: OtherQA
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-03-15 23:14 EDT by Nox
Modified: 2010-10-22 19:16 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-05-18 15:22:11 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Kerne panic screenshots (134.01 KB, image/png)
2008-03-15 23:14 EDT, Nox
no flags Details
New panic (87.69 KB, image/png)
2008-08-18 14:10 EDT, Joseph W. Breu
no flags Details
full kernel panic output (117.89 KB, text/plain)
2008-08-18 17:18 EDT, Joseph W. Breu
no flags Details

  None (edit)
Description Nox 2008-03-15 23:14:33 EDT
Description of problem:
Servers are getting down because of kernel panic.
Both servers are in cluster and they use same nfs share from local machine.

Version-Release number of selected component (if applicable):
Kernel version: 2.6.9-67.0.4.ELsmp
Server: DELL PE1855

How reproducible:
Can not reproduce this because we do not know when it appears, in what cases.

Additional info:
In attachment screenshots of kernel panics on both machines.
Comment 1 Nox 2008-03-15 23:14:33 EDT
Created attachment 298178 [details]
Kerne panic screenshots
Comment 3 Neil Horman 2008-03-19 13:51:43 EDT
Sure.  I thought of bz 319181 as well when I saw the subject header here. My
thought (as in the other bz), is that there is a race between a fast retransmit
from a duplicate ack a and the popping of the write timer, which leads to the
corruption of the write queue.  I've not been able to find it yet, but If I do
I'll post a patch here for you to try.  Have you only seen it once, or can you
reproduce it on occasion? (I know what you said above, but random reproduction
would be better than a single failure).
Comment 4 Joseph W. Breu 2008-08-18 14:10:47 EDT
Created attachment 314488 [details]
New panic
Comment 5 Joseph W. Breu 2008-08-18 14:17:35 EDT
Hello,

We have 6 servers that are seeing this bug.  We cannot duplicate it, but we have seen it happen quite often over the last 3 days on all 6 machines.

The 6 machines are:

HP DL385        2.6.9-55.0.2.ELsmp
HP DL385        2.6.9-67.0.15.ELsmp
HP DL385        2.6.9-55.0.2.ELsmp
HP DL385        2.6.9-67.ELsmp
HP DL385        2.6.9-67.ELsmp
Dell 2950 MK3   2.6.9-78.ELsmp

All machines have the Broadcom Corporation NetXtreme II BCM5708 NIC

I've attached a PNG of the panic
Comment 6 Neil Horman 2008-08-18 16:06:26 EDT
Are any of these servers available to try test kernels on, or have you noticed any pattern that leads to this panic?  If you can run test kernels I can try to put something together to confirm or disprove the race I hypothesized on in comment #3.  Also, if you have tcpdumps of the traffic into and out of these servers at the time of the panic, that might help corrolate whats going on on these systems.

Thanks!
Comment 7 Joseph W. Breu 2008-08-18 17:16:23 EDT
Hi Neil,

Unfortunately all 6 of these boxes are in production.  We've installed netdump on all of them so we can get a good copy of the oops and a dump of the memory when it panics.

We don't have tcpdumps as these servers have a high amount of NFS I/O on them.

I've attached the latest oops.
Comment 8 Joseph W. Breu 2008-08-18 17:18:59 EDT
Created attachment 314497 [details]
full kernel panic output
Comment 9 Joseph W. Breu 2008-08-18 17:23:33 EDT
Forgot to mention - this panic is from 2.6.9-67.0.15.ELsmp
Comment 10 Joseph W. Breu 2008-08-19 09:47:38 EDT
Hi Neil,

I've contacted the customer and we can run a devel kernel on their box.  The box is running 2.6.9-67.ELsmp 32bit.

In the meantime I've disabled tcp_retrans_collapse through sysctl on 4 of their 6 boxes and changed the NFS mount from TCP to UDP on 2 of those 4 boxes.

Let me know where I can grab the kernel.

-breu
Comment 11 Neil Horman 2008-08-19 11:17:44 EDT
thank you, I was just going to tell you to disable tcp_restrans_collapse.  That should skip the ccode thats oopsing.  I'll let you know when I have something put together
Comment 12 Joseph W. Breu 2008-08-19 11:47:33 EDT
Hi Neil,

I found this patch while googling:

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2005/01/18 12:24:11-08:00 kuznet@xxxxxxxxxxxxx 
#   [TCP]: Do not try to collapse multi-packet SKBs.
#   
#   Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
# 
# net/ipv4/tcp_output.c
#   2005/01/18 12:23:36-08:00 kuznet@xxxxxxxxxxxxx +1 -0
#   [TCP]: Do not try to collapse multi-packet SKBs.
# 
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c     2005-01-20 14:30:25 -08:00
+++ b/net/ipv4/tcp_output.c     2005-01-20 14:30:25 -08:00
@@ -1069,6 +1069,7 @@
           (skb->next != sk->sk_send_head) &&
           (skb->next != (struct sk_buff *)&sk->sk_write_queue) &&
           (skb_shinfo(skb)->nr_frags == 0 && skb_shinfo(skb->next)->nr_frags 
== 0) &&
+          (tcp_skb_pcount(skb) == 1 && tcp_skb_pcount(skb->next) == 1) &&
           (sysctl_tcp_retrans_collapse != 0))
                tcp_retrans_try_collapse(sk, skb, cur_mss);


It appears that this would resolve the issue that we are seeing.

In the interim we have turned off tcp_retrans_collapse
Comment 13 Neil Horman 2008-08-19 12:44:59 EDT
apparently we're debugging in parallel.  I was backporting that patch when you posted this :).  What arches do you need test kernels for?  I'll post them to my people page
Comment 14 Joseph W. Breu 2008-08-19 13:17:23 EDT
Can you post it for 2.6.9-78.0.1.ELsmp?  I'll get the versions on all of the other machines up to 2.6.9-78.0.1 if the fix works.
Comment 15 Neil Horman 2008-08-20 12:38:25 EDT
x86_64 test kernel available here:
http://people.redhat.com/nhorman/rpms/kernel-smp-2.6.9-78.4.EL.bz437674.x86_64.rpm

I'll build i686 shortly
Comment 16 Neil Horman 2008-08-20 15:43:47 EDT
i686 kernel package is in the same place now too:
http://people.redhat.com/nhorman
please test and let me know if the problem is fixed.  Thanks!
Comment 17 Joseph W. Breu 2008-08-20 16:45:01 EDT
I'll schedule with the customer to get this kernel up and test.  The workaround has fixed the issue in the interim and we haven't had a panic in 24 hours.
Comment 18 Joseph W. Breu 2008-08-21 10:09:45 EDT
Can you post the SMP i686 kernel as well?
Comment 19 Neil Horman 2008-08-21 13:37:42 EDT
I've replaced the link on my people page with the smp kernel i686 version
Comment 20 Joseph W. Breu 2008-08-21 14:03:18 EDT
I am running the SMP kernel on one of the affected servers with tcp_retrans_collapse turned on.
Comment 21 Joseph W. Breu 2008-08-22 08:39:13 EDT
This issue has been resolved with the test kernels.  I think we can go ahead and close this ticket now.  Thanks for all the help!
Comment 22 RHEL Product and Program Management 2008-11-26 13:29:32 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 26 Vivek Goyal 2008-12-10 17:11:38 EST
Committed in 78.21.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Comment 28 Chris Ward 2009-03-13 10:02:33 EDT
~~ Attention Partners!  ~~
RHEL 4.8Beta has been released on partners.redhat.com. There should
be a fix present, which addresses this bug. Please test and report back results on this OtherQA Partner bug at your earliest convenience.

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe any issues you encountered. If you have found a NEW bug, clone this bug and describe the issues you've encountered. Further questions can be directed to your Red Hat Partner Manager.

If you have VERIFIED the bug fix. Please select your PartnerID from the Verified field above. Please leave a comment with your test results details. Include which arches tested, package version and any applicable logs.

 - Red Hat QE Partner Management
Comment 30 Chris Ward 2009-03-25 08:27:36 EDT
Setting to verified based on Customer Verification results in comment #10. 

If there are any additional issues that need to be addressed, please clone this
bug and make a new request. If it is found that this bug has not really been
resolved, please reset to ASSIGNED state and describe the issues you are
encountering.
Comment 32 errata-xmlrpc 2009-05-18 15:22:11 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html

Note You need to log in before you can comment on or make changes to this bug.