Red Hat Bugzilla – Bug 437674
Kernel Panic in tcp_retransmit_skb
Last modified: 2010-10-22 19:16:53 EDT
Description of problem: Servers are getting down because of kernel panic. Both servers are in cluster and they use same nfs share from local machine. Version-Release number of selected component (if applicable): Kernel version: 2.6.9-67.0.4.ELsmp Server: DELL PE1855 How reproducible: Can not reproduce this because we do not know when it appears, in what cases. Additional info: In attachment screenshots of kernel panics on both machines.
Created attachment 298178 [details] Kerne panic screenshots
Sure. I thought of bz 319181 as well when I saw the subject header here. My thought (as in the other bz), is that there is a race between a fast retransmit from a duplicate ack a and the popping of the write timer, which leads to the corruption of the write queue. I've not been able to find it yet, but If I do I'll post a patch here for you to try. Have you only seen it once, or can you reproduce it on occasion? (I know what you said above, but random reproduction would be better than a single failure).
Created attachment 314488 [details] New panic
Hello, We have 6 servers that are seeing this bug. We cannot duplicate it, but we have seen it happen quite often over the last 3 days on all 6 machines. The 6 machines are: HP DL385 2.6.9-55.0.2.ELsmp HP DL385 2.6.9-67.0.15.ELsmp HP DL385 2.6.9-55.0.2.ELsmp HP DL385 2.6.9-67.ELsmp HP DL385 2.6.9-67.ELsmp Dell 2950 MK3 2.6.9-78.ELsmp All machines have the Broadcom Corporation NetXtreme II BCM5708 NIC I've attached a PNG of the panic
Are any of these servers available to try test kernels on, or have you noticed any pattern that leads to this panic? If you can run test kernels I can try to put something together to confirm or disprove the race I hypothesized on in comment #3. Also, if you have tcpdumps of the traffic into and out of these servers at the time of the panic, that might help corrolate whats going on on these systems. Thanks!
Hi Neil, Unfortunately all 6 of these boxes are in production. We've installed netdump on all of them so we can get a good copy of the oops and a dump of the memory when it panics. We don't have tcpdumps as these servers have a high amount of NFS I/O on them. I've attached the latest oops.
Created attachment 314497 [details] full kernel panic output
Forgot to mention - this panic is from 2.6.9-67.0.15.ELsmp
Hi Neil, I've contacted the customer and we can run a devel kernel on their box. The box is running 2.6.9-67.ELsmp 32bit. In the meantime I've disabled tcp_retrans_collapse through sysctl on 4 of their 6 boxes and changed the NFS mount from TCP to UDP on 2 of those 4 boxes. Let me know where I can grab the kernel. -breu
thank you, I was just going to tell you to disable tcp_restrans_collapse. That should skip the ccode thats oopsing. I'll let you know when I have something put together
Hi Neil, I found this patch while googling: # This is a BitKeeper generated diff -Nru style patch. # # ChangeSet # 2005/01/18 12:24:11-08:00 kuznet@xxxxxxxxxxxxx # [TCP]: Do not try to collapse multi-packet SKBs. # # Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx> # # net/ipv4/tcp_output.c # 2005/01/18 12:23:36-08:00 kuznet@xxxxxxxxxxxxx +1 -0 # [TCP]: Do not try to collapse multi-packet SKBs. # diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c --- a/net/ipv4/tcp_output.c 2005-01-20 14:30:25 -08:00 +++ b/net/ipv4/tcp_output.c 2005-01-20 14:30:25 -08:00 @@ -1069,6 +1069,7 @@ (skb->next != sk->sk_send_head) && (skb->next != (struct sk_buff *)&sk->sk_write_queue) && (skb_shinfo(skb)->nr_frags == 0 && skb_shinfo(skb->next)->nr_frags == 0) && + (tcp_skb_pcount(skb) == 1 && tcp_skb_pcount(skb->next) == 1) && (sysctl_tcp_retrans_collapse != 0)) tcp_retrans_try_collapse(sk, skb, cur_mss); It appears that this would resolve the issue that we are seeing. In the interim we have turned off tcp_retrans_collapse
apparently we're debugging in parallel. I was backporting that patch when you posted this :). What arches do you need test kernels for? I'll post them to my people page
Can you post it for 2.6.9-78.0.1.ELsmp? I'll get the versions on all of the other machines up to 2.6.9-78.0.1 if the fix works.
x86_64 test kernel available here: http://people.redhat.com/nhorman/rpms/kernel-smp-2.6.9-78.4.EL.bz437674.x86_64.rpm I'll build i686 shortly
i686 kernel package is in the same place now too: http://people.redhat.com/nhorman please test and let me know if the problem is fixed. Thanks!
I'll schedule with the customer to get this kernel up and test. The workaround has fixed the issue in the interim and we haven't had a panic in 24 hours.
Can you post the SMP i686 kernel as well?
I've replaced the link on my people page with the smp kernel i686 version
I am running the SMP kernel on one of the affected servers with tcp_retrans_collapse turned on.
This issue has been resolved with the test kernels. I think we can go ahead and close this ticket now. Thanks for all the help!
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 78.21.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
~~ Attention Partners! ~~ RHEL 4.8Beta has been released on partners.redhat.com. There should be a fix present, which addresses this bug. Please test and report back results on this OtherQA Partner bug at your earliest convenience. If you encounter any issues, please set the bug back to the ASSIGNED state and describe any issues you encountered. If you have found a NEW bug, clone this bug and describe the issues you've encountered. Further questions can be directed to your Red Hat Partner Manager. If you have VERIFIED the bug fix. Please select your PartnerID from the Verified field above. Please leave a comment with your test results details. Include which arches tested, package version and any applicable logs. - Red Hat QE Partner Management
Setting to verified based on Customer Verification results in comment #10. If there are any additional issues that need to be addressed, please clone this bug and make a new request. If it is found that this bug has not really been resolved, please reset to ASSIGNED state and describe the issues you are encountering.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html