Bug 582722 - TCP socket premature timeout with FRTO and TSO
Summary: TCP socket premature timeout with FRTO and TSO
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Jiri Pirko
QA Contact: Hangbin Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-04-15 16:04 UTC by Matt Cross
Modified: 2015-05-05 01:19 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-13 21:28:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Backported patch from 2.6.32 that fixes the TCP premature timeout issue (2.76 KB, patch)
2010-04-15 16:04 UTC, Matt Cross
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Matt Cross 2010-04-15 16:04:03 UTC
Created attachment 406848 [details]
Backported patch from 2.6.32 that fixes the TCP premature timeout issue

Description of problem:

A TCP session streaming data out of a machine with both FRTO and TSO enabled can prematurely timeout under some conditions, and it does this without sending any packets with FIN or RST.  This causes the socket to disappear on the sender side, but the receiver is still in established state and sits there forever (or until it tries to send something on the connection and gets an error).

Version-Release number of selected component (if applicable):

This has been observed in the 5.1 kernel (2.6.18-53) and 5.4 kernel (2.6.18-164)

How reproducible:

Very reproducible

Steps to Reproduce:
1. Get 2 machines - machine A is under test, it must have an ethernet device that supports TSO.  Machine B will receive data from machine A.
1. Enable FRTO on machine A by doing 'echo 1 > /proc/sys/net/ipv4/tcp_frto'
2. Enable TSO on outbound interface of machine A by doing 'ethtool -K <devname> tso on'.  Verify it is on with 'ethtool -k <devname>'.
3. On machine B, initiate a transfer of a large file from machine A using scp or ftp (I tested with scp, though I think any one-way TCP transfer will work)
4. Induce a small packet loss.  For example, do this on machine B: 'iptables -A INPUT -s <machine A's IP> -j DROP; iptables -D INPUT -s <machine A's IP> -j DROP'
  
Actual results:

After 15 retransmissions, the transmitting machine abruptly closes the socket (IE sets the state of the TCP socket internally to CLOSED) without sending any packet with a FIN or RST to the other side.  From a user point of view, the socket will be gone on machine A, but machine B will still have a socket in the 'ESTABLISHED' state.

Expected results:

TCP should recover from a small packet loss and continue the transfer with minimal interruption.

Additional info:

I don't understand the entire mechanism, but through adding some debug messages I have established that the test at the end of tcp_write_timeout() in net/ipv4/tcp_timer.c becomes true and it calls tcp_write_err() which forcibly shuts down the socket.  Turning off FRTO or TSO eliminates the problem.

I backported a change from 2.6.32 that fixes the issue in my testing.  This patch changes the test in tcp_write_timeout() and a couple other places to be based on how long this socket has been retransmitting instead of the number of retransmits.  The original code is at http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6fa12c85031485dff38ce550c24f10da23b0adaa (though I also grabbed some updates to the retransmits_timed_out() function).  A patch is attached.

Comment 2 RHEL Program Management 2010-05-20 12:42:06 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Jarod Wilson 2010-05-25 21:12:34 UTC
in kernel-2.6.18-200.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 5 Matt Cross 2010-06-03 13:19:53 UTC
I was able to reproduce the problem in (CentOS) kernel 2.6.18-164.6.1, and can no longer reproduce it in kernel-2.6.18-200.el5.  Looks good to me.

Comment 7 Hangbin Liu 2010-12-01 05:49:17 UTC
Verified 

did as Matt said , when made a small packet loss , on kernel 2.6.18-194.el5

HOST A:
[root@hp-xw6400-02 ~]# netstat -tn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State      
tcp        0      0 ::ffff:10.16.42.210:22      ::ffff:10.66.65.15:43507    ESTABLISHED 
tcp        0    384 ::ffff:10.16.42.210:22      ::ffff:10.66.65.15:55558    ESTABLISHED 
[root@hp-xw6400-02 x86_64]# uname -a
Linux hp-xw6400-02.lab.bos.redhat.com 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

HOST B:
[root@ibm-hs22-02 ~]# netstat -tn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State      
tcp        0      0 ::ffff:10.16.45.160:22      ::ffff:10.66.65.15:47555    ESTABLISHED 
tcp        0      0 ::ffff:10.16.45.160:22      ::ffff:10.66.65.15:47554    ESTABLISHED 
tcp        0      0 ::ffff:10.16.45.160:22      ::ffff:10.16.42.210:51840   ESTABLISHED 
[root@ibm-hs22-02 233.el5]# uname -a
Linux ibm-hs22-02.lab.bos.redhat.com 2.6.18-194.el5PAE #1 SMP Tue Mar 16 22:00:21 EDT 2010 i686 i686 i386 GNU/Linux

on kernel 2.6.18-233.el5 , when made a small packet loss ,everything done well .

Comment 9 errata-xmlrpc 2011-01-13 21:28:17 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.