Bug 582722

Summary: TCP socket premature timeout with FRTO and TSO
Product: Red Hat Enterprise Linux 5 Reporter: Matt Cross <matt.cross>
Component: kernelAssignee: Jiri Pirko <jpirko>
Status: CLOSED ERRATA QA Contact: Hangbin Liu <haliu>
Severity: medium Docs Contact:
Priority: low    
Version: 5.4CC: anton, caiqian, haliu, kzhang, rkhan
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-13 16:28:17 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Description Flags
Backported patch from 2.6.32 that fixes the TCP premature timeout issue none

Description Matt Cross 2010-04-15 12:04:03 EDT
Created attachment 406848 [details]
Backported patch from 2.6.32 that fixes the TCP premature timeout issue

Description of problem:

A TCP session streaming data out of a machine with both FRTO and TSO enabled can prematurely timeout under some conditions, and it does this without sending any packets with FIN or RST.  This causes the socket to disappear on the sender side, but the receiver is still in established state and sits there forever (or until it tries to send something on the connection and gets an error).

Version-Release number of selected component (if applicable):

This has been observed in the 5.1 kernel (2.6.18-53) and 5.4 kernel (2.6.18-164)

How reproducible:

Very reproducible

Steps to Reproduce:
1. Get 2 machines - machine A is under test, it must have an ethernet device that supports TSO.  Machine B will receive data from machine A.
1. Enable FRTO on machine A by doing 'echo 1 > /proc/sys/net/ipv4/tcp_frto'
2. Enable TSO on outbound interface of machine A by doing 'ethtool -K <devname> tso on'.  Verify it is on with 'ethtool -k <devname>'.
3. On machine B, initiate a transfer of a large file from machine A using scp or ftp (I tested with scp, though I think any one-way TCP transfer will work)
4. Induce a small packet loss.  For example, do this on machine B: 'iptables -A INPUT -s <machine A's IP> -j DROP; iptables -D INPUT -s <machine A's IP> -j DROP'
Actual results:

After 15 retransmissions, the transmitting machine abruptly closes the socket (IE sets the state of the TCP socket internally to CLOSED) without sending any packet with a FIN or RST to the other side.  From a user point of view, the socket will be gone on machine A, but machine B will still have a socket in the 'ESTABLISHED' state.

Expected results:

TCP should recover from a small packet loss and continue the transfer with minimal interruption.

Additional info:

I don't understand the entire mechanism, but through adding some debug messages I have established that the test at the end of tcp_write_timeout() in net/ipv4/tcp_timer.c becomes true and it calls tcp_write_err() which forcibly shuts down the socket.  Turning off FRTO or TSO eliminates the problem.

I backported a change from 2.6.32 that fixes the issue in my testing.  This patch changes the test in tcp_write_timeout() and a couple other places to be based on how long this socket has been retransmitting instead of the number of retransmits.  The original code is at http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6fa12c85031485dff38ce550c24f10da23b0adaa (though I also grabbed some updates to the retransmits_timed_out() function).  A patch is attached.
Comment 2 RHEL Product and Program Management 2010-05-20 08:42:06 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
Comment 4 Jarod Wilson 2010-05-25 17:12:34 EDT
in kernel-2.6.18-200.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.
Comment 5 Matt Cross 2010-06-03 09:19:53 EDT
I was able to reproduce the problem in (CentOS) kernel 2.6.18-164.6.1, and can no longer reproduce it in kernel-2.6.18-200.el5.  Looks good to me.
Comment 7 Hangbin Liu 2010-12-01 00:49:17 EST

did as Matt said , when made a small packet loss , on kernel 2.6.18-194.el5

[root@hp-xw6400-02 ~]# netstat -tn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State      
tcp        0      0 ::ffff:      ::ffff:    ESTABLISHED 
tcp        0    384 ::ffff:      ::ffff:    ESTABLISHED 
[root@hp-xw6400-02 x86_64]# uname -a
Linux hp-xw6400-02.lab.bos.redhat.com 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

[root@ibm-hs22-02 ~]# netstat -tn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State      
tcp        0      0 ::ffff:      ::ffff:    ESTABLISHED 
tcp        0      0 ::ffff:      ::ffff:    ESTABLISHED 
tcp        0      0 ::ffff:      ::ffff:   ESTABLISHED 
[root@ibm-hs22-02 233.el5]# uname -a
Linux ibm-hs22-02.lab.bos.redhat.com 2.6.18-194.el5PAE #1 SMP Tue Mar 16 22:00:21 EDT 2010 i686 i686 i386 GNU/Linux

on kernel 2.6.18-233.el5 , when made a small packet loss ,everything done well .
Comment 9 errata-xmlrpc 2011-01-13 16:28:17 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.