Bug 518651

Summary: It sometimes takes 3 seconds to connect to a TCP server
Product: Red Hat Enterprise Linux 5 Reporter: Marc Milgram <mmilgram>
Component: kernelAssignee: Thomas Graf <tgraf>
Status: CLOSED NOTABUG QA Contact: Network QE <network-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.3CC: dhoward, dwu, fskola, greg.walters, herrold, jwest, mgoodwin, pep, plyons, qcai, rkhan, spoyarek, tao
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 619001 (view as bug list) Environment:
Last Closed: 2011-03-09 08:46:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 525215, 533192, 619001    
Attachments:
Description Flags
Test Program
none
systemtap script
none
systemtap log none

Description Marc Milgram 2009-08-21 13:41:40 UTC
Description of problem:
If I continuously attempt to create and drop connections to a TCP server, occasionally it takes about 3 seconds to connect to the server.  This is even true over the lo interface.  Usually it takes less than 2 msec to connect.

Version-Release number of selected component (if applicable):
Reported as kernel : 2.6.18-128.4.1.el5PAE 32bits
But also reproduced on 2.6.18-128.2.1.el5 x86_64

How reproducible:
Very

Steps to Reproduce:
1. Compile attached file: SocketTestDelay.c
   cc SocketTestDelay.c -o SocketTestDelay
2. Run reproducer:
   ./SocketTestDelay 127.0.0.1 22 100000 0 > output.log
3. See if it ever took a long time to connect:
   awk -F: '{if ($2>=2990) print $2; }' output.log | wc -l
  
Actual results:
At least 1 time that it took over 3 seconds to connect

Expected results:
All connections within a few milliseconds.

Additional info:
The last parameter of the SocketTestDelay program is a delay between connections.  If it is set to 4, there is no problem.

Comment 1 Marc Milgram 2009-08-21 13:42:24 UTC
Created attachment 358242 [details]
Test Program

Comment 2 Marc Milgram 2009-08-21 14:00:26 UTC
I ran tests on my RHEL 5.3 system at work.  I notice that many sockets end up in the TIMED_WAIT state for a while.

When I run the Xen Dom0 kernel, socket connections are always fast (under 10 msec per connection).

Under a bare-metal kernel, I occasionally have a 3 second connection delay.

So, a workaround may be to use the Xen kernel.

Comment 3 Marc Milgram 2009-08-21 14:00:52 UTC
This was reported on linux-net more that a year ago. The root cause
is a race condition deep in the inet code somewhere. The workaround
is to set the SO_LINGER option in the socket layer, to avoid "lingering"
when the socket is closed, i.e. to consistently win the race (or loose
it :) You can also set TCP_NODELAY in the IPPROTO_TCP layer to speed
things up, but that's not strictly necessary to work-around the
random 3 second delay problem.

I don't know of any proc/net or sysfs tweaks that will fix this - the
SO_LINGER workaround requires changes to the application.

Comment 4 Marc Milgram 2009-08-21 14:01:17 UTC
I saw mention on a web page that this was fixed in 2.6.24.5.  I built 2.6.24.5, and found that it is broken.

I built and tried the latest upstream kernel (2.6.30.5).  It is still broken.

Comment 5 Marc Milgram 2009-08-21 15:16:16 UTC
Taking a look at tcpdumps, occasionally the server doesn't respond to a syn.  There is a 3 second delay, then the client sends another syn - to which the server replies with an ack (as expected).

So, the real question is, why is the initial syn being dropped?

Comment 9 Thomas Graf 2009-11-30 21:03:03 UTC
Actually the server does respond to the SYN but the resulting SYNACK is not processed correctly. It seems to disappear randomly on the client side resulting in the retransmit to happen.

I can confirm that enabling SO_LINGER works as a workaround.

Comment 14 Thomas Graf 2010-04-12 07:41:27 UTC
Since the issue suddenly disappeared in the latest upstream kernels I have started bisecting to see which change made the issue disappear.

I tracked it down to the following change:

commit 2bba22c50b06abe9fd0d23933b1e64d35b419262
Author: Mike Galbraith <efault>
Date:   Wed Sep 9 15:41:37 2009 +0200

    sched: Turn off child_runs_first
    
    Set child_runs_first default to off.
    
    It hurts 'optimal' make -j<NR_CPUS> workloads as make jobs
    get preempted by child tasks, reducing parallelism.
    [...]

Obviously this is not a real fix for the issue. It only shows that the issue is related to a race condition.

Comment 15 Thomas Graf 2010-04-12 08:42:59 UTC
While the issue is appearing, the acceptq seems to be overflowing. Both LISTENOVERFLOWS and LISTENDROPS are increasing although not by the exact number of delay occurences.

Comment 18 Thomas Graf 2010-11-16 15:34:28 UTC
Customer happy with workaround and no longer interested in having this resolved.

No solution found upstream. Thus closing.

Comment 19 Mark Wu 2011-01-06 07:39:46 UTC
Hi Thomas,
     We have another customer hit this issue and they can't accept the workaround. So I reopen it!
  
     I can't reproduce it on RHEL5, but reproduce it on FC14 with child_runs_first enabled. In child_runs_first case, it seems that this issue is caused by the full accept queue. Because the child processes forked by sshd are scheduled before sshd itself, so it can't get chance to accept new connection from the queue. From the systemtap log, I see the backlog increases from 0 to 129. When it reaches 129, a new syn will be dropped and then sshd accept all request in the accept queue. The request dropped is retransmitted by client, and it is accept at the second time because the backlog has been decreased to zero.

     

tcp_v4_conn_request            1294297510   41898      22     125     128
reqsk_alloc                    1294297510
tcp_v4_conn_request            1294297510   41899      22     126     128
reqsk_alloc                    1294297510
tcp_v4_conn_request            1294297510   41900      22     127     128
reqsk_alloc                    1294297510
inet_csk_accept                1294297510      22
tcp_v4_conn_request            1294297510   41901      22     127     128
reqsk_alloc                    1294297510
tcp_v4_conn_request            1294297510   41902      22     128     128
reqsk_alloc                    1294297510
tcp_v4_conn_request            1294297510   41903      22     129     128
reqsk_alloc                    1294297510
tcp_v4_conn_request            1294297510   41904      22     129     128
reqsk_alloc                    1294297510
tcp_v4_conn_request            1294297510   41905      22     129     128  
<comment>
This request with srcport 41095 was dropped in function tcp_v4_conn_request, but I can't explain why request 41903 and 41904 could get response. 
</comment> 
inet_csk_accept                1294297510      22
inet_csk_accept                1294297510      22
inet_csk_accept                1294297510      22
inet_csk_accept                1294297510      22

...

inet_csk_accept                1294297511      22
inet_csk_accept                1294297511      22
inet_csk_accept                1294297511      22
inet_csk_accept                1294297511      22
inet_csk_accept                1294297511      22
inet_csk_accept                1294297511      22
inet_csk_accept                1294297511      22
inet_csk_accept                1294297511      22
tcp_v4_conn_request            1294297513   41905      22       0     128
<comment>
41905 was responded this time
</comment>  

reqsk_alloc                    1294297513
inet_csk_accept                1294297512      22


tcp_v4_conn_request            1294297513   41906      22       0     128
reqsk_alloc                    1294297513
tcp_v4_conn_request            1294297513   41907      22       1     128
reqsk_alloc                    1294297513
tcp_v4_conn_request            1294297513   41908      22       2     128
reqsk_alloc                    1294297513
tcp_v4_conn_request            1294297513   41909      22       3     128
reqsk_alloc                    1294297513
tcp_v4_conn_request            1294297513   41910      22       4     128
reqsk_alloc                    1294297513

Comment 20 Mark Wu 2011-01-06 07:42:17 UTC
Created attachment 471995 [details]
systemtap script

Comment 21 Mark Wu 2011-01-06 07:44:26 UTC
Created attachment 471997 [details]
systemtap log

Comment 28 Thomas Graf 2011-03-09 08:46:09 UTC
The solution to this issue is to increase the backlog value on the listening socket. This is not a kernel bug.