Bug 518651
Summary: | It sometimes takes 3 seconds to connect to a TCP server | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Marc Milgram <mmilgram> | ||||||||
Component: | kernel | Assignee: | Thomas Graf <tgraf> | ||||||||
Status: | CLOSED NOTABUG | QA Contact: | Network QE <network-qe> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 5.3 | CC: | dhoward, dwu, fskola, greg.walters, herrold, jwest, mgoodwin, pep, plyons, qcai, rkhan, spoyarek, tao | ||||||||
Target Milestone: | rc | Keywords: | Reopened | ||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 619001 (view as bug list) | Environment: | |||||||||
Last Closed: | 2011-03-09 08:46:09 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 525215, 533192, 619001 | ||||||||||
Attachments: |
|
Description
Marc Milgram
2009-08-21 13:41:40 UTC
Created attachment 358242 [details]
Test Program
I ran tests on my RHEL 5.3 system at work. I notice that many sockets end up in the TIMED_WAIT state for a while. When I run the Xen Dom0 kernel, socket connections are always fast (under 10 msec per connection). Under a bare-metal kernel, I occasionally have a 3 second connection delay. So, a workaround may be to use the Xen kernel. This was reported on linux-net more that a year ago. The root cause is a race condition deep in the inet code somewhere. The workaround is to set the SO_LINGER option in the socket layer, to avoid "lingering" when the socket is closed, i.e. to consistently win the race (or loose it :) You can also set TCP_NODELAY in the IPPROTO_TCP layer to speed things up, but that's not strictly necessary to work-around the random 3 second delay problem. I don't know of any proc/net or sysfs tweaks that will fix this - the SO_LINGER workaround requires changes to the application. I saw mention on a web page that this was fixed in 2.6.24.5. I built 2.6.24.5, and found that it is broken. I built and tried the latest upstream kernel (2.6.30.5). It is still broken. Taking a look at tcpdumps, occasionally the server doesn't respond to a syn. There is a 3 second delay, then the client sends another syn - to which the server replies with an ack (as expected). So, the real question is, why is the initial syn being dropped? Actually the server does respond to the SYN but the resulting SYNACK is not processed correctly. It seems to disappear randomly on the client side resulting in the retransmit to happen. I can confirm that enabling SO_LINGER works as a workaround. Since the issue suddenly disappeared in the latest upstream kernels I have started bisecting to see which change made the issue disappear. I tracked it down to the following change: commit 2bba22c50b06abe9fd0d23933b1e64d35b419262 Author: Mike Galbraith <efault> Date: Wed Sep 9 15:41:37 2009 +0200 sched: Turn off child_runs_first Set child_runs_first default to off. It hurts 'optimal' make -j<NR_CPUS> workloads as make jobs get preempted by child tasks, reducing parallelism. [...] Obviously this is not a real fix for the issue. It only shows that the issue is related to a race condition. While the issue is appearing, the acceptq seems to be overflowing. Both LISTENOVERFLOWS and LISTENDROPS are increasing although not by the exact number of delay occurences. Customer happy with workaround and no longer interested in having this resolved. No solution found upstream. Thus closing. Hi Thomas, We have another customer hit this issue and they can't accept the workaround. So I reopen it! I can't reproduce it on RHEL5, but reproduce it on FC14 with child_runs_first enabled. In child_runs_first case, it seems that this issue is caused by the full accept queue. Because the child processes forked by sshd are scheduled before sshd itself, so it can't get chance to accept new connection from the queue. From the systemtap log, I see the backlog increases from 0 to 129. When it reaches 129, a new syn will be dropped and then sshd accept all request in the accept queue. The request dropped is retransmitted by client, and it is accept at the second time because the backlog has been decreased to zero. tcp_v4_conn_request 1294297510 41898 22 125 128 reqsk_alloc 1294297510 tcp_v4_conn_request 1294297510 41899 22 126 128 reqsk_alloc 1294297510 tcp_v4_conn_request 1294297510 41900 22 127 128 reqsk_alloc 1294297510 inet_csk_accept 1294297510 22 tcp_v4_conn_request 1294297510 41901 22 127 128 reqsk_alloc 1294297510 tcp_v4_conn_request 1294297510 41902 22 128 128 reqsk_alloc 1294297510 tcp_v4_conn_request 1294297510 41903 22 129 128 reqsk_alloc 1294297510 tcp_v4_conn_request 1294297510 41904 22 129 128 reqsk_alloc 1294297510 tcp_v4_conn_request 1294297510 41905 22 129 128 <comment> This request with srcport 41095 was dropped in function tcp_v4_conn_request, but I can't explain why request 41903 and 41904 could get response. </comment> inet_csk_accept 1294297510 22 inet_csk_accept 1294297510 22 inet_csk_accept 1294297510 22 inet_csk_accept 1294297510 22 ... inet_csk_accept 1294297511 22 inet_csk_accept 1294297511 22 inet_csk_accept 1294297511 22 inet_csk_accept 1294297511 22 inet_csk_accept 1294297511 22 inet_csk_accept 1294297511 22 inet_csk_accept 1294297511 22 inet_csk_accept 1294297511 22 tcp_v4_conn_request 1294297513 41905 22 0 128 <comment> 41905 was responded this time </comment> reqsk_alloc 1294297513 inet_csk_accept 1294297512 22 tcp_v4_conn_request 1294297513 41906 22 0 128 reqsk_alloc 1294297513 tcp_v4_conn_request 1294297513 41907 22 1 128 reqsk_alloc 1294297513 tcp_v4_conn_request 1294297513 41908 22 2 128 reqsk_alloc 1294297513 tcp_v4_conn_request 1294297513 41909 22 3 128 reqsk_alloc 1294297513 tcp_v4_conn_request 1294297513 41910 22 4 128 reqsk_alloc 1294297513 Created attachment 471995 [details]
systemtap script
Created attachment 471997 [details]
systemtap log
The solution to this issue is to increase the backlog value on the listening socket. This is not a kernel bug. |