Description of problem: The linux firewall loses track of valid TCP connections, due to premature expiry of the ip_conntrack timer. This can result in one end of the connection being kept open indefinitely, waiting for more data. This data is now dropped by the firewall, due to the ip_conntrack timer having expired. Example: A tomcat server which is supplying data to a front-end apache server typically has a limit to the number of "threads" it allows. These threads are related to the number of open TCP/IP connections. TCP/IP connections affected by this problem are kept open, and accumulate over time. Eventually the tomcat server reaches the limit of allowed threads, and stops serving requests from the apache server. This problem is caused by insufficient criteria used for "dead-peer" detection. The current algorithm is: if 3 of the "same" packets are seen in one direction, with no traffic in the other direction, the receiver is treated as a "dead-peer" and the ip_conntrack timer is reset to 5 minutes. The criteria for being the "same" is based on the TCP sequence number, but does not include: a) the TCP acknowledgement number b) the TCP window size advertisement Version-Release number of selected component (if applicable): 2.6.9-42.0.2 and earlier 2.6 series kernels. How reproducible: Partially reproducible. Requires 2 hosts, which may be VMware hosts. The procedure below reproduces the case where 3 or more non-duplicate ACK packets are sent in one direction. The case of 3 or more tcp window advertisements is visible during normal operation, but is more difficult to reproduce at will. Steps to Reproduce: =================== 1. On "server" host, run (attached): configure_server_eth0.sh 2. On "server" host, run (attached): ip_conntrack_bug_server.tcl 3. On "server" host, run (attached): ip_conntrack_bug_monitor.pl 4. On "server" host, run: tcpdump -i eth0 -S port 33000 4. On "client" host, run (attached): ip_conntrack_bug_client.tcl or run: telnet server 33000 Actual results: =============== Output from ip_conntrack_bug_monitor.pl --------------------------------------- 10:10:48.184 431999 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=52324 dport=33000 10:10:50.606 431998 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=52324 dport=33000 10:10:51.239 000299 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=52324 dport=33000 10:10:52.719 000298 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=52324 dport=33000 10:10:53.719 000297 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=52324 dport=33000 . 10:10:55.037 431999 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=52324 dport=33000 10:10:56.031 431998 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=52324 dport=33000 10:10:57.031 431997 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=52324 dport=33000 . 10:10:58.726 000300 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=52324 dport=33000 10:10:58.735 000299 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=52324 dport=33000 10:10:59.724 000298 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=52324 dport=33000 10:11:00.724 000297 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=52324 dport=33000 .................................until timer expires after 5 minutes Expected results: ================= Output from ip_conntrack_bug_monitor.pl --------------------------------------- 09:30:18.033 431999 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 09:30:19.407 432000 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 09:30:19.413 431999 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 09:30:20.407 431998 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 09:30:21.054 000300 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 09:30:21.059 000299 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 09:30:21.270 431999 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 09:30:22.266 431998 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 09:30:23.266 431997 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 . 09:30:24.964 431999 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 09:30:25.961 431998 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 09:30:26.961 431997 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 . 09:30:28.662 431999 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 09:30:29.660 431998 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 09:30:30.658 431997 ESTABLISHED src=192.168.215.1 dst=192.168.215.128 sport=44869 dport=33000 Additional info: ================ This problem is partially resolved in the current series kernel - 2.6.17.11, in that the ACK number is now tracked. The TCP window-size advertisement is not tracked. Another symptom of this problem are dropped "FIN" packets on a firewall. In the case of a slow HTTP/1.1 client connected to an apache server with the keep-alive option, the client will eventually close the connection. However the firewall may already have lost the connection state due to this problem, and the FIN packet from the client will be dropped by the firewall. To show any "lost" tcp connections, use: netstat -na | gawk '/ESTAB/ { local=$4; gsub(".*:","",local) ; remote=$5 ; gsub(".*:","",remote); result=system("grep -q port=" remote ".\\*port=" remote " /proc/net/ip_conntrack"); print local"/"remote " = " ( result ? "not found" : "found" );}'
Created attachment 135942 [details] Tar file of scripts to reproduce the problem, plus a patch to fix it
I'm sorry, but your patch doesn't seem to correlate to your description of the problem. You claim that insufficient dead-peer detectection is the root cause of your hung services, specifically the inability to differentiate dead from live peers based on lack ack and window size seen. However, the code as currently written currently does appear track last ack seen as well as window size, and your patch removes those tracks, which seems counter to your description. Can you comment on this please, and also, can you let me know what the upstream status on this is? I've looked, and this currently is out of line from what upstream has, and given the date on this bz, I'm guessing if it was submitted, it met with some objection (although I've yet to find any discussion either way on it). Thank you!
Can you be more specific about "the code as currently written". Which version of the kernel are you looking at? What do you mean by "upstream status"? One half of the problem was already fixed in linux-2.6.17.13 (last_ack), and the other half of the problem was accepted in bug: https://bugzilla.netfilter.org/bugzilla/show_bug.cgi?id=511 I was really hoping to get a fix in a Red Hat ES4 series kernel.
"What do you mean by "upstream status"?" I mean, what have you done with this patch, the one in your tarball: --- linux-2.6.9/net/ipv4/netfilter/ip_conntrack_proto_tcp.c 2006-09-11 08:27:28.000000000 +1000 +++ linux-2.6.9/net/ipv4/netfilter/ip_conntrack_proto_tcp.c.orig 2004-10-19 07:55:29.000000000 +1000 @@ -663,16 +663,12 @@ if (*index == TCP_ACK_SET) { if (state->last_dir == dir && state->last_seq == seq - && state->last_end == end - && state->last_ack == ack - && state->last_win == win) + && state->last_end == end) state->retrans++; else { state->last_dir = dir; state->last_seq = seq; state->last_end = end; - state->last_ack = ack; - state->last_win = win; state->retrans = 0; } } --- linux-2.6.9/include/linux/netfilter_ipv4/ip_conntrack_tcp.h 2006-09-11 09:28:19.000000000 +1000 +++ linux-2.6.9/include/linux/netfilter_ipv4/ip_conntrack_tcp.h.orig 2004-10-19 07:53:51.000000000 +1000 @@ -42,8 +42,6 @@ u_int8_t last_index; /* Index of the last packet */ u_int32_t last_seq; /* Last sequence number seen in dir */ u_int32_t last_end; /* Last seq + len */ - u_int32_t last_ack; /* Last sequence number seen in other dir */ - u_int16_t last_win; /* Last window advertisement seen in dir */ }; #endif /* _IP_CONNTRACK_TCP_H */ Have you proposed it upstream? Was there any discussion about it? It appears to undo what th upstream fixes did, and does the opposite from what you describe the fix as being. Or did you just diff your fix backwards?
Yes, it seems that I did 'diff it backwards'. A similar patch was submitted to the netfilter group (see above bug-report), and was accepted at the time. It is not exactly the same patch, because the problem was partially fixed in the 2.6.17.13 kernel, where as this patch was for the RedHat ES4 2.6.9 kernel. In essence, the patch should be adding the lines with 'last_ack' and 'last_win', not removing them. Sorry for the confusion.
Created attachment 148203 [details] patch to fix problem w/o breaking abi Ok, thanks for clearing that up. unfortunately, the patch is an ABI breaker, since it adds fields to ip_ct_tcp, which is the largest member of a union that is instantiatied in the middle of struct ip_conntrack. Since struct ip_conntrack can be used by out of tree modules, we can't shift any of the data around in there. This patch should preserve ABI and provide the same fix. Please test it out and confirm that the problem is resolved by this patch. Thanks!
Created attachment 148644 [details] Test results - including tcpdump output I've tested the revised patch, and it does resolve the issue reported for multiple non-retransmitted ACK packets. I have not tested it with window-size advertisements, as I do not have a suitable test-method for this case. The test method was as follows: server = an arbitrary server (not patched) client = the kernel under test 1. On "server" host, run (attached): configure_server_eth0.sh 2. On "server" host, run (attached): ip_conntrack_bug_server.tcl 3. On "client" host, run (attached): ip_conntrack_bug_monitor.pl 4. On "client" host, run: tcpdump -i eth0 -S port 33000 5. On "client" host, run (attached): ip_conntrack_bug_client.tcl or run: telnet server 33000 Results: -------- Note in particular the packets ack'ing 2739, 2943, ...3963 Time stamp: 02:00:57.40xxxx These would have resulted in the ip_conntrack timer being reset to 300s in an unpatched kernel (this was confirmed in a separate test). The timer IS reset to 300s at 02:01:01.098 This is due to a number of SACK (selective ACK) packets being sent in quick succession. This patch does not address the issue of SACK packets, but I suspect that SACK packets should be ignored by the ip_conntrack timer.
setting back to assigned since it appears testing went well on this. I'll make sure it still applies and post it.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
committed in stream U7 build 68.3. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html