Bug 205966 - Firewall - Premature ip_conntrack timer expiry on 3+ ack or window size advertisements - (hanging tomcat threads problem)
Firewall - Premature ip_conntrack timer expiry on 3+ ack or window size adver...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
i686 Linux
medium Severity low
: ---
: ---
Assigned To: Neil Horman
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-09-10 22:29 EDT by George Hansper
Modified: 2008-07-24 15:11 EDT (History)
5 users (show)

See Also:
Fixed In Version: RHSA-2008-0665
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-07-24 15:11:49 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Tar file of scripts to reproduce the problem, plus a patch to fix it (2.25 KB, application/x-bzip2)
2006-09-10 22:29 EDT, George Hansper
no flags Details
patch to fix problem w/o breaking abi (1.60 KB, patch)
2007-02-16 11:13 EST, Neil Horman
no flags Details | Diff
Test results - including tcpdump output (9.72 KB, text/plain)
2007-02-22 23:00 EST, George Hansper
no flags Details

  None (edit)
Description George Hansper 2006-09-10 22:29:47 EDT
Description of problem:

The linux firewall loses track of valid TCP connections, due to premature
expiry of the ip_conntrack timer.

This can result in one end of the connection being kept open
indefinitely, waiting for more data. This data is now dropped by
the firewall, due to the ip_conntrack timer having expired.

Example: A tomcat server which is supplying data to a front-end apache
server typically has a limit to the number of "threads" it allows.
These threads are related to the number of open TCP/IP connections.
TCP/IP connections affected by this problem are kept open, and accumulate
over time.

Eventually the tomcat server reaches the limit of allowed threads, and
stops serving requests from the apache server.

This problem is caused by insufficient criteria used for "dead-peer" detection.
The current algorithm is: if 3 of the "same" packets are seen in one
direction, with no traffic in the other direction, the receiver is treated
as a "dead-peer" and the ip_conntrack timer is reset to 5 minutes.

The criteria for being the "same" is based on the TCP sequence number,
but does not include:
a) the TCP acknowledgement number
b) the TCP window size advertisement

Version-Release number of selected component (if applicable):
2.6.9-42.0.2 and earlier 2.6 series kernels.

How reproducible:
Partially reproducible. Requires 2 hosts, which may be VMware hosts.
The procedure below reproduces the case where 3 or more non-duplicate ACK 
packets are sent in one direction.

The case of 3 or more tcp window advertisements is visible during normal operation,
but is more difficult to reproduce at will.

Steps to Reproduce:
===================
1. On "server" host, run (attached): configure_server_eth0.sh
2. On "server" host, run (attached): ip_conntrack_bug_server.tcl
3. On "server" host, run (attached): ip_conntrack_bug_monitor.pl
4. On "server" host, run:            tcpdump -i eth0 -S port 33000
4. On "client" host, run (attached): ip_conntrack_bug_client.tcl
   or run:                           telnet server 33000
  
Actual results:
===============
Output from ip_conntrack_bug_monitor.pl
---------------------------------------
10:10:48.184 431999 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=52324 dport=33000
10:10:50.606 431998 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=52324 dport=33000
10:10:51.239 000299 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=52324 dport=33000
10:10:52.719 000298 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=52324 dport=33000
10:10:53.719 000297 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=52324 dport=33000
.
10:10:55.037 431999 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=52324 dport=33000
10:10:56.031 431998 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=52324 dport=33000
10:10:57.031 431997 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=52324 dport=33000
.
10:10:58.726 000300 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=52324 dport=33000
10:10:58.735 000299 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=52324 dport=33000
10:10:59.724 000298 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=52324 dport=33000
10:11:00.724 000297 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=52324 dport=33000
.................................until timer expires after 5 minutes


Expected results:
=================
Output from ip_conntrack_bug_monitor.pl
---------------------------------------
09:30:18.033 431999 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000
09:30:19.407 432000 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000
09:30:19.413 431999 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000
09:30:20.407 431998 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000
09:30:21.054 000300 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000
09:30:21.059 000299 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000
09:30:21.270 431999 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000
09:30:22.266 431998 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000
09:30:23.266 431997 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000
.
09:30:24.964 431999 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000
09:30:25.961 431998 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000
09:30:26.961 431997 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000
.
09:30:28.662 431999 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000
09:30:29.660 431998 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000
09:30:30.658 431997 ESTABLISHED src=192.168.215.1 dst=192.168.215.128
sport=44869 dport=33000



Additional info:
================
This problem is partially resolved in the current series kernel - 2.6.17.11,
in that the ACK number is now tracked. The TCP window-size advertisement is not
tracked.

Another symptom of this problem are dropped "FIN" packets on a firewall.

In the case of a slow HTTP/1.1 client connected to an apache server with the
keep-alive option,
the client will eventually close the connection.
However the firewall may already have lost the connection state due to this problem,
and the FIN packet from the client will be dropped by the firewall.

To show any "lost" tcp connections, use:

netstat -na | gawk '/ESTAB/ { local=$4; gsub(".*:","",local) ; remote=$5 ;
gsub(".*:","",remote); result=system("grep -q port=" remote ".\\*port=" remote "
 /proc/net/ip_conntrack"); print local"/"remote " = " ( result ? "not found" :
"found" );}'
Comment 1 George Hansper 2006-09-10 22:29:47 EDT
Created attachment 135942 [details]
Tar file of scripts to reproduce the problem, plus a patch to fix it
Comment 2 Neil Horman 2007-02-12 14:20:52 EST
I'm sorry, but your patch doesn't seem to correlate to your description of the
problem.  You claim that insufficient dead-peer detectection is the root cause
of your hung services, specifically the inability to differentiate dead from
live peers based on lack ack and window size seen.  However, the code as
currently written currently does appear track last ack seen as well as window
size, and your patch removes those tracks, which seems counter to your
description.  Can you comment on this please, and also, can you let me know what
the upstream status on this is?  I've looked, and this currently is out of line
from what upstream has, and given the date on this bz, I'm guessing if it was
submitted, it met with some objection (although I've yet to find any discussion
either way on it).  Thank you!
Comment 3 George Hansper 2007-02-13 16:01:14 EST
Can you be more specific about "the code as currently written".
Which version of the kernel are you looking at?

What do you mean by "upstream status"?

One half of the problem was already fixed in linux-2.6.17.13 (last_ack),
and the other half of the problem was accepted in bug:

  https://bugzilla.netfilter.org/bugzilla/show_bug.cgi?id=511

I was really hoping to get a fix in a Red Hat ES4 series kernel.
Comment 4 Neil Horman 2007-02-14 09:12:30 EST
"What do you mean by "upstream status"?"

I mean, what have you done with this patch, the one in your tarball:

--- linux-2.6.9/net/ipv4/netfilter/ip_conntrack_proto_tcp.c     2006-09-11
08:27:28.000000000 +1000
+++ linux-2.6.9/net/ipv4/netfilter/ip_conntrack_proto_tcp.c.orig       
2004-10-19 07:55:29.000000000 +1000
@@ -663,16 +663,12 @@
                if (*index == TCP_ACK_SET) {
                        if (state->last_dir == dir
                            && state->last_seq == seq
-                           && state->last_end == end
-                           && state->last_ack == ack
-                           && state->last_win == win)
+                           && state->last_end == end)
                                state->retrans++;
                        else {
                                state->last_dir = dir;
                                state->last_seq = seq;
                                state->last_end = end;
-                               state->last_ack = ack;
-                               state->last_win = win;
                                state->retrans = 0;
                        }
                }
--- linux-2.6.9/include/linux/netfilter_ipv4/ip_conntrack_tcp.h 2006-09-11
09:28:19.000000000 +1000
+++ linux-2.6.9/include/linux/netfilter_ipv4/ip_conntrack_tcp.h.orig   
2004-10-19 07:53:51.000000000 +1000
@@ -42,8 +42,6 @@
        u_int8_t        last_index;     /* Index of the last packet */
        u_int32_t       last_seq;       /* Last sequence number seen in dir */
        u_int32_t       last_end;       /* Last seq + len */
-       u_int32_t       last_ack;       /* Last sequence number seen in other dir */
-       u_int16_t       last_win;       /* Last window advertisement seen in dir */
 };

 #endif /* _IP_CONNTRACK_TCP_H */


Have you proposed it upstream?  Was there any discussion about it?  It appears
to undo what th upstream fixes did, and does the opposite from what you describe
the fix as being.  Or did you just diff your fix backwards?
Comment 5 George Hansper 2007-02-15 19:59:45 EST
Yes, it seems that I did 'diff it backwards'.

A similar patch was submitted to the netfilter group (see above bug-report),
and was accepted at the time.

It is not exactly the same patch, because the problem was partially fixed in the
2.6.17.13 kernel, where as this patch was for the RedHat ES4 2.6.9 kernel.

In essence, the patch should be adding the lines with 'last_ack' and 'last_win',
not removing them.

Sorry for the confusion.
Comment 6 Neil Horman 2007-02-16 11:13:45 EST
Created attachment 148203 [details]
patch to fix problem w/o breaking abi

Ok, thanks for clearing that up.  unfortunately, the patch is an ABI breaker,
since it adds fields to ip_ct_tcp, which is the largest member of a union that
is instantiatied in the middle of struct ip_conntrack.	Since struct
ip_conntrack can be used by out of tree modules, we can't shift any of the data
around in there.  This patch should preserve ABI and provide the same fix. 
Please test it out and confirm that the problem is resolved by this patch. 
Thanks!
Comment 7 George Hansper 2007-02-22 23:00:23 EST
Created attachment 148644 [details]
Test results - including tcpdump output

I've tested the revised patch, and it does resolve the issue reported
for multiple non-retransmitted ACK packets.

I have not tested it with window-size advertisements, as I do not
have a suitable test-method for this case.

The test method was as follows:

server = an arbitrary server (not patched)
client = the kernel under test

1. On "server" host, run (attached): configure_server_eth0.sh
2. On "server" host, run (attached): ip_conntrack_bug_server.tcl
3. On "client" host, run (attached): ip_conntrack_bug_monitor.pl
4. On "client" host, run:	     tcpdump -i eth0 -S port 33000
5. On "client" host, run (attached): ip_conntrack_bug_client.tcl
   or run:			     telnet server 33000

Results:
--------
Note in particular the packets ack'ing 2739, 2943, ...3963
Time stamp: 02:00:57.40xxxx

These would have resulted in the ip_conntrack timer being reset
to 300s in an unpatched kernel (this was confirmed in a separate test).

The timer IS reset to 300s at 02:01:01.098
This is due to a number of SACK (selective ACK) packets being sent in quick
succession.

This patch does not address the issue of SACK packets, but I suspect that
SACK packets should be ignored by the ip_conntrack timer.
Comment 8 Neil Horman 2007-08-08 15:49:22 EDT
setting back to assigned since it appears testing went well on this.  I'll make
sure it still applies and post it.
Comment 10 RHEL Product and Program Management 2007-12-06 11:25:37 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 12 Jason Baron 2007-12-07 16:53:26 EST
committed in stream U7 build 68.3. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/
Comment 16 errata-xmlrpc 2008-07-24 15:11:49 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html

Note You need to log in before you can comment on or make changes to this bug.