Description of problem:
The situation happens in an environment that administrators login from a RHEL4U4
workstation (kernel 2.6.9-42.ELsmp) to hundres of RHEL4U3 servers (kernel
2.6.9-34.ELsmp) over the Internet. Sometimes the ssh session stops responding,
and we found some problem in tcpdump data.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.make hundres of concurrent ssh connections
2.type some commands in terminal, especially commands that generate large
portion of terminal output
sometimes the terminal suspends, output from the last command doesn't continue
and typing any characters get no response.
The attachment is a full session tcpdump from the client side. We don't have
server-side tcpdump data for this connection because the problem happens rarely
and it's very difficult to capture packets on hundres of running production
servers, but earlier captures from the server side shows there're no massive
We masked the address part of client-to-server packet to '>>>' and
server-to-client to spaces. And we filtered nop and timestamp options to make it
easier to read.
The interesting part starts from line 39502 (starred below), where the server
retransmit packet 2300959925:2300961373 to the client. Before that, the client
just acknowledged sequence 2300966901 which is several packets after 2300961373.
03:56:41.491259 P 2300965717:2300966901(1184) ack 2379603710 win 8840
03:56:41.491430 >>> . ack 2300966901 win 16022
03:56:41.492200 >>> P 2379603710:2379603758(48) ack 2300966901 win 16022
* 03:56:41.735891 . 2300959925:2300961373(1448) ack 2379603710 win 8840
03:56:41.735907 >>> . ack 2300966901 win 16022
03:56:41.739594 >>> P 2379603710:2379603758(48) ack 2300966901 win 16022
03:56:42.221765 . 2300959925:2300961373(1448) ack 2379603710 win 8840
03:56:42.221784 >>> . ack 2300966901 win 16022
03:56:42.233519 >>> P 2379603710:2379603758(48) ack 2300966901 win 16022
03:56:43.193557 . 2300959925:2300961373(1448) ack 2379603710 win 8840
03:56:43.193573 >>> . ack 2300966901 win 16022
When the client received the retransmission, he repeats the ack of 2300966901 to
the server again. But after some while (about 0.5 seconds here) the server
retransmit it again, seems like he doesn't get the ack packet. This behaviour
repeats for about 15 minutes, until the client side think the connection is lost
and make the recv syscall return -1 with ETIMEDOUT error, from strace of the ssh
04:12:47.562502 read(3, 0xbffc2ac0, 8192) = -1 ETIMEDOUT (Connection timed out)
During this, the network condition is good, there're dozens of active ssh
sessions working well at this time, and we can get every retransmission packet
from the server, the server also receives client reacknowledge packet very well
from early manual tcpdumps.
All servers have iptables settings like this, I think this might relate to some
problems in the ip_conntrack module:
/sbin/iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
/sbin/iptables -A INPUT -s 192.168.6.0/24 -p tcp -m state --state NEW -m tcp
--dport 22 -j ACCEPT
/sbin/iptables -A INPUT -j DROP
Created attachment 152869 [details]
sample tcpdump output
This problem is confirmed. When I add LOG rules before the iptables DROP rule, I
see match count and logged packets in /var/log/messages when the session is
locked-up. Then I insert an ACCEPT rule before stateful rules, the locked
Are you still experiencing the problem?
Yes, we let the customer add non-stateful iptables rules to avoid this problem.
But if any server (RHEL4 kernel, not sure the patch level) is still running the
stateful firewall, this problem do occur. It is very likely to reproduce, if we
open a dozen of SSH connections to the server, perform several operations, and
leave them alone for 10-30 minutes, then you go back to type a command that
produces a lot of outputs (for example, for i in `seq 10`; do dmesg; done), tit
is very likely the output will block, until you add a non-stateful iptables rule
to accept the packet, or the connection will timeout. It means the firewall
still recognizes the connection after the idle, but suddenly drops packets
RHEL4 has entered the Extended Life Phase. There will be no more minor releases.
I'm closing this bug due to inactivity.
Please reopen and provide an explanation if you need this issue to be addressed in RHEL4. Please note that only security and critical bugfixes are considered at this point.