Description of problem: The situation happens in an environment that administrators login from a RHEL4U4 workstation (kernel 2.6.9-42.ELsmp) to hundres of RHEL4U3 servers (kernel 2.6.9-34.ELsmp) over the Internet. Sometimes the ssh session stops responding, and we found some problem in tcpdump data. Version-Release number of selected component (if applicable): kernel-2.6.9-34.ELsmp How reproducible: rarely Steps to Reproduce: 1.make hundres of concurrent ssh connections 2.type some commands in terminal, especially commands that generate large portion of terminal output 3. Actual results: sometimes the terminal suspends, output from the last command doesn't continue and typing any characters get no response. Expected results: continues operation. Additional info: The attachment is a full session tcpdump from the client side. We don't have server-side tcpdump data for this connection because the problem happens rarely and it's very difficult to capture packets on hundres of running production servers, but earlier captures from the server side shows there're no massive packet loss. We masked the address part of client-to-server packet to '>>>' and server-to-client to spaces. And we filtered nop and timestamp options to make it easier to read. The interesting part starts from line 39502 (starred below), where the server retransmit packet 2300959925:2300961373 to the client. Before that, the client just acknowledged sequence 2300966901 which is several packets after 2300961373. 03:56:41.491259 P 2300965717:2300966901(1184) ack 2379603710 win 8840 03:56:41.491430 >>> . ack 2300966901 win 16022 03:56:41.492200 >>> P 2379603710:2379603758(48) ack 2300966901 win 16022 * 03:56:41.735891 . 2300959925:2300961373(1448) ack 2379603710 win 8840 03:56:41.735907 >>> . ack 2300966901 win 16022 03:56:41.739594 >>> P 2379603710:2379603758(48) ack 2300966901 win 16022 03:56:42.221765 . 2300959925:2300961373(1448) ack 2379603710 win 8840 03:56:42.221784 >>> . ack 2300966901 win 16022 03:56:42.233519 >>> P 2379603710:2379603758(48) ack 2300966901 win 16022 03:56:43.193557 . 2300959925:2300961373(1448) ack 2379603710 win 8840 03:56:43.193573 >>> . ack 2300966901 win 16022 When the client received the retransmission, he repeats the ack of 2300966901 to the server again. But after some while (about 0.5 seconds here) the server retransmit it again, seems like he doesn't get the ack packet. This behaviour repeats for about 15 minutes, until the client side think the connection is lost and make the recv syscall return -1 with ETIMEDOUT error, from strace of the ssh process: 04:12:47.562502 read(3, 0xbffc2ac0, 8192) = -1 ETIMEDOUT (Connection timed out) During this, the network condition is good, there're dozens of active ssh sessions working well at this time, and we can get every retransmission packet from the server, the server also receives client reacknowledge packet very well from early manual tcpdumps. All servers have iptables settings like this, I think this might relate to some problems in the ip_conntrack module: /sbin/iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT /sbin/iptables -A INPUT -s 192.168.6.0/24 -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT ... /sbin/iptables -A INPUT -j DROP
Created attachment 152869 [details] sample tcpdump output
This problem is confirmed. When I add LOG rules before the iptables DROP rule, I see match count and logged packets in /var/log/messages when the session is locked-up. Then I insert an ACCEPT rule before stateful rules, the locked session continues.
Are you still experiencing the problem?
Yes, we let the customer add non-stateful iptables rules to avoid this problem. But if any server (RHEL4 kernel, not sure the patch level) is still running the stateful firewall, this problem do occur. It is very likely to reproduce, if we open a dozen of SSH connections to the server, perform several operations, and leave them alone for 10-30 minutes, then you go back to type a command that produces a lot of outputs (for example, for i in `seq 10`; do dmesg; done), tit is very likely the output will block, until you add a non-stateful iptables rule to accept the packet, or the connection will timeout. It means the firewall still recognizes the connection after the idle, but suddenly drops packets afterwhile.
RHEL4 has entered the Extended Life Phase. There will be no more minor releases. I'm closing this bug due to inactivity. Please reopen and provide an explanation if you need this issue to be addressed in RHEL4. Please note that only security and critical bugfixes are considered at this point.