236872 – TCP connection breaks by ignoring acknowledge packets?

Bug 236872 - TCP connection breaks by ignoring acknowledge packets?

Summary: TCP connection breaks by ignoring acknowledge packets?

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Thomas Graf
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-04-18 04:38 UTC by Gary Shi
Modified:	2014-06-18 08:29 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-05-10 12:51:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sample tcpdump output (343.19 KB, application/octet-stream) 2007-04-18 04:38 UTC, Gary Shi	no flags	Details
View All

Description Gary Shi 2007-04-18 04:38:47 UTC

Description of problem:
The situation happens in an environment that administrators login from a RHEL4U4
workstation (kernel 2.6.9-42.ELsmp) to hundres of RHEL4U3 servers (kernel
2.6.9-34.ELsmp) over the Internet. Sometimes the ssh session stops responding,
and we found some problem in tcpdump data.

Version-Release number of selected component (if applicable):
kernel-2.6.9-34.ELsmp

How reproducible:
rarely

Steps to Reproduce:
1.make hundres of concurrent ssh connections
2.type some commands in terminal, especially commands that generate large
portion of terminal output
3.
  
Actual results:
sometimes the terminal suspends, output from the last command doesn't continue
and typing any characters get no response.

Expected results:
continues operation.

Additional info:
The attachment is a full session tcpdump from the client side. We don't have
server-side tcpdump data for this connection because the problem happens rarely
and it's very difficult to capture packets on hundres of running production
servers, but earlier captures from the server side shows there're no massive
packet loss.

We masked the address part of client-to-server packet to '>>>' and
server-to-client to spaces. And we filtered nop and timestamp options to make it
easier to read.

The interesting part starts from line 39502 (starred below), where the server
retransmit packet 2300959925:2300961373 to the client. Before that, the client
just acknowledged sequence 2300966901 which is several packets after 2300961373.

  03:56:41.491259     P 2300965717:2300966901(1184) ack 2379603710 win 8840
  03:56:41.491430 >>> . ack 2300966901 win 16022
  03:56:41.492200 >>> P 2379603710:2379603758(48) ack 2300966901 win 16022
* 03:56:41.735891     . 2300959925:2300961373(1448) ack 2379603710 win 8840
  03:56:41.735907 >>> . ack 2300966901 win 16022
  03:56:41.739594 >>> P 2379603710:2379603758(48) ack 2300966901 win 16022
  03:56:42.221765     . 2300959925:2300961373(1448) ack 2379603710 win 8840
  03:56:42.221784 >>> . ack 2300966901 win 16022
  03:56:42.233519 >>> P 2379603710:2379603758(48) ack 2300966901 win 16022
  03:56:43.193557     . 2300959925:2300961373(1448) ack 2379603710 win 8840
  03:56:43.193573 >>> . ack 2300966901 win 16022

When the client received the retransmission, he repeats the ack of 2300966901 to
the server again. But after some while (about 0.5 seconds here) the server
retransmit it again, seems like he doesn't get the ack packet. This behaviour
repeats for about 15 minutes, until the client side think the connection is lost
and make the recv syscall return -1 with ETIMEDOUT error, from strace of the ssh
process:

04:12:47.562502 read(3, 0xbffc2ac0, 8192) = -1 ETIMEDOUT (Connection timed out)

During this, the network condition is good, there're dozens of active ssh
sessions working well at this time, and we can get every retransmission packet
from the server, the server also receives client reacknowledge packet very well
from early manual tcpdumps.

All servers have iptables settings like this, I think this might relate to some
problems in the ip_conntrack module:

/sbin/iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
/sbin/iptables -A INPUT -s 192.168.6.0/24 -p tcp -m state --state NEW -m tcp
--dport 22 -j ACCEPT
...
/sbin/iptables -A INPUT -j DROP

Comment 1 Gary Shi 2007-04-18 04:38:47 UTC

Created attachment 152869 [details]
sample tcpdump output

Comment 2 Gary Shi 2007-05-10 06:04:47 UTC

This problem is confirmed. When I add LOG rules before the iptables DROP rule, I
see match count and logged packets in /var/log/messages when the session is
locked-up. Then I insert an ACCEPT rule before stateful rules, the locked
session continues.

Comment 3 Thomas Graf 2008-06-13 20:56:37 UTC

Are you still experiencing the problem?

Comment 4 Gary Shi 2008-06-16 12:30:00 UTC

Yes, we let the customer add non-stateful iptables rules to avoid this problem.
But if any server (RHEL4 kernel, not sure the patch level) is still running the
stateful firewall, this problem do occur. It is very likely to reproduce, if we
open a dozen of SSH connections to the server, perform several operations, and
leave them alone for 10-30 minutes, then you go back to type a command that
produces a lot of outputs (for example, for i in `seq 10`; do dmesg; done), tit
is very likely the output will block, until you add a non-stateful iptables rule
to accept the packet, or the connection will timeout. It means the firewall
still recognizes the connection after the idle, but suddenly drops packets
afterwhile.

Comment 5 Thomas Graf 2012-05-10 12:51:56 UTC

RHEL4 has entered the Extended Life Phase. There will be no more minor releases.

I'm closing this bug due to inactivity.

Please reopen and provide an explanation if you need this issue to be addressed in RHEL4. Please note that only security and critical bugfixes are considered at this point.

Note You need to log in before you can comment on or make changes to this bug.