198602 – TCP connections stall, packets larger than MTU being sent.

Bug 198602 - TCP connections stall, packets larger than MTU being sent.

Summary: TCP connections stall, packets larger than MTU being sent.

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	5
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-07-12 08:46 UTC by Karl Auerbach
Modified:	2015-01-04 22:27 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-11-24 22:55:04 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Ethereal trace of stalling TCP connection with oversize packets (format is libpcap) (34.71 KB, application/octet-stream) 2006-07-17 20:08 UTC, Karl Auerbach	no flags	Details
View All

Description Karl Auerbach 2006-07-12 08:46:18 UTC

Description of problem:

TCP connections stall after several megabytes.

I monitored the traffic via ethereal on the outgoing interface and saw that
tometimes TCP packets were sent that were larger than the interface MTU.  (E.g
2800 byte ethernet frame being sent out of interface with MTU of 1500.)

Version-Release number of selected component (if applicable):

2.6.17-1.2141_FC4, but seems to be dependent of the version at each end of the
TCP connection, and is depends on which end opened the TCP connection (but not
which way the data is flowing.)

(Also observed with 2.6.17-1.2145_FC5.)

How reproducible:

Always

Steps to Reproduce:
1. Create a large file, e.g. bigfile
2. scp bigfile othermachine:/dev/null
3. Watch it transfer a few megabytes and then enter a mode in which packets
above MTU are sent but not acknowledged.  Retransmission occur, but they are
usually futile.  Connection locks up.
  
Actual results:

TCP connection stalls.

Expected results:

TCP connection should be transferring (on my net this means typically about
500kbyte/second to 2mbyte/second transfers.)

Additional info:

I watched the connections via ethereal (I'll try to get some traces and attach
'em to this tomorrow.)  Even though interface MTU was set at 1500, packets of
over 2800 bytes were being sent (which, if they got onto the wire at all were
subsequently lost as they tried to traverse a switch that was not capable of
handling jumbograms.)

TCP MTU probing was off (net.ipv4.tcp_mtu_probing = 0)

Network was typical 10 and 100mbit full-duplex ethernet with consumer-grade
switches (i.e. most can't do packets above about 1500 bytes) and Cisco 2621
routers.  Problem also occurred when running across the internet.

I reverted back to either 2.6.16-1.2115_FC4 or 2.6.16-1.2111_FC4smp and the
problem went away.

Sorry for being so vague.  The problem only shows up on large transfers (several
tens of megabytes at least), through a sequence of routers and switches, and
only when at least one of the ends is a release later than 2.6.16-1.2111 or
2.6.16-1.2115.

Some of the machines that exhibited the problem were running FC5
-2.6.17-1.2145_FC5  

This showed up with scp and cvs updates.  It smells like a TCP stack issue
rather than an application issue.

Courtesy of a blown circuit breaker, there was a full power cycle of all of the
equipment involved - and the problem remained.

The work around for me was to revert to older kernels.

Comment 1 Karl Auerbach 2006-07-17 20:08:31 UTC

Created attachment 132571 [details]
Ethereal trace of stalling TCP connection with oversize packets (format is libpcap)

This attachment is an ethereal dump of a tcp connection that stalls because of
oversize packets.

TCP sends oversize packets starting at frame 48.

The interface MTU is 1500.  The physical interface is an Intel e1000 that can
send  packets up to about 16100 bytes.	The intervening networkn uses a variety
of gear, most of which is not willing to accomodate packets bigger than the
typical ~1500 byte variety found on most ethernets.

The OS version on 192.202.17.213 is 2.6.17-1.2145_FC5.

The OS version on 71.132.98.41 is 2.6.17-1.2142_FC4.

The capture was made on machine 192.202.17.213.

Comment 2 Karl Auerbach 2006-08-26 06:35:49 UTC

It is possible that this bug is not a bug.

Rather, it may be that I was being fooled by the TCP segment offloading
mechanisms that are supported by the Intel Pro 1000 NICs in the machines I use.
 When I do "ethtool -k" it tells me that segment offloading is "on".

However, there does seem to be some interaction between recent TCP kernel code
and the inspectors in Cisco IOS-based firewall code.  But I haven't been able to
isolate it.

Comment 3 Dave Jones 2006-09-17 03:07:03 UTC

[This comment added as part of a mass-update to all open FC4 kernel bugs]

FC4 has now transitioned to the Fedora legacy project, which will continue to
release security related updates for the kernel.  As this bug is not security
related, it is unlikely to be fixed in an update for FC4, and has been migrated
to FC5.

Please retest with Fedora Core 5.

Thank you.

Comment 4 Dave Jones 2006-10-17 00:14:38 UTC

A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 5 Dave Jones 2006-11-24 22:55:04 UTC

This bug has been mass-closed along with all other bugs that
have been in NEEDINFO state for several months.

Due to the large volume of inactive bugs in bugzilla, this
is the only method we have of cleaning out stale bug reports
where the reporter has disappeared.

If you can reproduce this bug after installing all the
current updates, please reopen this bug.

If you are not the reporter, you can add a comment requesting
it be reopened, and someone will get to it asap.

Thank you.

Note You need to log in before you can comment on or make changes to this bug.