Bug 72857

Summary: MSS in the router does not work arourn CIPE dead-connection problem
Product: [Retired] Red Hat Linux Reporter: Alexandre Oliva <aoliva>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: low Docs Contact:
Priority: medium    
Version: 8.0   
Target Milestone: ---   
Target Release: ---   
Hardware: athlon   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-30 15:39:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 752980, 1045207, 1045208    
Bug Blocks:    

Description Alexandre Oliva 2002-08-28 14:39:54 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020809

Description of problem:
I have a CIPE tunnel connecting my home gateway to a machine at the university,
that I use, among many other things, to get transparent access to the news
server in there.  The hardware set up goes like this:

my desktop is connected to my home gateway through 100MBPS ethernet, with a
switching hub in between.  it runs (null) with kernel 2.4.18-12.2.

my home gateway connects to the outside world through pppoe ADSL.  it has a cipe
tunnel to the server of the lab I run at the uni, and it routes all traffic to
IP addresses within the uni through this channel (except to the server itself).
 it currently runs Valhalla with kernel 2.4.18-10, but I first noticed the
problem with kernel 2.4.18-5.

the lab server at the uni masquerades packets it forwards that came from the
cipe channel.  it's in a 100MBPS network, plugged to a switch that is part of
the routing infrastructure of the institute.  it runs Valhalla, still with
kernel 2.4.18-5.  When it ran Enigma, the problem didn't show up.

there are at least two such switches between the lab server and the news server,
and possibly a router, but they don't show up in traceroute.

the news server run some version of Red Hat Linux, probably Enigma, and probably
not with all the most recent patches, but it doesn't matter too much.

The root of the problem is that the switching network drops PMTUD packets.  I
see with tcpdump that the lab server receives non-fragmentable packets that are
too big to go through the cipe channel, and it replies with ICMP packets
indicating the MTU of the cipe channel, but such packets never get back to the
news server, that doesn't have any packet filtering enabled.  Unfortunately, IS
at the institute can't seem to find any settings in the routing hardware to let
such packets get through, so I have to fix the problem elsewhere.

Back when my home network run Valhalla but the lab server ran Enigma, everything
worked just fine.  As soon as I upgraded the lab server to Valhalla, I started
getting dead connections to the news server.

Searching the internet, I found out that setting the MSS of the route to a
smaller value would fix the problem and, indeed, limiting to 1440 the MSS of the
route from my home gateway to the other end of the cipe channel fixed it.

Until I upgraded my desktop to limbo2.  At that point, it started getting dead
connections again, and no amount of limiting in the MSS seemed to help.

I ended up limiting its own MTU to 1442, but I don't understand why the MSS work
around, that would allow my local network to communicate more efficiently while
still fixing connections through the cipe channel, no longer works.  Any ideas?
 Is MSS not supposed to limit the TCP window size of all connections going
through it?  Why would upgrading one of the ends of the connection make any
difference?

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.connect one machine running Enigma (lab server) to another running say Enigma
(news server), such that ICMP PMTUD packets don't reach the Enigma machine
2.connect this machine running Enigma (lab server) to one running Valhalla (home
gateway), and set up a cipe connection between them, such that traffic to and
from news server goes through the cipe channel
3.connect this machine running Valhalla (home gateway) to yet another one
running Valhalla (desktop)
4.set up a server on the news server that, when connected to, sends a large packet
5.connect to the server from the desktop
6.watch the network traffic between the news server and the lab server
(everything works)
7.upgrade the lab server to valhalla
8.connect again
9.watch the network traffic, and notice the large packet doesn't get through
10.change the route settings in the home gateway such that the route to the news
server has an MSS of 1440
11.connect again
12.watch the network traffic, and see it works
13.upgrade the desktop to limbo2 or (null) (perhaps limbo 1 too, I didn't try that)
14.connect once again
15.watch the network traffic, and see the packet doesn't get through
16.set the mtu of the desktop to 1442
17.connect again
18.it works again, but the local network it's less efficient

Actual Results:  Already summarized above.

Expected Results:  I'd have hoped the MSS setting in the gateway would have
taken care of the problem.  Why did it stop working when the desktop was
upgraded to limbo2?

Additional info:

Comment 1 Alexandre Oliva 2003-01-02 18:50:10 UTC
A new bit of info just came in: the problem seems to be caused by the lack of
MASQuerading in ICMP need to fragment packets.  The IP address of the target
machine remains unchanged, instead of being masqueraded, so the sender of the
big packet has no way to tell on which connections to lower the mtu.

Comment 2 Alexandre Oliva 2003-01-02 19:30:50 UTC
And iptables'  TCPMSS rules seem to be able to `fix' the problem for me.  I'll
only be able to tell for sure after all PMTUs get dropped from caches all over,
but it's looking like it's exactly the right solution for the problem.  Unless a
connection is set up before the MTU is discovered, in which case there's not
much that can be done...

Comment 3 Alexandre Oliva 2003-02-24 00:59:58 UTC
I can confirm that adding a line such as this to my gateway's iptables
configuration file fixes the problem:

[0:0] -A FORWARD -o cipcb0 -p tcp -m tcp --syn -j TCPMSS --clamp-mss-to-pmtu


Comment 4 Bugzilla owner 2004-09-30 15:39:52 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/