141850 – GbE Network Data Corruption, IP Queues, ip_frag_queue()

Bug 141850 - GbE Network Data Corruption, IP Queues, ip_frag_queue()

Summary: GbE Network Data Corruption, IP Queues, ip_frag_queue()

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	David Miller
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-12-04 01:04 UTC by mark
Modified:	2007-11-30 22:07 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-01-26 04:17:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description mark 2004-12-04 01:04:46 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3)
Gecko/20040913 Firefox/0.10

Description of problem:
GbE Network Data Corruption, IP Queues, ip_frag_queue()

I'm seeing a data corruption while running a network test between
a client and a server. The client is RedHat 9 (2.4.20-31.9) and the
server is RH EL3 AS.
The server has two identical files. The client compares the two files
via NFS on UDP.
Most of the time, the client sees two equal files. Occasionally a
miscompare occurs.
The corruption happens only occasionally but is deterministic. 
The corrupted bits is a recycled portion of data that looks like an
earlier portion of the file.

I've been running this test scenario for several years on different
systems and NICs.
This suddenly comes up seemingly because of the transition to GbE NICs.

My fix at the moment is to decrease the number in
/proc/sys/net/ipv4/ipfrag_time.
I do this on the client side.
I changed it from 30 seconds to 7 seconds (after trying other numbers
in between).
The rest of the text will explain why ...

A little background: The test we have does reads/write in 4KB buffers.
The requests get broken down into three packets. Packets are called
fragments in the IP layer.

In ip_frag_queue(), the fragments that share the same IP ID go to the
same IP queue.
The queue stays around until all the member fragments are in.
When that happens, ip_frag_reasm() gets called and adjusts some
pointers and 
ultimately this leads to passing the skb's to UDP. The queue is
removed at that point.
All the related skb's are hanging off each other in the fragment list,
starting from
the first packet of the triple. 

In a perfect world, all the fragments in the queue come in and the
queues themselves 
go in and out of the hash table just fine. But in our world, some
fragments don't make it.
So the result is some old queues with some fragments, but not all its
fragments, are
still hanging around in the hash table. There is now a danger of later
fragments joining
up with the old queue (and their old fragments). 
How can that be ? The hash is calculated based on the IP ID, source
address, destination
address, and protocol type. For the testing we're doing, we're using
the same client and same server
and it's always (NFS on) UDP. And we do this for several days. So the
only variable
is a 16-bit IP ID. So obviously re-use or collision is a big possibility.

How do we get rid of those old queues apriori (i.e. before they cause
problems) ?
In the 2.4.20 kernel, there is no eviction based on age. 
There is eviction only based on using too much memory, over a
specified ceiling.
In other words, the old queues can stay there forever until a (bad)
collision.
However, there is one escape clause.
Each queue has a time limit. It is in the qp structure.
Right now it's 30 seconds and this is a historical number from when
GbE was not available.
The good news is that it's a user tunable parameter. See
/proc/sys/net/ipv4/ipfrag_time.
I've played with it a little bit, and 7 seconds seems to eliminate the
problem.
The queue is used only between the two routines: ip_frag_queue() and
ip_frag_reasm().
Each queue only needs to be alive until all its fragments come in. How
much time should
the queue really need to be alive ? 

All the above was learned via doing traces on the two said routines
and of course doing
the Ethereal trace on the wire.

At any rate, I've been running with frag time of 7 secs for about 3
hours and am not seeing the problem.
I'm going to run this over the weekend. But it's certainly a good sign.

How do all the above relate to all the symptoms we've seen so far ?
1. The corrupted buffer is a re-use of an earlier buffer. This relates
to the old queue with an old fragment.
    When the collision occurs between an old queue and a new packet,
the queue becomes "complete"
    and ip_frag_reasm() pushes it off to UDP to be processed.
2. The problem requires 1GbE NIC's to reproduce. This means the faster
rate of incoming packets causes a faster
    turnover of the IP ID. This raises the possibility of a collision.
3. The problem does not appear on the 2.6 kernel. The 2.6 kernel also
has the LRU eviction algorithm that
    was introduced in 2.4.21. But then again, there are countless
enhancements and bug fixes to 2.6.
4. The problem occurs in our environment (What about the rest of the
world ?). Our test environment has the same server,
    same client, and same protocol running for several days. The only
variable is the 16-bit IP ID. 
    Therefore, there is a great chance for the collision to occur.
    Another contributing factor is that the packets are all the same
size. Therefore, the queues all look the same,
    e.g. always split up into three fragments, and always the same
length. This raises the chances of interoperability
    between old and new fragments.
5. There is a timing issue such that if the turnaround time is slow
enough, then the problem never appears.
    This might be related to #2 above. Slower turnaround probably
raises the chances that queues get all their
    fragments and do not end up being an old ghost hanging around the
house until they're fulfilled.

Other notes:
1. In the 2.4.21 kernel, there is now an LRU eviction algorithm. 
    I don't know if it will totally take care of the issue I'm dealing
with.
    Patrick McHardy, owner of the patch, doesn't think it will.
    2.4.21 is used in the RH EL3 series.
2. Kicking out an old queue results in an ICMP error message between
the two systems.
    I have yet to find out the real consequence of this.
    It's an old buffer, so wouldn't the reader have already re-sent a
request because it never got the good packets ?
    So even if the old queue did get its proper fragments, processing
it would have been useless anyways
    because the reader probably already got its data because of the
re-sent request.

Some questions:
1. Manipulating the frag time is probably ok for our test environment.
But what is the proper thing to do
    outside of a test environment ?
2. Is this bug inherent in IP such that one would say that I really
should just use NFS over TCP ?
    In other words, do I just ignore this bug ?



Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Have a client (RH9) and a server (RH EL3 AS)
2. Have two similar files on the server
3. Mount the server's filesystem via NFS on UDP
4. Have the client read and compare the two files repeatedly.
   The test I use is an short program that just reads the two files
   via read(2).
   The read uses 4KB size buffers.


    

Actual Results:  Most of the time the files are found to match.
Once in a while, they are found to mismatch.

Expected Results:  The two files should be found to be equal.

Additional info:

Comment 1 Ernie Petrides 2004-12-04 01:35:32 UTC

Please provide the exact RHEL3 kernel version and network driver name
being used on the server.  Thanks.

Comment 2 mark 2004-12-06 17:16:02 UTC

20041206
The server uses 2.4.21-15.ELsmp #1 and e1000 driver.
The client uses 2.4.20-31.9smp #1 and e1000 driver

Comment 3 mark 2005-01-26 02:09:33 UTC

I instrumented the kernel to measure the typical  lifetime of a good IP queue,
and it measures about 227 microseconds (this includes interference from
instrumentation; actual time may be shorter). This means that 227ms after the IP
queue is created, all three packets belonging to that queue come in and the
queue is removed.

I measured the time it takes for the IP ID to rollover, and it measures around 8
seconds (this includes interference from instrumentation; actual time may be
shorter). This means that if a queue is hanging around waiting for a packet to
arrive and it has been waiting for about eight seconds, there's a good chance
that the IP ID has rolled over and an incoming packet with the same packet ID
will join the queue. That is the error, because the new packet does not really
belong with the other packets in the queue -- the ID already rolled over and is
being reused. The max measured rollover time is about 144 seconds. 

Right now, we're just using the tcp option in the mount command to get around
this problem.

Comment 4 David Miller 2005-01-26 04:17:42 UTC

This is a known problem with IP fragmentation as specified by
the standards.  With a 16-bit ID tag, it rolls over quickly
as you have seen.

There have been several attempts at a workaround for this, mostly
by folks in IBM's Linux group.  But all such attempts either cause
violations of RFCs or create more problems than they solve.

If you want reliable non-corrupting NFS over gigabit, use TCP.
There is no way to really fix this inherint problem with IPV4
fragmentation.

Note You need to log in before you can comment on or make changes to this bug.