From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20040913 Firefox/0.10 Description of problem: GbE Network Data Corruption, IP Queues, ip_frag_queue() I'm seeing a data corruption while running a network test between a client and a server. The client is RedHat 9 (2.4.20-31.9) and the server is RH EL3 AS. The server has two identical files. The client compares the two files via NFS on UDP. Most of the time, the client sees two equal files. Occasionally a miscompare occurs. The corruption happens only occasionally but is deterministic. The corrupted bits is a recycled portion of data that looks like an earlier portion of the file. I've been running this test scenario for several years on different systems and NICs. This suddenly comes up seemingly because of the transition to GbE NICs. My fix at the moment is to decrease the number in /proc/sys/net/ipv4/ipfrag_time. I do this on the client side. I changed it from 30 seconds to 7 seconds (after trying other numbers in between). The rest of the text will explain why ... A little background: The test we have does reads/write in 4KB buffers. The requests get broken down into three packets. Packets are called fragments in the IP layer. In ip_frag_queue(), the fragments that share the same IP ID go to the same IP queue. The queue stays around until all the member fragments are in. When that happens, ip_frag_reasm() gets called and adjusts some pointers and ultimately this leads to passing the skb's to UDP. The queue is removed at that point. All the related skb's are hanging off each other in the fragment list, starting from the first packet of the triple. In a perfect world, all the fragments in the queue come in and the queues themselves go in and out of the hash table just fine. But in our world, some fragments don't make it. So the result is some old queues with some fragments, but not all its fragments, are still hanging around in the hash table. There is now a danger of later fragments joining up with the old queue (and their old fragments). How can that be ? The hash is calculated based on the IP ID, source address, destination address, and protocol type. For the testing we're doing, we're using the same client and same server and it's always (NFS on) UDP. And we do this for several days. So the only variable is a 16-bit IP ID. So obviously re-use or collision is a big possibility. How do we get rid of those old queues apriori (i.e. before they cause problems) ? In the 2.4.20 kernel, there is no eviction based on age. There is eviction only based on using too much memory, over a specified ceiling. In other words, the old queues can stay there forever until a (bad) collision. However, there is one escape clause. Each queue has a time limit. It is in the qp structure. Right now it's 30 seconds and this is a historical number from when GbE was not available. The good news is that it's a user tunable parameter. See /proc/sys/net/ipv4/ipfrag_time. I've played with it a little bit, and 7 seconds seems to eliminate the problem. The queue is used only between the two routines: ip_frag_queue() and ip_frag_reasm(). Each queue only needs to be alive until all its fragments come in. How much time should the queue really need to be alive ? All the above was learned via doing traces on the two said routines and of course doing the Ethereal trace on the wire. At any rate, I've been running with frag time of 7 secs for about 3 hours and am not seeing the problem. I'm going to run this over the weekend. But it's certainly a good sign. How do all the above relate to all the symptoms we've seen so far ? 1. The corrupted buffer is a re-use of an earlier buffer. This relates to the old queue with an old fragment. When the collision occurs between an old queue and a new packet, the queue becomes "complete" and ip_frag_reasm() pushes it off to UDP to be processed. 2. The problem requires 1GbE NIC's to reproduce. This means the faster rate of incoming packets causes a faster turnover of the IP ID. This raises the possibility of a collision. 3. The problem does not appear on the 2.6 kernel. The 2.6 kernel also has the LRU eviction algorithm that was introduced in 2.4.21. But then again, there are countless enhancements and bug fixes to 2.6. 4. The problem occurs in our environment (What about the rest of the world ?). Our test environment has the same server, same client, and same protocol running for several days. The only variable is the 16-bit IP ID. Therefore, there is a great chance for the collision to occur. Another contributing factor is that the packets are all the same size. Therefore, the queues all look the same, e.g. always split up into three fragments, and always the same length. This raises the chances of interoperability between old and new fragments. 5. There is a timing issue such that if the turnaround time is slow enough, then the problem never appears. This might be related to #2 above. Slower turnaround probably raises the chances that queues get all their fragments and do not end up being an old ghost hanging around the house until they're fulfilled. Other notes: 1. In the 2.4.21 kernel, there is now an LRU eviction algorithm. I don't know if it will totally take care of the issue I'm dealing with. Patrick McHardy, owner of the patch, doesn't think it will. 2.4.21 is used in the RH EL3 series. 2. Kicking out an old queue results in an ICMP error message between the two systems. I have yet to find out the real consequence of this. It's an old buffer, so wouldn't the reader have already re-sent a request because it never got the good packets ? So even if the old queue did get its proper fragments, processing it would have been useless anyways because the reader probably already got its data because of the re-sent request. Some questions: 1. Manipulating the frag time is probably ok for our test environment. But what is the proper thing to do outside of a test environment ? 2. Is this bug inherent in IP such that one would say that I really should just use NFS over TCP ? In other words, do I just ignore this bug ? Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Have a client (RH9) and a server (RH EL3 AS) 2. Have two similar files on the server 3. Mount the server's filesystem via NFS on UDP 4. Have the client read and compare the two files repeatedly. The test I use is an short program that just reads the two files via read(2). The read uses 4KB size buffers. Actual Results: Most of the time the files are found to match. Once in a while, they are found to mismatch. Expected Results: The two files should be found to be equal. Additional info:
Please provide the exact RHEL3 kernel version and network driver name being used on the server. Thanks.
20041206 The server uses 2.4.21-15.ELsmp #1 and e1000 driver. The client uses 2.4.20-31.9smp #1 and e1000 driver
I instrumented the kernel to measure the typical lifetime of a good IP queue, and it measures about 227 microseconds (this includes interference from instrumentation; actual time may be shorter). This means that 227ms after the IP queue is created, all three packets belonging to that queue come in and the queue is removed. I measured the time it takes for the IP ID to rollover, and it measures around 8 seconds (this includes interference from instrumentation; actual time may be shorter). This means that if a queue is hanging around waiting for a packet to arrive and it has been waiting for about eight seconds, there's a good chance that the IP ID has rolled over and an incoming packet with the same packet ID will join the queue. That is the error, because the new packet does not really belong with the other packets in the queue -- the ID already rolled over and is being reused. The max measured rollover time is about 144 seconds. Right now, we're just using the tcp option in the mount command to get around this problem.
This is a known problem with IP fragmentation as specified by the standards. With a 16-bit ID tag, it rolls over quickly as you have seen. There have been several attempts at a workaround for this, mostly by folks in IBM's Linux group. But all such attempts either cause violations of RFCs or create more problems than they solve. If you want reliable non-corrupting NFS over gigabit, use TCP. There is no way to really fix this inherint problem with IPV4 fragmentation.