Bug 198565

Summary: NFS/udp Data Corruption
Product: [Fedora] Fedora Reporter: Toby Bereznak <brez>
Component: kernelAssignee: Steve Dickson <steved>
Status: CLOSED CURRENTRELEASE QA Contact: Ben Levenson <benl>
Severity: medium Docs Contact:
Priority: medium    
Version: 6CC: mast
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.20 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-03-22 15:16:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
script to copy files and diff them repeatedly
none
simple script to test with
none
original binary file
none
differing binary file after copy on NFS filesystem
none
Image showing glitch of inserted bytes then real image data out of register
none
tethereal output using nfs over TCP none

Description Toby Bereznak 2006-07-11 22:30:54 UTC
Description of problem:
Data in Binary files gets corrupted when transfering files
between machines via NFS using udp when the network is under a heavy load.
MTU: 1500 everywhere

Version-Release number of selected component (if applicable): kernel 2.6.15 and
above


How reproducible:


Steps to Reproduce:
Copy a binary file from/to an NFS mount over a busy Gigabit
network multiple times, and diff it with the original each time.
(to busy up the network either copy multiple files or use 'sudo ping -f -s 60000')


***Script to copy file and diff:
#!/bin/csh -f
# Script to test I/O on 2.6.15 kernel
# Give it filename and optional argument s to stop on error

nohup

set stop = 0set stop = 0
set file = $argv[1]
if ($#argv > 1) then
    if ($argv[2] == 's') set stop = 1
endif

while (1)
    \cp $file $file.copy
    diff $file $file.copy

    if ($status) then
        echo "ERROR AT   "`date`
        if ($stop) exit 1
    endif
end

****************

Actual results:  
Binary files differ


Expected results: 
Files should not differ


Additional info:
Was not fixed in kernel 2.6.17-1.2139_FC4

Comment 1 Toby Bereznak 2006-07-11 22:30:54 UTC
Created attachment 132270 [details]
script to copy files and diff them repeatedly

Comment 2 Steve Dickson 2006-07-24 18:22:45 UTC
Could you please post the oops output?

Comment 3 Toby Bereznak 2006-07-24 20:10:43 UTC
Created attachment 132943 [details]
simple script to test with

Comment 4 Toby Bereznak 2006-07-24 20:11:23 UTC
Created attachment 132944 [details]
original binary file

Comment 5 Toby Bereznak 2006-07-24 20:12:55 UTC
Created attachment 132945 [details]
differing binary file after copy on NFS filesystem

also I'm removing one of the 'script' attachments because they are the same

Comment 6 Steve Dickson 2006-07-25 03:41:20 UTC
Just curious... Does the same problem happen with TCP mounts?

Comment 7 Toby Bereznak 2006-07-25 16:40:02 UTC
copying files on TCP mounts works fine; we tried it

slower tho..

Comment 8 Steve Dickson 2006-07-25 19:59:15 UTC
With busy networks you really want to use TCP since it know how to
deal with congestion much much much better than RPC/UDP will....

Now I'm a bit surprised that TCP is not comparable to UDP since
with UDP I'm sure you getting tons and tons of retransmits which
in turns is just added even more congestion to an already busy
network... to prove this, simply do a 'nfsstat -rc' using both UDP
and TCP. You will see the number of 'retrans' will be much much
smaller with TCP than UDP... 

Comment 9 Toby Bereznak 2006-09-14 16:53:44 UTC
This problem seems to be related to our parallel processing system.  When nodes
(~8 usually) are done processing they copy the data back to an nfs-mounted
filesystem.

We are now using mount options:
'-o soft,intr,timeo=20,retrans=20,rsize=65536,wsize=65536,nfsvers=3,tcp' 
on the client nodes.

Comment 10 Toby Bereznak 2006-10-06 23:57:07 UTC
when using a low timeout of 1 (timeo=1) this bug can be typically be reproduced
in under 10 minutes.  It happens even when using tcp mounts.

Comment 11 Toby Bereznak 2006-10-09 19:07:46 UTC
(In reply to comment #7)
> copying files on TCP mounts works fine; we tried it
> 
> slower tho..

This actually FAILS, although there is more success with TCP

Comment 12 Steve Dickson 2006-12-07 01:45:49 UTC
Try turning off soft mounts... 

Comment 13 David Mastronarde 2007-01-05 18:41:14 UTC
I'm Toby's supervisor and we thought it would help if I weighted in at this 
point since we are feeling rather desperate.

To answer your latest question: we tried hard mounts between two machines and 
ran our standard copy test with timeo=1 to try to make it fail.  It ran 
successfully for at least an hour, whereas soft mounts would fail within 5 
minutes.

But when we switched all of our machines to hard mounts a few days ago, with 
the setting timeo=25, our users still got data corruption.  It's hard to say if 
this was any more or less frequent than with soft mounts, but one occurrence 
was quite severe with over 30 glitches in a 2 GB file.
During this time there were a number of messages like this in the server log:

Jan  4 11:24:50 simba kernel: RPC: bad TCP reclen 0x08020703 (large)
Jan  4 11:24:50 simba kernel: RPC: bad TCP reclen 0x7902dc02 (large)
Jan  4 11:24:50 simba kernel: rpc-srv/tcp: nfsd: got error -104 when sending 140
 bytes - shutting down socket
Jan  4 11:24:50 simba last message repeated 3 times
Jan  4 11:25:01 simba kernel: RPC: bad TCP reclen 0x40038d02 (non-terminal)
Jan  4 11:25:01 simba kernel: RPC: bad TCP reclen 0x5302b302 (non-terminal)
Jan  4 11:25:01 simba kernel: RPC: bad TCP reclen 0x0b02ff02 (large)

We are currently trying some parameter changes to see if they help this:
echo '8388608' > /proc/sys/net/core/rmem_default
echo '8388608' > /proc/sys/net/core/rmem_max
echo '8388608' > /proc/sys/net/core/wmem_default
echo '8388608' > /proc/sys/net/core/wmem_max
echo '32768 65536 8388608' > /proc/sys/net/ipv4/tcp_rmem
echo '32768 65536 8388608' > /proc/sys/net/ipv4/tcp_wmem
echo '8388608 8388608 8388608' > /proc/sys/net/ipv4/tcp_mem

To recap:
We can reproduce this bug within 5 minutes with an NFS connection between two 
workstations with two intervening gigabit switches, by running the test script 
while continuously copying a large file or directory tree (the copy 
occasionally gives Input/Output errors as well).  The parameters that give 
rapid failure are:
soft,intr,timeo=1,retrans=10,rsize=65536,wsize=65536,nfsvers=3
The mounts are all done with automount.

The failures occur with any kernel past 2.6.14.

The failures occur with tcp as well as udp.

Increasing the timeo to 10 or higher greatly reduces the failure rate: the 
simple test will not fail but our users still get data corruption if the 
network is busy.

The test also does not fail quickly with hard mounts but there is still 
corruption at some times.

As I said, we're getting desperate.  We're trying a few more things (including 
a new main switch) but will then have to go back to the 2.6.14 kernel, which 
may mean we are effectively stuck at Fedora 4 until this is resolved.

Comment 14 Toby Bereznak 2007-01-05 18:46:31 UTC
Current nfs mount options are
-hard,intr,timeo=35,retrans=35,rsize=65536,wsize=65536,nfsvers=3,tcp

Comment 15 Steve Dickson 2007-01-08 15:05:35 UTC
The corruption could be due to lower level network corruption . So would it
be possible to get a packet trace when the corruption happens? Something
similar to:
    "tethereal -w /tmp/bz198565.pcap host <server> ; bzip2 /tmp/bz198565.pcap"

What I'm looking for is TCP checksum error or TCP retransmissions or other
TCP errors. If these type of errors are indeed happening, then your network
is dropping packets which could be the cause of the corruption...



Comment 16 David Mastronarde 2007-01-08 20:51:32 UTC
Created attachment 145104 [details]
Image showing glitch of inserted bytes then real image data out of register

Comment 17 Steve Dickson 2007-01-09 14:41:39 UTC
That image definitely looks messed up... but without the packet 
trace as described in Comment #15 its hard to tell what is happening... 

Comment 18 Toby Bereznak 2007-01-17 18:36:49 UTC
We've discerned that the default nfs value for protocol is TCP instead of UDP. 
The manpage states that it's UDP and that's why we thought all along that we
were using UDP mounts but instead we were using TCP!  Oops.

It looks like UDP with these options has been successful in avoiding data
corruption:
-hard,intr,timeo=1,retrans=10,rsize=65536,wsize=65536,nfsvers=3,udp
Now we get lots of retrans in 'nfsstat -rc' with _some_ of our machines--things
aren't perfect, and we would like to run NFS over TCP.

Another thing we need to correct from our earlier statements is the one about
hard versus soft mounts.  When we first tested the hard mounts we forced it to
mount UDP, thinking we were testing the worst case, and it didn't show
corruption simply because it was UDP.

Attached are two tethereal outputs taken while using the 'tcp' option.  In each
case the corruption occurred during the last 1-2 seconds of the tethereal output.
They show the following TCP errors consistently throughout the tethereal output:
-[Unreassembled Packet [incorrect TCP checksum]]
-NFS [TCP ACKed lost segment] [TCP Previous segment lost]
-[TCP ZeroWindow] [TCP ACKed lost segment] [TCP Previous segment lost]

Comment 19 Toby Bereznak 2007-01-17 18:40:01 UTC
Created attachment 145845 [details]
tethereal output using nfs over TCP

Comment 20 Christian Iseli 2007-01-20 00:21:26 UTC
This report targets the FC3 or FC4 products, which have now been EOL'd.

Could you please check that it still applies to a current Fedora release, and
either update the target product or close it ?

Thanks.

Comment 21 David Mastronarde 2007-01-22 15:35:27 UTC
This problem has occurred with every release kernel past 2.6.14 that we have
tested.  It occurred in Fedora 5 and it occurs in Fedora 6 with the current
update kernel.

Comment 22 David Mastronarde 2007-03-20 19:03:55 UTC
Congratulations.  The problem with data corruption under TCP appears to be
solved with the latest Fedora kernel, 2.6.19-1.2911.6.5.fc6.  The standard test
that fails in 5-10 minutes ran for 4 hours without a problem.  I did not test
the previous  2.6.19 kernels.

Comment 23 Steve Dickson 2007-03-21 17:24:23 UTC
Can we close this bug?

Comment 24 David Mastronarde 2007-03-21 23:54:33 UTC
I've tested it for 50 minutes under the 2.6.20 kernel and it is OK there too.
So yes, you can close the bug.