Bug 198565
Description
Toby Bereznak
2006-07-11 22:30:54 UTC
Created attachment 132270 [details]
script to copy files and diff them repeatedly
Could you please post the oops output? Created attachment 132943 [details]
simple script to test with
Created attachment 132944 [details]
original binary file
Created attachment 132945 [details]
differing binary file after copy on NFS filesystem
also I'm removing one of the 'script' attachments because they are the same
Just curious... Does the same problem happen with TCP mounts? copying files on TCP mounts works fine; we tried it slower tho.. With busy networks you really want to use TCP since it know how to deal with congestion much much much better than RPC/UDP will.... Now I'm a bit surprised that TCP is not comparable to UDP since with UDP I'm sure you getting tons and tons of retransmits which in turns is just added even more congestion to an already busy network... to prove this, simply do a 'nfsstat -rc' using both UDP and TCP. You will see the number of 'retrans' will be much much smaller with TCP than UDP... This problem seems to be related to our parallel processing system. When nodes (~8 usually) are done processing they copy the data back to an nfs-mounted filesystem. We are now using mount options: '-o soft,intr,timeo=20,retrans=20,rsize=65536,wsize=65536,nfsvers=3,tcp' on the client nodes. when using a low timeout of 1 (timeo=1) this bug can be typically be reproduced in under 10 minutes. It happens even when using tcp mounts. (In reply to comment #7) > copying files on TCP mounts works fine; we tried it > > slower tho.. This actually FAILS, although there is more success with TCP Try turning off soft mounts... I'm Toby's supervisor and we thought it would help if I weighted in at this point since we are feeling rather desperate. To answer your latest question: we tried hard mounts between two machines and ran our standard copy test with timeo=1 to try to make it fail. It ran successfully for at least an hour, whereas soft mounts would fail within 5 minutes. But when we switched all of our machines to hard mounts a few days ago, with the setting timeo=25, our users still got data corruption. It's hard to say if this was any more or less frequent than with soft mounts, but one occurrence was quite severe with over 30 glitches in a 2 GB file. During this time there were a number of messages like this in the server log: Jan 4 11:24:50 simba kernel: RPC: bad TCP reclen 0x08020703 (large) Jan 4 11:24:50 simba kernel: RPC: bad TCP reclen 0x7902dc02 (large) Jan 4 11:24:50 simba kernel: rpc-srv/tcp: nfsd: got error -104 when sending 140 bytes - shutting down socket Jan 4 11:24:50 simba last message repeated 3 times Jan 4 11:25:01 simba kernel: RPC: bad TCP reclen 0x40038d02 (non-terminal) Jan 4 11:25:01 simba kernel: RPC: bad TCP reclen 0x5302b302 (non-terminal) Jan 4 11:25:01 simba kernel: RPC: bad TCP reclen 0x0b02ff02 (large) We are currently trying some parameter changes to see if they help this: echo '8388608' > /proc/sys/net/core/rmem_default echo '8388608' > /proc/sys/net/core/rmem_max echo '8388608' > /proc/sys/net/core/wmem_default echo '8388608' > /proc/sys/net/core/wmem_max echo '32768 65536 8388608' > /proc/sys/net/ipv4/tcp_rmem echo '32768 65536 8388608' > /proc/sys/net/ipv4/tcp_wmem echo '8388608 8388608 8388608' > /proc/sys/net/ipv4/tcp_mem To recap: We can reproduce this bug within 5 minutes with an NFS connection between two workstations with two intervening gigabit switches, by running the test script while continuously copying a large file or directory tree (the copy occasionally gives Input/Output errors as well). The parameters that give rapid failure are: soft,intr,timeo=1,retrans=10,rsize=65536,wsize=65536,nfsvers=3 The mounts are all done with automount. The failures occur with any kernel past 2.6.14. The failures occur with tcp as well as udp. Increasing the timeo to 10 or higher greatly reduces the failure rate: the simple test will not fail but our users still get data corruption if the network is busy. The test also does not fail quickly with hard mounts but there is still corruption at some times. As I said, we're getting desperate. We're trying a few more things (including a new main switch) but will then have to go back to the 2.6.14 kernel, which may mean we are effectively stuck at Fedora 4 until this is resolved. Current nfs mount options are -hard,intr,timeo=35,retrans=35,rsize=65536,wsize=65536,nfsvers=3,tcp The corruption could be due to lower level network corruption . So would it be possible to get a packet trace when the corruption happens? Something similar to: "tethereal -w /tmp/bz198565.pcap host <server> ; bzip2 /tmp/bz198565.pcap" What I'm looking for is TCP checksum error or TCP retransmissions or other TCP errors. If these type of errors are indeed happening, then your network is dropping packets which could be the cause of the corruption... Created attachment 145104 [details]
Image showing glitch of inserted bytes then real image data out of register
That image definitely looks messed up... but without the packet trace as described in Comment #15 its hard to tell what is happening... We've discerned that the default nfs value for protocol is TCP instead of UDP. The manpage states that it's UDP and that's why we thought all along that we were using UDP mounts but instead we were using TCP! Oops. It looks like UDP with these options has been successful in avoiding data corruption: -hard,intr,timeo=1,retrans=10,rsize=65536,wsize=65536,nfsvers=3,udp Now we get lots of retrans in 'nfsstat -rc' with _some_ of our machines--things aren't perfect, and we would like to run NFS over TCP. Another thing we need to correct from our earlier statements is the one about hard versus soft mounts. When we first tested the hard mounts we forced it to mount UDP, thinking we were testing the worst case, and it didn't show corruption simply because it was UDP. Attached are two tethereal outputs taken while using the 'tcp' option. In each case the corruption occurred during the last 1-2 seconds of the tethereal output. They show the following TCP errors consistently throughout the tethereal output: -[Unreassembled Packet [incorrect TCP checksum]] -NFS [TCP ACKed lost segment] [TCP Previous segment lost] -[TCP ZeroWindow] [TCP ACKed lost segment] [TCP Previous segment lost] Created attachment 145845 [details]
tethereal output using nfs over TCP
This report targets the FC3 or FC4 products, which have now been EOL'd. Could you please check that it still applies to a current Fedora release, and either update the target product or close it ? Thanks. This problem has occurred with every release kernel past 2.6.14 that we have tested. It occurred in Fedora 5 and it occurs in Fedora 6 with the current update kernel. Congratulations. The problem with data corruption under TCP appears to be solved with the latest Fedora kernel, 2.6.19-1.2911.6.5.fc6. The standard test that fails in 5-10 minutes ran for 4 hours without a problem. I did not test the previous 2.6.19 kernels. Can we close this bug? I've tested it for 50 minutes under the 2.6.20 kernel and it is OK there too. So yes, you can close the bug. |