From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050302 Firefox/1.0.1 Fedora/1.0.1-1.3.2 Description of problem: If no timeo= option is specified when mounting an NFS file system with TCP, the mount command provides a default value of 0.7 seconds. This value may be appropriate for NFS over UDP, but is way too aggressive for TCP, and can result in performance loss or data corruption. The correct default settings for NFS over TCP on 2.4 kernels should be timeo=600,retrans=2. Note that RHEL AS 2.1 also has this problem, but the RHEL 4 mount command should have patches that were included to support NFSv4, which have a fix for this issue. Version-Release number of selected component (if applicable): util-linux-2.11y-31.2 How reproducible: Always Steps to Reproduce: 1. Add a printk in the NFS client's mount logic to show the timeo 2. mount -o tcp 3. look at the output of the printk Actual Results: The printk will show that the mount command passes in a default timeo and retrans value, and that value is too small on NFS over TCP mounts Expected Results: The mount command should pass in no timeo value (in which case the NFS client will pick an appropriate default, or should pass in a reasonable timeo value such as described above. Additional info: This is a critical problem for customers who use NFS over TCP.
One reason this problem has gone on for so long is that /proc/mounts does not display the actual timeo and retrans values in effect for an NFS mount point. As part of the fix for this bug, can we get support for displaying those mount options added to the NFS client's show_options method? I'm working on adding similar support in 2.6 mainline. Thanks!
What's the status of this issue? The problem can potentially result in data corruption, so we'd like a fix for this in the next update, if possible.
This impacts those with 'older' NFS appliances and fileservers. Those with nis maps for automount etc will do well to set timeouts....
I'm having possibly-related problems with this issue under RHEL3, but using UDP. As previously mentioned, the UDP timeout is supposed to be 0.7 seconds, then double repeatedly after each timeout up to a max of 60 seconds. Looking at the source shows the line "data.timeo = tcp ? 70 : 7;", which I take to mean UDP has a 0.7 second timeout, and TCP has a 7 second timeout, by default. Problem is, that doesn't seem to be the case at all. I used tcpdump to get a packet capture that included some timeouts. The shocking thing is that it's not waiting anywhere near 0.7 seconds for the RPC response. It's actually much shorter. The first timeout seems to fluctuate a bit (latency in packet capture makes it hard to be precise), but it's on the order of 0.07 seconds. I'm not sure, but maybe the order-of-magnitude shift is because we're using gigabit? Another possible issue is we're using the SMP kernel. Anyway, this is a serious issue, since a moderately loaded fileserver will frequently take more than 0.07 seconds to respond. I have not yet tested whether setting the timeo= option will be respected, but I don't have high hopes given how quickly it's timing out right now. Should I submit this as a separate bug? It's not clear to me whether it's the same bug or a different one.
damian- short UDP timeouts are normal. RHEL 3 uses a request round-trip time estimator which can trim the timeouts pretty short. it will ignore the mount command line setting. i believe the lower bound was raised in later updates of RHEL 3 to address the same issue you are reporting, but i can't find the bugzilla report where this is addressed. if you report this problem again, be sure to mention which update of RHEL 3 you are using.
My management is pressing me pretty hard on this issue, as it increases the potential for data corruption on NFS/TCP mounts that use the default timeout setting. When will we get a fix for this problem?
Yes, the short UDP timeouts are a result of the RTT estimator trimming it to HZ/30. Recent kernels use HZ/10. I've submitted <A HREF="https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=155313">Bug 155313</A> on this issue, since it appears to be separate from your bug. We increased to retrans=10 in the meantime.
Should be fixed in util-linux-2.11y-31.8
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-626.html