Bug 151097
Summary: | Default TCP retransmit timeout too fast on NFS mounts | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Chuck Lever <cel> |
Component: | util-linux | Assignee: | Steve Dickson <steved> |
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.0 | CC: | buckh, kzak, menscher, mitch48 |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2005-626 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-09-28 15:53:09 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 156320 |
Description
Chuck Lever
2005-03-14 21:26:27 UTC
One reason this problem has gone on for so long is that /proc/mounts does not display the actual timeo and retrans values in effect for an NFS mount point. As part of the fix for this bug, can we get support for displaying those mount options added to the NFS client's show_options method? I'm working on adding similar support in 2.6 mainline. Thanks! What's the status of this issue? The problem can potentially result in data corruption, so we'd like a fix for this in the next update, if possible. This impacts those with 'older' NFS appliances and fileservers. Those with nis maps for automount etc will do well to set timeouts.... I'm having possibly-related problems with this issue under RHEL3, but using UDP. As previously mentioned, the UDP timeout is supposed to be 0.7 seconds, then double repeatedly after each timeout up to a max of 60 seconds. Looking at the source shows the line "data.timeo = tcp ? 70 : 7;", which I take to mean UDP has a 0.7 second timeout, and TCP has a 7 second timeout, by default. Problem is, that doesn't seem to be the case at all. I used tcpdump to get a packet capture that included some timeouts. The shocking thing is that it's not waiting anywhere near 0.7 seconds for the RPC response. It's actually much shorter. The first timeout seems to fluctuate a bit (latency in packet capture makes it hard to be precise), but it's on the order of 0.07 seconds. I'm not sure, but maybe the order-of-magnitude shift is because we're using gigabit? Another possible issue is we're using the SMP kernel. Anyway, this is a serious issue, since a moderately loaded fileserver will frequently take more than 0.07 seconds to respond. I have not yet tested whether setting the timeo= option will be respected, but I don't have high hopes given how quickly it's timing out right now. Should I submit this as a separate bug? It's not clear to me whether it's the same bug or a different one. damian- short UDP timeouts are normal. RHEL 3 uses a request round-trip time estimator which can trim the timeouts pretty short. it will ignore the mount command line setting. i believe the lower bound was raised in later updates of RHEL 3 to address the same issue you are reporting, but i can't find the bugzilla report where this is addressed. if you report this problem again, be sure to mention which update of RHEL 3 you are using. My management is pressing me pretty hard on this issue, as it increases the potential for data corruption on NFS/TCP mounts that use the default timeout setting. When will we get a fix for this problem? Yes, the short UDP timeouts are a result of the RTT estimator trimming it to HZ/30. Recent kernels use HZ/10. I've submitted <A HREF="https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=155313">Bug 155313</A> on this issue, since it appears to be separate from your bug. We increased to retrans=10 in the meantime. Should be fixed in util-linux-2.11y-31.8 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-626.html |