From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Description of problem: server: RH9 kernel-2.4.20-18.9, nfs-utils-1.0.1-3.9 cleints: RH8.0 kernel 2.4.20-18.8, fileutils-4.1.9-11 when clients cp large (>30MB) files within the nfs share, get input/output error Version-Release number of selected component (if applicable): RH9 nfs-utils-1.0.1-3.9 How reproducible: Sometimes Steps to Reproduce: 1. export a share (rw,sync) 2. mount to a client 3. cp a large file within the share from client Actual Results: some time the file copied ok, but most of the time get input/ouput error and only part of the file copied Expected Results: the whole file should be copied Additional info:
Are you mounting with the NFS 'soft' option? Check out man 5 nfs. There seems to be something in kernel-2.4.20 that causes more minor timeouts. Maybe it's a feature - better timeout reporting or something. Another suggested workaround was to raise 'retrans' to a higher level to deal with it, say 20 (from 3).
Steve, do you have an insight into the cause of the I/O errors? I don't argue with NOTABUG, 'cause it's a kernel bug if anything, but there might be some default mount options that would be appropriate.
Yes... Soft mounts are generally the reason for I/O errors. With Soft mounts, the client requests to the server are only tried once, which means on busy network (especially with UDP) packets are drop or more likely delayed long enough where the client will timeout. On normal mounts (i.e. hard mounts), the request is retried (which generally works) but with soft mounts they are not.
Right, but do you know if something changed in kernel 2.4.20 to significantly increase the soft timeouts? The issue at hand was that after applying the update that installed 2.4.20 soft mounts that were fine under the previous kernel started getting lots of IO errors. It's been a while, but as I recall it, you could take the machine, boot into the previous kernel (2.4.18?) and do all the NFS with softmounts you wanted and it would be fine (on a given network). Rebooting into 2.4.20 and repeating the tests showed lots of IO error.
> Right, but do you know if something changed in kernel 2.4.20 to >significantly increase the soft timeouts? No not that I'm aware of... but I know there was a lot of work done in the 2.4.21 (RHEL3/FC1) kernels on congestion control (actually I'm pretty sure we increased the timeout a bit to deal with 64bit machine) so you might what to try one of those kernels, since RH9 (at this point) an supported release...