Description of problem: A NFS Client (RHEL4 U6) attempt to write a file while the NFS server (RHEL4.4) server is an available. In that case, the NFS client wait retry until major time out expire, then it should return "server not responding". (ex) mount with -o soft, -o timeo=10, -o retrans=3, the major time out should expire in 7sec (1+2+4). On NFS client on RHEL5, it works as expected. But on RHEL4, the timeout never expire,and it never display an error message. Version-Release number of selected component (if applicable): Kernel 2.6.9-67 nfs-utils-1.0.6-84.EL4 How reproducible: Always Steps to Reproduce: 1. Mount nfs server from NFS Client. # mount -o soft -o timeo=10 -o retrans=3 nfs_serer:/export/dir /mnt/nfs 2. Touch a file from NFS client. This should work ok. 3. Stop NFS service on NFS server 4. Touch another file from NFS client. # strace -o output_file -t touch /mnt/nfs/abc123 NFS client never detect timeout 5. Re-start NFS server service, then see the /mnt/nfs directory, and I can find an abc123 file is there. Actual results: The NFS Client never timeout even after major timeout. The NFS Client write a file after the NFS server come back to online. Strace example, 18:21:38.795910 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0 18:21:38.795985 getrlimit(RLIMIT_STACK, {rlim_cur=10240*1024, rlim_max=RLIM_INFINITY}) = 0 18:21:38.796143 _sysctl({{CTL_KERN, KERN_VERSION}, 2, 0x7fbffff680, 30, (nil), 0}) = 0 18:21:38.796551 brk(0) = 0x509000 18:21:38.796614 brk(0x52a000) = 0x52a000 18:21:38.796736 open("/mnt5/touch4", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3 I see no error condition here. (Restart NFS Server at this point of time) 18:27:53.995461 close(3) = 0 18:27:53.996267 utime("/mnt5/touch4", NULL) = 0 18:27:54.007194 exit_group(0) = ? Expected results: The NFS Client should display time out after major timeout expire. The touch command should fail to write to the NFS server, if the NFS client encounter major timeout expiration. Following output is from RHEL5.1. 17:45:39 set_tid_address(0x2aaaaaac7fa0) = 2540 17:45:39 set_robust_list(0x2aaaaaac7fb0, 0x18) = 0 17:45:39 rt_sigaction(SIGRTMIN, {0x3185205350, [], SA_RESTORER|SA_SIGINFO, 0x318520de60}, NULL, 8) = 0 17:45:39 rt_sigaction(SIGRT_1, {0x31852052a0, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x318520de60}, NULL, 8) = 0 17:45:39 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0 17:45:39 getrlimit(RLIMIT_STACK, {rlim_cur=10240*1024, rlim_max=RLIM_INFINITY}) = 0 17:45:39 brk(0) = 0x14efc000 17:45:39 brk(0x14f1d000) = 0x14f1d000 17:45:39 close(0) = 0 17:45:39 open("/mnt5/touch54", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = -1 EIO (Input/output error) 17:45:42 futimesat(AT_FDCWD, "/mnt5/touch54", NULL) = -1 EIO (Input/output error) 17:45:45 write(2, "touch: ", 7) = 7 17:45:45 write(2, "cannot touch `/mnt5/touch54\'", 28) = 28 17:45:45 write(2, ": Input/output error", 20) = 20 17:45:45 write(2, "\n", 1) = 1 17:45:45 close(1) = 0 17:45:45 exit_group(1) = ? On RHEL5 NFS client, it display failure message "cannot touch" just after 6sec of the command execution. This is almost same as 7sec major timeout expiration. Additional info: On RHEL3 NFS Client, at reast it returned an error message after 60sec of the command execution. 17:07:09.602576 brk(0) = 0x8ba0000 17:07:09.602695 open("/mnt5/cc4", O_WRONLY|O_NONBLOCK|O_CREAT|O_NOCTTY|O_LARGEFILE, 0666) = -1 EIO (Input/output error) 17:09:05.100872 utime("/mnt5/cc4", NULL) = -1 EIO (Input/output error) 17:10:18.600952 write(2, "touch: ", 7) = 7 17:10:18.601267 write(2, "creating `/mnt5/cc4\'", 20) = 20 17:10:18.601507 write(2, ": Input/output error", 20) = 20 17:10:18.601761 write(2, "\n", 1) = 1 17:10:18.602020 exit_group(1) = ? It takes a little bit longer to display a "fail". but at least the command failed as expected, I think it is OK. On RHEL4 case, it didn't failed. This is a problem, AFAICT.
I think that U7 should have some patches to fix this. When I test this with a U7 kernel, I get an EIO error back on both syscalls that touch the mount. Using the same mount options as you are: ... 08:26:42.106601 close(3) = 0 <0.000024> 08:26:42.106902 open("/mnt/rhel5/testfile2", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = -1 EIO (Input/output error) <22.786684> 08:27:04.895655 --- SIGWINCH (Window changed) @ 0 (0) --- 08:27:04.895781 utime("/mnt/rhel5/testfile2", NULL) = -1 EIO (Input/output error) <59.993695> ... ...so a little longer on the timeouts than expected it looks like, but it did fail. I'm testing this on a FV xen guest, and timing there is a little screwy I think. So the syscall timing may not be perfect. This is testing with: kernel-smp-2.6.9-72.EL.jtltest.42.x86_64.rpm ...from my people page: http://people.redhat.com/jlayton/ ...though I think the patches that fix this are probably in all U7 kernels. Could you test this someplace non-critical with either the kernels from my people page or something else >= -72.EL and see if it's still reproducible?
I have tested kernel-smp-2.6.9-72.EL.jtltest.42.i686.rpm on NFS client. This new kernel works as expected. - EIO returned just after the command is executed. - Write failed to NFS server, after NFS Server back to operation. (This is expected behavior.) Command executed 14:27:18 execve("/bin/touch", ["touch"..., "/mnt/nfs/abc123"...], [/* 23 vars */]) = 0 (snip) 14:27:18 open("/mnt/nfs/abc123", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK|O_LARGEFILE, 0666) = -1 EIO (Input/output error) <= 14:28:18 utime("/mnt/nfs/abc123", NULL) = -1 EIO (Input/output error) 14:29:18 open("/usr/share/locale/locale.alias", O_RDONLY) = 3 14:29:18 fstat64(3, {st_mode=S_IFREG|0644, st_size=2528, ...}) = 0 14:29:18 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7cd5000 Thank you for support
Good. I'm going to go ahead and close this as a dupe. One of the patches included in U7 cleaned up soft task error handling and I believe that's what corrects this. *** This bug has been marked as a duplicate of 204309 ***