Bug 448479

Summary: NFS soft mount doesn't work as expected
Product: Red Hat Enterprise Linux 4 Reporter: masanari iida <masanari_iida>
Component: nfs-utilsAssignee: Jeff Layton <jlayton>
Status: CLOSED DUPLICATE QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: 4.6CC: staubach, steved
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-06-10 10:19:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description masanari iida 2008-05-27 07:28:07 UTC
Description of problem:
A NFS Client (RHEL4 U6) attempt to write a file while 
the NFS server (RHEL4.4) server is an available. In that 
case, the NFS client wait retry until major time out expire,
then it should return "server not responding".

(ex)  mount with -o soft,  -o timeo=10, -o retrans=3,
the major time out should expire in 7sec (1+2+4).
On NFS client on RHEL5, it works as expected.
But on RHEL4, the timeout never expire,and 
it never display an error message.


Version-Release number of selected component (if applicable):
Kernel 2.6.9-67
nfs-utils-1.0.6-84.EL4

How reproducible:
Always

Steps to Reproduce:
1. Mount nfs server from NFS Client.
# mount -o soft -o timeo=10 -o retrans=3  nfs_serer:/export/dir  /mnt/nfs

2. Touch a file from NFS client.
   This should work ok. 

3. Stop NFS service on NFS server

4. Touch another file from NFS client.
   # strace -o output_file -t touch /mnt/nfs/abc123

   NFS client never detect timeout

5. Re-start NFS server service, then see the 
   /mnt/nfs directory, and I can find an abc123 file is there.

Actual results:
The NFS Client never timeout even after major timeout.
The NFS Client write a file after the NFS server come 
back to online.

Strace example, 

18:21:38.795910 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
18:21:38.795985 getrlimit(RLIMIT_STACK, {rlim_cur=10240*1024,
rlim_max=RLIM_INFINITY}) = 0
18:21:38.796143 _sysctl({{CTL_KERN, KERN_VERSION}, 2, 0x7fbffff680, 30, (nil),
0}) = 0
18:21:38.796551 brk(0)                  = 0x509000
18:21:38.796614 brk(0x52a000)           = 0x52a000
18:21:38.796736 open("/mnt5/touch4", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3

I see no error condition here.
(Restart NFS Server at this point of time)

18:27:53.995461 close(3)                = 0
18:27:53.996267 utime("/mnt5/touch4", NULL) = 0 
18:27:54.007194 exit_group(0)           = ?


Expected results:
The NFS Client should display time out after major timeout expire.
The touch command should fail to write to the NFS server, 
if the NFS client encounter major timeout expiration.

Following output is from RHEL5.1.
17:45:39 set_tid_address(0x2aaaaaac7fa0) = 2540
17:45:39 set_robust_list(0x2aaaaaac7fb0, 0x18) = 0
17:45:39 rt_sigaction(SIGRTMIN, {0x3185205350, [], SA_RESTORER|SA_SIGINFO,
0x318520de60}, NULL, 8) = 0
17:45:39 rt_sigaction(SIGRT_1, {0x31852052a0, [],
SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x318520de60}, NULL, 8) = 0
17:45:39 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
17:45:39 getrlimit(RLIMIT_STACK, {rlim_cur=10240*1024, rlim_max=RLIM_INFINITY}) = 0
17:45:39 brk(0)                         = 0x14efc000
17:45:39 brk(0x14f1d000)                = 0x14f1d000
17:45:39 close(0)                       = 0
17:45:39 open("/mnt5/touch54", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = -1
EIO (Input/output error)
17:45:42 futimesat(AT_FDCWD, "/mnt5/touch54", NULL) = -1 EIO (Input/output error)
17:45:45 write(2, "touch: ", 7)         = 7
17:45:45 write(2, "cannot touch `/mnt5/touch54\'", 28) = 28
17:45:45 write(2, ": Input/output error", 20) = 20
17:45:45 write(2, "\n", 1)              = 1
17:45:45 close(1)                       = 0
17:45:45 exit_group(1)                  = ?

On RHEL5 NFS client, it display failure message "cannot touch" 
just after 6sec of the command execution.
This is almost same as 7sec major timeout expiration.

Additional info:

On RHEL3 NFS Client, at reast it returned an error message
after 60sec of the command execution.
17:07:09.602576 brk(0)                  = 0x8ba0000
17:07:09.602695 open("/mnt5/cc4",
O_WRONLY|O_NONBLOCK|O_CREAT|O_NOCTTY|O_LARGEFILE, 0666) = -1 EIO (Input/output
error)
17:09:05.100872 utime("/mnt5/cc4", NULL) = -1 EIO (Input/output error)
17:10:18.600952 write(2, "touch: ", 7)  = 7
17:10:18.601267 write(2, "creating `/mnt5/cc4\'", 20) = 20
17:10:18.601507 write(2, ": Input/output error", 20) = 20
17:10:18.601761 write(2, "\n", 1)       = 1
17:10:18.602020 exit_group(1)           = ?

It takes a little bit longer to display a "fail". 
but at least the command failed as expected, I think
it is OK.   
On RHEL4 case, it didn't failed. 
This is a problem, AFAICT.

Comment 1 Jeff Layton 2008-06-09 12:38:36 UTC
I think that U7 should have some patches to fix this. When I test this with a U7
kernel, I get an EIO error back on both syscalls that touch the mount. Using the
same mount options as you are:

...
08:26:42.106601 close(3)                = 0 <0.000024>
08:26:42.106902 open("/mnt/rhel5/testfile2",
O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = -1 EIO (Input/output error)
<22.786684>
08:27:04.895655 --- SIGWINCH (Window changed) @ 0 (0) ---
08:27:04.895781 utime("/mnt/rhel5/testfile2", NULL) = -1 EIO (Input/output
error) <59.993695>
...

...so a little longer on the timeouts than expected it looks like, but it did
fail. I'm testing this on a FV xen guest, and timing there is a little screwy I
think. So the syscall timing may not be perfect.

This is testing with:

kernel-smp-2.6.9-72.EL.jtltest.42.x86_64.rpm

...from my people page:

http://people.redhat.com/jlayton/

...though I think the patches that fix this are probably in all U7 kernels.
Could you test this someplace non-critical with either the kernels from my
people page or something else >= -72.EL and see if it's still reproducible?



Comment 2 masanari iida 2008-06-10 09:46:05 UTC
I have tested kernel-smp-2.6.9-72.EL.jtltest.42.i686.rpm
on NFS client.  
This new kernel works as expected.

- EIO returned just after the command is executed.
- Write failed to NFS server, after NFS Server back to operation. 
(This is expected behavior.)

Command executed
14:27:18 execve("/bin/touch", ["touch"..., "/mnt/nfs/abc123"...], [/* 23 vars
*/]) = 0

(snip)

14:27:18 open("/mnt/nfs/abc123",
O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK|O_LARGEFILE, 0666) = -1 EIO (Input/output
error) <=
14:28:18 utime("/mnt/nfs/abc123", NULL) = -1 EIO (Input/output error)
14:29:18 open("/usr/share/locale/locale.alias", O_RDONLY) = 3
14:29:18 fstat64(3, {st_mode=S_IFREG|0644, st_size=2528, ...}) = 0
14:29:18 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7cd5000

Thank you for support


Comment 3 Jeff Layton 2008-06-10 10:19:42 UTC
Good. I'm going to go ahead and close this as a dupe. One of the patches
included in U7 cleaned up soft task error handling and I believe that's what
corrects this.


*** This bug has been marked as a duplicate of 204309 ***