448479 – NFS soft mount doesn't work as expected

Bug 448479 - NFS soft mount doesn't work as expected

Summary: NFS soft mount doesn't work as expected

Keywords:
Status:	CLOSED DUPLICATE of bug 204309
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	nfs-utils
Sub Component:
Version:	4.6
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Layton
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-05-27 07:28 UTC by masanari iida
Modified:	2008-06-10 10:19 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-06-10 10:19:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description masanari iida 2008-05-27 07:28:07 UTC

Description of problem:
A NFS Client (RHEL4 U6) attempt to write a file while 
the NFS server (RHEL4.4) server is an available. In that 
case, the NFS client wait retry until major time out expire,
then it should return "server not responding".

(ex)  mount with -o soft,  -o timeo=10, -o retrans=3,
the major time out should expire in 7sec (1+2+4).
On NFS client on RHEL5, it works as expected.
But on RHEL4, the timeout never expire,and 
it never display an error message.


Version-Release number of selected component (if applicable):
Kernel 2.6.9-67
nfs-utils-1.0.6-84.EL4

How reproducible:
Always

Steps to Reproduce:
1. Mount nfs server from NFS Client.
# mount -o soft -o timeo=10 -o retrans=3  nfs_serer:/export/dir  /mnt/nfs

2. Touch a file from NFS client.
   This should work ok. 

3. Stop NFS service on NFS server

4. Touch another file from NFS client.
   # strace -o output_file -t touch /mnt/nfs/abc123

   NFS client never detect timeout

5. Re-start NFS server service, then see the 
   /mnt/nfs directory, and I can find an abc123 file is there.

Actual results:
The NFS Client never timeout even after major timeout.
The NFS Client write a file after the NFS server come 
back to online.

Strace example, 

18:21:38.795910 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
18:21:38.795985 getrlimit(RLIMIT_STACK, {rlim_cur=10240*1024,
rlim_max=RLIM_INFINITY}) = 0
18:21:38.796143 _sysctl({{CTL_KERN, KERN_VERSION}, 2, 0x7fbffff680, 30, (nil),
0}) = 0
18:21:38.796551 brk(0)                  = 0x509000
18:21:38.796614 brk(0x52a000)           = 0x52a000
18:21:38.796736 open("/mnt5/touch4", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3

I see no error condition here.
(Restart NFS Server at this point of time)

18:27:53.995461 close(3)                = 0
18:27:53.996267 utime("/mnt5/touch4", NULL) = 0 
18:27:54.007194 exit_group(0)           = ?


Expected results:
The NFS Client should display time out after major timeout expire.
The touch command should fail to write to the NFS server, 
if the NFS client encounter major timeout expiration.

Following output is from RHEL5.1.
17:45:39 set_tid_address(0x2aaaaaac7fa0) = 2540
17:45:39 set_robust_list(0x2aaaaaac7fb0, 0x18) = 0
17:45:39 rt_sigaction(SIGRTMIN, {0x3185205350, [], SA_RESTORER|SA_SIGINFO,
0x318520de60}, NULL, 8) = 0
17:45:39 rt_sigaction(SIGRT_1, {0x31852052a0, [],
SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x318520de60}, NULL, 8) = 0
17:45:39 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
17:45:39 getrlimit(RLIMIT_STACK, {rlim_cur=10240*1024, rlim_max=RLIM_INFINITY}) = 0
17:45:39 brk(0)                         = 0x14efc000
17:45:39 brk(0x14f1d000)                = 0x14f1d000
17:45:39 close(0)                       = 0
17:45:39 open("/mnt5/touch54", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = -1
EIO (Input/output error)
17:45:42 futimesat(AT_FDCWD, "/mnt5/touch54", NULL) = -1 EIO (Input/output error)
17:45:45 write(2, "touch: ", 7)         = 7
17:45:45 write(2, "cannot touch `/mnt5/touch54\'", 28) = 28
17:45:45 write(2, ": Input/output error", 20) = 20
17:45:45 write(2, "\n", 1)              = 1
17:45:45 close(1)                       = 0
17:45:45 exit_group(1)                  = ?

On RHEL5 NFS client, it display failure message "cannot touch" 
just after 6sec of the command execution.
This is almost same as 7sec major timeout expiration.

Additional info:

On RHEL3 NFS Client, at reast it returned an error message
after 60sec of the command execution.
17:07:09.602576 brk(0)                  = 0x8ba0000
17:07:09.602695 open("/mnt5/cc4",
O_WRONLY|O_NONBLOCK|O_CREAT|O_NOCTTY|O_LARGEFILE, 0666) = -1 EIO (Input/output
error)
17:09:05.100872 utime("/mnt5/cc4", NULL) = -1 EIO (Input/output error)
17:10:18.600952 write(2, "touch: ", 7)  = 7
17:10:18.601267 write(2, "creating `/mnt5/cc4\'", 20) = 20
17:10:18.601507 write(2, ": Input/output error", 20) = 20
17:10:18.601761 write(2, "\n", 1)       = 1
17:10:18.602020 exit_group(1)           = ?

It takes a little bit longer to display a "fail". 
but at least the command failed as expected, I think
it is OK.   
On RHEL4 case, it didn't failed. 
This is a problem, AFAICT.

Comment 1 Jeff Layton 2008-06-09 12:38:36 UTC

I think that U7 should have some patches to fix this. When I test this with a U7
kernel, I get an EIO error back on both syscalls that touch the mount. Using the
same mount options as you are:

...
08:26:42.106601 close(3)                = 0 <0.000024>
08:26:42.106902 open("/mnt/rhel5/testfile2",
O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = -1 EIO (Input/output error)
<22.786684>
08:27:04.895655 --- SIGWINCH (Window changed) @ 0 (0) ---
08:27:04.895781 utime("/mnt/rhel5/testfile2", NULL) = -1 EIO (Input/output
error) <59.993695>
...

...so a little longer on the timeouts than expected it looks like, but it did
fail. I'm testing this on a FV xen guest, and timing there is a little screwy I
think. So the syscall timing may not be perfect.

This is testing with:

kernel-smp-2.6.9-72.EL.jtltest.42.x86_64.rpm

...from my people page:

http://people.redhat.com/jlayton/

...though I think the patches that fix this are probably in all U7 kernels.
Could you test this someplace non-critical with either the kernels from my
people page or something else >= -72.EL and see if it's still reproducible?

Comment 2 masanari iida 2008-06-10 09:46:05 UTC

I have tested kernel-smp-2.6.9-72.EL.jtltest.42.i686.rpm
on NFS client.  
This new kernel works as expected.

- EIO returned just after the command is executed.
- Write failed to NFS server, after NFS Server back to operation. 
(This is expected behavior.)

Command executed
14:27:18 execve("/bin/touch", ["touch"..., "/mnt/nfs/abc123"...], [/* 23 vars
*/]) = 0

(snip)

14:27:18 open("/mnt/nfs/abc123",
O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK|O_LARGEFILE, 0666) = -1 EIO (Input/output
error)　＜＝
14:28:18 utime("/mnt/nfs/abc123", NULL) = -1 EIO (Input/output error)
14:29:18 open("/usr/share/locale/locale.alias", O_RDONLY) = 3
14:29:18 fstat64(3, {st_mode=S_IFREG|0644, st_size=2528, ...}) = 0
14:29:18 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7cd5000

Thank you for support

Comment 3 Jeff Layton 2008-06-10 10:19:42 UTC

Good. I'm going to go ahead and close this as a dupe. One of the patches
included in U7 cleaned up soft task error handling and I believe that's what
corrects this.


*** This bug has been marked as a duplicate of 204309 ***

Note You need to log in before you can comment on or make changes to this bug.