Bug 477951 - Kernel Hangs at __rpc_execute with NFS
Kernel Hangs at __rpc_execute with NFS
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.7
i686 Linux
medium Severity high
: rc
: ---
Assigned To: Jeff Layton
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-26 05:09 EST by CAI Qian
Modified: 2009-01-05 06:59 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-05 06:59:21 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/var/log/message with "echo [wtm] >/proc/sysrq-trigger" while hanging (802.79 KB, text/plain)
2008-12-26 05:14 EST, CAI Qian
no flags Details

  None (edit)
Description CAI Qian 2008-12-26 05:09:32 EST
Description of problem:

Some investigation information so far,
* the latest RHEL 4.8 kernel - 2.6.9-78.23.EL has the same behaviour.
* this is not NFSv2 specific, because NFSv4 also affected.
* this is not a regression, both RHEL 4.6 and 4.7 kernels are affected.
* I cannot reproduce this other than IA-32 UP kernels.
* change NFS option from "sync" to "async" without help.
* Even if don't list the file before stop NFS, the system still hang.

The following reproducer will hang forever with an IA-32 UP kernel,

# uname -ra
Linux dell-pe860-01.rhts.bos.redhat.com 2.6.9-78.0.12.EL #1 Thu Dec 18 07:11:33 EST 2008 i686 i686 i386 GNU/Linux

# cat export.sh 
#!/bin/sh -x
rm -rf /tmp/import /tmp/export
mkdir /tmp/import /tmp/export
date >/tmp/export/mydate
echo '/tmp/export *(rw,sync,no_root_squash)' >/etc/exports
service nfs restart
mount $(hostname):/tmp/export /tmp/import
ls -al /tmp/import/mydate
cat /tmp/import/mydate
service nfs stop
cat /tmp/import/mydate

+ rm -rf /tmp/import /tmp/export
+ mkdir /tmp/import /tmp/export
+ date
+ echo '/tmp/export *(rw,sync,no_root_squash)'
+ service nfs restart
Shutting down RPC svcgssd: [FAILED]
Shutting down NFS mountd: [FAILED]
Shutting down NFS daemon: [FAILED]
Shutting down NFS quotas: [FAILED]
Shutting down NFS services:  [FAILED]
Starting NFS services:  [  OK  ]
Starting NFS quotas: [  OK  ]
Starting NFS daemon: [  OK  ]
Starting NFS mountd: [  OK  ]
++ hostname
+ mount dell-pe860-01.rhts.bos.redhat.com:/tmp/export /tmp/import
+ ls -al /tmp/import/mydate
-rw-r--r--  1 root root 29 Dec 26 05:05 /tmp/import/mydate
+ cat /tmp/import/mydate
Fri Dec 26 05:05:14 EST 2008
+ service nfs stop
Shutting down RPC svcgssd: [FAILED]
Shutting down NFS mountd: [  OK  ]
Shutting down NFS daemon: [  OK  ]
Shutting down NFS quotas: [  OK  ]
Shutting down NFS services:  [  OK  ]
+ cat /tmp/import/mydate

As a result, the system is in a bad state and can't recover -- /tmp/import seems can't be removed anymore without a reboot; any RPM installation seems hang as well,

# rpm -ivh kernel-2.6.9-78.0.8.EL.i686.rpm --force
<hang ...>

I can't even interrupt (Ctrl-C) the reproducer. Several messages prints out via serial console,

nfsd: last server has exited
nfsd: unexporting all filesystems
nfs: server dell-pe860-01.rhts.bos.redhat.com not responding, timed out
nfs: server dell-pe860-01.rhts.bos.redhat.com not responding, timed out
nfs: server dell-pe860-01.rhts.bos.redhat.com not responding, timed out
nfs: server dell-pe860-01.rhts.bos.redhat.com not responding, timed out
...

Version-Release number of selected component (if applicable):
kernel-2.6.9-78.0.12.EL
kernel-2.6.9-78.0.8.EL
kernel-2.6.9-78.EL
kernel-2.6.9-67.0.22.EL
kernel-2.6.9-78.23.EL

How reproducible:
always

Steps to Reproduce:
1. run the reproducer with an IA-32 UP kernel.
  
Actual results:
System hangs.

Expected results:
No hang.
Comment 1 CAI Qian 2008-12-26 05:14:20 EST
Created attachment 327853 [details]
/var/log/message with "echo [wtm] >/proc/sysrq-trigger" while hanging
Comment 2 Jeff Layton 2008-12-28 22:24:32 EST
Is this reproducible when you use a different machine for client and server? 

Having a server be a client of itself is a known problematic configuration. It's fine for testing when it works, but it won't always. When you have problems with such a setup, it's important to reproduce it with separate client and server.
Comment 3 CAI Qian 2008-12-28 23:17:40 EST
Yes, it is the same behaviour when using different machines for client and server.

Server side:
# mkdir /tmp/export
# date >/tmp/export/mydate
# echo '/tmp/export *(rw,sync,no_root_squash)' >/etc/exports
# service nfs restart
Shutting down RPC svcgssd: [FAILED]
Shutting down NFS mountd: [FAILED]
Shutting down NFS daemon: [FAILED]
Shutting down NFS quotas: [FAILED]
Shutting down NFS services:  [FAILED]
Starting NFS services:  [  OK  ]
Starting NFS quotas: [  OK  ]
Starting NFS daemon: [  OK  ]
Starting NFS mountd: [  OK  ]

Client side:
# mkdir /tmp/import
# mount z209.z900.redhat.com:/tmp/export /tmp/import
# ls -al /tmp/import/mydate
-rw-r--r--  1 root root 29 Dec 28 23:10 /tmp/import/mydate
# cat /tmp/import/mydate
Sun Dec 28 23:10:28 EST 2008

Server side:
# service nfs stop
Shutting down RPC svcgssd: [FAILED]
Shutting down NFS mountd: [  OK  ]
Shutting down NFS daemon: [  OK  ]
Shutting down NFS quotas: [  OK  ]
Shutting down NFS services:  [  OK  ]

Client side:
# cat /tmp/import/mydate
<hang...>
Comment 4 Jeff Layton 2008-12-29 13:51:32 EST
Wait...I misread the reproducer before.

This is expected behavior. You're shutting down the server and then doing some activity on the mount. The syscall should hang indefinitely at that point since this is a hard mount.

You said:

* I cannot reproduce this other than IA-32 UP kernels.

What happens on other machines here? (SMP or other arch)
Comment 5 CAI Qian 2008-12-29 22:41:05 EST
(In reply to comment #4)
> Wait...I misread the reproducer before.
> 
> This is expected behavior. You're shutting down the server and then doing some
> activity on the mount. The syscall should hang indefinitely at that point since
> this is a hard mount.
> 
> You said:
> 
> * I cannot reproduce this other than IA-32 UP kernels.
> 
> What happens on other machines here? (SMP or other arch)

Sorry, the information there is a little bit incorrect. The original problem I have seen with NFSv4. That is to say, IA-32 UP kernel behaviours different with SMP or other architectures. Considering the following example.

Server side:
# uname -ra
Linux sun-v40z-01.rhts.bos.redhat.com 2.6.9-78.0.12.EL #1 Thu Dec 18 07:11:33 EST 2008 i686 athlon i386 GNU/Linux

# mkdir /tmp/export

# date >/tmp/export/mydate

# echo '/tmp/export   *(rw,fsid=0,insecure,no_subtree_check,sync,no_root_squash)' >/etc/exports

# service nfs restart
Shutting down RPC svcgssd: [FAILED]
Shutting down NFS mountd: [FAILED]
Shutting down NFS daemon: [FAILED]
Shutting down NFS quotas: [FAILED]
Shutting down NFS services:  [FAILED]
Starting NFS services:  [  OK  ]
Starting NFS quotas: [  OK  ]
Starting NFS daemon: [  OK  ]
Starting NFS mountd: [  OK  ]

Client side:
# uname -ra
Linux hp-bl460c-02.rhts.bos.redhat.com 2.6.9-78.ELsmp #1 SMP Wed Jul 9 15:46:26 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

# mkdir /tmp/import

# mount -t nfs4 sun-v40z-01.rhts.bos.redhat.com:/ /tmp/import

# cat /tmp/import/mydate 
Mon Dec 29 22:22:24 EST 2008

Server side:
# service nfs stop
Shutting down RPC svcgssd: [FAILED]
Shutting down NFS mountd: [  OK  ]
Shutting down NFS daemon: [  OK  ]
Shutting down NFS quotas: [  OK  ]
Shutting down NFS services:  [  OK  ]

Client side:
# cat /tmp/import/mydate &

# ps aux
...
root      5003  0.0  0.0 49924  444 pts/0    D    22:24   0:00 cat /mnt/import/mydate
...

As it showed the above, it is in D (uninterruptible sleep) state. As the result, it is not possible to kill the process or umount. I found the only way to recover is to start up the NFS server again, and then being able to kill the process.

However, with SMP or other architectures. It is in S (interruptible sleep) state.

Server side:
# uname -ra
Linux hp-xw4550-01.rhts.bos.redhat.com 2.6.9-78.0.12.ELsmp #1 SMP Thu Dec 18 07:23:42 EST 2008 i686 athlon i386 GNU/Linux

Client side:
# uname -ra
Linux hp-bl460c-02.rhts.bos.redhat.com 2.6.9-78.ELsmp #1 SMP Wed Jul 9 15:46:26 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

# ps aux
...
root      4991  0.0  0.0 49924  444 pts/0    S    22:18   0:00 cat /tmp/import/mydate
...
Comment 6 CAI Qian 2008-12-29 23:46:25 EST
Interesting thing is that if I dropped "insecure" NFS server option from the above NFSv4 example, the "cat" process is in D state even if using kernels other than IA-32 UP on the server side.

Server side:
# uname -ra
Linux hp-xw4550-01.rhts.bos.redhat.com 2.6.9-78.0.12.ELsmp #1 SMP Thu Dec 18
07:23:42 EST 2008 i686 athlon i386 GNU/Linux

# echo '/tmp/export  
*(rw,fsid=0,no_subtree_check,sync,no_root_squash)' >/etc/exports

Client side:
# uname -ra
Linux hp-bl460c-02.rhts.bos.redhat.com 2.6.9-78.ELsmp #1 SMP Wed Jul 9 15:46:26
EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

# ps aux
...
root     18803  0.0  0.0 49924  396 pts/0    D    23:36   0:00 cat /tmp/import/mydate
...

This only apply to NFSv4 though. With NFSv2, neither "insecure" option nor kernels other than IA-32 UP make any difference. The "cat" process is always in unrecoverable D state.
Comment 7 CAI Qian 2008-12-30 01:02:11 EST
Please scratch comment #5 and comment #6. After the further investigation, the different behaviour regards IA-32 UP kernel with NFSv4 and "insecure" option look like because of NFS server grace period. If I wait 100 seconds after the NFS server starting up, I did not see the different behaviours.

In summary, the process playing with a NFSv2 mount is in uninterruptible sleep state after the server shut down; the process playing with a NFSv4 mount is in interruptible sleep state after the server shut down. Is this expected?
Comment 8 Jeff Layton 2009-01-04 07:34:43 EST
Adding Steve and Peter in case they have opinions...

I see why this is occurring, but I'm not sure whether it's intended behavior or not. util-linux seems to make nfsv4 mounts default to "intr" unless "nointr" is explicitly specified. The reverse is true for nfsv2/3.

This also seems to be the case in RHEL5's mount helper. "intr" and "nointr" are considered deprecated upstream. This may be by design due to the stateful nature of NFSv4, but I'm not sure.

Given that this difference doesn't seem to be causing a problem, I'm inclined not to change it in either current RHEL release.

I suggest we mark this as NOTABUG but if you can make a case as to why this is a problem, I'm willing to listen.
Comment 9 CAI Qian 2009-01-05 06:44:15 EST
I have one question here. Is it expected that "cat" process was in D state forever? If so, how does the client recover from it without rebooting or the server starting up again. For example, in NFSv2 case,

# mount z209.z900.redhat.com:/tmp/export /tmp/import

isn't it use the default value -- 7 (0.7 seconds)? However, after the server stopped, the following command seemed hang forever,

# cat /tmp/import/mydate

Apart from this, I don't have other arguments to not close it as NOTABUG.
Comment 10 Jeff Layton 2009-01-05 06:59:21 EST
> I have one question here. Is it expected that "cat" process was in D state forever?

Yes. See the explanation of the hard/soft options (and intr/nointr) in the nfs(5) manpage.

> If so, how does the client recover from it without rebooting or the
server starting up again.

It doesn't.

I'll go ahead and close this as NOTABUG. Please reopen if you want to discuss it further.

Note You need to log in before you can comment on or make changes to this bug.