Description of problem: Some investigation information so far, * the latest RHEL 4.8 kernel - 2.6.9-78.23.EL has the same behaviour. * this is not NFSv2 specific, because NFSv4 also affected. * this is not a regression, both RHEL 4.6 and 4.7 kernels are affected. * I cannot reproduce this other than IA-32 UP kernels. * change NFS option from "sync" to "async" without help. * Even if don't list the file before stop NFS, the system still hang. The following reproducer will hang forever with an IA-32 UP kernel, # uname -ra Linux dell-pe860-01.rhts.bos.redhat.com 2.6.9-78.0.12.EL #1 Thu Dec 18 07:11:33 EST 2008 i686 i686 i386 GNU/Linux # cat export.sh #!/bin/sh -x rm -rf /tmp/import /tmp/export mkdir /tmp/import /tmp/export date >/tmp/export/mydate echo '/tmp/export *(rw,sync,no_root_squash)' >/etc/exports service nfs restart mount $(hostname):/tmp/export /tmp/import ls -al /tmp/import/mydate cat /tmp/import/mydate service nfs stop cat /tmp/import/mydate + rm -rf /tmp/import /tmp/export + mkdir /tmp/import /tmp/export + date + echo '/tmp/export *(rw,sync,no_root_squash)' + service nfs restart Shutting down RPC svcgssd: [FAILED] Shutting down NFS mountd: [FAILED] Shutting down NFS daemon: [FAILED] Shutting down NFS quotas: [FAILED] Shutting down NFS services: [FAILED] Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] ++ hostname + mount dell-pe860-01.rhts.bos.redhat.com:/tmp/export /tmp/import + ls -al /tmp/import/mydate -rw-r--r-- 1 root root 29 Dec 26 05:05 /tmp/import/mydate + cat /tmp/import/mydate Fri Dec 26 05:05:14 EST 2008 + service nfs stop Shutting down RPC svcgssd: [FAILED] Shutting down NFS mountd: [ OK ] Shutting down NFS daemon: [ OK ] Shutting down NFS quotas: [ OK ] Shutting down NFS services: [ OK ] + cat /tmp/import/mydate As a result, the system is in a bad state and can't recover -- /tmp/import seems can't be removed anymore without a reboot; any RPM installation seems hang as well, # rpm -ivh kernel-2.6.9-78.0.8.EL.i686.rpm --force <hang ...> I can't even interrupt (Ctrl-C) the reproducer. Several messages prints out via serial console, nfsd: last server has exited nfsd: unexporting all filesystems nfs: server dell-pe860-01.rhts.bos.redhat.com not responding, timed out nfs: server dell-pe860-01.rhts.bos.redhat.com not responding, timed out nfs: server dell-pe860-01.rhts.bos.redhat.com not responding, timed out nfs: server dell-pe860-01.rhts.bos.redhat.com not responding, timed out ... Version-Release number of selected component (if applicable): kernel-2.6.9-78.0.12.EL kernel-2.6.9-78.0.8.EL kernel-2.6.9-78.EL kernel-2.6.9-67.0.22.EL kernel-2.6.9-78.23.EL How reproducible: always Steps to Reproduce: 1. run the reproducer with an IA-32 UP kernel. Actual results: System hangs. Expected results: No hang.
Created attachment 327853 [details] /var/log/message with "echo [wtm] >/proc/sysrq-trigger" while hanging
Is this reproducible when you use a different machine for client and server? Having a server be a client of itself is a known problematic configuration. It's fine for testing when it works, but it won't always. When you have problems with such a setup, it's important to reproduce it with separate client and server.
Yes, it is the same behaviour when using different machines for client and server. Server side: # mkdir /tmp/export # date >/tmp/export/mydate # echo '/tmp/export *(rw,sync,no_root_squash)' >/etc/exports # service nfs restart Shutting down RPC svcgssd: [FAILED] Shutting down NFS mountd: [FAILED] Shutting down NFS daemon: [FAILED] Shutting down NFS quotas: [FAILED] Shutting down NFS services: [FAILED] Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] Client side: # mkdir /tmp/import # mount z209.z900.redhat.com:/tmp/export /tmp/import # ls -al /tmp/import/mydate -rw-r--r-- 1 root root 29 Dec 28 23:10 /tmp/import/mydate # cat /tmp/import/mydate Sun Dec 28 23:10:28 EST 2008 Server side: # service nfs stop Shutting down RPC svcgssd: [FAILED] Shutting down NFS mountd: [ OK ] Shutting down NFS daemon: [ OK ] Shutting down NFS quotas: [ OK ] Shutting down NFS services: [ OK ] Client side: # cat /tmp/import/mydate <hang...>
Wait...I misread the reproducer before. This is expected behavior. You're shutting down the server and then doing some activity on the mount. The syscall should hang indefinitely at that point since this is a hard mount. You said: * I cannot reproduce this other than IA-32 UP kernels. What happens on other machines here? (SMP or other arch)
(In reply to comment #4) > Wait...I misread the reproducer before. > > This is expected behavior. You're shutting down the server and then doing some > activity on the mount. The syscall should hang indefinitely at that point since > this is a hard mount. > > You said: > > * I cannot reproduce this other than IA-32 UP kernels. > > What happens on other machines here? (SMP or other arch) Sorry, the information there is a little bit incorrect. The original problem I have seen with NFSv4. That is to say, IA-32 UP kernel behaviours different with SMP or other architectures. Considering the following example. Server side: # uname -ra Linux sun-v40z-01.rhts.bos.redhat.com 2.6.9-78.0.12.EL #1 Thu Dec 18 07:11:33 EST 2008 i686 athlon i386 GNU/Linux # mkdir /tmp/export # date >/tmp/export/mydate # echo '/tmp/export *(rw,fsid=0,insecure,no_subtree_check,sync,no_root_squash)' >/etc/exports # service nfs restart Shutting down RPC svcgssd: [FAILED] Shutting down NFS mountd: [FAILED] Shutting down NFS daemon: [FAILED] Shutting down NFS quotas: [FAILED] Shutting down NFS services: [FAILED] Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] Client side: # uname -ra Linux hp-bl460c-02.rhts.bos.redhat.com 2.6.9-78.ELsmp #1 SMP Wed Jul 9 15:46:26 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux # mkdir /tmp/import # mount -t nfs4 sun-v40z-01.rhts.bos.redhat.com:/ /tmp/import # cat /tmp/import/mydate Mon Dec 29 22:22:24 EST 2008 Server side: # service nfs stop Shutting down RPC svcgssd: [FAILED] Shutting down NFS mountd: [ OK ] Shutting down NFS daemon: [ OK ] Shutting down NFS quotas: [ OK ] Shutting down NFS services: [ OK ] Client side: # cat /tmp/import/mydate & # ps aux ... root 5003 0.0 0.0 49924 444 pts/0 D 22:24 0:00 cat /mnt/import/mydate ... As it showed the above, it is in D (uninterruptible sleep) state. As the result, it is not possible to kill the process or umount. I found the only way to recover is to start up the NFS server again, and then being able to kill the process. However, with SMP or other architectures. It is in S (interruptible sleep) state. Server side: # uname -ra Linux hp-xw4550-01.rhts.bos.redhat.com 2.6.9-78.0.12.ELsmp #1 SMP Thu Dec 18 07:23:42 EST 2008 i686 athlon i386 GNU/Linux Client side: # uname -ra Linux hp-bl460c-02.rhts.bos.redhat.com 2.6.9-78.ELsmp #1 SMP Wed Jul 9 15:46:26 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux # ps aux ... root 4991 0.0 0.0 49924 444 pts/0 S 22:18 0:00 cat /tmp/import/mydate ...
Interesting thing is that if I dropped "insecure" NFS server option from the above NFSv4 example, the "cat" process is in D state even if using kernels other than IA-32 UP on the server side. Server side: # uname -ra Linux hp-xw4550-01.rhts.bos.redhat.com 2.6.9-78.0.12.ELsmp #1 SMP Thu Dec 18 07:23:42 EST 2008 i686 athlon i386 GNU/Linux # echo '/tmp/export *(rw,fsid=0,no_subtree_check,sync,no_root_squash)' >/etc/exports Client side: # uname -ra Linux hp-bl460c-02.rhts.bos.redhat.com 2.6.9-78.ELsmp #1 SMP Wed Jul 9 15:46:26 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux # ps aux ... root 18803 0.0 0.0 49924 396 pts/0 D 23:36 0:00 cat /tmp/import/mydate ... This only apply to NFSv4 though. With NFSv2, neither "insecure" option nor kernels other than IA-32 UP make any difference. The "cat" process is always in unrecoverable D state.
Please scratch comment #5 and comment #6. After the further investigation, the different behaviour regards IA-32 UP kernel with NFSv4 and "insecure" option look like because of NFS server grace period. If I wait 100 seconds after the NFS server starting up, I did not see the different behaviours. In summary, the process playing with a NFSv2 mount is in uninterruptible sleep state after the server shut down; the process playing with a NFSv4 mount is in interruptible sleep state after the server shut down. Is this expected?
Adding Steve and Peter in case they have opinions... I see why this is occurring, but I'm not sure whether it's intended behavior or not. util-linux seems to make nfsv4 mounts default to "intr" unless "nointr" is explicitly specified. The reverse is true for nfsv2/3. This also seems to be the case in RHEL5's mount helper. "intr" and "nointr" are considered deprecated upstream. This may be by design due to the stateful nature of NFSv4, but I'm not sure. Given that this difference doesn't seem to be causing a problem, I'm inclined not to change it in either current RHEL release. I suggest we mark this as NOTABUG but if you can make a case as to why this is a problem, I'm willing to listen.
I have one question here. Is it expected that "cat" process was in D state forever? If so, how does the client recover from it without rebooting or the server starting up again. For example, in NFSv2 case, # mount z209.z900.redhat.com:/tmp/export /tmp/import isn't it use the default value -- 7 (0.7 seconds)? However, after the server stopped, the following command seemed hang forever, # cat /tmp/import/mydate Apart from this, I don't have other arguments to not close it as NOTABUG.
> I have one question here. Is it expected that "cat" process was in D state forever? Yes. See the explanation of the hard/soft options (and intr/nointr) in the nfs(5) manpage. > If so, how does the client recover from it without rebooting or the server starting up again. It doesn't. I'll go ahead and close this as NOTABUG. Please reopen if you want to discuss it further.