Red Hat Bugzilla – Bug 484167
processes being stuck in rpc_wait for nfs mounts
Last modified: 2009-05-07 23:02:04 EDT
Description of problem:
I have a netgear ReadyNAS NV+ storage device that is sharing out contents using NFS (latest firmware installed)
I mount this in Fedora 9 using soft and intr options (also tried "hard" earlier as preferred)
This is then shared out throughout the office using samba.
At least twice a day I'm having to do a forced umount of this share because samba or even a terminal shell end up stuck in rpc_wait state.
The only way I seem to be able to revive the shell is through the forced umount.
Sometimes the forced umount doesn't actually un-mount the nfs share, but it seems to restart things as they become responsive again. Other times it will actually un-mount the nfs share. I'm assuming this is something to do with the intr behaviour (but it seems inconsistent).
I'm not sure if this an RPC bug or a kernel bug?
How would I narrow it down?
Kernel is 18.104.22.168-78.2.8.fc9.x86_64
It could also be a server problem (i.e. server has gone unresponsive)...
The best way to start here is get some sysrq-t info when this occurs. If you have a responsive shell somewhere:
# echo t > /proc/sysrq-trigger
...then dump dmesg to a file:
# dmesg -s 131072 > /tmp/dmesg.out
...and attach it here. Once you do this, please try to do what you can to identify the pid's of processes that are actually hung.
With that info we can look at where these processes are hung kernel and might have some clues about what's going wrong.
Created attachment 331285 [details]
Created attachment 331286 [details]
ps -Al dump
ok as usual when you want something to die it doesn't.
Finally got a failure this arvo.
Attached the dmesg dump
and a ps -Al listing. This shows multiple smbd's etc in either rpc_wait or nfs_wait
also had a terminal session locked in rpc_wait doing an ls.
Interesting that the last line in this dump shows
nfs: server 192.168.0.147 not responding, timed out
so perhaps something is going wrong with the NAS all the time.
However I would expect slightly more manageable recovery from this situation.
Having to shutdown samba, and do multiple forced umounts to try and get things unlocked is a bit of a pain.
I will in parallel to this investigate more thoroughly any reports of others having similar issues with these NAS's.
(In reply to comment #5)
> Interesting that the last line in this dump shows
> nfs: server 192.168.0.147 not responding, timed out
> so perhaps something is going wrong with the NAS all the time.
> However I would expect slightly more manageable recovery from this situation.
> Having to shutdown samba, and do multiple forced umounts to try and get things
> unlocked is a bit of a pain.
> I will in parallel to this investigate more thoroughly any reports of others
> having similar issues with these NAS's.
The sysrq-t output isn't quite complete (rpciod isn't listed there, for instance), but from what I can see the processes that are in NFS code are
just waiting for the server to respond.
As to why you have to do all of that to unwedge things...
The RPC layer only does what it's told, and the default is to use hard
mounts. This means we have to retry RPC calls indefinitely until the
server responds. See the nfs(5) manpage for discussion of why we use
hard by default.
If the ability to unwedge processes is more important to you than data
integrity, then you might consider soft mounts. The processes may also
be killable using SIGKILL.
If this problem just cropped up after a kernel upgrade, it's possible
that there's a problem somewhere in the underlying networking stack.
I'll leave this open for now with "needinfo" set, if you get more info
about the problem, I'll have another look.
Ahh sorry, just noticed that you are using soft mounts. "intr" is now deprecated since the TASK_KILLABLE stuff was added a few months ago (the nfs(5) manpage needs to be updated to reflect that).
In any case, "soft" doesn't give you any sort of guarantee about how long it will be before your syscall errors out, it just causes the RPC call to error out after a major timeout. You may want to try using SIGKILL on the processes to unwedge them but you may of course be risking data integrity with that, particularly with something like smbd.
I found that the last update I thought I had applied had not stuck, seems the netgear NAS didn't actually reboot. (typical)
I haven't seen this issue for the last week since I went and recycled power on the netgear NAS so I think we can close the bug.
Though now that I've said this it'll probably fail on Monday.. :)
Created attachment 342964 [details]
I'm back. This problem has occurred again a couple of times and I'm lost as to where further to look.
One other thing I'd like to add to this is that samba seems to end up with numerous processes with the same user and same mount point.
It's as though the original process for that user has failed and the windows machines are trying to reconnect again and again, each resulting in a new process for that user. This is not normal behavior, as usually the one process will handle that users requests.
I believe this is stemming from the original samba process being hungup in a wait state and never recovering. As for why the nfs_client is crapping itself I'm a tad stumped.
I've noted that now I seem to be seeing these stuck processes in nfs_wait.
I grabbed another sysrq dump and have attached it..
Is there any polling that goes on down in the nfsclient to make sure a connection is still valid?? can I turn on some more debugging for this??