Description of problem: My wife's computer mounts /home from my machine via NFSv4. I get semi-regular hangs when pulling up images in eog. All access to /home/ hangs. umount doesnt work, but umount -l does. Unable to remount /home after unmounting it. There are no messages in dmesg on either computer generated by this. There are no syslog messages logged. # grep /home /etc/fstab 192.168.1.52:/home /home nfs4 auto,rw,nodev,async,_netdev,proto=tcp,retry=10,rsize=32768,wsize=32768,hard,intr 0 0 Both machines run Fedora 8. My wife's machine: Linux d500.michaels-house.net 2.6.23.9-85.fc8 #1 SMP Fri Dec 7 15:49:59 EST 2007 i686 i686 i386 GNU/Linux My machine: Linux localhost.localdomain 2.6.23.9-85.fc8 #1 SMP Fri Dec 7 15:49:36 EST 2007 x86_64 x86_64 x86_64 GNU/Linux Version-Release number of selected component (if applicable): kernel 2.6.23.9-85.fc8 This has been happening over several kernel releases now. How reproducible: Open a picture via 'Open with "Image Viewer"' in nautilus. repeat until it hangs. Hangs are not completely 100% reproduceable. Happens about once a week unless she is doing heavy picture stuff, when it hangs a bit more often (every couple days.)
happened again today while opening Gimp.
Created attachment 292272 [details] /var/log/messages This is /var/log/messages from a normal boot up until the hang occurs. Note that there are no messages added due to the hang.
Created attachment 292273 [details] /var/log/messages sysrq-trigger output This is a pretty fresh boot and hang with output of "echo t > /proc/sysrq-trigger"
Some additional info: On my system (NFS server), I had firewall configured. While debugging this morning, I disabled firewall, and had the hang happen again, but with different behaviour. With firewall disabled, going through photos in eog, I had nautilus and eog both lock up, but ssh-ing in, I can access all of /home with no problems. Her desktop session is completely locked up, though. The attachment #3 [details], above, is output of sysrq-trigger while the system was in this state.
Thanks for the sysrq-t info. It helps some, but doesn't tell me the complete story unfortunately... > I had nautilus and eog both lock up Looking in the log, I see nautilus in the sysrq-t info, but not eog. syslog tends to not be very reliable for capturing sysrq-t info. It ends up dropping portions of it. In any case, here's the top of nautilus' stack: Jan 19 14:53:16 d500 kernel: nautilus S 00000000 0 2877 2754 Jan 19 14:53:16 d500 kernel: e9b79c8c 00000086 000b0000 00000000 00000000 000000fd e9b79000 0dee710c Jan 19 14:53:16 d500 kernel: 00000070 ef1fb230 ef1fb3d4 c180c180 00000000 ebfdb660 00000000 00000000 Jan 19 14:53:16 d500 kernel: d08841d4 d6cec9c0 e9b79cc0 c18026f8 e9b79cb8 00000000 e9b79cc0 c18026f8 Jan 19 14:53:16 d500 kernel: Call Trace: Jan 19 14:53:16 d500 kernel: [<f8c9015d>] rpc_wait_bit_interruptible+0x1a/0x1f [sunrpc] Jan 19 14:53:16 d500 kernel: [<c061e060>] __wait_on_bit+0x33/0x58 Jan 19 14:53:16 d500 kernel: [<f8c90143>] rpc_wait_bit_interruptible+0x0/0x1f [sunrpc] Jan 19 14:53:16 d500 kernel: [<f8c90143>] rpc_wait_bit_interruptible+0x0/0x1f [sunrpc] Jan 19 14:53:16 d500 kernel: [<c061e0e8>] out_of_line_wait_on_bit+0x63/0x6b Jan 19 14:53:16 d500 kernel: [<c043d4ca>] wake_bit_function+0x0/0x3c Jan 19 14:53:16 d500 kernel: [<f8c900f1>] __rpc_wait_for_completion_task+0x31/0x38 [sunrpc] Jan 19 14:53:16 d500 kernel: [<f8e56f24>] nfs4_wait_for_completion_rpc_task+0x1b/0x2f [nfs] Jan 19 14:53:16 d500 kernel: [<f8e5925b>] _nfs4_proc_open+0x75/0x1b4 [nfs] Jan 19 14:53:16 d500 kernel: [<f8e598da>] nfs4_do_open+0x112/0x23b [nfs] Jan 19 14:53:16 d500 kernel: [<f8e5ab21>] nfs4_open_revalidate+0x60/0x116 [nfs] It's trying to open a file. _nfs4_proc_open, calls rpc_run_task() to actually issue the open call and then calls nfs4_wait_for_completion_rpc_task() to wait for it to complete. The task should get picked up by rpciod, and when it's complete it'll set the completion variable correctly and things will continue. That looks like where things are stuck here. I also see a few other processes in similar states. The other thing missing from the sysrq-t info is the stack trace for the rpciod processes, which would probably be very helpful for figuring out why they're not coming back. If you're able to reproduce this again please get sysrq-t again and instead of relying on syslog to capture the info, please just dump dmesg to a file: # dmesg > /tmp/dmesg.out that should hopefully be a bit more complete and I might be able to tell more.
I'll attempt to reproduce and get the additional dmesg dump. I've switched her client over to nfsv3 for the time being and have not seen any crashes over the weekend with this setup. She did a bit of photo work over the weekend, so I would have expected to see a crash. I'll switch back to nfsv4 tonight and see if I can get this data.
> I've switched her client over to nfsv3 for the time being and have not seen any > crashes over the weekend with this setup. She did a bit of photo work over the > weekend, so I would have expected to see a crash. Doesn't surprise me a bit. NFSv4 is really a completely different animal so it's easily possible for there to be bugs there that don't exist with NFSv3.
Created attachment 292415 [details] Interesting threads... Here is a "echo t > /proc/sysrq-trigger" from a RHEL-5 system running the Connectathon testsuite. The stack traces look similiar to those previously attached.
Interesting. It looks like rpciod/1 has ended up trying to return a delegation, and ended up calling __rpc_wait_for_completion_task(). That seems like it could be a deadlock situation -- rpciod is waiting on a rpc task to complete, and it will never complete because rpciod can't pick it up until it's through waiting. Are the other rpciod threads stuck too? Of course, some of this could just be junk on the stack. I always have a hard time telling when that's the case. It might be worth getting a core since the stack trace in it might be more reliable...
Ok. If you tell me how to get a core dump, I can try again. This is not 100% reproducable, but pretty reliably locks the machine.
Actually, I was mentioning that to Peter, but a core from your machine might also be helpful. These instructions are for F6, but I think setting up F8 should be pretty much the same: http://fedoraproject.org/wiki/FC6KdumpKexecHowTo ...let me know if you have Q's.
Wife's machine is a Dell D500 with dead internal HDD drives, so I run it off an external USB disk. Two attempts at forcing crashdump failed. I see on the screen: "SysRq : Trigger a crashdump", and then at the top of the screen, "Uncompressing Linux... Ok, booting the kernel." and the cursor goes to the next line, but it doesnt output anything else.
Created attachment 292593 [details] dmesg output from hang This is output from dmesg after echo t > /proc/sysrq-trigger. It looks like the first part got truncated a bit.
Created attachment 292594 [details] /var/log/messages sysrq-trigger output This is /var/log/messages and should contain the same info as the dmesg output above. Note that this hang was produced simply clicking on "Next" repeatedly in eog.
If there is any other way to get the data you need off of the machine, please let me know. So far, I have been able to pretty reliably reproduce this issue.
Created attachment 292959 [details] Possible patch I exchanged email with Trond on the stacktraces that I am seeing and posted here. He responded with a diagnosis and a patch which seems to address the problems that I am seeing. I am still testing, but things look good so far. I need to look at the more patch as well.
Thanks Peter. Looks like this patch is already in Trond's nfs-2.6 git tree. Michael, this patch is experimental, but since you're able to reproduce this issue it would be good if you could test it someplace non-critical and see whether it seems to fix the issue for you.
Actually...I may have spoken too soon. Looks like Trond is going to respin this upstream, so you might want to hold off for now...
If you can point me at a koji build with any kind of patch for this, I would be happy to test.
Yes, I believe that he is switching all of the issync values in the calls to nfs_do_return_delegation() because they were backwards. Strangely, the original patch made the hang that I was seeing not reproducible anymore. I will try his latest patch and see what that does.
Created attachment 292987 [details] Possible patch Here is the updated patch from Trond...
So, you are able to reproduce my hang, yes? Can you do a koji scratch build with this patch for me to test? If I have to apply it myself, it will take me much longer to get that done.
I don't know whether or not I am reproducing your hang. I am able to reproduce a hang, yes, with a similar look, but I can't say for sure that is the same situation that you are encountering. I am using the Connectathon testsuite and running it against some local servers. Against some, the testsuite runs fine. Against others, the testsuite hangs.
I've never done a kernel build out of Koji. I'll have to look and see what's involved. It may take me a while to get that together, however, so if you are able to patch and build your own kernels that would probably be the fastest way to test this.
Created attachment 293154 [details] Possible patch It turns out that the other versions of the patch weren't quite right. This latest version is somewhat lightly tested, but does seem to work.
I've done a koji build with a backport of the patch in Trond's tree. The kernels are here. This was a scratch build so they won't be there more than a few days, IIRC: http://koji.fedoraproject.org/koji/taskinfo?taskID=381663 Please test them somewhere non-critical and let me know whether they work for you.
I'm installing this kernel now. Thanks.
Spent 10 mins going through pictures and no crash. Will spend some more time with it banging on it when the wife gets done with what she is doing. Thanks!
Setting to NEEDINFO to await further test results...
Got sick literally hours after sending the last message... :( We have been running the system for a few days now with the koji scratch kernel build without any problems. My wife has done some photo work with eog/gimp over the past couple days, but not heavy usage. Our big photo day is Saturdays, so I should have more data tomorrow. So far, though, it looks good.
No crashes today after more pictures. When can we get this patch into an official kernel update?
Thanks, that's good to know. AFAIK, Trond has this patch in his tree and it's being pushed to Linus, so we'll pick this up for Fedora when 2.6.25 is released.