Bug 429391 - complete hang of NFSv4 client using Eye of Gnome
Summary: complete hang of NFSv4 client using Eye of Gnome
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 8
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
Assignee: Jeff Layton
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-01-19 05:52 UTC by Michael E Brown
Modified: 2008-02-05 12:32 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-02-05 12:32:48 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
/var/log/messages (49.35 KB, text/plain)
2008-01-19 20:56 UTC, Michael E Brown
no flags Details
/var/log/messages sysrq-trigger output (345.28 KB, text/plain)
2008-01-19 20:57 UTC, Michael E Brown
no flags Details
Interesting threads... (10.78 KB, text/plain)
2008-01-21 21:27 UTC, Peter Staubach
no flags Details
dmesg output from hang (119.74 KB, text/plain)
2008-01-23 05:30 UTC, Michael E Brown
no flags Details
/var/log/messages sysrq-trigger output (304.74 KB, application/octet-stream)
2008-01-23 05:31 UTC, Michael E Brown
no flags Details
Possible patch (6.99 KB, patch)
2008-01-25 16:35 UTC, Peter Staubach
no flags Details | Diff
Possible patch (6.99 KB, patch)
2008-01-25 20:15 UTC, Peter Staubach
no flags Details | Diff
Possible patch (7.40 KB, patch)
2008-01-28 13:20 UTC, Peter Staubach
no flags Details | Diff

Description Michael E Brown 2008-01-19 05:52:18 UTC
Description of problem:

My wife's computer mounts /home from my machine via NFSv4. I get semi-regular
hangs when pulling up images in eog. All access to /home/ hangs. umount doesnt
work, but umount -l does. Unable to remount /home after unmounting it. There are
no messages in dmesg on either computer generated by this. There are no syslog
messages logged.

# grep /home /etc/fstab 
192.168.1.52:/home /home nfs4
auto,rw,nodev,async,_netdev,proto=tcp,retry=10,rsize=32768,wsize=32768,hard,intr 0 0

Both machines run Fedora 8. My wife's machine:
   Linux d500.michaels-house.net 2.6.23.9-85.fc8 #1 SMP Fri Dec 7 15:49:59 EST
2007 i686 i686 i386 GNU/Linux

My machine:
   Linux localhost.localdomain 2.6.23.9-85.fc8 #1 SMP Fri Dec 7 15:49:36 EST
2007 x86_64 x86_64 x86_64 GNU/Linux


Version-Release number of selected component (if applicable):
kernel 2.6.23.9-85.fc8

This has been happening over several kernel releases now.

How reproducible:

Open a picture via 'Open with "Image Viewer"' in nautilus. repeat until it hangs.

Hangs are not completely 100% reproduceable. Happens about once a week unless
she is doing heavy picture stuff, when it hangs a bit more often (every couple
days.)

Comment 1 Michael E Brown 2008-01-19 19:31:27 UTC
happened again today while opening Gimp.

Comment 2 Michael E Brown 2008-01-19 20:56:34 UTC
Created attachment 292272 [details]
/var/log/messages 

This is /var/log/messages from a normal boot up until the hang occurs. Note
that there are no messages added due to the hang.

Comment 3 Michael E Brown 2008-01-19 20:57:57 UTC
Created attachment 292273 [details]
/var/log/messages sysrq-trigger output

This is a pretty fresh boot and hang with output of "echo t >
/proc/sysrq-trigger"

Comment 4 Michael E Brown 2008-01-19 21:01:03 UTC
Some additional info:

On my system (NFS server), I had firewall configured. While debugging this
morning, I disabled firewall, and had the hang happen again, but with different
behaviour. With firewall disabled, going through photos in eog, I had nautilus
and eog both lock up, but ssh-ing in, I can access all of /home with no
problems. Her desktop session is completely locked up, though.

The attachment #3 [details], above, is output of sysrq-trigger while the system was in
this state.

Comment 5 Jeff Layton 2008-01-21 13:28:12 UTC
Thanks for the sysrq-t info. It helps some, but doesn't tell me the complete
story unfortunately...

> I had nautilus and eog both lock up

Looking in the log, I see nautilus in the sysrq-t info, but not eog. syslog
tends to not be very reliable for capturing sysrq-t info. It ends up dropping
portions of it.

In any case, here's the top of nautilus' stack:

Jan 19 14:53:16 d500 kernel: nautilus      S 00000000     0  2877   2754
Jan 19 14:53:16 d500 kernel:        e9b79c8c 00000086 000b0000 00000000 00000000
000000fd e9b79000 0dee710c 
Jan 19 14:53:16 d500 kernel:        00000070 ef1fb230 ef1fb3d4 c180c180 00000000
ebfdb660 00000000 00000000 
Jan 19 14:53:16 d500 kernel:        d08841d4 d6cec9c0 e9b79cc0 c18026f8 e9b79cb8
00000000 e9b79cc0 c18026f8 
Jan 19 14:53:16 d500 kernel: Call Trace:
Jan 19 14:53:16 d500 kernel:  [<f8c9015d>] rpc_wait_bit_interruptible+0x1a/0x1f
[sunrpc]
Jan 19 14:53:16 d500 kernel:  [<c061e060>] __wait_on_bit+0x33/0x58
Jan 19 14:53:16 d500 kernel:  [<f8c90143>] rpc_wait_bit_interruptible+0x0/0x1f
[sunrpc]
Jan 19 14:53:16 d500 kernel:  [<f8c90143>] rpc_wait_bit_interruptible+0x0/0x1f
[sunrpc]
Jan 19 14:53:16 d500 kernel:  [<c061e0e8>] out_of_line_wait_on_bit+0x63/0x6b
Jan 19 14:53:16 d500 kernel:  [<c043d4ca>] wake_bit_function+0x0/0x3c
Jan 19 14:53:16 d500 kernel:  [<f8c900f1>]
__rpc_wait_for_completion_task+0x31/0x38 [sunrpc]
Jan 19 14:53:16 d500 kernel:  [<f8e56f24>]
nfs4_wait_for_completion_rpc_task+0x1b/0x2f [nfs]
Jan 19 14:53:16 d500 kernel:  [<f8e5925b>] _nfs4_proc_open+0x75/0x1b4 [nfs]
Jan 19 14:53:16 d500 kernel:  [<f8e598da>] nfs4_do_open+0x112/0x23b [nfs]
Jan 19 14:53:16 d500 kernel:  [<f8e5ab21>] nfs4_open_revalidate+0x60/0x116 [nfs]

It's trying to open a file. _nfs4_proc_open, calls rpc_run_task() to actually
issue the open call and then calls nfs4_wait_for_completion_rpc_task() to wait
for it to complete. The task should get picked up by rpciod, and when it's
complete it'll set the completion variable correctly and things will continue.
That looks like where things are stuck here. I also see a few other processes in
similar states.

The other thing missing from the sysrq-t info is the stack trace for the rpciod
processes, which would probably be very helpful for figuring out why they're not
coming back.

If you're able to reproduce this again please get sysrq-t again and instead of
relying on syslog to capture the info, please just dump dmesg to a file:

# dmesg > /tmp/dmesg.out

that should hopefully be a bit more complete and I might be able to tell more.


Comment 10 Michael E Brown 2008-01-21 17:37:45 UTC
I'll attempt to reproduce and get the additional dmesg dump.

I've switched her client over to nfsv3 for the time being and have not seen any
crashes over the weekend with this setup. She did a bit of photo work over the
weekend, so I would have expected to see a crash.

I'll switch back to nfsv4 tonight and see if I can get this data.

Comment 11 Jeff Layton 2008-01-21 17:58:54 UTC
> I've switched her client over to nfsv3 for the time being and have not seen any
> crashes over the weekend with this setup. She did a bit of photo work over the
> weekend, so I would have expected to see a crash.

Doesn't surprise me a bit. NFSv4 is really a completely different animal so it's
easily possible for there to be bugs there that don't exist with NFSv3.

Comment 12 Peter Staubach 2008-01-21 21:27:08 UTC
Created attachment 292415 [details]
Interesting threads...

Here is a "echo t > /proc/sysrq-trigger" from a RHEL-5 system running
the Connectathon testsuite.

The stack traces look similiar to those previously attached.

Comment 13 Jeff Layton 2008-01-22 02:42:27 UTC
Interesting. It looks like rpciod/1 has ended up trying to return a delegation,
and ended up calling __rpc_wait_for_completion_task(). That seems like it could
be a deadlock situation -- rpciod is waiting on a rpc task to complete, and it
will never complete because rpciod can't pick it up until it's through waiting.
Are the other rpciod threads stuck too?

Of course, some of this could just be junk on the stack. I always have a hard
time telling when that's the case. It might be worth getting a core since the
stack trace in it might be more reliable...


Comment 14 Michael E Brown 2008-01-22 05:48:56 UTC
Ok. If you tell me how to get a core dump, I can try again. This is not 100%
reproducable, but pretty reliably locks the machine.

Comment 15 Jeff Layton 2008-01-22 11:06:35 UTC
Actually, I was mentioning that to Peter, but a core from your machine might
also be helpful. These instructions are for F6, but I think setting up F8 should
be pretty much the same:

http://fedoraproject.org/wiki/FC6KdumpKexecHowTo

...let me know if you have Q's.


Comment 16 Michael E Brown 2008-01-23 04:35:17 UTC
Wife's machine is a Dell D500 with dead internal HDD drives, so I run it off an
external USB disk.

Two attempts at forcing crashdump failed. I see on the screen: "SysRq : Trigger
a crashdump", and then at the top of the screen, "Uncompressing Linux... Ok,
booting the kernel." and the cursor goes to the next line, but it doesnt output
anything else.



Comment 17 Michael E Brown 2008-01-23 05:30:19 UTC
Created attachment 292593 [details]
dmesg output from hang

This is output from dmesg after echo t > /proc/sysrq-trigger. It looks like the
first part got truncated a bit.

Comment 18 Michael E Brown 2008-01-23 05:31:46 UTC
Created attachment 292594 [details]
/var/log/messages sysrq-trigger output

This is /var/log/messages and should contain the same info as the dmesg output
above.

Note that this hang was produced simply clicking on "Next" repeatedly in eog.

Comment 19 Michael E Brown 2008-01-23 05:57:39 UTC
If there is any other way to get the data you need off of the machine, please
let me know. So far, I have been able to pretty reliably reproduce this issue.

Comment 20 Peter Staubach 2008-01-25 16:35:13 UTC
Created attachment 292959 [details]
Possible patch

I exchanged email with Trond on the stacktraces that I am seeing and
posted here.  He responded with a diagnosis and a patch which seems
to address the problems that I am seeing.  I am still testing, but
things look good so far.  I need to look at the more patch as well.

Comment 21 Jeff Layton 2008-01-25 17:11:12 UTC
Thanks Peter. Looks like this patch is already in Trond's nfs-2.6 git tree.

Michael, this patch is experimental, but since you're able to reproduce this
issue it would be good if you could test it someplace non-critical and see
whether it seems to fix the issue for you.


Comment 22 Jeff Layton 2008-01-25 18:22:51 UTC
Actually...I may have spoken too soon. Looks like Trond is going to respin this
upstream, so you might want to hold off for now...


Comment 23 Michael E Brown 2008-01-25 18:49:08 UTC
If you can point me at a koji build with any kind of patch for this, I would be
happy to test.

Comment 24 Peter Staubach 2008-01-25 19:55:41 UTC
Yes, I believe that he is switching all of the issync values in the
calls to nfs_do_return_delegation() because they were backwards.

Strangely, the original patch made the hang that I was seeing not
reproducible anymore.  I will try his latest patch and see what that
does.

Comment 25 Peter Staubach 2008-01-25 20:15:10 UTC
Created attachment 292987 [details]
Possible patch

Here is the updated patch from Trond...

Comment 26 Michael E Brown 2008-01-25 20:31:10 UTC
So, you are able to reproduce my hang, yes?

Can you do a koji scratch build with this patch for me to test? If I have to
apply it myself, it will take me much longer to get that done.

Comment 27 Peter Staubach 2008-01-25 21:00:28 UTC
I don't know whether or not I am reproducing your hang.  I am able
to reproduce a hang, yes, with a similar look, but I can't say for
sure that is the same situation that you are encountering.  I am
using the Connectathon testsuite and running it against some local
servers.  Against some, the testsuite runs fine.  Against others,
the testsuite hangs.

Comment 28 Jeff Layton 2008-01-27 12:36:00 UTC
I've never done a kernel build out of Koji. I'll have to look and see what's
involved. It may take me a while to get that together, however, so if you are
able to patch and build your own kernels that would probably be the fastest way
to test this.


Comment 29 Peter Staubach 2008-01-28 13:20:32 UTC
Created attachment 293154 [details]
Possible patch

It turns out that the other versions of the patch weren't quite right.
This latest version is somewhat lightly tested, but does seem to work.

Comment 34 Jeff Layton 2008-01-29 20:06:14 UTC
I've done a koji build with a backport of the patch in Trond's tree. The kernels
are here. This was a scratch build so they won't be there more than a few days,
IIRC:

http://koji.fedoraproject.org/koji/taskinfo?taskID=381663

Please test them somewhere non-critical and let me know whether they work for you.


Comment 35 Michael E Brown 2008-01-30 01:51:07 UTC
I'm installing this kernel now. Thanks.

Comment 36 Michael E Brown 2008-01-30 02:42:56 UTC
Spent 10 mins going through pictures and no crash. Will spend some more time
with it banging on it when the wife gets done with what she is doing.

Thanks!

Comment 37 Jeff Layton 2008-02-01 11:43:41 UTC
Setting to NEEDINFO to await further test results...

Comment 38 Michael E Brown 2008-02-01 21:13:02 UTC
Got sick literally hours after sending the last message... :(

We have been running the system for a few days now with the koji scratch kernel
build without any problems. My wife has done some photo work with eog/gimp over
the past couple days, but not heavy usage. Our big photo day is Saturdays, so I
should have more data tomorrow.

So far, though, it looks good.

Comment 39 Michael E Brown 2008-02-02 20:25:37 UTC
No crashes today after more pictures.

When can we get this patch into an official kernel update?

Comment 40 Jeff Layton 2008-02-05 12:32:48 UTC
Thanks, that's good to know. AFAIK, Trond has this patch in his tree and it's
being pushed to Linus, so we'll pick this up for Fedora when 2.6.25 is released.



Note You need to log in before you can comment on or make changes to this bug.