Bug 690196

Summary: System hangs occasionally when using NFSv4
Product: Red Hat Enterprise Linux 5 Reporter: Ondrej Valousek <ondrejv>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: medium    
Version: 5.6CC: bfields, dhowells, jlayton, rwheeler, sprabhu, steved, tsobral
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-16 10:14:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
diskless machine pisces log (2 occurences)
none
diskless machine sagitta log (1 occurence)
none
/var/log/messages from a diskless system running RHEL-6 none

Description Ondrej Valousek 2011-03-23 14:44:58 UTC
Description of problem:
I am mounting several directories using NFSv4 (system authentication) from NFS server (NetApp NAS filer) and I am experiencing frequent system hangs when the following is printed on the system console:
2011-03-20T20:07:52.657919+01:00 pisces kernel: INFO: task rpciod/6:833 blocked for more than 120 seconds.
2011-03-20T20:07:52.658055+01:00 pisces kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2011-03-20T20:07:52.670631+01:00 pisces kernel: rpciod/6      D ffff8102eb520050     0   833    307           834   832 (L-TLB)
2011-03-20T20:07:52.677520+01:00 pisces kernel:  ffff81087ce75c40 0000000000000046 0000000000000006 ffffffff800fb1de
2011-03-20T20:07:52.683742+01:00 pisces kernel:  ffff8102f1ab02f0 000000000000000a ffff81087de7d7e0 ffff81087c612820
2011-03-20T20:07:52.689972+01:00 pisces kernel:  000e2aa8f01d9b39 000000000000098e ffff81087de7d9c8 0000000600000001
2011-03-20T20:07:52.690047+01:00 pisces kernel: Call Trace:
2011-03-20T20:07:52.695671+01:00 pisces kernel:  [<ffffffff800fb1de>] ep_poll_callback+0x64/0x8b
2011-03-20T20:07:52.695739+01:00 pisces kernel:  [<ffffffff8002e283>] __wake_up+0x38/0x4f
2011-03-20T20:07:52.704515+01:00 pisces kernel:  [<ffffffff880be996>] :nfs:nfs_idmap_id+0x189/0x209
2011-03-20T20:07:52.704596+01:00 pisces kernel:  [<ffffffff8008e40a>] default_wake_function+0x0/0xe
2011-03-20T20:07:52.714428+01:00 pisces kernel:  [<ffffffff880b97fc>] :nfs:decode_getfattr+0x73c/0xc5e
2011-03-20T20:07:52.719541+01:00 pisces kernel:  [<ffffffff880ba9b4>] :nfs:nfs4_xdr_dec_write+0x0/0xbd
2011-03-20T20:07:52.724834+01:00 pisces kernel:  [<ffffffff8803dca0>] :sunrpc:rpc_async_schedule+0x0/0x5
2011-03-20T20:07:52.730049+01:00 pisces kernel:  [<ffffffff880baa67>] :nfs:nfs4_xdr_dec_write+0xb3/0xbd
2011-03-20T20:07:52.735071+01:00 pisces kernel:  [<ffffffff8803890d>] :sunrpc:call_decode+0x4b7/0x509
2011-03-20T20:07:52.735137+01:00 pisces kernel:  [<ffffffff8803dae4>] :sunrpc:__rpc_execute+0x92/0x24e
2011-03-20T20:07:52.744531+01:00 pisces kernel:  [<ffffffff8004d7d0>] run_workqueue+0x99/0xf6
2011-03-20T20:07:52.744608+01:00 pisces kernel:  [<ffffffff8004a018>] worker_thread+0x0/0x122
2011-03-20T20:07:52.753842+01:00 pisces kernel:  [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
2011-03-20T20:07:52.753908+01:00 pisces kernel:  [<ffffffff8004a108>] worker_thread+0xf0/0x122
2011-03-20T20:07:52.763107+01:00 pisces kernel:  [<ffffffff8008e40a>] default_wake_function+0x0/0xe
2011-03-20T20:07:52.763175+01:00 pisces kernel:  [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
2011-03-20T20:07:52.773137+01:00 pisces kernel:  [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
2011-03-20T20:07:52.773203+01:00 pisces kernel:  [<ffffffff80032996>] kthread+0xfe/0x132
2011-03-20T20:07:52.780948+01:00 pisces kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
2011-03-20T20:07:52.781016+01:00 pisces kernel:  [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
2011-03-20T20:07:52.789760+01:00 pisces kernel:  [<ffffffff80032898>] kthread+0x0/0x132
2011-03-20T20:07:52.789817+01:00 pisces kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
2011-03-20T20:07:52.794055+01:00 pisces kernel:
2011-03-20T20:07:52.799621+01:00 pisces kernel: INFO: task adclient:3918 blocked for more than 120 seconds.
2011-03-20T20:07:52.799690+01:00 pisces kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2011-03-20T20:07:52.812675+01:00 pisces kernel: adclient      D ffff810001036b20     0  3918      1          3919  3917 (NOTLB)
2011-03-20T20:07:52.819828+01:00 pisces kernel:  ffff8108775efbb8 0000000000000086 ffff810697135148 ffff81087cf9dc00
2011-03-20T20:07:52.826198+01:00 pisces kernel:  0000000000000286 000000000000000a ffff81087b85b100 ffff81011de58080
2011-03-20T20:07:52.832659+01:00 pisces kernel:  000e2aafbbd86a3d 00000000000067a8 ffff81087b85b2e8 0000000600000286
2011-03-20T20:07:52.832777+01:00 pisces kernel: Call Trace:
2011-03-20T20:07:52.838891+01:00 pisces kernel:  [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90
2011-03-20T20:07:52.844801+01:00 pisces kernel:  [<ffffffff880a6941>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
2011-03-20T20:07:52.844864+01:00 pisces kernel:  [<ffffffff800637ca>] io_schedule+0x3f/0x67
2011-03-20T20:07:52.854822+01:00 pisces kernel:  [<ffffffff880a694a>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd
2011-03-20T20:07:52.854925+01:00 pisces kernel:  [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e
2011-03-20T20:07:52.859120+01:00 pisces kernel:  [<ffffffff8009a446>] recalc_sigpending+0xe/0x25
2011-03-20T20:07:52.875885+01:00 pisces kernel:  [<ffffffff880a6941>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
2011-03-20T20:07:52.875980+01:00 pisces kernel:  [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78
2011-03-20T20:07:52.885654+01:00 pisces kernel:  [<ffffffff800a28e2>] wake_bit_function+0x0/0x23
...

The system becomes unresponsive and the only help is to reboot the system.

How reproducible:
There is no clear way how to reproduce the error.



Additional info:

Comment 1 Ondrej Valousek 2011-03-24 07:40:51 UTC
I believe I am experiencing the same behavior as http://www.spinics.net/lists/linux-nfs/msg13022.html

The NFS client is also diskless in my case but it is using NFSv3 to mount the root FS so it should not cause idmapper to hang.

Comment 2 Ondrej Valousek 2011-03-24 08:05:30 UTC
The problem has also been described here:
https://bugzilla.redhat.com/show_bug.cgi?id=483365
I wonder why was the fix declined?

Comment 3 Ondrej Valousek 2011-03-24 16:24:56 UTC
Created attachment 487391 [details]
diskless machine pisces log (2 occurences)

Comment 4 Ondrej Valousek 2011-03-24 16:26:27 UTC
Created attachment 487392 [details]
diskless machine sagitta log (1 occurence)

Comment 5 Ondrej Valousek 2011-03-24 16:30:46 UTC
Attached logs from the affected machines (3 occurences in total).
Note - the process 'adclient' is a daemon for the NSS library - provides username/uid mappings (so it is indirectly being used by the rpc.idmapd).

No system vmcore available unfortunately.

Comment 6 Ondrej Valousek 2011-05-03 09:49:50 UTC
I am able to replicate the problem on RHEL-6 (kernel Linux data 2.6.32-71.el6.x86_64) too. In fact RHEL-6 in the similar configuration hangs almost immediately.
SysRq trace is attached. Could you tell me if we are experiencing a similar issue or is this one completely different?

Comment 7 Ondrej Valousek 2011-05-03 09:52:48 UTC
Created attachment 496471 [details]
/var/log/messages from a diskless system running RHEL-6

/var/log/messages file containing the SysRq traces

Comment 9 Ric Wheeler 2011-08-01 21:37:15 UTC
Hi Ondrej,

Redhat's bugzilla is not really meant to be used to go around the normal Red Hat support channel. You should talk with your Red Hat service people and open a ticket so they can help gather information and prioritize the response.

If you are not a red hat customer, please report this as well to the upstream nfs developer list.

Thanks!

Comment 10 Ondrej Valousek 2011-08-02 07:47:55 UTC
Hi Ric,

I am a RedHat customer and I have opened service case (#00441148) already few months ago. No satisfactory update yet.
This case has been opened so that other people can see the problem.

Ondrej

Comment 14 Sachin Prabhu 2011-08-16 10:11:10 UTC
Ondrej, 

bz 609252 deals with the issue you reported on RHEL 5.

Sachin Prabhu

Comment 15 Jeff Layton 2011-08-16 10:14:48 UTC
There's little info to go on here, but this is almost certainly a duplicate of bug 609252. Marking it as such...

*** This bug has been marked as a duplicate of bug 609252 ***