483365 – NFSv4 client deadlock (rpciod and rpc.idmapd) under prolonged load

Bug 483365 - NFSv4 client deadlock (rpciod and rpc.idmapd) under prolonged load

Summary: NFSv4 client deadlock (rpciod and rpc.idmapd) under prolonged load

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Steve Dickson
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	499522 525215 526950 533192
TreeView+	depends on / blocked

Reported:	2009-01-30 23:29 UTC by Nathaniel W. Turner
Modified:	2018-10-19 18:42 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-10-16 13:45:54 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nathaniel W. Turner 2009-01-30 23:29:54 UTC

Description of problem:
An NFSv4 client will hang when writing a large amount of data to an NFS share for a prolonged period of time.

There is a deadlock scenario involving rpc.idmapd and rpciod:
- NFS client is doing lots of writes.
- Client system has many dirty and "nfs unstable" pages.
- NFS client (rpciod kernel thread) goes to handle a write reply from NFS server.
- The reply contains a GETFATTR response, which means server must decode the username into a numeric user ID.
- The NFS client normally gets this info from a cache, idmap_hashtable, but the cache does not have a valid entry for the username in question this time because the entry has expired.
- NFS client kernel code calls out to the userspace rpc.idmapd process (via an RPC pipe) to look up the user ID.
- rpc.idmapd does something that requires a memory allocation, such as reading /etc/passwd, and ends up waiting for memory reclaim.
- Since most dirty pages are NFS write cache, memory reclaim involves waiting for NFS to commit some writes.

However, this will never happen, because rpciod would normally do this, and it is currently waiting for rpc.idmapd to return the user ID so it can handle the most recent reply it got from the NFS server.

I think the following kernel stacks (from a kdump triggered with Alt-SysRq-c during a deadlock) may illustrate this more clearly:

PID: 8813   TASK: ffff81008303d040  CPU: 3   COMMAND: "rpciod/3"
 #0 [ffff81007f839b60] schedule at ffffffff80062f4b
 #1 [ffff81007f839c38] nfs_idmap_id at ffffffff88512ce0
 #2 [ffff81007f839cf8] decode_getfattr at ffffffff8850c589
 #3 [ffff81007f839d88] nfs4_xdr_dec_write at ffffffff8850ee73
 #4 [ffff81007f839de8] call_decode at ffffffff882ef90f
 #5 [ffff81007f839e18] __rpc_execute at ffffffff882f4a56
 #6 [ffff81007f839e38] run_workqueue at ffffffff8004cf77
 #7 [ffff81007f839e78] worker_thread at ffffffff80049946
 #8 [ffff81007f839ee8] kthread at ffffffff80032569
 #9 [ffff81007f839f48] kernel_thread at ffffffff8005dfb1

PID: 5967   TASK: ffff810037e63820  CPU: 1   COMMAND: "rpc.idmapd"
 #0 [ffff8100b3a69448] schedule at ffffffff80062f4b
 #1 [ffff8100b3a69520] nfs_wait_bit_interruptible at ffffffff884fb033
 #2 [ffff8100b3a69530] __wait_on_bit at ffffffff800639be
 #3 [ffff8100b3a69570] out_of_line_wait_on_bit at ffffffff80063a58
 #4 [ffff8100b3a695e0] nfs_wait_on_request at ffffffff884faff7
 #5 [ffff8100b3a69620] nfs_wait_on_requests_locked at ffffffff884fea02
 #6 [ffff8100b3a69670] nfs_sync_inode_wait at ffffffff884ffb7d
 #7 [ffff8100b3a696e0] nfs_release_page at ffffffff884f5e6c
 #8 [ffff8100b3a69700] shrink_inactive_list at ffffffff800c4c3c
 #9 [ffff8100b3a698f0] shrink_zone at ffffffff80012c49
#10 [ffff8100b3a69930] try_to_free_pages at ffffffff800c5651
#11 [ffff8100b3a699b0] __alloc_pages at ffffffff8000f234
#12 [ffff8100b3a69a10] __handle_mm_fault at ffffffff80008c4b
#13 [ffff8100b3a69ab0] do_page_fault at ffffffff8006686f
#14 [ffff8100b3a69ba0] error_exit at ffffffff8005dde9
    [exception RIP: file_read_actor+37]
    RIP: ffffffff8000cdd9  RSP: ffff8100b3a69c58  RFLAGS: 00010246
    RAX: 000000000001006c  RBX: ffff810009b87528  RCX: 00000000000005d2
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 00002b8ca0381000
    RBP: 00000000000005d2   R8: ffff8100b3a69ce8   R9: 0000000000000001
    R10: 0000000000000020  R11: ffffffffffffffff  R12: ffff8100b3a69d88
    R13: 0000000000001000  R14: 0000000000000000  R15: ffff8100b3a69d88
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#15 [ffff8100b3a69c50] page_cache_readahead at ffffffff80013aa9
#16 [ffff8100b3a69c80] do_generic_mapping_read at ffffffff8000bf25
#17 [ffff8100b3a69d70] __generic_file_aio_read at ffffffff8000c2d6
#18 [ffff8100b3a69de0] generic_file_aio_read at ffffffff80016839
#19 [ffff8100b3a69e00] do_sync_read at ffffffff8000cadd
#20 [ffff8100b3a69f10] vfs_read at ffffffff8000b370
#21 [ffff8100b3a69f40] sys_read at ffffffff80011778
#22 [ffff8100b3a69f80] tracesys at ffffffff8005d28d (via system_call)
    RIP: 00002b8ca0eb33a0  RSP: 00007fff0a72dab8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
    RDX: 0000000000001000  RSI: 00002b8ca0381000  RDI: 000000000000000c
    RBP: 00002b8cae785750   R8: 00000000ffffffff   R9: 0000000000000000
    R10: 0000000000000022  R11: 0000000000000246  R12: 00000000000003ff
    R13: 0000000000000000  R14: 00002b8cae785b60  R15: 000000000000000a
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

My tests were done with nfs-utils-1.0.9-35z.el5_2 and kernel-2.6.18-92.1.22.el5.  This also reproduces (slightly different stacks, but same basic code path) with a custom-compiled 2.6.27 kernel.

How reproducible:
Simple test case will reproduce deadlock every time on my test setup.

Steps to Reproduce:
Setup:  Client and server are both commodity desktop hardware running RHEL 5.2 (but I can reproduce with other distros acting as the server, and this is presumably not RHEL-specific on the client side either).

On SERVER:
1. Create NFSv4 share (my export has sync,rw,no_root_squash)
2. Start monitoring network throughput

On CLIENT:
1. Mount NFSv4 share (I'm using mount -t nfs4 -o hard,intr,proto=tcp,timeo=600,retrans=15 ...)
2. To speed up reproduction, reduce idmap_cache_timeout (otherwise, it will still reproduce, but may take several days):
# echo 1 > /proc/sys/fs/nfs/idmap_cache_timeout
3. Start some dd processes, each writing zeros to a different file on the share.  (The bug reproduces with a single writer, but will reproduce faster with more; 10 is good.)

Monitor network throughput on server.  After a few minutes (anywhere from 2 - 10 in my tests), throughput will drop to near 0.  On the client, you'll find that one of the rpciod kernel threads is waiting in nfs_idmap_id (the others are idle), and the rpc.idmapd process is waiting in nfs_wait_bit_interruptible.  At this point, anything that tries to allocate memory will hang.

One potential workaround is to give rpc.idmapd realtime priority.  This way, __alloc_pages tries harder to find free pages before doing synchronous reclaim (and getting stuck in try_to_free_pages).  This fixes the deadlock in my testing, but of course may have undesirable side effects.

Comment 3 Issue Tracker 2009-07-10 14:35:04 UTC

Event posted on 07-10-2009 10:35am EDT by jkachuck




This event sent from IssueTracker by jkachuck 
 issue 315313

Comment 4 Issue Tracker 2009-07-10 14:36:38 UTC

Event posted on 07-10-2009 10:36am EDT by jkachuck

From client:
This issue can be recreated on ext3 file system.
But sosreport command hangs when NFS client hangs.
So I can not get that from NFS cllient.

I can get sosreport from NFS server.
If that's what you need, I can attach that. 

------- Comment From ffilz.com 2009-07-09 17:05 EDT-------
I had a conversation with Steve French and we've got a thought for how to
fix. If you could get source rpm installed (if it isn't already), I can
code up the fix and put a new kernel on.

Also, I'd like to verify the problem occurs on latest mainline, and if so
(and I expect so), I can post the fix upstream also.

Basically, the fix Steve and I came up with is to set up so
nfs4_xdr_dec_write() can pass a flag down to nfs_idmap_id() indicating it
can not block on up call. If nfs_idmap_id() would have to block, it would
return EWOULDBLOCK which would cause decode_getfattr(), but the
nfs4_xdr_dec_write() doesn't care about the return value from
decode_getfattr() and nfs_idmap_id() can already return EIO.

------- Comment From wangwhao.com 2009-07-09 22:46 EDT-------
Hi, Frank:

Where will be your efix code in, Linux kernel or NFS packages?

Please note that current cluster has kernel patch for bug 52649. Without
that patch, the NFS client does not hang each time.

I suppose the efix for bug 52649 is included in RHEL5.4. If that's the
case and your code is also in Linux kernel, please do include 52649 code
in your test kernel. Then I can correctly verify whether the kernel patch
works.

Thanks. 
------- Comment From ffilz.com 2009-07-10 01:48 EDT-------
If you can install the RHEL 5.4 kernel source, I will apply both the fix
for bug 52649 (if it isn't in RHEL 5.4 already) and my proposed fix, then
we can test it.

------- Comment From wangwhao.com 2009-07-10 03:32 EDT-------
As I wrote before, with RHEL5.4 beta kernel, I hit input/output error in
my
program. And NFS client is alive. I am not sure whether it is one new
issue or
not.

If input/output error is another bug, we can not verify whether your efix
works
or not, with RHEL5.4 kernel. Will you have a look at input/output error
within
Redhat5.4 kernel?

Or, would you or anyone can check whether bug 52649 efix code is included
in
RHEL5.4? The kernel patch version is 2.6.18-138.el5.jtltest.74..

Thanks. 




This event sent from IssueTracker by jkachuck 
 issue 315313

Comment 5 Issue Tracker 2009-08-04 18:11:01 UTC

Event posted on 07-10-2009 10:38am EDT by Glen Johnson

------- Comment From ffilz.com 2009-07-10 10:32 EDT-------
I can verify if the patch is included in RHEL 5.4 beta from the source. If
not, we should try and re-create with patch.

If you could put the RHEL 5.3 and RHEL 5.4 beta source rpms on the
machine, I will proceed with working out what we need to do to proceed. I
don't want to get hung up on the IO error.


This event sent from IssueTracker by jkachuck 
 issue 315313

Comment 6 Issue Tracker 2009-08-04 18:11:05 UTC

Event posted on 07-22-2009 06:22pm EDT by Glen Johnson

------- Comment From ffilz.com 2009-07-22 18:18 EDT-------
We are having trouble re-creating this bug on RHEL 5.4. We have applied
the patch from IBM Bug 52649  -  RIT283765 - Operations to all nfs mount
hang. updatedb running  (RH BZ#495059).

I am curious what other patches are in Jeff Laytton's
2.6.18-138.el5.jtltest.74 test kernel, perhaps one of the other patches is
the cause, especially if there are some patched other than 52649 that are
not in RHEL 5.4.


This event sent from IssueTracker by jkachuck 
 issue 315313

Comment 7 Issue Tracker 2009-08-04 18:11:07 UTC

Event posted on 08-04-2009 02:06pm EDT by Glen Johnson

------- Comment From ffilz.com 2009-08-04 14:03 EDT-------
At this point, this bug appears to only be exposed when the particular
test kernel that the fix for bug 52649 appeared in. We have been unable to
re-create using RHEL 5.4 beta with the fix for 52649 patched in.

The deadlock that is occuring here is one that can be seen to be possible
when examining the structure of the code involved, however, it clearly is
extremely unlikely to be hit. It seems that the particular conditions
created by the test kernel happened to make the deadlock extremely easy to
hit.

There is a proposed fix for this deadlock, however, there has been no
discussion yet in the community, and the fix does have some downsides.
Given that, it doesn't seem to be a good idea to take the fix given the
unlikeliness of hitting the deadlock.

Based on that, I am deferring the bug to RHEL 5.5. I think we should also
reduce the severity on this bug.

Ticket type changed from 'Problem' to ''

This event sent from IssueTracker by jkachuck 
 issue 315313

Comment 9 RHEL Program Management 2009-09-25 17:44:15 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 10 RHEL Program Management 2009-10-16 13:45:54 UTC

Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.

Comment 11 Sachin Prabhu 2011-04-11 17:04:43 UTC

This particular problem seems to have been reported again in bz 609252.

Note You need to log in before you can comment on or make changes to this bug.