Description of problem: An NFSv4 client will hang when writing a large amount of data to an NFS share for a prolonged period of time. There is a deadlock scenario involving rpc.idmapd and rpciod: - NFS client is doing lots of writes. - Client system has many dirty and "nfs unstable" pages. - NFS client (rpciod kernel thread) goes to handle a write reply from NFS server. - The reply contains a GETFATTR response, which means server must decode the username into a numeric user ID. - The NFS client normally gets this info from a cache, idmap_hashtable, but the cache does not have a valid entry for the username in question this time because the entry has expired. - NFS client kernel code calls out to the userspace rpc.idmapd process (via an RPC pipe) to look up the user ID. - rpc.idmapd does something that requires a memory allocation, such as reading /etc/passwd, and ends up waiting for memory reclaim. - Since most dirty pages are NFS write cache, memory reclaim involves waiting for NFS to commit some writes. However, this will never happen, because rpciod would normally do this, and it is currently waiting for rpc.idmapd to return the user ID so it can handle the most recent reply it got from the NFS server. I think the following kernel stacks (from a kdump triggered with Alt-SysRq-c during a deadlock) may illustrate this more clearly: PID: 8813 TASK: ffff81008303d040 CPU: 3 COMMAND: "rpciod/3" #0 [ffff81007f839b60] schedule at ffffffff80062f4b #1 [ffff81007f839c38] nfs_idmap_id at ffffffff88512ce0 #2 [ffff81007f839cf8] decode_getfattr at ffffffff8850c589 #3 [ffff81007f839d88] nfs4_xdr_dec_write at ffffffff8850ee73 #4 [ffff81007f839de8] call_decode at ffffffff882ef90f #5 [ffff81007f839e18] __rpc_execute at ffffffff882f4a56 #6 [ffff81007f839e38] run_workqueue at ffffffff8004cf77 #7 [ffff81007f839e78] worker_thread at ffffffff80049946 #8 [ffff81007f839ee8] kthread at ffffffff80032569 #9 [ffff81007f839f48] kernel_thread at ffffffff8005dfb1 PID: 5967 TASK: ffff810037e63820 CPU: 1 COMMAND: "rpc.idmapd" #0 [ffff8100b3a69448] schedule at ffffffff80062f4b #1 [ffff8100b3a69520] nfs_wait_bit_interruptible at ffffffff884fb033 #2 [ffff8100b3a69530] __wait_on_bit at ffffffff800639be #3 [ffff8100b3a69570] out_of_line_wait_on_bit at ffffffff80063a58 #4 [ffff8100b3a695e0] nfs_wait_on_request at ffffffff884faff7 #5 [ffff8100b3a69620] nfs_wait_on_requests_locked at ffffffff884fea02 #6 [ffff8100b3a69670] nfs_sync_inode_wait at ffffffff884ffb7d #7 [ffff8100b3a696e0] nfs_release_page at ffffffff884f5e6c #8 [ffff8100b3a69700] shrink_inactive_list at ffffffff800c4c3c #9 [ffff8100b3a698f0] shrink_zone at ffffffff80012c49 #10 [ffff8100b3a69930] try_to_free_pages at ffffffff800c5651 #11 [ffff8100b3a699b0] __alloc_pages at ffffffff8000f234 #12 [ffff8100b3a69a10] __handle_mm_fault at ffffffff80008c4b #13 [ffff8100b3a69ab0] do_page_fault at ffffffff8006686f #14 [ffff8100b3a69ba0] error_exit at ffffffff8005dde9 [exception RIP: file_read_actor+37] RIP: ffffffff8000cdd9 RSP: ffff8100b3a69c58 RFLAGS: 00010246 RAX: 000000000001006c RBX: ffff810009b87528 RCX: 00000000000005d2 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00002b8ca0381000 RBP: 00000000000005d2 R8: ffff8100b3a69ce8 R9: 0000000000000001 R10: 0000000000000020 R11: ffffffffffffffff R12: ffff8100b3a69d88 R13: 0000000000001000 R14: 0000000000000000 R15: ffff8100b3a69d88 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #15 [ffff8100b3a69c50] page_cache_readahead at ffffffff80013aa9 #16 [ffff8100b3a69c80] do_generic_mapping_read at ffffffff8000bf25 #17 [ffff8100b3a69d70] __generic_file_aio_read at ffffffff8000c2d6 #18 [ffff8100b3a69de0] generic_file_aio_read at ffffffff80016839 #19 [ffff8100b3a69e00] do_sync_read at ffffffff8000cadd #20 [ffff8100b3a69f10] vfs_read at ffffffff8000b370 #21 [ffff8100b3a69f40] sys_read at ffffffff80011778 #22 [ffff8100b3a69f80] tracesys at ffffffff8005d28d (via system_call) RIP: 00002b8ca0eb33a0 RSP: 00007fff0a72dab8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff RDX: 0000000000001000 RSI: 00002b8ca0381000 RDI: 000000000000000c RBP: 00002b8cae785750 R8: 00000000ffffffff R9: 0000000000000000 R10: 0000000000000022 R11: 0000000000000246 R12: 00000000000003ff R13: 0000000000000000 R14: 00002b8cae785b60 R15: 000000000000000a ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b My tests were done with nfs-utils-1.0.9-35z.el5_2 and kernel-2.6.18-92.1.22.el5. This also reproduces (slightly different stacks, but same basic code path) with a custom-compiled 2.6.27 kernel. How reproducible: Simple test case will reproduce deadlock every time on my test setup. Steps to Reproduce: Setup: Client and server are both commodity desktop hardware running RHEL 5.2 (but I can reproduce with other distros acting as the server, and this is presumably not RHEL-specific on the client side either). On SERVER: 1. Create NFSv4 share (my export has sync,rw,no_root_squash) 2. Start monitoring network throughput On CLIENT: 1. Mount NFSv4 share (I'm using mount -t nfs4 -o hard,intr,proto=tcp,timeo=600,retrans=15 ...) 2. To speed up reproduction, reduce idmap_cache_timeout (otherwise, it will still reproduce, but may take several days): # echo 1 > /proc/sys/fs/nfs/idmap_cache_timeout 3. Start some dd processes, each writing zeros to a different file on the share. (The bug reproduces with a single writer, but will reproduce faster with more; 10 is good.) Monitor network throughput on server. After a few minutes (anywhere from 2 - 10 in my tests), throughput will drop to near 0. On the client, you'll find that one of the rpciod kernel threads is waiting in nfs_idmap_id (the others are idle), and the rpc.idmapd process is waiting in nfs_wait_bit_interruptible. At this point, anything that tries to allocate memory will hang. One potential workaround is to give rpc.idmapd realtime priority. This way, __alloc_pages tries harder to find free pages before doing synchronous reclaim (and getting stuck in try_to_free_pages). This fixes the deadlock in my testing, but of course may have undesirable side effects.
Event posted on 07-10-2009 10:35am EDT by jkachuck This event sent from IssueTracker by jkachuck issue 315313
Event posted on 07-10-2009 10:36am EDT by jkachuck From client: This issue can be recreated on ext3 file system. But sosreport command hangs when NFS client hangs. So I can not get that from NFS cllient. I can get sosreport from NFS server. If that's what you need, I can attach that. ------- Comment From ffilz.com 2009-07-09 17:05 EDT------- I had a conversation with Steve French and we've got a thought for how to fix. If you could get source rpm installed (if it isn't already), I can code up the fix and put a new kernel on. Also, I'd like to verify the problem occurs on latest mainline, and if so (and I expect so), I can post the fix upstream also. Basically, the fix Steve and I came up with is to set up so nfs4_xdr_dec_write() can pass a flag down to nfs_idmap_id() indicating it can not block on up call. If nfs_idmap_id() would have to block, it would return EWOULDBLOCK which would cause decode_getfattr(), but the nfs4_xdr_dec_write() doesn't care about the return value from decode_getfattr() and nfs_idmap_id() can already return EIO. ------- Comment From wangwhao.com 2009-07-09 22:46 EDT------- Hi, Frank: Where will be your efix code in, Linux kernel or NFS packages? Please note that current cluster has kernel patch for bug 52649. Without that patch, the NFS client does not hang each time. I suppose the efix for bug 52649 is included in RHEL5.4. If that's the case and your code is also in Linux kernel, please do include 52649 code in your test kernel. Then I can correctly verify whether the kernel patch works. Thanks. ------- Comment From ffilz.com 2009-07-10 01:48 EDT------- If you can install the RHEL 5.4 kernel source, I will apply both the fix for bug 52649 (if it isn't in RHEL 5.4 already) and my proposed fix, then we can test it. ------- Comment From wangwhao.com 2009-07-10 03:32 EDT------- As I wrote before, with RHEL5.4 beta kernel, I hit input/output error in my program. And NFS client is alive. I am not sure whether it is one new issue or not. If input/output error is another bug, we can not verify whether your efix works or not, with RHEL5.4 kernel. Will you have a look at input/output error within Redhat5.4 kernel? Or, would you or anyone can check whether bug 52649 efix code is included in RHEL5.4? The kernel patch version is 2.6.18-138.el5.jtltest.74.. Thanks. This event sent from IssueTracker by jkachuck issue 315313
Event posted on 07-10-2009 10:38am EDT by Glen Johnson ------- Comment From ffilz.com 2009-07-10 10:32 EDT------- I can verify if the patch is included in RHEL 5.4 beta from the source. If not, we should try and re-create with patch. If you could put the RHEL 5.3 and RHEL 5.4 beta source rpms on the machine, I will proceed with working out what we need to do to proceed. I don't want to get hung up on the IO error. This event sent from IssueTracker by jkachuck issue 315313
Event posted on 07-22-2009 06:22pm EDT by Glen Johnson ------- Comment From ffilz.com 2009-07-22 18:18 EDT------- We are having trouble re-creating this bug on RHEL 5.4. We have applied the patch from IBM Bug 52649 - RIT283765 - Operations to all nfs mount hang. updatedb running (RH BZ#495059). I am curious what other patches are in Jeff Laytton's 2.6.18-138.el5.jtltest.74 test kernel, perhaps one of the other patches is the cause, especially if there are some patched other than 52649 that are not in RHEL 5.4. This event sent from IssueTracker by jkachuck issue 315313
Event posted on 08-04-2009 02:06pm EDT by Glen Johnson ------- Comment From ffilz.com 2009-08-04 14:03 EDT------- At this point, this bug appears to only be exposed when the particular test kernel that the fix for bug 52649 appeared in. We have been unable to re-create using RHEL 5.4 beta with the fix for 52649 patched in. The deadlock that is occuring here is one that can be seen to be possible when examining the structure of the code involved, however, it clearly is extremely unlikely to be hit. It seems that the particular conditions created by the test kernel happened to make the deadlock extremely easy to hit. There is a proposed fix for this deadlock, however, there has been no discussion yet in the community, and the fix does have some downsides. Given that, it doesn't seem to be a good idea to take the fix given the unlikeliness of hitting the deadlock. Based on that, I am deferring the bug to RHEL 5.5. I think we should also reduce the severity on this bug. Ticket type changed from 'Problem' to '' This event sent from IssueTracker by jkachuck issue 315313
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.
This particular problem seems to have been reported again in bz 609252.