Bug 797209
Summary: | NFS3 mounts hang on clients after getting "nfsd: Dropping request; may be revisited later" on server | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Stefan Walter <walteste> | ||||||
Component: | nfs-utils | Assignee: | Steve Dickson <steved> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Petr Beňas <pbenas> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 6.2 | CC: | baumanmo, bfields, jiali, jlayton, jwest, ngupta, pbenas, pstehlik, rdassen, rrajaram, spanjikk, steved, vgaikwad, yanwang | ||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | NFS server | ||||||||
Fixed In Version: | nfs-utils-1.2.3-35.el6 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Cause: Bug in rpc.mound
Consequence: Causes the client to hang
Fix: a number upstream patches
Result: The bug is fixed.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2013-02-21 10:55:06 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 727267, 856347 | ||||||||
Attachments: |
|
Description
Stefan Walter
2012-02-24 15:03:20 UTC
After some testing I have now built a new kernel and added a dump_stack() in svc_defer() net/sunrpc/svc_xprt.c. The result is the following: Pid: 8051, comm: nfsd Not tainted 2.6.32-220.4.2.4.el6.x86_64 #1 Call Trace: [<ffffffffa050391e>] ? svc_defer+0x9e/0x160 [sunrpc] [<ffffffffa04fec55>] ? cache_check+0x135/0x360 [sunrpc] [<ffffffffa059e632>] ? exp_find_key+0x62/0xb0 [nfsd] [<ffffffffa059ee44>] ? exp_find+0x24/0x90 [nfsd] [<ffffffffa059f209>] ? rqst_exp_find+0x49/0xf0 [nfsd] [<ffffffffa0598cd6>] ? fh_verify+0x436/0x640 [nfsd] [<ffffffff814ec3c6>] ? printk+0x41/0x43 [<ffffffffa05a3b1c>] ? nfsd3_proc_getattr+0x6c/0xe0 [nfsd] [<ffffffffa059543e>] ? nfsd_dispatch+0xfe/0x240 [nfsd] [<ffffffffa04f45a4>] ? svc_process_common+0x344/0x640 [sunrpc] [<ffffffff8105e7f0>] ? default_wake_function+0x0/0x20 [<ffffffffa04f4be0>] ? svc_process+0x110/0x160 [sunrpc] [<ffffffffa0595b62>] ? nfsd+0xc2/0x160 [nfsd] [<ffffffffa0595aa0>] ? nfsd+0x0/0x160 [nfsd] [<ffffffff81090726>] ? kthread+0x96/0xa0 [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [<ffffffff81090690>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 nfsd: Dropping request; may be revisited later ... Pid: 8050, comm: nfsd Not tainted 2.6.32-220.4.2.4.el6.x86_64 #1 Call Trace: [<ffffffffa050391e>] ? svc_defer+0x9e/0x160 [sunrpc] [<ffffffffa04fec55>] ? cache_check+0x135/0x360 [sunrpc] [<ffffffffa059e632>] ? exp_find_key+0x62/0xb0 [nfsd] [<ffffffffa059ee44>] ? exp_find+0x24/0x90 [nfsd] [<ffffffffa059f209>] ? rqst_exp_find+0x49/0xf0 [nfsd] [<ffffffffa0598cd6>] ? fh_verify+0x436/0x640 [nfsd] [<ffffffffa05999b8>] ? nfsd_access+0x38/0x150 [nfsd] [<ffffffff814ec3c6>] ? printk+0x41/0x43 [<ffffffffa05a3cc7>] ? nfsd3_proc_access+0x77/0xd0 [nfsd] [<ffffffffa059543e>] ? nfsd_dispatch+0xfe/0x240 [nfsd] [<ffffffffa04f45a4>] ? svc_process_common+0x344/0x640 [sunrpc] [<ffffffff8105e7f0>] ? default_wake_function+0x0/0x20 [<ffffffffa04f4be0>] ? svc_process+0x110/0x160 [sunrpc] [<ffffffffa0595b62>] ? nfsd+0xc2/0x160 [nfsd] [<ffffffffa0595aa0>] ? nfsd+0x0/0x160 [nfsd] [<ffffffff81090726>] ? kthread+0x96/0xa0 [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [<ffffffff81090690>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 nfsd: Dropping request; may be revisited later It seems that the export cache is the culprit. I actually found an easy reproducer that triggers the bug on a simple NFS server: 1. Create the directories /export/test1 and /export/test2 on a server 2. Create /etc/exports with the following line /export/test1 *(rw,async,no_root_squash,no_subtree_check,fsid=99999) 3. Start the nfs service 4. Mount /export/test1 on a workstation 5. On the workstation run something like this: cd {mountpoint of server:/export/test1} while (true); do date ; ls ; touch x ; sleep 1; done 6. On The server run for i in `seq 1 1000`; do mkdir -p /export/test2/$i ; exportfs -o rw,async,no_root_squash,no_subtree_check,fsid=19898$i @$i:/export/test2/$i ; done The task on the workstation almost immediately freezes because accesses to server:/export/test1 hang. On the server a nfsd: Dropping request; may be revisited later message appears in dmesg. I have filed BZ798251 against F17 because the same bug is also present there. After some more testing and experimenting I found the bug in rpc.mountd! The following part in the function nfsd_fh() in utils/mountd/cache.c is causing our problems: if (use_ipaddr) { if (ai == NULL) { struct addrinfo *tmp; tmp = host_pton(dom); if (tmp == NULL) goto out; ai = client_resolve(tmp->ai_addr); freeaddrinfo(tmp); } if (!client_check(exp->m_client, ai)) continue; } /* It's a match !! */ 'use_ipaddr' on our cluster and in the reproducer is initially 0 and becomes 1 as more exports are added. The code in the if clause tries to parse 'dom' (the export target) as an IP address. host_pton() cannot parse export targets that are netmasks, netgroups or '*' and fails. If rpc.mountd is run with '-F -dall' it will print mountd: nfsd_fh: found (nil) path (null) The quick fix is to ensure that use_ipaddr always stays 0. The case when it is 1 is apparently not well tested anyway. The right thing would be to restructure the code above to fail more gracefully for non-IP exports. I tried both fixes I suggested in my previous post. Setting use_ipaddr to 0 permanently works fine as far as I can see but makes rpc.mountd slow. The other approach will IMO be tougher. The whole code that uses use_ipaddr apparently does not really support that use_ipaddr changes between 0 and 1 while rpc.mountd runs. Good catch, but--there's something I don't quite understand: the cache_flush(1) in check_useipaddr should have flushed any exports using the use_ipaddr=0 format from the kernel's caches, so it should prevent any more upcalls for such a client. Looking quickly at the net/sunrpc/cache.c:write_flush(), I'm not sure if we're guaranteed on return that no cache items are referenced--perhaps in-progress requests could keep them around? If so, maybe mountd needs to be prepared to parse netgroups even in the use_ipaddr case? Another update: I modified the code in comment 5 to look like this: if (use_ipaddr) { if (ai == NULL) { struct addrinfo *tmp; tmp = host_pton(dom); if (tmp != NULL) { ai = client_resolve(tmp->ai_addr); freeaddrinfo(tmp); } } if (ai != NULL) { if (!client_check(exp->m_client, ai)) continue; } } /* It's a match !! */ An rpc.mountd patched like this will work fine through the reproducer. But when one mounts 'server:/export/test1' when all exports are there ('use_ipaddr' is 1) and then removes all exports with for i in `seq 1 1000`; do exportfs -u @$i:/export/test2/$i; done the client will get a stale NFS handle. The reason seems to be the following if clause which becomes true (also in nfsd_fh()): if (!use_ipaddr && !client_member(dom, exp->m_client->m_hostname)) continue; Again, with '-F -dall' rpc.mountd will print: mountd: nfsd_fh: found (nil) path (null) I haven't tried to find a fix for this one yet. (In reply to comment #9) > Another update: I modified the code in comment 5 to look like this: > > if (use_ipaddr) { > if (ai == NULL) { > struct addrinfo *tmp; > tmp = host_pton(dom); > if (tmp != NULL) { > ai = client_resolve(tmp->ai_addr); > freeaddrinfo(tmp); > } > } > if (ai != NULL) { > if (!client_check(exp->m_client, ai)) > continue; > } > } > /* It's a match !! */ > > An rpc.mountd patched like this will work fine through the reproducer. Its unclear to me how this is working... all your doing is ignoring the host_pton() failure instead of erroring out.... My question is how is the dom string, which is passed up from the kernel, a netmask, netroup or '*' which causes host_pton() to fail? With mountd debug enabled (aka -d all), what is the 'nfsd_fh: inbuf " debug output which happens at the beginning of nfsd_fh()? Also its it valid for use_ipaddr to become non-zero in the first place? Do you have a large amount of netgroup that are being parsed? (In reply to comment #10) > (In reply to comment #9) > > Another update: I modified the code in comment 5 to look like this: > > > > if (use_ipaddr) { > > if (ai == NULL) { > > struct addrinfo *tmp; > > tmp = host_pton(dom); > > if (tmp != NULL) { > > ai = client_resolve(tmp->ai_addr); > > freeaddrinfo(tmp); > > } > > } > > if (ai != NULL) { > > if (!client_check(exp->m_client, ai)) > > continue; > > } > > } > > /* It's a match !! */ > > > > An rpc.mountd patched like this will work fine through the reproducer. > Its unclear to me how this is working... all your doing is ignoring > the host_pton() failure instead of erroring out.... > > My question is how is the dom string, which is passed up from the kernel, > a netmask, netroup or '*' which causes host_pton() to fail? 'dom' contains exactly the string of the export target, such as '*', '@123', '123.45.67.89/16', which all fail to be parsed by host_pton(). I verified that with printf(). The code above indeed does just continue because the verification cannot succeed. And I'm sure that there are security implications too - the client should be verified at this point by some means. The code was intended to show that even if you fix this right at that spot alone it will not be enough and just uncover more bugs. The whole 'use_ipaddr mode' needs to be revisited. > With mountd debug enabled (aka -d all), what is the 'nfsd_fh: inbuf " debug > output which happens at the beginning of nfsd_fh()? mountd: nfsd_fh: inbuf '* 1 \x9f860100' > Also its it valid for use_ipaddr to become non-zero in the first place? Do you > have a large amount of netgroup that are being parsed? Yes, our entire NFS3 access management currently relies on netgroups. When dumped in a file (from LDAP) /etc/netgroup is 1165 lines and 63Kb. Apparently we are the only people who seriously use them. :) An 'rpc.mountd' rigged with printf()s clearly shows me that use_ipaddr starts with 0 and becomes 1. My stable fix so far is to modify check_useipaddr() in utils/mountd/auth.c to ever only switch from -1 to 0, but that has a performance penalty: a mount takes two seconds with rpc.mountd and sssd_* totaling 100% load on a 1-cpu VM. check_useipaddr() is depending on a write to auth.unix.ip/flush to remove all the client names of the previous type from the auth.unix.ip cache before returning. However, from inspection of the kernel code I don't believe it does that, for two reasons: - cache_clean() appears to remove only entries with last_update time not later than the current second; entries created in the last second will remain. - cache_clean() removes entries from the cache, but can't do anything about entries currently in use by nfsd threads. Those nfsd threads may still do export upcalls using those entries. That strikes me as a bug in the /flush interface. Note exportfs also depends on it to remove old entries on changing the export table. Possible fixes for the use_ipaddr bug: - We could try to fix /flush. I don't see how to do that yet. Note that whatever solution we come with has to be careful about waiting on nfsd threads, since if those threads are themselves waiting on mountd, we could create a deadlock. - As noted in comment 11, we could disable the use_ipaddr feature entirely in return for a severe performance regression for some setups. I don't consider that acceptable. - We could try to teach the export upcall to handle client names of either type. (Is it possible to do that unambiguously?) Created attachment 579035 [details]
try resolving client as either ip address or list of "domains"
Apologies for the delay. Here's an attempt at the third approach above. Stefan Walter, would this be easy for you to test with your reproducer? WARNING: it compiles, but I haven't yet tested it otherwise.
I have built a custom nfs-util package that contains the patch from comment 15 and the patches from BZ 812364 and deployed that on a test system. Both test cases from comment 2 and 9 work fine, there are no haning mounts and no ESTALE errors. I did however accidentally try to mount the test share with nfsvers=4 while all 1000 additional exports from comment 2 were present. The mount command hung and I could see the following with tcpdump: 15:34:42.496976 IP chibi.inf.ethz.ch.3345383294 > debu.inf.ethz.ch.nfs: 172 getattr fh 0,0/22 15:34:42.497013 IP debu.inf.ethz.ch.nfs > chibi.inf.ethz.ch.vpps-qua: Flags [.], ack 819275482, win 5185, options [nop,nop,TS val 6371042 ecr 6397866], length 0 15:34:47.497550 IP debu.inf.ethz.ch.nfs > chibi.inf.ethz.ch.3345383294: reply ok 44 getattr ERROR: Request couldn't be completed in time 15:34:47.498161 IP chibi.inf.ethz.ch.vpps-qua > debu.inf.ethz.ch.nfs: Flags [.], ack 48, win 279, options [nop,nop,TS val 6402867 ecr 6376043], length 0 15:35:02.497586 IP chibi.inf.ethz.ch.3362160510 > debu.inf.ethz.ch.nfs: 172 getattr fh 0,0/22 15:35:02.497631 IP debu.inf.ethz.ch.nfs > chibi.inf.ethz.ch.vpps-qua: Flags [.], ack 177, win 5201, options [nop,nop,TS val 6391043 ecr 6417867], length 0 15:35:07.497453 IP debu.inf.ethz.ch.nfs > chibi.inf.ethz.ch.3362160510: reply ok 44 getattr ERROR: Request couldn't be completed in time 15:35:07.497986 IP chibi.inf.ethz.ch.vpps-qua > debu.inf.ethz.ch.nfs: Flags [.], ack 96, win 279, options [nop,nop,TS val 6422867 ecr 6396043], length 0 rpc.mountd is not involved in this case (strace shows it's sleeping). I guess the reproducer triggers a related bug in the nfs4 code in the kernel. (In reply to comment #18) > I have built a custom nfs-util package that contains the patch from comment 15 > and the patches from BZ 812364 and deployed that on a test system. Both test > cases from comment 2 and 9 work fine, there are no haning mounts and no ESTALE > errors. Thanks for confirming. Is the performance penalty mentioned in comment 11 ("a mount takes two seconds with rpc.mountd and sssd_* totaling 100% load on a 1-cpu VM.") also gone? > I did however accidentally try to mount the test share with nfsvers=4 while > all 1000 additional exports from comment 2 were present. The mount command > hung and I could see the following with tcpdump: > > 15:34:42.496976 IP chibi.inf.ethz.ch.3345383294 > debu.inf.ethz.ch.nfs: 172 > getattr fh 0,0/22 > 15:34:42.497013 IP debu.inf.ethz.ch.nfs > chibi.inf.ethz.ch.vpps-qua: Flags > [.], ack 819275482, win 5185, options [nop,nop,TS val 6371042 ecr 6397866], > length 0 > 15:34:47.497550 IP debu.inf.ethz.ch.nfs > chibi.inf.ethz.ch.3345383294: reply > ok 44 getattr ERROR: Request couldn't be completed in time > 15:34:47.498161 IP chibi.inf.ethz.ch.vpps-qua > debu.inf.ethz.ch.nfs: Flags > [.], ack 48, win 279, options [nop,nop,TS val 6402867 ecr 6376043], length 0 > 15:35:02.497586 IP chibi.inf.ethz.ch.3362160510 > debu.inf.ethz.ch.nfs: 172 > getattr fh 0,0/22 > 15:35:02.497631 IP debu.inf.ethz.ch.nfs > chibi.inf.ethz.ch.vpps-qua: Flags > [.], ack 177, win 5201, options [nop,nop,TS val 6391043 ecr 6417867], length 0 > 15:35:07.497453 IP debu.inf.ethz.ch.nfs > chibi.inf.ethz.ch.3362160510: reply > ok 44 getattr ERROR: Request couldn't be completed in time > 15:35:07.497986 IP chibi.inf.ethz.ch.vpps-qua > debu.inf.ethz.ch.nfs: Flags > [.], ack 96, win 279, options [nop,nop,TS val 6422867 ecr 6396043], length 0 > > rpc.mountd is not involved in this case (strace shows it's sleeping). I > guess the reproducer triggers a related bug in the nfs4 code in the kernel. That doesn't look familiar. But the tcpdump output isn't very informative. Could you get a binary capture (tcpdump -s0 -wtmp.pcap) and attach it? Given that mountd isn't involved, we may want to open a new bug. Also, the contents of /proc/net/rpc/nfsd.fh/content and /proc/net/rpc/nfsd.export captured after the problem is seen might be interesting. (In reply to comment #19) > Thanks for confirming. Is the performance penalty mentioned in comment 11 ("a > mount takes two seconds with rpc.mountd and sssd_* totaling 100% load on a > 1-cpu VM.") also gone? Yes, the mount succeeds immediately and no particular load is seen on the server. > That doesn't look familiar. But the tcpdump output isn't very informative. > Could you get a binary capture (tcpdump -s0 -wtmp.pcap) and attach it? Given > that mountd isn't involved, we may want to open a new bug. > > Also, the contents of /proc/net/rpc/nfsd.fh/content and > /proc/net/rpc/nfsd.export captured after the problem is seen might be > interesting. I will open a new case for that. After some more tests I noticed that the first mount of a share from a client is still slow with high load on the server. Unmounting and mounting the same share again is instantaneous. The authentication info seems to be cached. in this case. Created attachment 581507 [details] make use_ipaddr clients self-describing After further discussion upstream: http://thread.gmane.org/gmane.linux.nfs/48853 I think we have an improved fix. Untested rhel6 backport attached. I have tested the new patch (together with the patches for BZ812364) and it did not work. I also tried a modified patch with + if (dom[0] == '$') { + ai = lookup_client_addr(dom+1); as the comment http://article.gmane.org/gmane.linux.nfs/49311 implies, same result. The mount command on the client just hangs. Here is what I get as debug output from 'rpc.mountd -F -d all' for a hanging mount: rpc.mountd: from_local: updating local if addr list rpc.mountd: from_local: checked 6 local if addrs; incoming address not found rpc.mountd: check_default: access by 129.132.10.7 ALLOWED rpc.mountd: Received NULL request from 129.132.10.7 rpc.mountd: check_default: access by 129.132.10.7 ALLOWED (cached) rpc.mountd: Received NULL request from 129.132.10.7 rpc.mountd: check_default: access by 129.132.10.7 ALLOWED (cached) rpc.mountd: Received MNT3(/export/test1) request from 129.132.10.7 rpc.mountd: authenticated mount request from puchi.inf.ethz.ch:700 for /export/test1 (/export/test1) rpc.mountd: nfsd_fh: inbuf '$129.132.10.7 1 \x9f860100' rpc.mountd: nfsd_fh: found (nil) path (null) Sorry, I made a mistake when I tested the package with the modified patch. The package built with the patch modified as follows works fine for both test cases (mounting and ls after adding or removing 1000 exports): - ai = lookup_client_addr(dom); + ai = lookup_client_addr(dom+1); So that looks like a good fix, thanks. (In reply to comment #25) > The package built with the patch modified as follows works fine for both test > cases (mounting and ls after adding or removing 1000 exports): > > - ai = lookup_client_addr(dom); > + ai = lookup_client_addr(dom+1); > > So that looks like a good fix, thanks. Thanks again for the testing. The nfsd_export upcall also needs the same fix, so I've fixed the implementation of lookup_client_addr() rather than the caller. At the same time I also noticed a pre-existing bug in the nfsd_export() call that will cause it to fail in the use_ipaddr case. (Which *might* also explain the failure you were seeing the nfsv4 case.) The version I've submitted upstream: http://thread.gmane.org/gmane.linux.nfs/49326 Here are the tree upstream patches that fix this problem: commit c2544b77566690ebec32a2d47c9249548b1a0941 Author: J. Bruce Fields <bfields> Date: Thu May 3 15:06:21 2012 -0400 mountd: prepend '$' to make use_ipaddr clients self-describing commit 236025253c6ec534cd816390464435ec18c499f7 Author: J. Bruce Fields <bfields> Date: Thu May 3 15:01:20 2012 -0400 mountd: add trivial helpers for client-matching commit 005c49a063544b4005b5b7c3c28ed48c197f66ac Author: J. Bruce Fields <bfields> Date: Thu May 3 14:59:27 2012 -0400 mountd: parse ip address earlier Reproduced in nfs-utils-1.2.3-28.el6.x86_64 and verified in nfs-utils-1.2.3-29.el6.x86_64. *** Bug 798251 has been marked as a duplicate of this bug. *** re-assign the bug since there's a regression introduced, pls check https://bugzilla.redhat.com/show_bug.cgi?id=876847#c61 to https://bugzilla.redhat.com/show_bug.cgi?id=876847#c71 hi Petr, Since the bug re-fix and you are the qa contact, could you help to re-test it again? thanks. Verified in nfs-utils-1.2.3-35.el6.x86_64. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0468.html |