Bug 254195
Description
Jeff Layton
2007-08-24 17:35:22 UTC
I've not seen this on RHEL5, but a brief look seems to indicate that it has the same problem. There are 2 pieces to fixing this, I think: 1) make it so that nlm_destroy_host actually calls rpc_shutdown_client (like upstream does), which should make it cancel any outstanding RPC calls and callbacks when lockd goes down. Upstream already does this, so it's probably just a matter of backporting patches for it. 2) make it so that lockd can't be started until the old lockd is completely down. The current code is set up in such a way as to allow for a new lockd to stop before the old one goes down all the way, but there are data structures that are not adequately protected for this. I proposed a patch upstream for #2, but it got no comments. I'll see if I can revive it (or at least prod some discussion along of whether we need to allow for more than one lockd to run at once). I've spent some more time today tracking this down and have found a few interesting things. I've experimented with this patch: diff --git a/fs/lockd/host.c b/fs/lockd/host.c index 38b0e8a..098594d 100644 --- a/fs/lockd/host.c +++ b/fs/lockd/host.c @@ -331,15 +331,8 @@ nlm_gc_hosts(void) /* Don't unmonitor hosts that have been invalidated */ if (host->h_monitored && !host->h_killed) nsm_unmonitor(host); - if ((clnt = host->h_rpcclnt) != NULL) { - if (atomic_read(&clnt->cl_users)) { - printk(KERN_WARNING - "lockd: active RPC handle\n"); - clnt->cl_dead = 1; - } else { - rpc_destroy_client(host->h_rpcclnt); - } - } + if ((clnt = host->h_rpcclnt) != NULL) + rpc_shutdown_client(host->h_rpcclnt); kfree(host); nrhosts--; } ...but it does not fix the issue. The problem is that on rhel5, when shutting down the server's lockd when there's an outstanding grant callback, h_count is never 0. The host in question is skipped and rpc_shutdown_client (and rpc_destroy_client) is never called. On more recent kernels, when the grant callback comes in, h_rpcclnt is almost always NULL for the host and we end up calling rpc_create for it. That always fails since it can't talk to the portmapper and the RPC call is never attempted (and svc_wake_up isn't called). This is not always the case though, I've had a couple of occurrences where recent kernels have crashed. The trick seems to be to catch it so that h_rpcclnt is not NULL with nlm_bind_host is called, but I haven't figured out a way to reliably do that. In short, my earlier theory of why recent kernels don't seem to hit this as easily is wrong. The suggestion Trond made on the upstream mailing list to add some new reference counting to lockd will probably actually fix this after all. For now though, I'm focusing on getting this to reliably reproduce on rawhide. I think I can now reproduce this at will on recent upstream kernels. The recent sunrpc changes make it a little tougher to reproduce than on RHEL4/5, but the problem is still there. The trick is to make sure that the nlm_host has a valid and bound rpc_clnt when the traffic is blocked and the callback comes in. So it's important to do a couple of lock/block/unlock/callback operations before blocking the traffic with iptables. I think Trond's suggestion here will actually fix this: --------------[snip]----------------- For cleanliness, I suggest adding a new refcount to lockd: nlmsvc_ref. Replace the test for (nlmsvc_users || !signalled()) in lockd() with a test for nlmsvc_ref != 0, and then have lockd_up() take one reference to it whenever it bumps nlmsvc_users == 0 / lockd_down() release a reference whenever it decrements nlmsvc_users == 1. Have nlmsvc_insert_block() take a reference whenever adding something to an empty nlm_blocked / nlmsvc_unlink_block release the reference whenever nlm_blocked is emptied. -------------[snip]------------------ I'm going to look at implementing it. Created attachment 277081 [details]
patch -- add reference counting to lockd
First pass at a patch to implement Trond's suggestion. This cleans up the lockd
startup/shutdown and adds a new reference counter. A side benefit of this patch
is that only one lockd can run at a time. If lockd_up is done while lockd is
still waiting for the nlmsvc_ref to go to 0, then the old lockd will just stay
up.
The problem is that this doesn't actually fix the bug. With the reproducer,
lockd still goes down before the async callback is done. I need to do some
debugging work, but I'm thinking we'll have to take a reference in more places
than just the ones Trond suggested.
The other issue is that this triggers a lockdep warning. I don't think it's
possible to deadlock since the only lock takers are lockd itself. Regardless, I
think I can work around this by making nlmsvc_ref be atomic_t and not requiring
the nlmsvc_mutex around it.
Created attachment 278361 [details]
patch -- Add lockd reference counting and clean up lockd startup/shutdown
I think this patch fixes most of the issues from the old patch and does fix the
original problem. The main differences are:
1) nlmsvc_ref has been changed to atomic_t, so there's no need for the
nlmsvc_mutex locking when changing it.
2) we take an extra reference when nlm_granted callback is done, and drop the
reference after svc_wake_up of lockd is done.
Simply tying the nlmsvc_ref to the list_empty(&nlm_blocked) is not sufficient.
When lockd comes down it invalidates all of the locks (via nlm_traverse_files)
and ends up emptying the list.
Dec 5 08:02:23 dhcp231-229 kernel: lockd: unlinking block ffff8100144c5000...
If the rpc client is still active though, nlm_gc_hosts won't kill it:
Dec 5 08:03:19 dhcp231-229 kernel: lockd: server dhcp231-236.rdu.redhat.com
not responding, still trying
Dec 5 08:03:42 dhcp231-229 kernel: lockd: request from 10.11.231.236, port=822
Dec 5 08:03:42 dhcp231-229 kernel: lockd: shutting down host module
Dec 5 08:03:42 dhcp231-229 kernel: lockd: nuking all hosts...
Dec 5 08:03:42 dhcp231-229 kernel: lockd: host garbage collection
Dec 5 08:03:42 dhcp231-229 kernel: lockd: nlmsvc_mark_resources
Dec 5 08:03:42 dhcp231-229 kernel: nlm_gc_hosts skipping
dhcp231-236.rdu.redhat.com (cnt 1 use 1 exp 4295561826)
Dec 5 08:03:42 dhcp231-229 kernel: lockd: couldn't shutdown host module!
Dec 5 08:03:42 dhcp231-229 kernel: lockd: 1 hosts left:
Dec 5 08:03:42 dhcp231-229 kernel: dhcp231-236.rdu.redhat.com (cnt 1
use 1 exp 4295561826)
...so lockd goes down anyway because the nlm_blocked list was emptied, but
there are still outstanding async rpc calls. This patch fixes this by taking an
extra reference in nlm_granted callbacks, but I'm not certain that this is the
best fix. It might be better to change how nlm_gc_hosts works so that it
cancels the RPC calls in this situation.
Created attachment 280141 [details]
patch -- Add lockd reference counting and clean up lockd startup/shutdown (upstream)
Actually, I've had good results with this after all. I must have made a mistake
in my testing before when I thought I had reproduced the problem. It doesn't
seem to be necessary to take an extra reference in nlmsvc_grant_blocked() after
all.
This is the patch against current upstream code. I'll plan to submit it there
next week.
Created attachment 280151 [details]
patch -- correctly handle errors from lockd_up (rhel5)
This patch is a foundational patch. Not strictly required for the patch to fix
this bug, but it makes the fix much cleaner.
Created attachment 280161 [details] patch -- Add lockd reference counting and clean up lockd startup/shutdown (rhel5) Backported of patch in comment #8. I've verified that this fixes the reproducible oops on RHEL5. Patch sent upstream to linux-nfs list. Awaiting comment at this point. The only comment I got was from Christoph Hellwig who said:
> might be better to do the refcounting outside the thread and use the
> kthread api, which is something we still need to do for lockd anyway.
I've spent the last day or so working on an implementation of that and have sent
a patchset upstream to the linux-nfs list entitled:
Subject: [PATCH 0/6] Intro: convert lockd to kthread and fix use-after-free
...unfortunately we can't move the refcounting outside the thread altogether,
but I've at least got an implementation that uses kthreads (and also fixes a
couple of other bugs I stumbled across).
Awaiting upstream comment at this point...
The only upstream comment on the last set was from HCH, and it was just a recommendation to export svc_prepare_thread and have lockd_up call it directly. I've made some changes to the patchset based on the recommendation and have sent a new set to linux-nfs. New patchset starts with this email: Subject: [PATCH 0/7] Intro: convert lockd to kthread and fix use-after-free (try #2) Progress report... I've posted several patchsets in the last few weeks. The most recent was today, and basically has nlm_shutdown_hosts do an rpc_killall_tasks on the host->h_rpcclnt before trying to GC them. This seems to fix the problem and doesn't require that lockd stay up past the last lockd_down call. ...that should read that I've posted several patchsets upstream in the last few weeks. The fix for this is still being hashed out there. On the bright side, the upstream fix should correct a lot of ambiguity about how lockd startup and shutdown should work. The current implementation is a bit of a mess... Patchset has been taken into Bruce Fields' git tree upstream. After it's had some soak time upstream, I'll plan to do a more targeted fix of this problem for 5.3. I suspect we might be able to get away with the last patch in the set that just does a rpc_killall_tasks on each host->h_rpcclnt, but we'll have to test that... I found a problem with the last patch in the set and had to respin it. I *think* we can probably just use that last patch to fix this problem, though I need to test it. Yeargh....I've got a patch that does pretty much what upstream does for this -- we kill off the RPC tasks when shutting down hosts. On RHEL5 though, I *still* get this message when I take down lockd: lockd: nlmsvc_mark_resources nlm_gc_hosts skipping 10.11.231.229 (cnt 2 use 1 exp 4294894621) lockd: couldn't shutdown host module! ...we don't have any more RPC tasks in queue at this point, so I think that we may have some sort of nlm_block refcount leak. Still, that seems to be an entirely different issue, so I'll probably just run with the patch that I have to fix this problem and open a new BZ for that issue... Created attachment 306390 [details]
patch 1 - tear down rpc_clients in nlm_shutdown_hosts
Created attachment 306391 [details]
patch 2 - don't reattempt GRANT_MSG when there is already an RPC in flight
Created attachment 306392 [details]
patch 3 -- don't requeue block if it was invalidated while GRANT_MSG was in flight
Variants of these three patches went upstream a few months ago and seem to fix this problem in RHEL5, at least against the reproducer that I have. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. in kernel-2.6.18-100.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html |