+++ This bug was initially created as a clone of Bug #253754 +++ -- Additional comment from jburke on 2007-08-20 19:21 EST -- Jeff, I found out that there was a second issue last week while I was on vacation. Steve D was looking at it. Looks like it may have panic(But I am not sure) This may/may not be a different issue. Here is a link to the vmcore from when the system had an issue. ndnc-1.lab.boston.redhat.com:/var/crash/10.12.4.232-2007-08-15-11:36 Thanks, Jeff -- Additional comment from jlayton on 2007-08-20 21:35 EST -- This issue looks different, but I suppose there's a chance that they ultimately have the same cause: kernel BUG at include/asm/spinlock.h:133! That error means that the spinlock magic was corrupt: void svc_wake_up(struct svc_serv *serv) { struct svc_rqst *rqstp; spin_lock_bh(&serv->sv_lock); ...so the serv->sv_lock was bad (and that likely means that the whole svc_serv struct was bad). The ring buffer in the core didn't have any info before the panic, so it's hard to tell what what happened leading up to the crash. Here's the stack: PID: 3618 TASK: f63deeb0 CPU: 0 COMMAND: "rpciod" #0 [caeb6dc0] netpoll_start_netdump at f8a52570 #1 [caeb6de0] die at c0106045 #2 [caeb6e14] do_invalid_op at c0106420 #3 [caeb6ec4] error_code (via invalid_op) at c02d51dd EAX: f8aa7c45 EBX: c1afd518 ECX: c02e6fa1 EDX: c02e6fa1 EBP: c28ea808 DS: 007b ESI: c1afd500 ES: 007b EDI: ec2a4a80 CS: 0060 EIP: c02d33c2 ERR: ffffffff EFLAGS: 00010216 #4 [caeb6f00] _spin_lock_bh at c02d33c2 #5 [caeb6f08] svc_wake_up at f8aa7c40 #6 [caeb6f14] nlmsvc_grant_callback at f8c41f1d #7 [caeb6f2c] __rpc_execute at f8aa50f7 #8 [caeb6f84] __rpc_schedule at f8aa520c #9 [caeb6f90] rpciod_killall at f8aa5ad8 #10 [caeb6f9c] rpciod at f8aa5a52 #11 [caeb6ff0] kernel_thread_helper at c01041f3 rpciod_killall() is in the stack here, which indicates that rpciod was in the process of shutting down at the time. Looks like lockd was in the process of shutting down too: crash> bt PID: 3617 TASK: f63de930 CPU: 1 COMMAND: "lockd" #0 [f2c26f14] schedule at c02d25c9 #1 [f2c26f78] wait_for_completion at c02d26f6 #2 [f2c26fcc] rpciod_down at f8aa5cca #3 [f2c26fd4] lockd at f8c40f31 #4 [f2c26ff0] kernel_thread_helper at c01041f3 I don't see any NFS filesystems in the mount table, so I'm guessing this happened after unmounting the last NFS filesystem on the box. Is there already a separate BZ for this? -- Additional comment from jlayton on 2007-08-21 10:30 EST -- Ahh Jeff pointed out that this is actually a nfs server. So lockd should only be shutting down if nfs is going down. Here's something odd though -- there are 2 lockd threads in the process list: 3617 1 1 f63de930 UN 0.0 0 0 [lockd] 27857 1 1 c1b46e30 UN 0.0 0 0 [lockd] the second thread looks like this: crash> set 27857 PID: 27857 COMMAND: "lockd" TASK: c1b46e30 [THREAD_INFO: f152b000] CPU: 1 STATE: TASK_UNINTERRUPTIBLE crash> bt PID: 27857 TASK: c1b46e30 CPU: 1 COMMAND: "lockd" #0 [f152bf20] schedule at c02d25c9 #1 [f152bf84] __sched_text_start at c02d1bb4 #2 [f152bfb8] __down_failed at c02d1d2b #3 [f152bfc8] .text.lock.sched (via rpciod_up) at f8aa5e92 #4 [f152bfd4] lockd at f8c40d7f #5 [f152bff0] kernel_thread_helper at c01041f3 I think it's blocked on the nlmsvc_sema (which is probably being held by whoever is calling lockd_down). Possibility? lockd was taking an inordinate amount of time to come down, nfs was restarted or something and a second lockd thread was started... -- Additional comment from jlayton on 2007-08-21 12:14 EST -- err...actually the second lockd thread is blocked on the rpciod_sema from rpciod_up. This makes sense since rpciod is coming down and the other lockd is holding it in rpciod_down. My guess is that rpciod was taking a long time to come down and the svc_serv struct that it crashed while accessing was freed and reused before it could get to it. Either that or we have memory corruption of some sort... -- Additional comment from jlayton on 2007-08-24 10:15 EST -- Ok, I think I see what could be the issue here. From lockd(): /* Exit the RPC thread */ svc_exit_thread(rqstp); /* release rpciod */ rpciod_down(); ...svc_exit_thread does this: /* Release the server */ if (serv) svc_destroy(serv); which eventually kfree's serv. So this codepath seems to be pretty much always using memory after free. This might be reproducible with slab poisoning turned on, presuming you can hit it so that there's an outstanding lock call that needs to be cleaned up when rpciod_down is called. It looks like a recent upstream patch removed the rpciod_up/down calls from lockd, but I don't think that will work for RHEL4. -- Additional comment from jlayton on 2007-08-24 11:12 EST -- Created an attachment (id=172425) untested proposed patch I think a patch like this would probably fix this particular issue (though it's untested so far, even for compilation). It just separates out the freeing of "serv" from everything else, and delaying freeing that until after rpciod_down returns. The problem is going to be reproducing and testing this since this race looks like it'll be tricky to hit. The race is (I'm pretty sure), something like this: lockd nlmsvc_retry_blocked nlmsvc_grant_blocked nlmsvc_async_call which queues up nlmsvc_granted_callback for rpciod ...then before that call actually gets serviced by rpciod, you'll need lockd to begin exiting and call rpciod_down. It's probably reproducible but may take a while. -- Additional comment from jlayton on 2007-08-24 12:31 EST -- ...that last comment should be nlmsvc_grant_callback (not granted)... I'm thinking a reproducer may could happen something like this: 1) local process locks file on exported filesystem on NFS server 2) nfs client tries to lock same file and blocks waiting for callback from server 3) set up iptables rules to drop traffic between client and server 4) have local process unlock file 5) do service nfs stop on server 6) remove iptables rules to allow communication to continue The iptables rule would hopefully delay the callback long enough to allow lockd to call svc_exit_thread. We'd also want the server to run a kernel with memory poisoning.
I've not seen this on RHEL5, but a brief look seems to indicate that it has the same problem.
There are 2 pieces to fixing this, I think: 1) make it so that nlm_destroy_host actually calls rpc_shutdown_client (like upstream does), which should make it cancel any outstanding RPC calls and callbacks when lockd goes down. Upstream already does this, so it's probably just a matter of backporting patches for it. 2) make it so that lockd can't be started until the old lockd is completely down. The current code is set up in such a way as to allow for a new lockd to stop before the old one goes down all the way, but there are data structures that are not adequately protected for this. I proposed a patch upstream for #2, but it got no comments. I'll see if I can revive it (or at least prod some discussion along of whether we need to allow for more than one lockd to run at once).
I've spent some more time today tracking this down and have found a few interesting things. I've experimented with this patch: diff --git a/fs/lockd/host.c b/fs/lockd/host.c index 38b0e8a..098594d 100644 --- a/fs/lockd/host.c +++ b/fs/lockd/host.c @@ -331,15 +331,8 @@ nlm_gc_hosts(void) /* Don't unmonitor hosts that have been invalidated */ if (host->h_monitored && !host->h_killed) nsm_unmonitor(host); - if ((clnt = host->h_rpcclnt) != NULL) { - if (atomic_read(&clnt->cl_users)) { - printk(KERN_WARNING - "lockd: active RPC handle\n"); - clnt->cl_dead = 1; - } else { - rpc_destroy_client(host->h_rpcclnt); - } - } + if ((clnt = host->h_rpcclnt) != NULL) + rpc_shutdown_client(host->h_rpcclnt); kfree(host); nrhosts--; } ...but it does not fix the issue. The problem is that on rhel5, when shutting down the server's lockd when there's an outstanding grant callback, h_count is never 0. The host in question is skipped and rpc_shutdown_client (and rpc_destroy_client) is never called. On more recent kernels, when the grant callback comes in, h_rpcclnt is almost always NULL for the host and we end up calling rpc_create for it. That always fails since it can't talk to the portmapper and the RPC call is never attempted (and svc_wake_up isn't called). This is not always the case though, I've had a couple of occurrences where recent kernels have crashed. The trick seems to be to catch it so that h_rpcclnt is not NULL with nlm_bind_host is called, but I haven't figured out a way to reliably do that. In short, my earlier theory of why recent kernels don't seem to hit this as easily is wrong. The suggestion Trond made on the upstream mailing list to add some new reference counting to lockd will probably actually fix this after all. For now though, I'm focusing on getting this to reliably reproduce on rawhide.
I think I can now reproduce this at will on recent upstream kernels. The recent sunrpc changes make it a little tougher to reproduce than on RHEL4/5, but the problem is still there. The trick is to make sure that the nlm_host has a valid and bound rpc_clnt when the traffic is blocked and the callback comes in. So it's important to do a couple of lock/block/unlock/callback operations before blocking the traffic with iptables. I think Trond's suggestion here will actually fix this: --------------[snip]----------------- For cleanliness, I suggest adding a new refcount to lockd: nlmsvc_ref. Replace the test for (nlmsvc_users || !signalled()) in lockd() with a test for nlmsvc_ref != 0, and then have lockd_up() take one reference to it whenever it bumps nlmsvc_users == 0 / lockd_down() release a reference whenever it decrements nlmsvc_users == 1. Have nlmsvc_insert_block() take a reference whenever adding something to an empty nlm_blocked / nlmsvc_unlink_block release the reference whenever nlm_blocked is emptied. -------------[snip]------------------ I'm going to look at implementing it.
Created attachment 277081 [details] patch -- add reference counting to lockd First pass at a patch to implement Trond's suggestion. This cleans up the lockd startup/shutdown and adds a new reference counter. A side benefit of this patch is that only one lockd can run at a time. If lockd_up is done while lockd is still waiting for the nlmsvc_ref to go to 0, then the old lockd will just stay up. The problem is that this doesn't actually fix the bug. With the reproducer, lockd still goes down before the async callback is done. I need to do some debugging work, but I'm thinking we'll have to take a reference in more places than just the ones Trond suggested. The other issue is that this triggers a lockdep warning. I don't think it's possible to deadlock since the only lock takers are lockd itself. Regardless, I think I can work around this by making nlmsvc_ref be atomic_t and not requiring the nlmsvc_mutex around it.
Created attachment 278361 [details] patch -- Add lockd reference counting and clean up lockd startup/shutdown I think this patch fixes most of the issues from the old patch and does fix the original problem. The main differences are: 1) nlmsvc_ref has been changed to atomic_t, so there's no need for the nlmsvc_mutex locking when changing it. 2) we take an extra reference when nlm_granted callback is done, and drop the reference after svc_wake_up of lockd is done. Simply tying the nlmsvc_ref to the list_empty(&nlm_blocked) is not sufficient. When lockd comes down it invalidates all of the locks (via nlm_traverse_files) and ends up emptying the list. Dec 5 08:02:23 dhcp231-229 kernel: lockd: unlinking block ffff8100144c5000... If the rpc client is still active though, nlm_gc_hosts won't kill it: Dec 5 08:03:19 dhcp231-229 kernel: lockd: server dhcp231-236.rdu.redhat.com not responding, still trying Dec 5 08:03:42 dhcp231-229 kernel: lockd: request from 10.11.231.236, port=822 Dec 5 08:03:42 dhcp231-229 kernel: lockd: shutting down host module Dec 5 08:03:42 dhcp231-229 kernel: lockd: nuking all hosts... Dec 5 08:03:42 dhcp231-229 kernel: lockd: host garbage collection Dec 5 08:03:42 dhcp231-229 kernel: lockd: nlmsvc_mark_resources Dec 5 08:03:42 dhcp231-229 kernel: nlm_gc_hosts skipping dhcp231-236.rdu.redhat.com (cnt 1 use 1 exp 4295561826) Dec 5 08:03:42 dhcp231-229 kernel: lockd: couldn't shutdown host module! Dec 5 08:03:42 dhcp231-229 kernel: lockd: 1 hosts left: Dec 5 08:03:42 dhcp231-229 kernel: dhcp231-236.rdu.redhat.com (cnt 1 use 1 exp 4295561826) ...so lockd goes down anyway because the nlm_blocked list was emptied, but there are still outstanding async rpc calls. This patch fixes this by taking an extra reference in nlm_granted callbacks, but I'm not certain that this is the best fix. It might be better to change how nlm_gc_hosts works so that it cancels the RPC calls in this situation.
Created attachment 280141 [details] patch -- Add lockd reference counting and clean up lockd startup/shutdown (upstream) Actually, I've had good results with this after all. I must have made a mistake in my testing before when I thought I had reproduced the problem. It doesn't seem to be necessary to take an extra reference in nlmsvc_grant_blocked() after all. This is the patch against current upstream code. I'll plan to submit it there next week.
Created attachment 280151 [details] patch -- correctly handle errors from lockd_up (rhel5) This patch is a foundational patch. Not strictly required for the patch to fix this bug, but it makes the fix much cleaner.
Created attachment 280161 [details] patch -- Add lockd reference counting and clean up lockd startup/shutdown (rhel5) Backported of patch in comment #8. I've verified that this fixes the reproducible oops on RHEL5.
Patch sent upstream to linux-nfs list. Awaiting comment at this point.
The only comment I got was from Christoph Hellwig who said: > might be better to do the refcounting outside the thread and use the > kthread api, which is something we still need to do for lockd anyway. I've spent the last day or so working on an implementation of that and have sent a patchset upstream to the linux-nfs list entitled: Subject: [PATCH 0/6] Intro: convert lockd to kthread and fix use-after-free ...unfortunately we can't move the refcounting outside the thread altogether, but I've at least got an implementation that uses kthreads (and also fixes a couple of other bugs I stumbled across). Awaiting upstream comment at this point...
The only upstream comment on the last set was from HCH, and it was just a recommendation to export svc_prepare_thread and have lockd_up call it directly. I've made some changes to the patchset based on the recommendation and have sent a new set to linux-nfs. New patchset starts with this email: Subject: [PATCH 0/7] Intro: convert lockd to kthread and fix use-after-free (try #2)
Progress report... I've posted several patchsets in the last few weeks. The most recent was today, and basically has nlm_shutdown_hosts do an rpc_killall_tasks on the host->h_rpcclnt before trying to GC them. This seems to fix the problem and doesn't require that lockd stay up past the last lockd_down call.
...that should read that I've posted several patchsets upstream in the last few weeks. The fix for this is still being hashed out there. On the bright side, the upstream fix should correct a lot of ambiguity about how lockd startup and shutdown should work. The current implementation is a bit of a mess...
Patchset has been taken into Bruce Fields' git tree upstream. After it's had some soak time upstream, I'll plan to do a more targeted fix of this problem for 5.3. I suspect we might be able to get away with the last patch in the set that just does a rpc_killall_tasks on each host->h_rpcclnt, but we'll have to test that...
I found a problem with the last patch in the set and had to respin it. I *think* we can probably just use that last patch to fix this problem, though I need to test it.
Yeargh....I've got a patch that does pretty much what upstream does for this -- we kill off the RPC tasks when shutting down hosts. On RHEL5 though, I *still* get this message when I take down lockd: lockd: nlmsvc_mark_resources nlm_gc_hosts skipping 10.11.231.229 (cnt 2 use 1 exp 4294894621) lockd: couldn't shutdown host module! ...we don't have any more RPC tasks in queue at this point, so I think that we may have some sort of nlm_block refcount leak. Still, that seems to be an entirely different issue, so I'll probably just run with the patch that I have to fix this problem and open a new BZ for that issue...
Created attachment 306390 [details] patch 1 - tear down rpc_clients in nlm_shutdown_hosts
Created attachment 306391 [details] patch 2 - don't reattempt GRANT_MSG when there is already an RPC in flight
Created attachment 306392 [details] patch 3 -- don't requeue block if it was invalidated while GRANT_MSG was in flight
Variants of these three patches went upstream a few months ago and seem to fix this problem in RHEL5, at least against the reproducer that I have.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-100.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html