Bug 254195

Summary: use after free in nlm subsystem
Product: Red Hat Enterprise Linux 5 Reporter: Jeff Layton <jlayton>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: low Docs Contact:
Priority: low    
Version: 5.0CC: dzickus, jburke, jiali, staubach, steved
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
URL: http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=324671
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 20:23:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 425461    
Attachments:
Description Flags
patch -- add reference counting to lockd
none
patch -- Add lockd reference counting and clean up lockd startup/shutdown
none
patch -- Add lockd reference counting and clean up lockd startup/shutdown (upstream)
none
patch -- correctly handle errors from lockd_up (rhel5)
none
patch -- Add lockd reference counting and clean up lockd startup/shutdown (rhel5)
none
patch 1 - tear down rpc_clients in nlm_shutdown_hosts
none
patch 2 - don't reattempt GRANT_MSG when there is already an RPC in flight
none
patch 3 -- don't requeue block if it was invalidated while GRANT_MSG was in flight none

Description Jeff Layton 2007-08-24 17:35:22 UTC
+++ This bug was initially created as a clone of Bug #253754 +++

-- Additional comment from jburke on 2007-08-20 19:21 EST --
Jeff,
  I found out that there was a second issue last week while I was on vacation.
Steve D was looking at it. Looks like it may have panic(But I am not sure) This
may/may not be a different issue.

 Here is a link to the vmcore from when the system had an issue.
ndnc-1.lab.boston.redhat.com:/var/crash/10.12.4.232-2007-08-15-11:36

Thanks,
Jeff

-- Additional comment from jlayton on 2007-08-20 21:35 EST --
This issue looks different, but I suppose there's a chance that they ultimately
have the same cause:

kernel BUG at include/asm/spinlock.h:133!

That error means that the spinlock magic was corrupt:

void
svc_wake_up(struct svc_serv *serv)
{
        struct svc_rqst *rqstp;

        spin_lock_bh(&serv->sv_lock);

...so the serv->sv_lock was bad (and that likely means that the whole svc_serv
struct was bad). The ring buffer in the core didn't have any info before the
panic, so it's hard to tell what what happened leading up to the crash. 

Here's the stack:

PID: 3618   TASK: f63deeb0  CPU: 0   COMMAND: "rpciod"
 #0 [caeb6dc0] netpoll_start_netdump at f8a52570
 #1 [caeb6de0] die at c0106045
 #2 [caeb6e14] do_invalid_op at c0106420
 #3 [caeb6ec4] error_code (via invalid_op) at c02d51dd
    EAX: f8aa7c45  EBX: c1afd518  ECX: c02e6fa1  EDX: c02e6fa1  EBP: c28ea808 
    DS:  007b      ESI: c1afd500  ES:  007b      EDI: ec2a4a80 
    CS:  0060      EIP: c02d33c2  ERR: ffffffff  EFLAGS: 00010216 
 #4 [caeb6f00] _spin_lock_bh at c02d33c2
 #5 [caeb6f08] svc_wake_up at f8aa7c40
 #6 [caeb6f14] nlmsvc_grant_callback at f8c41f1d
 #7 [caeb6f2c] __rpc_execute at f8aa50f7
 #8 [caeb6f84] __rpc_schedule at f8aa520c
 #9 [caeb6f90] rpciod_killall at f8aa5ad8
#10 [caeb6f9c] rpciod at f8aa5a52
#11 [caeb6ff0] kernel_thread_helper at c01041f3

rpciod_killall() is in the stack here, which indicates that rpciod was in the
process of shutting down at the time. Looks like lockd was in the process of
shutting down too:

crash> bt
PID: 3617   TASK: f63de930  CPU: 1   COMMAND: "lockd"
 #0 [f2c26f14] schedule at c02d25c9
 #1 [f2c26f78] wait_for_completion at c02d26f6
 #2 [f2c26fcc] rpciod_down at f8aa5cca
 #3 [f2c26fd4] lockd at f8c40f31
 #4 [f2c26ff0] kernel_thread_helper at c01041f3

I don't see any NFS filesystems in the mount table, so I'm guessing this
happened after unmounting the last NFS filesystem on the box.

Is there already a separate BZ for this?


-- Additional comment from jlayton on 2007-08-21 10:30 EST --
Ahh Jeff pointed out that this is actually a nfs server. So lockd should only be
shutting down if nfs is going down.

Here's something odd though -- there are 2 lockd threads in the process list:

   3617      1   1  f63de930  UN   0.0       0      0  [lockd]
  27857      1   1  c1b46e30  UN   0.0       0      0  [lockd]

the second thread looks like this:

crash> set 27857
    PID: 27857
COMMAND: "lockd"
   TASK: c1b46e30  [THREAD_INFO: f152b000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE 
crash> bt
PID: 27857  TASK: c1b46e30  CPU: 1   COMMAND: "lockd"
 #0 [f152bf20] schedule at c02d25c9
 #1 [f152bf84] __sched_text_start at c02d1bb4
 #2 [f152bfb8] __down_failed at c02d1d2b
 #3 [f152bfc8] .text.lock.sched (via rpciod_up) at f8aa5e92
 #4 [f152bfd4] lockd at f8c40d7f
 #5 [f152bff0] kernel_thread_helper at c01041f3

I think it's blocked on the nlmsvc_sema (which is probably being held by whoever
is calling lockd_down).

Possibility? lockd was taking an inordinate amount of time to come down, nfs was
restarted or something and a second lockd thread was started...



-- Additional comment from jlayton on 2007-08-21 12:14 EST --
err...actually the second lockd thread is blocked on the rpciod_sema from
rpciod_up. This makes sense since rpciod is coming down and the other lockd is
holding it in rpciod_down.

My guess is that rpciod was taking a long time to come down and the svc_serv
struct that it crashed while accessing was freed and reused before it could get
to it. Either that or we have memory corruption of some sort...

-- Additional comment from jlayton on 2007-08-24 10:15 EST --
Ok, I think I see what could be the issue here. From lockd():

        /* Exit the RPC thread */
        svc_exit_thread(rqstp);

        /* release rpciod */
        rpciod_down();


...svc_exit_thread does this:

        /* Release the server */
        if (serv)
                svc_destroy(serv);

which eventually kfree's serv. So this codepath seems to be pretty much always
using memory after free. This might be reproducible with slab poisoning turned
on, presuming you can hit it so that there's an outstanding lock call that needs
to be cleaned up when rpciod_down is called.

It looks like a recent upstream patch removed the rpciod_up/down calls from
lockd, but I don't think that will work for RHEL4.



-- Additional comment from jlayton on 2007-08-24 11:12 EST --
Created an attachment (id=172425)
untested proposed patch

I think a patch like this would probably fix this particular issue (though it's
untested so far, even for compilation). It just separates out the freeing of
"serv" from everything else, and delaying freeing that until after rpciod_down
returns.

The problem is going to be reproducing and testing this since this race looks
like it'll be tricky to hit. The race is (I'm pretty sure), something like
this:

lockd
    nlmsvc_retry_blocked
	nlmsvc_grant_blocked
	    nlmsvc_async_call which queues up nlmsvc_granted_callback for
rpciod

...then before that call actually gets serviced by rpciod, you'll need lockd to
begin exiting and call rpciod_down. It's probably reproducible but may take a
while.


-- Additional comment from jlayton on 2007-08-24 12:31 EST --
...that last comment should be nlmsvc_grant_callback (not granted)...

I'm thinking a reproducer may could happen something like this:

1) local process locks file on exported filesystem on NFS server
2) nfs client tries to lock same file and blocks waiting for callback
from server 
3) set up iptables rules to drop traffic between client and server
4) have local process unlock file
5) do service nfs stop on server
6) remove iptables rules to allow communication to continue

The iptables rule would hopefully delay the callback long enough to allow lockd
to call svc_exit_thread. We'd also want the server to run a kernel with memory
poisoning.

Comment 1 Jeff Layton 2007-08-24 17:36:00 UTC
I've not seen this on RHEL5, but a brief look seems to indicate that it has the
same problem.


Comment 3 Jeff Layton 2007-11-17 13:56:06 UTC
There are 2 pieces to fixing this, I think:

1) make it so that nlm_destroy_host actually calls rpc_shutdown_client (like
upstream does), which should make it cancel any outstanding RPC calls and
callbacks when lockd goes down. Upstream already does this, so it's probably
just a matter of backporting patches for it.

2) make it so that lockd can't be started until the old lockd is completely
down. The current code is set up in such a way as to allow for a new lockd to
stop before the old one goes down all the way, but there are data structures
that are not adequately protected for this.

I proposed a patch upstream for #2, but it got no comments. I'll see if I can
revive it (or at least prod some discussion along of whether we need to allow
for more than one lockd to run at once).


Comment 4 Jeff Layton 2007-11-30 18:35:19 UTC
I've spent some more time today tracking this down and have found a few
interesting things. I've experimented with this patch:

diff --git a/fs/lockd/host.c b/fs/lockd/host.c
index 38b0e8a..098594d 100644
--- a/fs/lockd/host.c
+++ b/fs/lockd/host.c
@@ -331,15 +331,8 @@ nlm_gc_hosts(void)
                        /* Don't unmonitor hosts that have been invalidated */
                        if (host->h_monitored && !host->h_killed)
                                nsm_unmonitor(host);
-                       if ((clnt = host->h_rpcclnt) != NULL) {
-                               if (atomic_read(&clnt->cl_users)) {
-                                       printk(KERN_WARNING
-                                               "lockd: active RPC handle\n");
-                                       clnt->cl_dead = 1;
-                               } else {
-                                       rpc_destroy_client(host->h_rpcclnt);
-                               }
-                       }
+                       if ((clnt = host->h_rpcclnt) != NULL)
+                               rpc_shutdown_client(host->h_rpcclnt);
                        kfree(host);
                        nrhosts--;
                }

...but it does not fix the issue. The problem is that on rhel5, when shutting
down the server's lockd when there's an outstanding grant callback, h_count is
never 0. The host in question is skipped and rpc_shutdown_client (and
rpc_destroy_client) is never called.

On more recent kernels, when the grant callback comes in, h_rpcclnt is almost
always NULL for the host and we end up calling rpc_create for it. That always
fails since it can't talk to the portmapper and the RPC call is never attempted
(and svc_wake_up isn't called). This is not always the case though, I've had a
couple of occurrences where recent kernels have crashed. The trick seems to be
to catch it so that h_rpcclnt is not NULL with nlm_bind_host is called, but I
haven't figured out a way to reliably do that.

In short, my earlier theory of why recent kernels don't seem to hit this as
easily is wrong. The suggestion Trond made on the upstream mailing list to add
some new reference counting to lockd will probably actually fix this after all. 

For now though, I'm focusing on getting this to reliably reproduce on rawhide.


Comment 5 Jeff Layton 2007-12-03 13:30:56 UTC
I think I can now reproduce this at will on recent upstream kernels. The recent
sunrpc changes make it a little tougher to reproduce than on RHEL4/5, but the
problem is still there. The trick is to make sure that the nlm_host has a valid
and bound rpc_clnt when the traffic is blocked and the callback comes in. So
it's important to do a couple of lock/block/unlock/callback operations before
blocking the traffic with iptables.

I think Trond's suggestion here will actually fix this:

--------------[snip]-----------------
For cleanliness, I suggest adding a new refcount to lockd: nlmsvc_ref.
Replace the test for (nlmsvc_users || !signalled()) in lockd() with a
test for nlmsvc_ref != 0, and then have lockd_up() take one reference to
it whenever it bumps nlmsvc_users == 0  / lockd_down() release a
reference whenever it decrements nlmsvc_users == 1.

Have nlmsvc_insert_block() take a reference whenever adding something to
an empty nlm_blocked / nlmsvc_unlink_block release the reference
whenever nlm_blocked is emptied.
-------------[snip]------------------

I'm going to look at implementing it.


Comment 6 Jeff Layton 2007-12-04 15:24:59 UTC
Created attachment 277081 [details]
patch -- add reference counting to lockd

First pass at a patch to implement Trond's suggestion. This cleans up the lockd
startup/shutdown and adds a new reference counter. A side benefit of this patch
is that only one lockd can run at a time. If lockd_up is done while lockd is
still waiting for the nlmsvc_ref to go to 0, then the old lockd will just stay
up.

The problem is that this doesn't actually fix the bug. With the reproducer,
lockd still goes down before the async callback is done. I need to do some
debugging work, but I'm thinking we'll have to take a reference in more places
than just the ones Trond suggested.

The other issue is that this triggers a lockdep warning. I don't think it's
possible to deadlock since the only lock takers are lockd itself. Regardless, I
think I can work around this by making nlmsvc_ref be atomic_t and not requiring
the nlmsvc_mutex around it.

Comment 7 Jeff Layton 2007-12-05 14:34:01 UTC
Created attachment 278361 [details]
patch -- Add lockd reference counting and clean up lockd startup/shutdown

I think this patch fixes most of the issues from the old patch and does fix the
original problem. The main differences are:

1) nlmsvc_ref has been changed to atomic_t, so there's no need for the
nlmsvc_mutex locking when changing it.

2) we take an extra reference when nlm_granted callback is done, and drop the
reference after svc_wake_up of lockd is done.

Simply tying the nlmsvc_ref to the list_empty(&nlm_blocked) is not sufficient.
When lockd comes down it invalidates all of the locks (via nlm_traverse_files)
and ends up emptying the list.

Dec  5 08:02:23 dhcp231-229 kernel: lockd: unlinking block ffff8100144c5000...

If the rpc client is still active though, nlm_gc_hosts won't kill it:

Dec  5 08:03:19 dhcp231-229 kernel: lockd: server dhcp231-236.rdu.redhat.com
not responding, still trying
Dec  5 08:03:42 dhcp231-229 kernel: lockd: request from 10.11.231.236, port=822

Dec  5 08:03:42 dhcp231-229 kernel: lockd: shutting down host module
Dec  5 08:03:42 dhcp231-229 kernel: lockd: nuking all hosts...
Dec  5 08:03:42 dhcp231-229 kernel: lockd: host garbage collection
Dec  5 08:03:42 dhcp231-229 kernel: lockd: nlmsvc_mark_resources
Dec  5 08:03:42 dhcp231-229 kernel: nlm_gc_hosts skipping
dhcp231-236.rdu.redhat.com (cnt 1 use 1 exp 4295561826)
Dec  5 08:03:42 dhcp231-229 kernel: lockd: couldn't shutdown host module!
Dec  5 08:03:42 dhcp231-229 kernel: lockd: 1 hosts left:
Dec  5 08:03:42 dhcp231-229 kernel:	   dhcp231-236.rdu.redhat.com (cnt 1
use 1 exp 4295561826)

...so lockd goes down anyway because the nlm_blocked list was emptied, but
there are still outstanding async rpc calls. This patch fixes this by taking an
extra reference in nlm_granted callbacks, but I'm not certain that this is the
best fix. It might be better to change how nlm_gc_hosts works so that it
cancels the RPC calls in this situation.

Comment 8 Jeff Layton 2007-12-06 19:28:15 UTC
Created attachment 280141 [details]
patch -- Add lockd reference counting and clean up lockd startup/shutdown (upstream)

Actually, I've had good results with this after all. I must have made a mistake
in my testing before when I thought I had reproduced the problem. It doesn't
seem to be necessary to take an extra reference in nlmsvc_grant_blocked() after
all.

This is the patch against current upstream code. I'll plan to submit it there
next week.

Comment 9 Jeff Layton 2007-12-06 19:30:19 UTC
Created attachment 280151 [details]
patch -- correctly handle errors from lockd_up (rhel5)

This patch is a foundational patch. Not strictly required for the patch to fix
this bug, but it makes the fix much cleaner.

Comment 10 Jeff Layton 2007-12-06 19:32:35 UTC
Created attachment 280161 [details]
patch -- Add lockd reference counting and clean up lockd startup/shutdown (rhel5)

Backported of patch in comment #8. I've verified that this fixes the
reproducible oops on RHEL5.

Comment 11 Jeff Layton 2007-12-08 13:19:13 UTC
Patch sent upstream to linux-nfs list. Awaiting comment at this point.

Comment 12 Jeff Layton 2007-12-13 20:52:26 UTC
The only comment I got was from Christoph Hellwig who said:

> might be better to do the refcounting outside the thread and use the
> kthread api, which is something we still need to do for lockd anyway. 

I've spent the last day or so working on an implementation of that and have sent
a patchset upstream to the linux-nfs list entitled:

Subject: [PATCH 0/6] Intro: convert lockd to kthread and fix use-after-free

...unfortunately we can't move the refcounting outside the thread altogether,
but I've at least got an implementation that uses kthreads (and also fixes a
couple of other bugs I stumbled across).

Awaiting upstream comment at this point...


Comment 13 Jeff Layton 2007-12-18 16:45:47 UTC
The only upstream comment on the last set was from HCH, and it was just a
recommendation to export svc_prepare_thread and have lockd_up call it directly.
I've made some changes to the patchset based on the recommendation and have sent
a new set to linux-nfs. New patchset starts with this email:

Subject: [PATCH 0/7] Intro: convert lockd to kthread and fix use-after-free (try #2)



Comment 14 Jeff Layton 2008-01-10 20:07:10 UTC
Progress report...

I've posted several patchsets in the last few weeks. The most recent was today,
and basically has nlm_shutdown_hosts do an rpc_killall_tasks on the
host->h_rpcclnt before trying to GC them. This seems to fix the problem and
doesn't require that lockd stay up past the last lockd_down call.


Comment 15 Jeff Layton 2008-01-10 20:08:46 UTC
...that should read that I've posted several patchsets upstream in the last few
weeks. The fix for this is still being hashed out there.

On the bright side, the upstream fix should correct a lot of ambiguity about how
lockd startup and shutdown should work. The current implementation is a bit of a
mess...


Comment 16 Jeff Layton 2008-01-21 20:23:47 UTC
Patchset has been taken into Bruce Fields' git tree upstream. After it's had
some soak time upstream, I'll plan to do a more targeted fix of this problem for
5.3. I suspect we might be able to get away with the last patch in the set that
just does a rpc_killall_tasks on each host->h_rpcclnt, but we'll have to test
that...


Comment 17 Jeff Layton 2008-01-28 20:14:23 UTC
I found a problem with the last patch in the set and had to respin it. I *think*
we can probably just use that last patch to fix this problem, though I need to
test it.


Comment 18 Jeff Layton 2008-02-12 21:10:52 UTC
Yeargh....I've got a patch that does pretty much what upstream does for this --
we kill off the RPC tasks when shutting down hosts. On RHEL5 though, I *still*
get this message when I take down lockd:

lockd: nlmsvc_mark_resources
nlm_gc_hosts skipping 10.11.231.229 (cnt 2 use 1 exp 4294894621)
lockd: couldn't shutdown host module!

...we don't have any more RPC tasks in queue at this point, so I think that we
may have some sort of nlm_block refcount leak. Still, that seems to be an
entirely different issue, so I'll probably just run with the patch that I have
to fix this problem and open a new BZ for that issue...



Comment 19 Jeff Layton 2008-05-22 15:27:57 UTC
Created attachment 306390 [details]
patch 1 - tear down rpc_clients in nlm_shutdown_hosts

Comment 20 Jeff Layton 2008-05-22 15:29:13 UTC
Created attachment 306391 [details]
patch 2 - don't reattempt GRANT_MSG when there is already an RPC in flight

Comment 21 Jeff Layton 2008-05-22 15:30:05 UTC
Created attachment 306392 [details]
patch 3 -- don't requeue block if it was invalidated while GRANT_MSG was in flight

Comment 22 Jeff Layton 2008-05-22 15:31:09 UTC
Variants of these three patches went upstream a few months ago and seem to fix
this problem in RHEL5, at least against the reproducer that I have.


Comment 24 RHEL Program Management 2008-05-28 15:28:30 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 25 Don Zickus 2008-07-25 20:26:35 UTC
in kernel-2.6.18-100.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 28 errata-xmlrpc 2009-01-20 20:23:32 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html