Bug 207123

Summary: kernel assert in nlm_release_host
Product: Red Hat Enterprise Linux 4 Reporter: Lenny Maiorani <lenny>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3CC: jbaron, jlayton, staubach, steved, tao
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-06-23 14:15:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Changes to at least alleviate this bug
none
Changes to at least alleviate this bug
none
Changes to at least alleviate this bug
none
Changes to at least alleviate this bug - in unified diff format
none
Changes to at least alleviate this bug - in unified diff format none

Description Lenny Maiorani 2006-09-19 15:40:28 UTC
Description of problem:
Kernel asserts in nlm_release_host.

Here is the backtrace:
[0]kdb> bt
Stack traceback for pid 8380
0x00000100dee83030     8380        1  1    0   R  0x00000100dee83430 *rpciod
RSP           RIP                Function (args)
0x100deeb3df0 0xffffffffa02918b0 [lockd]nlm_release_host+0x2f
0x100deeb3e18 0xffffffffa0298666 [lockd]nlm4svc_callback_exit+0x38
0x2e79f50f9dab, 0x100e5ed57f0, 100dee832d0, 0x0, 0x2e7761d0ca48)
0x100deeb3e28 0xffffffffa0193e3d [sunrpc]__rpc_execute+0x2e9
0x100deeb3eb8 0xffffffffa0193f0f [sunrpc]__rpc_schedule+0x4a (0x0,
x100dee83030, 0xffffffff801351da, 100deeb3f20, 0x100deeb3f20)
0x100deeb3ed8 0xffffffffa01943cf [sunrpc]rpciod+0x12d
0x100deeb3f58 0xffffffff80110e17 child_rip+0x8


Here is the register dump. 
[0]kdb> rd
     r15 = 0xffffffffa0292088      r14 = 0x00000000fffffff4 
     r13 = 0x0000000000000000      r12 = 0x00000101fd63f880 
     rbp = 0x0000000000000002      rbx = 0x0000010193194180 
     r11 = 0xffffffff8015c142      r10 = 0xffffffff8015c142 
      r9 = 0x00000101050d33c0       r8 = 0x0000000000000003 
     rax = 0x00000000ffffffff      rcx = 0x0000000000000006 
     rdx = 0x00000000fffffff3      rsi = 0x0000000000000008 
     rdi = 0x0000010193194180 orig_rax = 0xffffffffffffffff 
     rip = 0xffffffffa02918b0       cs = 0x0000000000000010 
  eflags = 0x0000000000010286      rsp = 0x00000100deeb3df0 
      ss = 0x00000100deeb2000 &regs = 0x00000100deeb3d58


So, I took at look at the RIP just before the instruction in nlm_release_host
that caused the panic:
[0]kdb> id 0xffffffffa0291891
0xffffffffa0291891 nlm_release_host+0x10:    je     0xffffffffa02918a5
nlm_release_host+0x24
0xffffffffa0291893 nlm_release_host+0x12:    lea    0x20(%rdi),%rsi
0xffffffffa0291897 nlm_release_host+0x16:    xor    %eax,%eax
0xffffffffa0291899 nlm_release_host+0x18:    mov    $0xffffffffa0298bf8,%rdi
0xffffffffa02918a0 nlm_release_host+0x1f:    callq  0xffffffff80137fb2 printk
0xffffffffa02918a5 nlm_release_host+0x24:    lock decl 0x64(%rbx)
0xffffffffa02918a9 nlm_release_host+0x28:    mov    0x64(%rbx),%eax
0xffffffffa02918ac nlm_release_host+0x2b:    test   %eax,%eax
0xffffffffa02918ae nlm_release_host+0x2d:    jns    0xffffffffa02918bc
nlm_release_host+0x3b
0xffffffffa02918b0 nlm_release_host+0x2f:    ud2a   
0xffffffffa02918b2 nlm_release_host+0x31:    adc   
%cl,0xffffffffffffffa0(%rcx,%rbp,1)
0xffffffffa02918b9 nlm_release_host+0x38:    (bad)  
0xffffffffa02918ba nlm_release_host+0x39:    cld    
0xffffffffa02918bb nlm_release_host+0x3a:    add   
%bl,0xffffffffffffffc3(%rbx)
0xffffffffa02918be nlm_gc_hosts:         push   %r14
0xffffffffa02918c0 nlm_gc_hosts+0x2:     push   %r13


I just backed up 0x2f from the instruction which failed so we would be right
where nlm_release_host call started. From this we can see that the assert we
are doing is looking at 0x64(%rbx) which is a positive offset of 0x64 from the
memory address in %rbx.

%rbx is 0x0000010193194180 which we know from the reg dump. So I added 0x64 to
that and did a mem dump at 0x101931941e4:
[0]kdb> mds 0x101931941e4
0x00000101931941e0 ffffffff00000000   ....ÿÿÿÿ
0x00000101931941e8 0000000000000001   ........
0x00000101931941f0 dead4ead00000001   ....­N­Þ
0x00000101931941f8 00000101931941f8   øA......
0x0000010193194200 00000101931941f8   øA......
0x0000010193194208 0000000103081229   ).......
0x0000010193194210 000000010308fcb5   µü......
0x0000010193194218 0000010193194218   .B......


That address is 0xffffffff00000000. Now time to look at the code… Here is the
nlm_release_host():
/*
 * Release NLM host after use
 */
void nlm_release_host(struct nlm_host *host)
{
      if (host != NULL) {
            dprintk("lockd: release host %s\n", host->h_name);
            atomic_dec(&host->h_count);
            BUG_ON(atomic_read(&host->h_count) < 0);
      }
}


Here is the struct we are referencing:
/*
 * Lockd host handle (used both by the client and server personality).
 */
struct nlm_host {
      struct nlm_host * h_next;           /* linked list (hash table) */
      struct sockaddr_in      h_addr;           /* peer address */
      struct rpc_clnt   *     h_rpcclnt;  /* RPC client to talk to peer */
      char              h_name[20]; /* remote hostname */
      u32               h_version;  /* interface version */
      rpc_authflavor_t  h_authflavor;     /* NOTUSED: RPC authentication type
*/
      unsigned short          h_proto;    /* transport proto */
      unsigned short          h_reclaiming : 1,
                        h_server     : 1, /* server side, not client side */
                        h_inuse      : 1,
                        h_killed     : 1,
                        h_monitored  : 1;
      wait_queue_head_t h_gracewait;      /* wait while reclaiming */
      u32               h_state;    /* pseudo-state counter */
      u32               h_nsmstate; /* true remote NSM state */
      u32               h_pidcount; /* Pseudopids */
      atomic_t          h_count;    /* reference count */
      struct semaphore  h_sema;           /* mutex for pmap binding */
      unsigned long           h_nextrebind;     /* next portmap call */
      unsigned long           h_expires;  /* eligible for GC */
      struct list_head  h_lockowners;     /* Lockowners for the client */
      spinlock_t        h_lock;
};


From this we see that h_count is 0xffffffff or -1. This is less that 0 and will
assert causing kernel panic (KDB).

Version-Release number of selected component (if applicable):
2.6.9-34

How reproducible:
This is occuring about 1-2 times per day, but so far is not reproducible on demand.


Expected results:
No kernel panic.

Comment 1 Lenny Maiorani 2006-09-21 17:21:30 UTC
Created attachment 136891 [details]
Changes to at least alleviate this bug

After doing some code tracing I came to the conclusion that if the
__rpc_execute returned an error condition (and it can return -EIO or
-ETIMEDOUT) then this nlm_release_host is called twice. 

Attached is my patch to fs/lockd/host.c so that this double decrement doesn't
cause kernel panic.

Question: is there a better way of fixing this?

Comment 2 Lenny Maiorani 2006-09-21 17:40:15 UTC
Created attachment 136894 [details]
Changes to at least alleviate this bug

Oops

Comment 3 Lenny Maiorani 2006-09-21 20:39:20 UTC
Created attachment 136907 [details]
Changes to at least alleviate this bug

Ok, this one really works.

Comment 5 Peter Staubach 2007-04-19 22:11:12 UTC
Yes, the proposed patch may alleviate the bug, doesn't fix it and probably
just delays something else bad happening.

We need to figure out why nlm_release_host() is called too often
and address that.

Comment 6 Peter Staubach 2008-06-18 15:46:52 UTC
Is there some way to reproduce this situation?

I have not seen failures like this during my testing of the NFS and
locking code.

Comment 7 Lenny Maiorani 2008-06-18 15:56:31 UTC
I don't know of any way to reproduce it. It is a timing issue caused when items
are not being serviced fast enough (if I remember the bug correctly). This
happened rarely, but I was able to reproduce it only when saturating many (8)
interfaces at once. 

It is no longer happening with this patch and I haven't seen any side-effects.


Comment 8 Peter Staubach 2008-06-18 19:37:57 UTC
Well, unfortunately, I can't use the patch.  It doesn't address the
underlying problem, but just avoids the symptom due to the imbalance
in the h_count values.

Without some way to cause this to happen and given that I don't see
it in my own testing, this may be difficult to find.  I will need to
do it via code inspection and this can be slow.

Comment 9 Jeff Layton 2008-12-04 21:10:53 UTC
Lenny,
   Any chance you can attach your patch in unified-diff format? It's hard to tell what part of the code you're touching here...

i.e.:

diff -u ...

Comment 10 Lenny Maiorani 2008-12-12 22:13:42 UTC
Created attachment 326787 [details]
Changes to at least alleviate this bug - in unified diff format

Attached is the unified diff.

Comment 11 Lenny Maiorani 2008-12-12 22:14:45 UTC
Created attachment 326789 [details]
Changes to at least alleviate this bug - in unified diff format

Attached is a unified diff and it is marked as a patch this time...

Comment 12 Jeff Layton 2008-12-12 22:47:21 UTC
Thanks Lenny...

I have to agree with Peter here. This patch is just treating the symptom and not the real bug. If nlm_release_host is being called too often then that probably means that we have a potential use-after-free. The host could be freed while it's still in use because the refcount is too low. We need to address the root cause.

The best thing you could do to help us nail this is come up with a way to reliably reproduce the problem...

A while back, you said:

"After doing some code tracing I came to the conclusion that if the
__rpc_execute returned an error condition (and it can return -EIO or
-ETIMEDOUT) then this nlm_release_host is called twice."

...how did you determine this?

Comment 14 Lenny Maiorani 2008-12-12 23:01:03 UTC
I agree with Peter also. This isn't a great solution. It only lets the machine keep running instead of crashing.

The way I determined this was just to follow through the code and see what was being called and keep tracing. 

Basically, this spot gets called regardless of whether or not an error was returned from __rpc_execute(). This is a problem, because if __rpc_execute has returned an error it has already decremented this counter. 

That is my memory of this issue without diving back into it.

Comment 15 Jeff Layton 2008-12-15 15:36:58 UTC
Ok, so it sounds like lockd is occasionally calling into __rpc_execute getting an error and calling an extra nlm_release_host. We'll need to determine how we're calling into __rpc_execute when it returns an error.

Lenny, I'm assuming the answer here is "no" but do you have a way to reliably reproduce this?

Also, is this machine acting as a NFS server, client or both?

Comment 16 Lenny Maiorani 2008-12-15 15:46:37 UTC
I do not have a way to reproduce this. I know that it was happening under heavy load when we were running an rsync, but that is about it.

It was the NFS server.

Comment 17 Jeff Layton 2008-12-15 16:05:20 UTC
I also have a number of NLM patches that are slated for release in RHEL4.8. It's hard to know for sure whether they will help this problem, but they might. It would be good to rule them out in any case. The case in question is here:

https://bugzilla.redhat.com/show_bug.cgi?id=253754

Given that you're running a custom-patched kernel already, testing the patches associated with that BZ may be helpful.

Comment 18 Lenny Maiorani 2008-12-15 23:52:49 UTC
Unfortunately the company making the product which this was occurring on went out of business in early 2007. I do not have any way of testing this.

Comment 19 Jeff Layton 2009-06-23 14:15:57 UTC
Hmm...given that we haven't seen this in testing and have no way to reproduce it, we're at a bit of an impasse here. It's quite possible that this was already fixed by one of the other patches that went into later kernels.

I'm going to move that we close this with a resolution of INSUFFICIENT_DATA. If this pops up again or if you have, please reopen it and I'll have another look.