Description of problem: Kernel asserts in nlm_release_host. Here is the backtrace: [0]kdb> bt Stack traceback for pid 8380 0x00000100dee83030 8380 1 1 0 R 0x00000100dee83430 *rpciod RSP RIP Function (args) 0x100deeb3df0 0xffffffffa02918b0 [lockd]nlm_release_host+0x2f 0x100deeb3e18 0xffffffffa0298666 [lockd]nlm4svc_callback_exit+0x38 0x2e79f50f9dab, 0x100e5ed57f0, 100dee832d0, 0x0, 0x2e7761d0ca48) 0x100deeb3e28 0xffffffffa0193e3d [sunrpc]__rpc_execute+0x2e9 0x100deeb3eb8 0xffffffffa0193f0f [sunrpc]__rpc_schedule+0x4a (0x0, x100dee83030, 0xffffffff801351da, 100deeb3f20, 0x100deeb3f20) 0x100deeb3ed8 0xffffffffa01943cf [sunrpc]rpciod+0x12d 0x100deeb3f58 0xffffffff80110e17 child_rip+0x8 Here is the register dump. [0]kdb> rd r15 = 0xffffffffa0292088 r14 = 0x00000000fffffff4 r13 = 0x0000000000000000 r12 = 0x00000101fd63f880 rbp = 0x0000000000000002 rbx = 0x0000010193194180 r11 = 0xffffffff8015c142 r10 = 0xffffffff8015c142 r9 = 0x00000101050d33c0 r8 = 0x0000000000000003 rax = 0x00000000ffffffff rcx = 0x0000000000000006 rdx = 0x00000000fffffff3 rsi = 0x0000000000000008 rdi = 0x0000010193194180 orig_rax = 0xffffffffffffffff rip = 0xffffffffa02918b0 cs = 0x0000000000000010 eflags = 0x0000000000010286 rsp = 0x00000100deeb3df0 ss = 0x00000100deeb2000 ®s = 0x00000100deeb3d58 So, I took at look at the RIP just before the instruction in nlm_release_host that caused the panic: [0]kdb> id 0xffffffffa0291891 0xffffffffa0291891 nlm_release_host+0x10: je 0xffffffffa02918a5 nlm_release_host+0x24 0xffffffffa0291893 nlm_release_host+0x12: lea 0x20(%rdi),%rsi 0xffffffffa0291897 nlm_release_host+0x16: xor %eax,%eax 0xffffffffa0291899 nlm_release_host+0x18: mov $0xffffffffa0298bf8,%rdi 0xffffffffa02918a0 nlm_release_host+0x1f: callq 0xffffffff80137fb2 printk 0xffffffffa02918a5 nlm_release_host+0x24: lock decl 0x64(%rbx) 0xffffffffa02918a9 nlm_release_host+0x28: mov 0x64(%rbx),%eax 0xffffffffa02918ac nlm_release_host+0x2b: test %eax,%eax 0xffffffffa02918ae nlm_release_host+0x2d: jns 0xffffffffa02918bc nlm_release_host+0x3b 0xffffffffa02918b0 nlm_release_host+0x2f: ud2a 0xffffffffa02918b2 nlm_release_host+0x31: adc %cl,0xffffffffffffffa0(%rcx,%rbp,1) 0xffffffffa02918b9 nlm_release_host+0x38: (bad) 0xffffffffa02918ba nlm_release_host+0x39: cld 0xffffffffa02918bb nlm_release_host+0x3a: add %bl,0xffffffffffffffc3(%rbx) 0xffffffffa02918be nlm_gc_hosts: push %r14 0xffffffffa02918c0 nlm_gc_hosts+0x2: push %r13 I just backed up 0x2f from the instruction which failed so we would be right where nlm_release_host call started. From this we can see that the assert we are doing is looking at 0x64(%rbx) which is a positive offset of 0x64 from the memory address in %rbx. %rbx is 0x0000010193194180 which we know from the reg dump. So I added 0x64 to that and did a mem dump at 0x101931941e4: [0]kdb> mds 0x101931941e4 0x00000101931941e0 ffffffff00000000 ....ÿÿÿÿ 0x00000101931941e8 0000000000000001 ........ 0x00000101931941f0 dead4ead00000001 ....NÞ 0x00000101931941f8 00000101931941f8 øA...... 0x0000010193194200 00000101931941f8 øA...... 0x0000010193194208 0000000103081229 )....... 0x0000010193194210 000000010308fcb5 µü...... 0x0000010193194218 0000010193194218 .B...... That address is 0xffffffff00000000. Now time to look at the code… Here is the nlm_release_host(): /* * Release NLM host after use */ void nlm_release_host(struct nlm_host *host) { if (host != NULL) { dprintk("lockd: release host %s\n", host->h_name); atomic_dec(&host->h_count); BUG_ON(atomic_read(&host->h_count) < 0); } } Here is the struct we are referencing: /* * Lockd host handle (used both by the client and server personality). */ struct nlm_host { struct nlm_host * h_next; /* linked list (hash table) */ struct sockaddr_in h_addr; /* peer address */ struct rpc_clnt * h_rpcclnt; /* RPC client to talk to peer */ char h_name[20]; /* remote hostname */ u32 h_version; /* interface version */ rpc_authflavor_t h_authflavor; /* NOTUSED: RPC authentication type */ unsigned short h_proto; /* transport proto */ unsigned short h_reclaiming : 1, h_server : 1, /* server side, not client side */ h_inuse : 1, h_killed : 1, h_monitored : 1; wait_queue_head_t h_gracewait; /* wait while reclaiming */ u32 h_state; /* pseudo-state counter */ u32 h_nsmstate; /* true remote NSM state */ u32 h_pidcount; /* Pseudopids */ atomic_t h_count; /* reference count */ struct semaphore h_sema; /* mutex for pmap binding */ unsigned long h_nextrebind; /* next portmap call */ unsigned long h_expires; /* eligible for GC */ struct list_head h_lockowners; /* Lockowners for the client */ spinlock_t h_lock; }; From this we see that h_count is 0xffffffff or -1. This is less that 0 and will assert causing kernel panic (KDB). Version-Release number of selected component (if applicable): 2.6.9-34 How reproducible: This is occuring about 1-2 times per day, but so far is not reproducible on demand. Expected results: No kernel panic.
Created attachment 136891 [details] Changes to at least alleviate this bug After doing some code tracing I came to the conclusion that if the __rpc_execute returned an error condition (and it can return -EIO or -ETIMEDOUT) then this nlm_release_host is called twice. Attached is my patch to fs/lockd/host.c so that this double decrement doesn't cause kernel panic. Question: is there a better way of fixing this?
Created attachment 136894 [details] Changes to at least alleviate this bug Oops
Created attachment 136907 [details] Changes to at least alleviate this bug Ok, this one really works.
Yes, the proposed patch may alleviate the bug, doesn't fix it and probably just delays something else bad happening. We need to figure out why nlm_release_host() is called too often and address that.
Is there some way to reproduce this situation? I have not seen failures like this during my testing of the NFS and locking code.
I don't know of any way to reproduce it. It is a timing issue caused when items are not being serviced fast enough (if I remember the bug correctly). This happened rarely, but I was able to reproduce it only when saturating many (8) interfaces at once. It is no longer happening with this patch and I haven't seen any side-effects.
Well, unfortunately, I can't use the patch. It doesn't address the underlying problem, but just avoids the symptom due to the imbalance in the h_count values. Without some way to cause this to happen and given that I don't see it in my own testing, this may be difficult to find. I will need to do it via code inspection and this can be slow.
Lenny, Any chance you can attach your patch in unified-diff format? It's hard to tell what part of the code you're touching here... i.e.: diff -u ...
Created attachment 326787 [details] Changes to at least alleviate this bug - in unified diff format Attached is the unified diff.
Created attachment 326789 [details] Changes to at least alleviate this bug - in unified diff format Attached is a unified diff and it is marked as a patch this time...
Thanks Lenny... I have to agree with Peter here. This patch is just treating the symptom and not the real bug. If nlm_release_host is being called too often then that probably means that we have a potential use-after-free. The host could be freed while it's still in use because the refcount is too low. We need to address the root cause. The best thing you could do to help us nail this is come up with a way to reliably reproduce the problem... A while back, you said: "After doing some code tracing I came to the conclusion that if the __rpc_execute returned an error condition (and it can return -EIO or -ETIMEDOUT) then this nlm_release_host is called twice." ...how did you determine this?
I agree with Peter also. This isn't a great solution. It only lets the machine keep running instead of crashing. The way I determined this was just to follow through the code and see what was being called and keep tracing. Basically, this spot gets called regardless of whether or not an error was returned from __rpc_execute(). This is a problem, because if __rpc_execute has returned an error it has already decremented this counter. That is my memory of this issue without diving back into it.
Ok, so it sounds like lockd is occasionally calling into __rpc_execute getting an error and calling an extra nlm_release_host. We'll need to determine how we're calling into __rpc_execute when it returns an error. Lenny, I'm assuming the answer here is "no" but do you have a way to reliably reproduce this? Also, is this machine acting as a NFS server, client or both?
I do not have a way to reproduce this. I know that it was happening under heavy load when we were running an rsync, but that is about it. It was the NFS server.
I also have a number of NLM patches that are slated for release in RHEL4.8. It's hard to know for sure whether they will help this problem, but they might. It would be good to rule them out in any case. The case in question is here: https://bugzilla.redhat.com/show_bug.cgi?id=253754 Given that you're running a custom-patched kernel already, testing the patches associated with that BZ may be helpful.
Unfortunately the company making the product which this was occurring on went out of business in early 2007. I do not have any way of testing this.
Hmm...given that we haven't seen this in testing and have no way to reproduce it, we're at a bit of an impasse here. It's quite possible that this was already fixed by one of the other patches that went into later kernels. I'm going to move that we close this with a resolution of INSUFFICIENT_DATA. If this pops up again or if you have, please reopen it and I'll have another look.