Bug 207123
Summary: | kernel assert in nlm_release_host | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Lenny Maiorani <lenny> |
Component: | kernel | Assignee: | Jeff Layton <jlayton> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.3 | CC: | jbaron, jlayton, staubach, steved, tao |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-06-23 14:15:57 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Lenny Maiorani
2006-09-19 15:40:28 UTC
Created attachment 136891 [details]
Changes to at least alleviate this bug
After doing some code tracing I came to the conclusion that if the
__rpc_execute returned an error condition (and it can return -EIO or
-ETIMEDOUT) then this nlm_release_host is called twice.
Attached is my patch to fs/lockd/host.c so that this double decrement doesn't
cause kernel panic.
Question: is there a better way of fixing this?
Created attachment 136894 [details]
Changes to at least alleviate this bug
Oops
Created attachment 136907 [details]
Changes to at least alleviate this bug
Ok, this one really works.
Yes, the proposed patch may alleviate the bug, doesn't fix it and probably just delays something else bad happening. We need to figure out why nlm_release_host() is called too often and address that. Is there some way to reproduce this situation? I have not seen failures like this during my testing of the NFS and locking code. I don't know of any way to reproduce it. It is a timing issue caused when items are not being serviced fast enough (if I remember the bug correctly). This happened rarely, but I was able to reproduce it only when saturating many (8) interfaces at once. It is no longer happening with this patch and I haven't seen any side-effects. Well, unfortunately, I can't use the patch. It doesn't address the underlying problem, but just avoids the symptom due to the imbalance in the h_count values. Without some way to cause this to happen and given that I don't see it in my own testing, this may be difficult to find. I will need to do it via code inspection and this can be slow. Lenny, Any chance you can attach your patch in unified-diff format? It's hard to tell what part of the code you're touching here... i.e.: diff -u ... Created attachment 326787 [details]
Changes to at least alleviate this bug - in unified diff format
Attached is the unified diff.
Created attachment 326789 [details]
Changes to at least alleviate this bug - in unified diff format
Attached is a unified diff and it is marked as a patch this time...
Thanks Lenny... I have to agree with Peter here. This patch is just treating the symptom and not the real bug. If nlm_release_host is being called too often then that probably means that we have a potential use-after-free. The host could be freed while it's still in use because the refcount is too low. We need to address the root cause. The best thing you could do to help us nail this is come up with a way to reliably reproduce the problem... A while back, you said: "After doing some code tracing I came to the conclusion that if the __rpc_execute returned an error condition (and it can return -EIO or -ETIMEDOUT) then this nlm_release_host is called twice." ...how did you determine this? I agree with Peter also. This isn't a great solution. It only lets the machine keep running instead of crashing. The way I determined this was just to follow through the code and see what was being called and keep tracing. Basically, this spot gets called regardless of whether or not an error was returned from __rpc_execute(). This is a problem, because if __rpc_execute has returned an error it has already decremented this counter. That is my memory of this issue without diving back into it. Ok, so it sounds like lockd is occasionally calling into __rpc_execute getting an error and calling an extra nlm_release_host. We'll need to determine how we're calling into __rpc_execute when it returns an error. Lenny, I'm assuming the answer here is "no" but do you have a way to reliably reproduce this? Also, is this machine acting as a NFS server, client or both? I do not have a way to reproduce this. I know that it was happening under heavy load when we were running an rsync, but that is about it. It was the NFS server. I also have a number of NLM patches that are slated for release in RHEL4.8. It's hard to know for sure whether they will help this problem, but they might. It would be good to rule them out in any case. The case in question is here: https://bugzilla.redhat.com/show_bug.cgi?id=253754 Given that you're running a custom-patched kernel already, testing the patches associated with that BZ may be helpful. Unfortunately the company making the product which this was occurring on went out of business in early 2007. I do not have any way of testing this. Hmm...given that we haven't seen this in testing and have no way to reproduce it, we're at a bit of an impasse here. It's quite possible that this was already fixed by one of the other patches that went into later kernels. I'm going to move that we close this with a resolution of INSUFFICIENT_DATA. If this pops up again or if you have, please reopen it and I'll have another look. |