Bug 235712

Summary:

Kernel Bug fs/lockd/host.c

Product:

Red Hat Enterprise Linux 4

Reporter:

Jim Summers <jbsummers>

Component:

kernel

Assignee:

Jeff Layton <jlayton>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Martin Jenner <mjenner>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.0

CC:

staubach, steved

Target Milestone:

---

Target Release:

---

Hardware:

i386

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-05-03 14:06:41 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Log Messages right before the panic	none

Description Jim Summers 2007-04-09 19:19:22 UTC

Description of problem:
The server panicked and after recovery the logs had messages regarding a bug
in fs/lockd/host.c line 252

Version-Release number of selected component (if applicable):
2.6.9-42.0.3.ELsmp


How reproducible:
It has happened twice now in about a month and a half span.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Attached file has the kernel messages before the 5 second panic.

Comment 1 Jim Summers 2007-04-09 19:19:22 UTC

Created attachment 152008 [details]
Log Messages right before the panic

Comment 2 Jeff Layton 2007-04-26 16:55:05 UTC

In taking a look at this case, it looks like the machine paniced due to this:

        BUG_ON(atomic_read(&host->h_count) < 0);

So the h_count here was too low at the time this function was called. The big
question is why. Unfortunately, it's hard to tell much from the oops here. What
can you tell me about the conditions under which this is occuring? Would it be
possible to get a crash dump?

Comment 3 Jim Summers 2007-04-27 02:51:39 UTC

The first time it happened, the main campus's IT department was having a big
outage of network services.  So I had written it off to some severe delay in
name resolution or something.  

But this last time, it seemed to be a regular day and then all of a sudden the
calls started coming in and when I got to the console it had panicked.  

Setup is nothing out of the ordinary.  Redhat server, fedora 5 clients, a few
OS/X machines.  

What does h_count keep track of?  Maybe that will give me a clue to some more
info I can provide.

I looked in /var/crash and did not see any files.  Are they saved somewhere else?

Thanks for your time and efforts.

Comment 4 Jeff Layton 2007-04-27 11:33:05 UTC

h_count is a reference count for the nlm_host structure. The kernel crashed
because after it went to decrement this count, it was less than 0, and so it
should not have been being used in the first place. This may mean that we have a
case of too many "releases" and not enough "gets" on this struct, or it may mean
some sort of memory corruption is occuring. Unfortunately, I can't tell much
from just the oops messages here. I've seen some upstream reports of panics that
look similar to this, but nothing that points me to whether they were ever resolved.

If you can get a core, that may help answer some questions. Our support people
should be able to help you do this. There are also some kbase articles on how to
set up diskdump and netdump. It would also be good to open a support case anyway
so that we can track this as an actual customer issue internally.

If you do so, reference this BZ case number so they're aware of it.

Comment 5 Jeff Layton 2007-05-02 13:36:54 UTC

Setting to needinfo until we can get some more info on how this is reproduced,
or a vmcore...

Comment 6 Jim Summers 2007-05-03 13:59:36 UTC

Sounds good here.

I am not sure I will be able to provide much more info.  I have since upgraded
to a newer kernel.  This is on a production machine and I can't really shutdown
for testing and such.

Thanks

Comment 7 Jeff Layton 2007-05-03 14:06:41 UTC

Fair enough. I've not heard of anyone else hitting this issue and it sounds like
you've worked around it by using a different kernel. I'm going to close this
case with a resolution of INSUFFICIENT_DATA. If you find you're able to
reproduce it (and optimally, get a vmcore), then please reopen this case and
I'll have another look.