Bug 119571

Summary: kernel BUG at transaction.c:2025 (panic)
Product: Red Hat Enterprise Linux 3 Reporter: Jon Jensen <jon>
Component: kernelAssignee: Stephen Tweedie <sct>
Status: CLOSED NOTABUG QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: petrides, riel
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-04-21 13:33:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Kernel panic messages from /var/log/messages none

Description Jon Jensen 2004-03-31 14:16:47 UTC
While running a program that created many thousands of roughly 3-10 KB
files on an ext3 filesystem, this kernel panic occurred. (The program
is run at night, usually a few times a week, and has never caused any
problem before.) The server was otherwise mostly idle. It has a few
rarely-used websites.

Panic information was sent to an open ssh session, as well as to
/var/log/messages, and is attached.

Comment 1 Jon Jensen 2004-03-31 14:17:58 UTC
Created attachment 99000 [details]
Kernel panic messages from /var/log/messages

Comment 2 Ernie Petrides 2004-03-31 22:59:40 UTC
Stephen, this looks like something you fixed in RHEL 3 U1
(2.4.21-9.EL) in my tracking file 0169.sct.critical-ext3-fixes.patch
for bugzilla 77839.  But according to the attachment in comment #1,
customer is running -9.0.1.EL (the 1st security errata built on U1),
so your fix should already be incorporated there.

Could you please investigate this one?  Thanks.  -ernie


Comment 3 Stephen Tweedie 2004-04-01 23:01:57 UTC
Looking more closely, the error looks quite distinct from #77839.  I
don't recall  _ever_ seeing this footprint before, on any kernel. 
Very odd.

journal.c:499 in journal_write_metadata_buffer() is committing a
buffer, so we know that the buffer was just recently found on the
committing transaction's metadata list.  But when we go to refile the
buffer to indicate that it's now being journalled, we find that the
buffer is not marked as being part of this transaction any longer.

Now, that part of ext3 gets exercised *all* the time.  Most times when
we see a new, random inexplicable error from ext3 like this, it turns
out to be hardware error.  I reckon that's true in ~99% of the cases.
   But it's just hard to be sure: there's not enough information in
this oops to go on.  We can see the transaction that's being committed
(it's in %edi), but not the transaction that the buffer belongs to
(that was in %eax and has been overwritten by the assert()'s printk
call by the time we get to the oops.)

Is there anything else in the logs at all that might point to a kernel
or system problem?  Is this at all reproducible?  Have there been any
other fs problems in the past?

Comment 4 Jon Jensen 2004-04-21 13:33:17 UTC
There haven't been any problems before or since, though I scanned the
logs just now to check for any anomalies, and found this:

Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 1c000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 20000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 20000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 3c000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 24000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 2c000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 08000000.

That was 4 days after the panic. That hasn't shown up in the logs
before or since.

So I ran memtester and it failed every test at the same location in
memory, so it looks like bad RAM. I haven't before seen a machine
remain so functional for months with bad RAM. Sorry for the false alarm!