119571 – kernel BUG at transaction.c:2025 (panic)

Bug 119571 - kernel BUG at transaction.c:2025 (panic)

Summary: kernel BUG at transaction.c:2025 (panic)

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-03-31 14:16 UTC by Jon Jensen
Modified:	2007-11-30 22:07 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-04-21 13:33:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Kernel panic messages from /var/log/messages (2.50 KB, text/plain) 2004-03-31 14:17 UTC, Jon Jensen	no flags	Details
View All

Description Jon Jensen 2004-03-31 14:16:47 UTC

While running a program that created many thousands of roughly 3-10 KB
files on an ext3 filesystem, this kernel panic occurred. (The program
is run at night, usually a few times a week, and has never caused any
problem before.) The server was otherwise mostly idle. It has a few
rarely-used websites.

Panic information was sent to an open ssh session, as well as to
/var/log/messages, and is attached.

Comment 1 Jon Jensen 2004-03-31 14:17:58 UTC

Created attachment 99000 [details]
Kernel panic messages from /var/log/messages

Comment 2 Ernie Petrides 2004-03-31 22:59:40 UTC

Stephen, this looks like something you fixed in RHEL 3 U1
(2.4.21-9.EL) in my tracking file 0169.sct.critical-ext3-fixes.patch
for bugzilla 77839.  But according to the attachment in comment #1,
customer is running -9.0.1.EL (the 1st security errata built on U1),
so your fix should already be incorporated there.

Could you please investigate this one?  Thanks.  -ernie

Comment 3 Stephen Tweedie 2004-04-01 23:01:57 UTC

Looking more closely, the error looks quite distinct from #77839.  I
don't recall  _ever_ seeing this footprint before, on any kernel. 
Very odd.

journal.c:499 in journal_write_metadata_buffer() is committing a
buffer, so we know that the buffer was just recently found on the
committing transaction's metadata list.  But when we go to refile the
buffer to indicate that it's now being journalled, we find that the
buffer is not marked as being part of this transaction any longer.

Now, that part of ext3 gets exercised *all* the time.  Most times when
we see a new, random inexplicable error from ext3 like this, it turns
out to be hardware error.  I reckon that's true in ~99% of the cases.
   But it's just hard to be sure: there's not enough information in
this oops to go on.  We can see the transaction that's being committed
(it's in %edi), but not the transaction that the buffer belongs to
(that was in %eax and has been overwritten by the assert()'s printk
call by the time we get to the oops.)

Is there anything else in the logs at all that might point to a kernel
or system problem?  Is this at all reproducible?  Have there been any
other fs problems in the past?

Comment 4 Jon Jensen 2004-04-21 13:33:17 UTC

There haven't been any problems before or since, though I scanned the
logs just now to check for any anomalies, and found this:

Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 1c000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 20000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 20000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 3c000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 24000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 2c000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 08000000.

That was 4 days after the panic. That hasn't shown up in the logs
before or since.

So I ran memtester and it failed every test at the same location in
memory, so it looks like bad RAM. I haven't before seen a machine
remain so functional for months with bad RAM. Sorry for the false alarm!

Note You need to log in before you can comment on or make changes to this bug.