Bug 119571 - kernel BUG at transaction.c:2025 (panic)
kernel BUG at transaction.c:2025 (panic)
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Stephen Tweedie
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-03-31 09:16 EST by Jon Jensen
Modified: 2007-11-30 17:07 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-04-21 09:33:17 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Kernel panic messages from /var/log/messages (2.50 KB, text/plain)
2004-03-31 09:17 EST, Jon Jensen
no flags Details

  None (edit)
Description Jon Jensen 2004-03-31 09:16:47 EST
While running a program that created many thousands of roughly 3-10 KB
files on an ext3 filesystem, this kernel panic occurred. (The program
is run at night, usually a few times a week, and has never caused any
problem before.) The server was otherwise mostly idle. It has a few
rarely-used websites.

Panic information was sent to an open ssh session, as well as to
/var/log/messages, and is attached.
Comment 1 Jon Jensen 2004-03-31 09:17:58 EST
Created attachment 99000 [details]
Kernel panic messages from /var/log/messages
Comment 2 Ernie Petrides 2004-03-31 17:59:40 EST
Stephen, this looks like something you fixed in RHEL 3 U1
(2.4.21-9.EL) in my tracking file 0169.sct.critical-ext3-fixes.patch
for bugzilla 77839.  But according to the attachment in comment #1,
customer is running -9.0.1.EL (the 1st security errata built on U1),
so your fix should already be incorporated there.

Could you please investigate this one?  Thanks.  -ernie
Comment 3 Stephen Tweedie 2004-04-01 18:01:57 EST
Looking more closely, the error looks quite distinct from #77839.  I
don't recall  _ever_ seeing this footprint before, on any kernel. 
Very odd.

journal.c:499 in journal_write_metadata_buffer() is committing a
buffer, so we know that the buffer was just recently found on the
committing transaction's metadata list.  But when we go to refile the
buffer to indicate that it's now being journalled, we find that the
buffer is not marked as being part of this transaction any longer.

Now, that part of ext3 gets exercised *all* the time.  Most times when
we see a new, random inexplicable error from ext3 like this, it turns
out to be hardware error.  I reckon that's true in ~99% of the cases.
   But it's just hard to be sure: there's not enough information in
this oops to go on.  We can see the transaction that's being committed
(it's in %edi), but not the transaction that the buffer belongs to
(that was in %eax and has been overwritten by the assert()'s printk
call by the time we get to the oops.)

Is there anything else in the logs at all that might point to a kernel
or system problem?  Is this at all reproducible?  Have there been any
other fs problems in the past?
Comment 4 Jon Jensen 2004-04-21 09:33:17 EDT
There haven't been any problems before or since, though I scanned the
logs just now to check for any anomalies, and found this:

Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 1c000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 20000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 20000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 3c000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 24000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 2c000000.
Apr  3 13:06:21 rs8 kernel: memory.c:189: bad pmd 08000000.

That was 4 days after the panic. That hasn't shown up in the logs
before or since.

So I ran memtester and it failed every test at the same location in
memory, so it looks like bad RAM. I haven't before seen a machine
remain so functional for months with bad RAM. Sorry for the false alarm!

Note You need to log in before you can comment on or make changes to this bug.