Bug 119571
Summary: | kernel BUG at transaction.c:2025 (panic) | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Jon Jensen <jon> | ||||
Component: | kernel | Assignee: | Stephen Tweedie <sct> | ||||
Status: | CLOSED NOTABUG | QA Contact: | |||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3.0 | CC: | petrides, riel | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2004-04-21 13:33:17 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Jon Jensen
2004-03-31 14:16:47 UTC
Created attachment 99000 [details]
Kernel panic messages from /var/log/messages
Stephen, this looks like something you fixed in RHEL 3 U1 (2.4.21-9.EL) in my tracking file 0169.sct.critical-ext3-fixes.patch for bugzilla 77839. But according to the attachment in comment #1, customer is running -9.0.1.EL (the 1st security errata built on U1), so your fix should already be incorporated there. Could you please investigate this one? Thanks. -ernie Looking more closely, the error looks quite distinct from #77839. I don't recall _ever_ seeing this footprint before, on any kernel. Very odd. journal.c:499 in journal_write_metadata_buffer() is committing a buffer, so we know that the buffer was just recently found on the committing transaction's metadata list. But when we go to refile the buffer to indicate that it's now being journalled, we find that the buffer is not marked as being part of this transaction any longer. Now, that part of ext3 gets exercised *all* the time. Most times when we see a new, random inexplicable error from ext3 like this, it turns out to be hardware error. I reckon that's true in ~99% of the cases. But it's just hard to be sure: there's not enough information in this oops to go on. We can see the transaction that's being committed (it's in %edi), but not the transaction that the buffer belongs to (that was in %eax and has been overwritten by the assert()'s printk call by the time we get to the oops.) Is there anything else in the logs at all that might point to a kernel or system problem? Is this at all reproducible? Have there been any other fs problems in the past? There haven't been any problems before or since, though I scanned the logs just now to check for any anomalies, and found this: Apr 3 13:06:21 rs8 kernel: memory.c:189: bad pmd 1c000000. Apr 3 13:06:21 rs8 kernel: memory.c:189: bad pmd 20000000. Apr 3 13:06:21 rs8 kernel: memory.c:189: bad pmd 20000000. Apr 3 13:06:21 rs8 kernel: memory.c:189: bad pmd 3c000000. Apr 3 13:06:21 rs8 kernel: memory.c:189: bad pmd 24000000. Apr 3 13:06:21 rs8 kernel: memory.c:189: bad pmd 2c000000. Apr 3 13:06:21 rs8 kernel: memory.c:189: bad pmd 08000000. That was 4 days after the panic. That hasn't shown up in the logs before or since. So I ran memtester and it failed every test at the same location in memory, so it looks like bad RAM. I haven't before seen a machine remain so functional for months with bad RAM. Sorry for the false alarm! |