Red Hat Bugzilla – Bug 108948
File system complaints ultimately trigger networking BUG
Last modified: 2007-04-18 12:59:05 EDT
Description of problem:
System became sluggish and failed to start "up2date", "ps-ef"
and "top" - was able to run dmesg to collect some information.
System refused to reboot cleanly and many filesystem errors on
subsequent e2fsck run after power cycle.
Version-Release number of selected component (if applicable):
System is treated almost as a headless lonesome server that controls
the firewall for the DSL connection PPPOE. Also runs services for
four other systems - a mix of Linux and Windows including Samba
Master, DNS master, DHCP master, Fax server (hylafax). Firewall is
Usually netowrk traffic is about 500MB/week unless I'm downloading
ISOS... or doing a major upgrade.
Recently updated the system using RHN ( in the last three days ) and
rebooted using the new updated kernel. Everything seemed ok -
checked firewall for functionality with external port scan and came
up clean. System had been running linux-2.4.21-ac3 without incident
for several months.
Steps to Reproduce:
1. Unsure of what activity triggers the problem - this occurred
overnight, most likely. Usually, overnight activity amounts to
tripwire run that thrashes the disk pretty good.
Created attachment 95680 [details]
Dmesg output showing file system complaints and BUG output
This is the data I was able to collect from the system.
That's a corruption of something. Without more data, it's impossible
to say what: whether it's bad memory, a bad disk/controller/cable, or
a kernel fault in a driver or filesystem or in the VM core.
The BUG() looks like it was triggered by previous failed IO. The
kernel should respond more gracefully to that, but this doesn't tell
us what the corruption was caused by in the first place. We need to
know if it is reproducible, and how to reproduce it.
BUG() appears to be in ext2_get_branch() -> sb_read(). We've got to
bh = sb_bread(sb, le32_to_cpu(p->key));
which looks up an indirect block, but gets one which is both not
uptodate and not mapped. bread() gets the BUG when it tries to read
in the buffer.
So a previous IO error has left behind an unmapped buffer, but one
that _is_ hashed. Odd. Sounds like one for Al to check out.
We're looking at it, but the kernel log is seriously truncated and
it's hard for us to identify what the *first* thing that went wrong is.
We may find a route to the BUG(), but without more information there's
almost no chance that we'll be able to diagnose whether the initial
problem here was due to hardware or software.
Do you want me to set up a serial console and catch all the
messages? - Most likely, we could catch the first complaint...
Will setting debug on the kernel boot help?
Were you able to capture any more here? These problems still look
more like hardware than anything else, and they don't match any other
footprint I've been looking at. Please reopen if you can still
reproduce against a current kernel.
I've updated the system to 2.4.27 now, and have not seen the issue in
21 days of uptime. I hadn't gotten a response on my queries in
comment#5, so haven't set up the serial console recording.
debug on the kernel boot won't help here. A full debug kernel build
might (enable things like slab poisoning in .config.)