Bug 108948

Summary: File system complaints ultimately trigger networking BUG
Product: [Retired] Red Hat Linux Reporter: Richard Schaal <rschaal_95135>
Component: kernelAssignee: Stephen Tweedie <sct>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 7.2   
Target Milestone: ---   
Target Release: ---   
Hardware: i586   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-10 14:12:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Dmesg output showing file system complaints and BUG output none

Description Richard Schaal 2003-11-03 16:54:55 UTC
Description of problem:
System became sluggish and failed to start "up2date", "ps-ef" 
and "top" - was able to run dmesg to collect some information.  
System refused to reboot cleanly and many filesystem errors on 
subsequent e2fsck run after power cycle.

Version-Release number of selected component (if applicable):

linux-2.4.20-20.7

How reproducible:

System is treated almost as a headless lonesome server that controls 
the firewall for the DSL connection PPPOE.  Also runs services for 
four other systems - a mix of Linux and Windows including Samba 
Master, DNS master, DHCP master, Fax server (hylafax).  Firewall is 
using IPCHAINS.

Usually netowrk traffic is about 500MB/week unless I'm downloading 
ISOS... or doing a major upgrade.

Recently updated the system using RHN ( in the last three days ) and 
rebooted using the new updated kernel.  Everything seemed ok - 
checked firewall for functionality with external port scan and came 
up clean.  System had been running linux-2.4.21-ac3 without incident 
for several months.

Steps to Reproduce:
1.  Unsure of what activity triggers the problem - this occurred 
overnight, most likely.  Usually, overnight activity amounts to 
tripwire run that thrashes the disk pretty good.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Richard Schaal 2003-11-03 16:58:52 UTC
Created attachment 95680 [details]
Dmesg output showing file system complaints and BUG output

This is the data I was able to collect from the system.

Comment 2 Stephen Tweedie 2003-11-03 18:27:08 UTC
That's a corruption of something.  Without more data, it's impossible
to say what: whether it's bad memory, a bad disk/controller/cable, or
a kernel fault in a driver or filesystem or in the VM core.

The BUG() looks like it was triggered by previous failed IO.  The
kernel should respond more gracefully to that, but this doesn't tell
us what the corruption was caused by in the first place.  We need to
know if it is reproducible, and how to reproduce it.


Comment 3 Stephen Tweedie 2003-11-03 18:34:27 UTC
BUG() appears to be in ext2_get_branch() -> sb_read().  We've got to

		bh = sb_bread(sb, le32_to_cpu(p->key));

which looks up an indirect block, but gets one which is both not
uptodate and not mapped.  bread() gets the BUG when it tries to read
in the buffer.

So a previous IO error has left behind an unmapped buffer, but one
that _is_ hashed.  Odd.  Sounds like one for Al to check out.

Comment 4 Stephen Tweedie 2003-11-03 18:52:25 UTC
We're looking at it, but the kernel log is seriously truncated and
it's hard for us to identify what the *first* thing that went wrong is.

We may find a route to the BUG(), but without more information there's
almost no chance that we'll be able to diagnose whether the initial
problem here was due to hardware or software.

Comment 5 Richard Schaal 2003-11-03 19:05:26 UTC
Do you want me to set up a serial console and catch all the 
messages? - Most likely, we could catch the first complaint...

Will setting debug on the kernel boot help?



Comment 6 Stephen Tweedie 2004-09-10 14:12:00 UTC
Were you able to capture any more here?  These problems still look
more like hardware than anything else, and they don't match any other
footprint I've been looking at.  Please reopen if you can still
reproduce against a current kernel.


Comment 7 Richard Schaal 2004-09-10 14:39:32 UTC
I've updated the system to 2.4.27 now, and have not seen the issue in
21 days of uptime.  I hadn't gotten a response on my queries in
comment#5, so haven't set up the serial console recording.  

Comment 8 Stephen Tweedie 2004-09-10 15:42:09 UTC
debug on the kernel boot won't help here.  A full debug kernel build
might (enable things like slab poisoning in .config.)