Bug 102792

Summary: Kernel 2.4.20-20.7 dies under very heavy network traffic
Product: [Retired] Red Hat Linux Reporter: abs01
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: high    
Version: 7.3CC: gary.mansell, pfrields, riel, snielsen
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-30 15:41:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description abs01 2003-08-21 05:14:51 UTC
Description of problem: Kernel 2.4.20-20.7 installs fine and under a normal load
may work just fine for most users. But under very heavy network traffic the
kernel hangs. Probably a problem with the driver (gigabit tg3 I believe - Dell 2650)

You Bugzilla wouldn't let me select "Kernel" as the component....


Version-Release number of selected component (if applicable): 2.4.20-20.7


How reproducible: every time!


Steps to Reproduce:
1. run a robot fetching 2+ million URLs a day and you'll see the bug
2.
3.
    
Actual results:


Expected results:


Additional info:
This same thing happened with an older kernel for 7.3, but I can't remember
which one. Kernel 2.4.18-27.7.xsmp works just fine. Load up the new Kernel
2.4.20-20.7 and it dies every time under heavy load.

Comment 1 Gary Mansell 2003-11-26 09:23:02 UTC
I see this bug too - I have a Dell PE2650 running Redhat Linux 7.3.
The machine is an NFS server that has been running for at least 6
months on kernel 2.4.18-27.7.xsmp quite happily.

I tested the new 2.4.20-20.7smp kernel on my test machine for a week
with no problems (admittedly I did not hammer it with NFS traffic) and
presumed it OK. When I then up2dated my production server to this
kernel, the machine crashed within one hour. I then rebooted and it
crashed after about 4 hours. Since rebooting into the old
2.4.18-27.7.xsmp kernel,  the machine has been fine again.

The symptoms of the crash were that the server just hung/locked up -
it would not respond to pings. There were no messages in the log files.

Comment 2 Gary Mansell 2003-12-16 16:29:50 UTC
Is anything being done about this ???

Comment 3 Steve Nielsen 2003-12-16 16:47:42 UTC
I too have been experiencing crashes under heavy network/disk io load.
I am (was) using kernel 2.4.20-20.7smp on a dell 2450 with a 10/100
ethernet card adaptor. When I get a crash the box is completely dead
to the world and no log messages related to the crash are written to
syslog or the console. Is there a way to increase verbosity so
something will get logged? Anyway, To work around the crashing issue I
now boot off the kernel that was stable for me before the udpate
(2.4.18-10smp). I am not using any binary only modules. I am taint
free. If you need more info from me please let me know.

Comment 4 abs01 2003-12-16 22:22:08 UTC
Redhat isn't going to do anything about it. After all they can't wait to drop support for 
7.3 December 31st and try and push you to Enterprise or something else. I'm 
moving away from Redhat and compiling my own from sources at kernel.org. I can't 
wait to get away from all the rpm crap anyway.

Comment 5 acount closed by user 2003-12-16 22:32:31 UTC
tg3 is not the problem, it's *aacraid* driver.

Comment 6 abs01 2003-12-16 23:43:22 UTC
If you think it's the *aacraid* driver then I guess you can download the most recent 
one at:

http://domsch.com/linux/

There's quite a bit of discussion here and to be honest with you, I'm not sure what to 
do or how to fix my current box. Currently I'm running kernel 2.4.20-24.7smp without 
a crash, but cpu is eaten by kscand which is the #1 cpu usage on the box running 
almost constantly at 5% on a dual XEON(TM) CPU 1.80GHz.

If anyone has any suggestions on calming this beast down without breaking the box, 
let me know.

Comment 7 Steve Nielsen 2003-12-17 14:36:13 UTC
Xose, why do you say its the aacraid driver ? Are there settings 
where I can have the kernel print something out when things crash ? 
Currently I am not getting anything written to the console or to 
syslog.
Thanks,
Steve

 

Comment 8 acount closed by user 2003-12-18 01:24:01 UTC
abs01: 

kscand bug is another thing, and that bug is already open in bugzilla. 

dave jones is going to release a new kernel errata very soon: try the
*beta* release http://people.redhat.com/davej/rhl-errata/2.4.20-27.7/


Comment 9 acount closed by user 2003-12-18 02:10:02 UTC
Steve Nielsen:

a colleague of mine has some dell-2650. She updated to latest BIOS,
BackPlane firmware, RAID firmware and RHL-kernel(2.4.20-20.x) and the
systems are stable.

Other people in the dell mailing list has problems with aacraid
driver, but tg3 driver use to be stable. If you have any doubt try
bcm-5700 instead of tg3 (danger!! unsupported by Red Hat):
http://www.broadcom.com/drivers/downloaddrivers.php

But I am sure that the problem is aacraid.

to catch the bug -> /usr/src/linux-2.4/Documentation/nmi_watchdog.txt

Comment 10 Steve Nielsen 2003-12-22 15:09:49 UTC
I tried using the beta errata kernel from Dave Jones and my system
crashed again after a couple of days. Same symptoms (no syslog, no
text printed to the console). I am using rh7.3, a dell 2450, 10/100
ethernet cards, no hardware raid only software raid.

Comment 11 Bugzilla owner 2004-09-30 15:41:27 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/