From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.7 (X11; Linux i686; U;) Gecko/20030131
Description of problem:
Running 2.4.20-9 on an i686 [GenuineIntel Pentium III (Coppermine) 1004 MHZ] and
have had consistant resets during large FTP and HTTP data transfers.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. ftp a directory full of large files (20 - 300MB in size) [also happens during
some large, single file http transfers]
2. machine resets/reboots somewhere in the middle of transfer
It is not as if it consistantly reboots at a particular file nor quantity of
data. I've had about 10 resets over a week or so period and on a few occations
it has simply locked hard without resetting (won't respond to pings nor accept
any input other than a manual reset/poweroff).
It seems tied to large data transfers because rsyc (where there is a lot of
traffic but no really large files) does not trigger the problem. I haven't tried
scp'ing the large files but I don't see why it would matter.
To the best of my knowledge, this is not being caused by faulty hardware. I
have done the memory tests as well as other tests and can find no hardware faults.
Output from lsmod:
Module Size Used by Not tainted
emu10k1 69032 1 (autoclean)
ac97_codec 13640 0 (autoclean) [emu10k1]
sound 74228 0 (autoclean) [emu10k1]
soundcore 6404 7 (autoclean) [emu10k1 sound]
parport_pc 19076 1 (autoclean)
lp 8996 0 (autoclean)
parport 37056 1 (autoclean) [parport_pc lp]
nfsd 80176 8 (autoclean)
iptable_filter 2412 0 (autoclean) (unused)
ip_tables 15096 1 [iptable_filter]
autofs 13268 0 (autoclean) (unused)
nfs 81336 3 (autoclean)
lockd 58704 1 (autoclean) [nfsd nfs]
sunrpc 81564 1 (autoclean) [nfsd nfs lockd]
3c59x 30704 1
sg 36524 0 (autoclean)
sr_mod 18136 0 (autoclean)
ide-scsi 12208 0
ide-cd 35708 0
cdrom 33728 0 [sr_mod ide-cd]
st 31248 0 (unused)
loop 12152 0 (autoclean)
lvm-mod 64000 0
ext3 70784 9
jbd 51892 9 [ext3]
aic7xxx 141204 0 (unused)
sd_mod 13452 0 (unused)
scsi_mod 107128 6 [sg sr_mod ide-scsi st aic7xxx sd_mod]
Output from free:
total used free shared buffers cached
Mem: 255328 173800 81528 0 61168 54684
-/+ buffers/cache: 57948 197380
Swap: 521632 86376 435256
Output from cat /proc/pci:
PCI devices found:
Bus 0, device 0, function 0:
Host bridge: Intel Corp. 82815 815 Chipset Host Bridge and Memory Controller
Hub (rev 4).
Prefetchable 32 bit memory at 0xd0000000 [0xd3ffffff].
Bus 0, device 1, function 0:
PCI bridge: Intel Corp. 82815 815 Chipset AGP Bridge (rev 4).
Master Capable. Latency=32. Min Gnt=12.
Bus 0, device 30, function 0:
PCI bridge: Intel Corp. 82801BA/CA/DB PCI Bridge (rev 17).
Master Capable. No bursts. Min Gnt=6.
Bus 0, device 31, function 0:
ISA bridge: Intel Corp. 82801BA ISA Bridge (LPC) (rev 17).
Bus 0, device 31, function 1:
IDE interface: Intel Corp. 82801BA IDE U100 (rev 17).
I/O at 0xf000 [0xf00f].
Bus 0, device 31, function 3:
SMBus: Intel Corp. 82801BA/BAM SMBus (rev 17).
I/O at 0x5000 [0x500f].
Bus 1, device 0, function 0:
VGA compatible controller: Matrox Graphics, Inc. MGA G400 AGP (rev 133).
Master Capable. Latency=32. Min Gnt=16.Max Lat=32.
Prefetchable 32 bit memory at 0xd4000000 [0xd5ffffff].
Non-prefetchable 32 bit memory at 0xd6000000 [0xd6003fff].
Non-prefetchable 32 bit memory at 0xd7000000 [0xd77fffff].
Bus 2, device 0, function 0:
SCSI storage controller: Adaptec AHA-7850 (rev 3).
Master Capable. Latency=32. Min Gnt=4.Max Lat=4.
I/O at 0x9000 [0x90ff].
Non-prefetchable 32 bit memory at 0xda001000 [0xda001fff].
Bus 2, device 1, function 0:
Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 8).
Master Capable. Latency=32. Min Gnt=2.Max Lat=20.
I/O at 0x9400 [0x941f].
Bus 2, device 1, function 1:
Input device controller: Creative Labs SB Live! MIDI/Game Port (rev 8).
Master Capable. Latency=32.
I/O at 0x9800 [0x9807].
Bus 2, device 2, function 0:
Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 120).
Master Capable. Latency=32. Min Gnt=10.Max Lat=10.
I/O at 0x9c00 [0x9c7f].
Non-prefetchable 32 bit memory at 0xda000000 [0xda00007f].
Bus 2, device 3, function 0:
SCSI storage controller: Adaptec AHA-2940U/UW/D / AIC-7881U (rev 1).
Master Capable. Latency=32. Min Gnt=8.Max Lat=8.
I/O at 0xa000 [0xa0ff].
Non-prefetchable 32 bit memory at 0xda002000 [0xda002fff].
Machine also seems significantly slower for interactive usage in KDE during
rsync operations as well as any disk intensive operation when compared to kernel
from Red Hat Linux 8.0... but that's also a different KDE so who knows?
There are no /var/log/message entries generated by the problem nor kernel panic
warnings on the screen when it happens... it's like someone just hit the reset
button during the transfer.
Default runlevel is 5... and usually in KDE with lots of apps open when problem
No other operations seem to trigger the reset behavior other than the transfer
(from the trouble machine to another machine) of large files.
Will provide additional information upon request.
Did a bit more testing to attempt to eliminate a few more things.
Getting rid of the vmware modules (not listed above) really improved the
performance of the system. That got rid of the problem with slowness... but had
nothing to do with the reset/lockup situation.
Booted into runlevel 3 and did a bunch of transfers. It locked up... so I'm
assuming X has nothing to do with the problem.
Changed network cards from a 3COM 3c590C to a 3c590B and that didn't fix
anything... it still reset in just a couple of minutes after booting... again...
only if transfering a big chunk of data via FTP.
Updated to kernel-2.4.20-13.9 and that only seem to make the problem worse... as
NFS transfers over LAN locked the system up in about 2 minutes flat... twice.
Am booting previous kernel. But that does indicate that it isn't specific to
FTP. Oddly enough I can seem to transfer hordes of files via LAN over scp and
it doesn't hurt anything. Please tell me I'm not imagining that!
I haven't done systematic testing... and it really isn't fun to lockup/reset
ones machine over and over... so I don't anticipate doing systematic testing
unless directed to do so.
Seems as if any heavy traffic (LAN only and not normal internet traffic) has the
potential to lock up or reset my machine... but if there isn't any heavy
traffic, it can go for days/weeks without a problem... although lately I've had
a lot of transfers/lockups.
I wish I new how to better gather useful information about this problem... but
since when it resets/locks... I have no log entries or anything else... that
would indicate the problem... I'm stumped. Any suggestions? ...or is there any
further diagnostic information that I could provide that would be helpful???
Are you sure that you problem is not at hardware?
Ok. I downloaded the latest iso of the memtest-86, burned a copy, booted with
it. It has been running for about 3.5 hours now. It passed 7 times on the
Standard test with Cache and ECC turned off with 0 errors. I just went into the
configuration and turned on ECC and Cache and turned on the Extended test. I'll
let that run for the remainder of the weekend. If it passes... then what? So
far, it appears that it will pass but I'm willing to give it an extended period
of time... just in case there are some heat issues or something with my machine.
I have checked all of the fan operations... blown out the computer... checked
all of the connections... and so far, if there is a hardware problem... it is
undetectable other than by the symptoms of my problem.
The machine is rather stocky for a workstation (two SCSI cards - one for scanner
and one for external VX1 tape drive), all four IDE channels in use including a
120GB second hard drive... it serves as a tape backup unit for a number of
servers... and gets a ton of network traffic when engaged in backups (rsync over
ssh)... and has performed flawlessly until the kernel before last 2.4.20-9...
although prior to that it did have a few swap storms... but I attributed that to
multiple periodic uses of VMware (tainted kernel)... and just to clarify... I
have disabled the VMware modules and have not being using VMware since the
problem got worse... with 2.4.20-9 and beyond.
Yeah, I know it could be a hardware problem... I'm not going to ask why a
hardware problem happened because I know that when a lightbulb goes out, it just
I would consider reverting back to RHL 8 or a pre-2.4.20-9 kernel if someone
thinks that would be helpful for troubleshooting... although everything from
RHL7.1 and up has gone to the 2.4.20-13 recently.
Ok, I admit it... I'm not convinced it is the kernel... but what else then?
Shall I just order a new motherboard/computer... or have a run across an obscure
bug resulting from a certain combination of hardware and kernel modules?
Oh, forgot to mention that I switched to a 3C905b (no difference) and now I'm on
a Kingston (no difference). It is *SO* cool how Kudzu detects the hardware
change and migrates the settings so easily... trying switching network cards
four times on Windows.
I give up on changing network cards as it is painfully obvious that if it is a
kernel issue, it is above the network card driver.
if you think that your problem is the net driver/NIC try a stress tools:
ttcp is easiest:
# dd if=/dev/zero of=file bs=1048576 count=10
#This runs the ttcp reciever over and over, at server_1
while [ "$testing" = yes ]
./ttcp -s -r
#This runs the ttcp tranmitter over and over, at server_2
while [ "$testing" = yes ]
./ttcp -s -t $1 < testfile
Ran RAM test for the remainder of the weekend... and after 44 hours and 17
minutes it has passed the extended memtest-86 v3.0 87 times with 0 errors.
Attempted to boot computer after then and got continued resets during booting.
Removed all PCI cards... and only left in th network card, floppy drive, video
card, and hard drive (removed all SCSI, sound, and CD-ROM drives)... still had
issue. Changed floppy drive, floppy cable and IDE cable... still had issue.
Gave up and put in a new motherboard. So far so good.
cpuburn is a good system test
"The goal has been to maximize heat
production from the CPU, putting stress on the CPU itself, cooling
system, motherboard (especially voltage regulators) and power supply"