From Bugzilla Helper: User-Agent: Mozilla/5.0 Galeon/1.2.7 (X11; Linux i686; U;) Gecko/20030131 Description of problem: Running 2.4.20-9 on an i686 [GenuineIntel Pentium III (Coppermine) 1004 MHZ] and have had consistant resets during large FTP and HTTP data transfers. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. ftp a directory full of large files (20 - 300MB in size) [also happens during some large, single file http transfers] 2. machine resets/reboots somewhere in the middle of transfer 3. Additional info: It is not as if it consistantly reboots at a particular file nor quantity of data. I've had about 10 resets over a week or so period and on a few occations it has simply locked hard without resetting (won't respond to pings nor accept any input other than a manual reset/poweroff). It seems tied to large data transfers because rsyc (where there is a lot of traffic but no really large files) does not trigger the problem. I haven't tried scp'ing the large files but I don't see why it would matter. To the best of my knowledge, this is not being caused by faulty hardware. I have done the memory tests as well as other tests and can find no hardware faults. Output from lsmod: Module Size Used by Not tainted emu10k1 69032 1 (autoclean) ac97_codec 13640 0 (autoclean) [emu10k1] sound 74228 0 (autoclean) [emu10k1] soundcore 6404 7 (autoclean) [emu10k1 sound] parport_pc 19076 1 (autoclean) lp 8996 0 (autoclean) parport 37056 1 (autoclean) [parport_pc lp] nfsd 80176 8 (autoclean) iptable_filter 2412 0 (autoclean) (unused) ip_tables 15096 1 [iptable_filter] autofs 13268 0 (autoclean) (unused) nfs 81336 3 (autoclean) lockd 58704 1 (autoclean) [nfsd nfs] sunrpc 81564 1 (autoclean) [nfsd nfs lockd] 3c59x 30704 1 sg 36524 0 (autoclean) sr_mod 18136 0 (autoclean) ide-scsi 12208 0 ide-cd 35708 0 cdrom 33728 0 [sr_mod ide-cd] st 31248 0 (unused) loop 12152 0 (autoclean) lvm-mod 64000 0 ext3 70784 9 jbd 51892 9 [ext3] aic7xxx 141204 0 (unused) sd_mod 13452 0 (unused) scsi_mod 107128 6 [sg sr_mod ide-scsi st aic7xxx sd_mod] Output from free: total used free shared buffers cached Mem: 255328 173800 81528 0 61168 54684 -/+ buffers/cache: 57948 197380 Swap: 521632 86376 435256 Output from cat /proc/pci: PCI devices found: Bus 0, device 0, function 0: Host bridge: Intel Corp. 82815 815 Chipset Host Bridge and Memory Controller Hub (rev 4). Prefetchable 32 bit memory at 0xd0000000 [0xd3ffffff]. Bus 0, device 1, function 0: PCI bridge: Intel Corp. 82815 815 Chipset AGP Bridge (rev 4). Master Capable. Latency=32. Min Gnt=12. Bus 0, device 30, function 0: PCI bridge: Intel Corp. 82801BA/CA/DB PCI Bridge (rev 17). Master Capable. No bursts. Min Gnt=6. Bus 0, device 31, function 0: ISA bridge: Intel Corp. 82801BA ISA Bridge (LPC) (rev 17). Bus 0, device 31, function 1: IDE interface: Intel Corp. 82801BA IDE U100 (rev 17). I/O at 0xf000 [0xf00f]. Bus 0, device 31, function 3: SMBus: Intel Corp. 82801BA/BAM SMBus (rev 17). IRQ 9. I/O at 0x5000 [0x500f]. Bus 1, device 0, function 0: VGA compatible controller: Matrox Graphics, Inc. MGA G400 AGP (rev 133). IRQ 10. Master Capable. Latency=32. Min Gnt=16.Max Lat=32. Prefetchable 32 bit memory at 0xd4000000 [0xd5ffffff]. Non-prefetchable 32 bit memory at 0xd6000000 [0xd6003fff]. Non-prefetchable 32 bit memory at 0xd7000000 [0xd77fffff]. Bus 2, device 0, function 0: SCSI storage controller: Adaptec AHA-7850 (rev 3). IRQ 10. Master Capable. Latency=32. Min Gnt=4.Max Lat=4. I/O at 0x9000 [0x90ff]. Non-prefetchable 32 bit memory at 0xda001000 [0xda001fff]. Bus 2, device 1, function 0: Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 8). IRQ 9. Master Capable. Latency=32. Min Gnt=2.Max Lat=20. I/O at 0x9400 [0x941f]. Bus 2, device 1, function 1: Input device controller: Creative Labs SB Live! MIDI/Game Port (rev 8). Master Capable. Latency=32. I/O at 0x9800 [0x9807]. Bus 2, device 2, function 0: Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 120). IRQ 5. Master Capable. Latency=32. Min Gnt=10.Max Lat=10. I/O at 0x9c00 [0x9c7f]. Non-prefetchable 32 bit memory at 0xda000000 [0xda00007f]. Bus 2, device 3, function 0: SCSI storage controller: Adaptec AHA-2940U/UW/D / AIC-7881U (rev 1). IRQ 11. Master Capable. Latency=32. Min Gnt=8.Max Lat=8. I/O at 0xa000 [0xa0ff]. Non-prefetchable 32 bit memory at 0xda002000 [0xda002fff]. Machine also seems significantly slower for interactive usage in KDE during rsync operations as well as any disk intensive operation when compared to kernel from Red Hat Linux 8.0... but that's also a different KDE so who knows? There are no /var/log/message entries generated by the problem nor kernel panic warnings on the screen when it happens... it's like someone just hit the reset button during the transfer. Default runlevel is 5... and usually in KDE with lots of apps open when problem happens. No other operations seem to trigger the reset behavior other than the transfer (from the trouble machine to another machine) of large files. Will provide additional information upon request.
Did a bit more testing to attempt to eliminate a few more things. Getting rid of the vmware modules (not listed above) really improved the performance of the system. That got rid of the problem with slowness... but had nothing to do with the reset/lockup situation. Booted into runlevel 3 and did a bunch of transfers. It locked up... so I'm assuming X has nothing to do with the problem. Changed network cards from a 3COM 3c590C to a 3c590B and that didn't fix anything... it still reset in just a couple of minutes after booting... again... only if transfering a big chunk of data via FTP.
Updated to kernel-2.4.20-13.9 and that only seem to make the problem worse... as NFS transfers over LAN locked the system up in about 2 minutes flat... twice. Am booting previous kernel. But that does indicate that it isn't specific to FTP. Oddly enough I can seem to transfer hordes of files via LAN over scp and it doesn't hurt anything. Please tell me I'm not imagining that! I haven't done systematic testing... and it really isn't fun to lockup/reset ones machine over and over... so I don't anticipate doing systematic testing unless directed to do so. Seems as if any heavy traffic (LAN only and not normal internet traffic) has the potential to lock up or reset my machine... but if there isn't any heavy traffic, it can go for days/weeks without a problem... although lately I've had a lot of transfers/lockups. I wish I new how to better gather useful information about this problem... but since when it resets/locks... I have no log entries or anything else... that would indicate the problem... I'm stumped. Any suggestions? ...or is there any further diagnostic information that I could provide that would be helpful???
Are you sure that you problem is not at hardware? try: http://www.memtest86.com
Ok. I downloaded the latest iso of the memtest-86, burned a copy, booted with it. It has been running for about 3.5 hours now. It passed 7 times on the Standard test with Cache and ECC turned off with 0 errors. I just went into the configuration and turned on ECC and Cache and turned on the Extended test. I'll let that run for the remainder of the weekend. If it passes... then what? So far, it appears that it will pass but I'm willing to give it an extended period of time... just in case there are some heat issues or something with my machine. I have checked all of the fan operations... blown out the computer... checked all of the connections... and so far, if there is a hardware problem... it is undetectable other than by the symptoms of my problem. The machine is rather stocky for a workstation (two SCSI cards - one for scanner and one for external VX1 tape drive), all four IDE channels in use including a 120GB second hard drive... it serves as a tape backup unit for a number of servers... and gets a ton of network traffic when engaged in backups (rsync over ssh)... and has performed flawlessly until the kernel before last 2.4.20-9... although prior to that it did have a few swap storms... but I attributed that to multiple periodic uses of VMware (tainted kernel)... and just to clarify... I have disabled the VMware modules and have not being using VMware since the problem got worse... with 2.4.20-9 and beyond. Yeah, I know it could be a hardware problem... I'm not going to ask why a hardware problem happened because I know that when a lightbulb goes out, it just goes out. I would consider reverting back to RHL 8 or a pre-2.4.20-9 kernel if someone thinks that would be helpful for troubleshooting... although everything from RHL7.1 and up has gone to the 2.4.20-13 recently. Ok, I admit it... I'm not convinced it is the kernel... but what else then? Shall I just order a new motherboard/computer... or have a run across an obscure bug resulting from a certain combination of hardware and kernel modules?
Oh, forgot to mention that I switched to a 3C905b (no difference) and now I'm on a Kingston (no difference). It is *SO* cool how Kudzu detects the hardware change and migrates the settings so easily... trying switching network cards four times on Windows. I give up on changing network cards as it is painfully obvious that if it is a kernel issue, it is above the network card driver.
if you think that your problem is the net driver/NIC try a stress tools: ttcp http://it-div-cs.web.cern.ch/it-div-cs/public/projects/atm/ttcp.html netpipe http://www.scl.ameslab.gov/netpipe/ nttcp http://www.leo.org/~elmar/nttcp/ netperf http://www.netperf.org/ gensink http://jes.home.cern.ch/jes/gensink/ ttcp is easiest: # dd if=/dev/zero of=file bs=1048576 count=10 -- #!/bin/sh #This runs the ttcp reciever over and over, at server_1 testing=yes while [ "$testing" = yes ] do ./ttcp -s -r done -- -- #!/bin/sh #This runs the ttcp tranmitter over and over, at server_2 testing=yes while [ "$testing" = yes ] do ./ttcp -s -t $1 < testfile done --
Ran RAM test for the remainder of the weekend... and after 44 hours and 17 minutes it has passed the extended memtest-86 v3.0 87 times with 0 errors. Attempted to boot computer after then and got continued resets during booting. Removed all PCI cards... and only left in th network card, floppy drive, video card, and hard drive (removed all SCSI, sound, and CD-ROM drives)... still had issue. Changed floppy drive, floppy cable and IDE cable... still had issue. Gave up and put in a new motherboard. So far so good.
cpuburn is a good system test "The goal has been to maximize heat production from the CPU, putting stress on the CPU itself, cooling system, motherboard (especially voltage regulators) and power supply" http://users.ev1.net/~redelm