90571 – Heavy network traffic resets/reboots machine

Bug 90571 - Heavy network traffic resets/reboots machine

Summary: Heavy network traffic resets/reboots machine

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	9
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jeff Garzik
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-05-09 18:42 UTC by Scott Dowdle
Modified:	2013-07-03 02:11 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2003-05-19 20:13:26 UTC
Embargoed:

Attachments	(Terms of Use)

Description Scott Dowdle 2003-05-09 18:42:07 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.7 (X11; Linux i686; U;) Gecko/20030131

Description of problem:
Running 2.4.20-9 on an i686 [GenuineIntel Pentium III (Coppermine) 1004 MHZ] and
have had consistant resets during large FTP and HTTP data transfers.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. ftp a directory full of large files (20 - 300MB in size) [also happens during
some large, single file http transfers]
2. machine resets/reboots somewhere in the middle of transfer
3. 
    

Additional info:

It is not as if it consistantly reboots at a particular file nor quantity of
data.  I've had about 10 resets over a week or so period and on a few occations
it has simply locked hard without resetting (won't respond to pings nor accept
any input other than a manual reset/poweroff).

It seems tied to large data transfers because rsyc (where there is a lot of
traffic but no really large files) does not trigger the problem. I haven't tried
scp'ing the large files but I don't see why it would matter.

To the best of my knowledge, this is not being caused by faulty hardware.  I
have done the memory tests as well as other tests and can find no hardware faults.

Output from lsmod:

Module                  Size  Used by    Not tainted
emu10k1                69032   1  (autoclean)
ac97_codec             13640   0  (autoclean) [emu10k1]
sound                  74228   0  (autoclean) [emu10k1]
soundcore               6404   7  (autoclean) [emu10k1 sound]
parport_pc             19076   1  (autoclean)
lp                      8996   0  (autoclean)
parport                37056   1  (autoclean) [parport_pc lp]
nfsd                   80176   8  (autoclean)
iptable_filter          2412   0  (autoclean) (unused)
ip_tables              15096   1  [iptable_filter]
autofs                 13268   0  (autoclean) (unused)
nfs                    81336   3  (autoclean)
lockd                  58704   1  (autoclean) [nfsd nfs]
sunrpc                 81564   1  (autoclean) [nfsd nfs lockd]
3c59x                  30704   1
sg                     36524   0  (autoclean)
sr_mod                 18136   0  (autoclean)
ide-scsi               12208   0
ide-cd                 35708   0
cdrom                  33728   0  [sr_mod ide-cd]
st                     31248   0  (unused)
loop                   12152   0  (autoclean)
lvm-mod                64000   0
ext3                   70784   9
jbd                    51892   9  [ext3]
aic7xxx               141204   0  (unused)
sd_mod                 13452   0  (unused)
scsi_mod              107128   6  [sg sr_mod ide-scsi st aic7xxx sd_mod]

Output from free:
             total       used       free     shared    buffers     cached
Mem:        255328     173800      81528          0      61168      54684
-/+ buffers/cache:      57948     197380
Swap:       521632      86376     435256

Output from cat /proc/pci:

PCI devices found:
  Bus  0, device   0, function  0:
    Host bridge: Intel Corp. 82815 815 Chipset Host Bridge and Memory Controller
Hub (rev 4).
      Prefetchable 32 bit memory at 0xd0000000 [0xd3ffffff].
  Bus  0, device   1, function  0:
    PCI bridge: Intel Corp. 82815 815 Chipset AGP Bridge (rev 4).
      Master Capable.  Latency=32.  Min Gnt=12.
  Bus  0, device  30, function  0:
    PCI bridge: Intel Corp. 82801BA/CA/DB PCI Bridge (rev 17).
      Master Capable.  No bursts.  Min Gnt=6.
  Bus  0, device  31, function  0:
    ISA bridge: Intel Corp. 82801BA ISA Bridge (LPC) (rev 17).
  Bus  0, device  31, function  1:
    IDE interface: Intel Corp. 82801BA IDE U100 (rev 17).
      I/O at 0xf000 [0xf00f].
  Bus  0, device  31, function  3:
    SMBus: Intel Corp. 82801BA/BAM SMBus (rev 17).
      IRQ 9.
      I/O at 0x5000 [0x500f].
  Bus  1, device   0, function  0:
    VGA compatible controller: Matrox Graphics, Inc. MGA G400 AGP (rev 133).
      IRQ 10.
      Master Capable.  Latency=32.  Min Gnt=16.Max Lat=32.
      Prefetchable 32 bit memory at 0xd4000000 [0xd5ffffff].
      Non-prefetchable 32 bit memory at 0xd6000000 [0xd6003fff].
      Non-prefetchable 32 bit memory at 0xd7000000 [0xd77fffff].
  Bus  2, device   0, function  0:
    SCSI storage controller: Adaptec AHA-7850 (rev 3).
      IRQ 10.
      Master Capable.  Latency=32.  Min Gnt=4.Max Lat=4.
      I/O at 0x9000 [0x90ff].
      Non-prefetchable 32 bit memory at 0xda001000 [0xda001fff].
  Bus  2, device   1, function  0:
    Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 8).
      IRQ 9.
      Master Capable.  Latency=32.  Min Gnt=2.Max Lat=20.
      I/O at 0x9400 [0x941f].
  Bus  2, device   1, function  1:
    Input device controller: Creative Labs SB Live! MIDI/Game Port (rev 8).
      Master Capable.  Latency=32.
      I/O at 0x9800 [0x9807].
  Bus  2, device   2, function  0:
    Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 120).
      IRQ 5.
      Master Capable.  Latency=32.  Min Gnt=10.Max Lat=10.
      I/O at 0x9c00 [0x9c7f].
      Non-prefetchable 32 bit memory at 0xda000000 [0xda00007f].
  Bus  2, device   3, function  0:
    SCSI storage controller: Adaptec AHA-2940U/UW/D / AIC-7881U (rev 1).
      IRQ 11.
      Master Capable.  Latency=32.  Min Gnt=8.Max Lat=8.
      I/O at 0xa000 [0xa0ff].
      Non-prefetchable 32 bit memory at 0xda002000 [0xda002fff].

Machine also seems significantly slower for interactive usage in KDE during
rsync operations as well as any disk intensive operation when compared to kernel
from Red Hat Linux 8.0... but that's also a different KDE so who knows?

There are no /var/log/message entries generated by the problem nor kernel panic
warnings on the screen when it happens... it's like someone just hit the reset
button during the transfer.

Default runlevel is 5... and usually in KDE with lots of apps open when problem
happens.

No other operations seem to trigger the reset behavior other than the transfer
(from the trouble machine to another machine) of large files.

Will provide additional information upon request.

Comment 1 Scott Dowdle 2003-05-09 21:41:48 UTC

Did a bit more testing to attempt to eliminate a few more things.

Getting rid of the vmware modules (not listed above) really improved the
performance of the system.  That got rid of the problem with slowness... but had
nothing to do with the reset/lockup situation.

Booted into runlevel 3 and did a bunch of transfers.  It locked up... so I'm
assuming X has nothing to do with the problem.

Changed network cards from a 3COM 3c590C to a 3c590B and that didn't fix
anything... it still reset in just a couple of minutes after booting... again...
only if transfering a big chunk of data via FTP.

Comment 2 Scott Dowdle 2003-05-16 22:14:13 UTC

Updated to kernel-2.4.20-13.9 and that only seem to make the problem worse... as
 NFS transfers over LAN locked the system up in about 2 minutes flat... twice. 
Am booting previous kernel.  But that does indicate that it isn't specific to
FTP.  Oddly enough I can seem to transfer hordes of files via LAN over scp and
it doesn't hurt anything.  Please tell me I'm not imagining that!

I haven't done systematic testing... and it really isn't fun to lockup/reset
ones machine over and over... so I don't anticipate doing systematic testing
unless directed to do so.

Seems as if any heavy traffic (LAN only and not normal internet traffic) has the
potential to lock up or reset my machine... but if there isn't any heavy
traffic, it can go for days/weeks without a problem... although lately I've had
a lot of transfers/lockups.

I wish I new how to better gather useful information about this problem... but
since when it resets/locks... I have no log entries or anything else... that
would indicate the problem... I'm stumped.  Any suggestions?  ...or is there any
further diagnostic information that I could provide that would be helpful???

Comment 3 acount closed by user 2003-05-17 02:35:23 UTC

Are you sure that you problem is not at hardware? 
try: http://www.memtest86.com

Comment 4 Scott Dowdle 2003-05-17 18:42:48 UTC

Ok.  I downloaded the latest iso of the memtest-86, burned a copy, booted with
it.  It has been running for about 3.5 hours now.  It passed 7 times on the
Standard test with Cache and ECC turned off with 0 errors.  I just went into the
configuration and turned on ECC and Cache and turned on the Extended test.  I'll
let that run for the remainder of the weekend.  If it passes... then what?  So
far, it appears that it will pass but I'm willing to give it an extended period
of time... just in case there are some heat issues or something with my machine.

I have checked all of the fan operations... blown out the computer... checked
all of the connections... and so far, if there is a hardware problem... it is
undetectable other than by the symptoms of my problem.

The machine is rather stocky for a workstation (two SCSI cards - one for scanner
and one for external VX1 tape drive), all four IDE channels in use including a
120GB second hard drive... it serves as a tape backup unit for a number of
servers... and gets a ton of network traffic when engaged in backups (rsync over
ssh)... and has performed flawlessly until the kernel before last 2.4.20-9...
although prior to that it did have a few swap storms... but I attributed that to
multiple periodic uses of VMware (tainted kernel)... and just to clarify... I
have disabled the VMware modules and have not being using VMware since the
problem got worse... with 2.4.20-9 and beyond.

Yeah, I know it could be a hardware problem... I'm not going to ask why a
hardware problem happened because I know that when a lightbulb goes out, it just
goes out.

I would consider reverting back to RHL 8 or a pre-2.4.20-9 kernel if someone
thinks that would be helpful for troubleshooting... although everything from
RHL7.1 and up has gone to the 2.4.20-13 recently.

Ok, I admit it... I'm not convinced it is the kernel... but what else then? 
Shall I just order a new motherboard/computer... or have a run across an obscure
bug resulting from a certain combination of hardware and kernel modules?

Comment 5 Scott Dowdle 2003-05-17 18:54:11 UTC

Oh, forgot to mention that I switched to a 3C905b (no difference) and now I'm on
a Kingston (no difference).  It is *SO* cool how Kudzu detects the hardware
change and migrates the settings so easily... trying switching network cards
four times on Windows.

I give up on changing network cards as it is painfully obvious that if it is a
kernel issue, it is above the network card driver.

Comment 6 acount closed by user 2003-05-19 00:39:24 UTC

if you think that your problem is the net driver/NIC try a stress tools:
ttcp http://it-div-cs.web.cern.ch/it-div-cs/public/projects/atm/ttcp.html
netpipe http://www.scl.ameslab.gov/netpipe/
nttcp http://www.leo.org/~elmar/nttcp/
netperf http://www.netperf.org/
gensink http://jes.home.cern.ch/jes/gensink/

ttcp is easiest:
# dd if=/dev/zero of=file bs=1048576 count=10

--
#!/bin/sh
#This runs the ttcp reciever over and over, at server_1

testing=yes

while [ "$testing" = yes ]
        do

        ./ttcp -s -r

        done
--

--
#!/bin/sh
#This runs the ttcp tranmitter over and over, at server_2

testing=yes

while [ "$testing" = yes ]
        do

        ./ttcp -s -t $1 < testfile

        done
--

Comment 7 Scott Dowdle 2003-05-19 20:13:26 UTC

Ran RAM test for the remainder of the weekend... and after 44 hours and 17
minutes it has passed the extended memtest-86 v3.0 87 times with 0 errors.

Attempted to boot computer after then and got continued resets during booting. 
Removed all PCI cards... and only left in th network card, floppy drive, video
card, and hard drive (removed all SCSI, sound, and CD-ROM drives)... still had
issue.  Changed floppy drive, floppy cable and IDE cable... still had issue.

Gave up and put in a new motherboard.  So far so good.

Comment 8 acount closed by user 2003-05-20 01:52:29 UTC

cpuburn is a good system test
"The goal has been to maximize heat
production from the CPU, putting stress on the CPU itself, cooling
system, motherboard (especially voltage regulators) and power supply"
http://users.ev1.net/~redelm

Note You need to log in before you can comment on or make changes to this bug.