Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 527110

Summary:	PCI network cards failing / flooding under load on older hardware
Product:	Red Hat Enterprise Linux 5	Reporter:	Steve Morgan <captainmrgn>
Component:	kernel	Assignee:	Andy Gospodarek <agospoda>
Status:	CLOSED NOTABUG	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	high	Docs Contact:
Priority:	low
Version:	5.4	CC:	jesse.brandeburg, peterm
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-08-19 02:33:23 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Steve Morgan 2009-10-04 16:19:01 UTC

Description of problem:

On older Pentium 4 hardware I'm having three different brand NIC's lock up and completely stop working while under heavy load. Normal network browsing works fine but as soon as rsync or stress testing with ttcp, the interface seems to crash or be so flooded its unable to operate. This problem only happens on older hardware with 32bit PCI slots. I've tried every pci= option, disabled plug and play and tried every e1000 option. Nothing works. 

Another strange thing is that this happens with onboard NIC's as seen on the following motherboard:

Intel® Desktop Board D925XCV

NIC:
  
04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8050 PCI-E ASF Gigabit Ethernet Controller (rev 17)


Add-in PCI cards that fail:

06:02.0 Ethernet controller: Intel Corporation 82545GM Gigabit Ethernet Controller (rev 04)
06:03.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10)

The operating system continues to function fine as well as the other network cards in the system. Removing the ethernet module and loading it again does not solve the problem. A reboot is required to fix the problem.


Steps to Reproduce:
1. Install RHEL5 up-to-date and start running ttcp or rsync. 
2. Install Fedora 11 up-to-date and start running ttcp or rsync.
  
Actual results:

Network card dies/floods/become unresponsive.

Expected results:

Network card should sustain traffic and operate normally under load.

Comment 1 Andy Gospodarek 2009-10-14 14:58:41 UTC

For clarification, does this happen on only the Intel® Desktop Board D925XCV or have you seen this happen on other systems as well?

If this has happened on any other systems that happened to be made by a system vendor please let me know what those are, so I can try and find one and reproduce the problem.

It's fine if this only reproduces on systems that you have built with this motherboard, but the more information I have the more helpful I can be.

Comment 2 Steve Morgan 2009-10-14 17:57:48 UTC

I have reproduced the problem on the following motherboard as well.

 Product Name: P4M800CE-8237

Let me know if you need any more information.

Comment 3 Andy Gospodarek 2009-10-15 20:48:33 UTC

Steve, can you send me a sosreport from one of these systems?

It's odd that the sky2, via-rhine, e1000, and r8169-based devices all have problems and different chipsets seem to be locking up with your testing.

I've seen problems with sky2 from time to time, but I thought they were resolved with 5.4.

There are also some known problems with 82545 hardware, so if you are seeing tx timeouts there I'm not surprised.  Many have stopped using this hardware because of those problems.

The via-rhine and r8169 problems are new to me though.  

This is a shot in the dark, but I notice your lspci for your sky2-based system has 'ASF' in the description:

04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8050 PCI-E ASF
Gigabit Ethernet Controller (rev 17)

Can you check your BIOS and try to disable ASF or IPMI on that system if available.  I know rsync and ttcp can both use UDP, and I'm curious if there is any interference being caused by the motherboard's IPMI resources chewing up 623 and if that is causing any sort of disruption.  I know it's a long shot, but I'm curious.

I'd also like to know if this happens consistently enough that you can capture the traffic on the network interface when the failure happens.

Comment 4 Steve Morgan 2009-10-20 02:12:40 UTC

This looks to be a waste of time. Although I have not tested cards other than the Intel Corporation 82545GM Gigabit Ethernet Controller the problem has been resolved by putting a fan directly on the network card.

The system case is in a very cool room where the CPU and hard drives  temperatures are below normal. The problem is that the card without any airflow directly on the interface will cause it to hang. After thinking this over and over the problem resembled a heat issue. After touching the card and feeling that some of the chips were very warm I put a fan on it. Taking the fan away cause the chips to get very hot immediately and produce the following errors.

NETDEV WATCHDOG: eth1: transmit timed out
e1000: eth1: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
  Tx Queue             <0>
  TDH                  <af>
  TDT                  <af>
  next_to_use          <af>
  next_to_clean        <a3>
buffer_info[next_to_clean]
  time_stamp           <ffffaad8>
  next_to_watch        <a3>
  jiffies              <ffffb00c>
  next_to_watch.status <0>

There has to be a load on the interface for this to happen. If there is not enough network traffic the interface seems to operate okay. 

When the fan is returned the errors stop and network traffic resumes. I wonder if many of the famous Tx Unit Hang messages can be attributed to this.

Comment 5 Steve Morgan 2009-10-20 18:25:00 UTC

I may have spoken to soon. This morning I found the card hanging again but without the Tx Unit Hang messages. I will get the sosreport asap. 

This board does not have IPMI. I'm positive that I'm only using tcp because of my firewall rules. I captured the traffic with tcpdump and it looks normal other than pausing for a few seconds while the network is hanging, resumes and then pauses again.

Comment 6 Andy Gospodarek 2009-10-21 18:54:01 UTC

Steve, it would be great if you can you post the messages from the most recent tx unit hang as well as the sosreport tar-ball.

I don't have a cheat-sheet for decoding these messages -- one would think I would have written this down my now :-/ -- but I should really develop one.  It would go a long way to helping others with these problems.  Sometimes there are some 'false hangs' detected and I think this might be one of those.

I'm sure Jesse can enlighten us.

Comment 7 Jesse Brandeburg 2009-10-23 00:21:41 UTC

Interesting about the heat issue, that case must really be designed poorly for airflow around the slots.  It was a dead giveaway that more than one kind of card was failing.  You could always glue a heatsink on top of the MAC chip.

I'd be glad to look over some of the dmesg output for the Intel.
if you were having heat problems that card could have been damaged.

Do you happen to have >= 4GB of memory?

If you change the driver to only advertise gigabit link it might come back faster during reset (reducing your downtime) - if you're running gig.

ethtool -s ethX advertise 0x20

Comment 8 Andy Gospodarek 2009-11-10 16:43:35 UTC

Steve, any updates for Jesse and I?

Comment 11 Andy Gospodarek 2011-08-19 02:33:23 UTC

Sounds like the heat on the system was causing multiple network card failures.