Bug 527110
| Summary: | PCI network cards failing / flooding under load on older hardware | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Steve Morgan <captainmrgn> |
| Component: | kernel | Assignee: | Andy Gospodarek <agospoda> |
| Status: | CLOSED NOTABUG | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
| Severity: | high | Docs Contact: | |
| Priority: | low | ||
| Version: | 5.4 | CC: | jesse.brandeburg, peterm |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-08-19 02:33:23 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Steve Morgan
2009-10-04 16:19:01 UTC
For clarification, does this happen on only the IntelĀ® Desktop Board D925XCV or have you seen this happen on other systems as well? If this has happened on any other systems that happened to be made by a system vendor please let me know what those are, so I can try and find one and reproduce the problem. It's fine if this only reproduces on systems that you have built with this motherboard, but the more information I have the more helpful I can be. I have reproduced the problem on the following motherboard as well. Product Name: P4M800CE-8237 Let me know if you need any more information. Steve, can you send me a sosreport from one of these systems? It's odd that the sky2, via-rhine, e1000, and r8169-based devices all have problems and different chipsets seem to be locking up with your testing. I've seen problems with sky2 from time to time, but I thought they were resolved with 5.4. There are also some known problems with 82545 hardware, so if you are seeing tx timeouts there I'm not surprised. Many have stopped using this hardware because of those problems. The via-rhine and r8169 problems are new to me though. This is a shot in the dark, but I notice your lspci for your sky2-based system has 'ASF' in the description: 04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8050 PCI-E ASF Gigabit Ethernet Controller (rev 17) Can you check your BIOS and try to disable ASF or IPMI on that system if available. I know rsync and ttcp can both use UDP, and I'm curious if there is any interference being caused by the motherboard's IPMI resources chewing up 623 and if that is causing any sort of disruption. I know it's a long shot, but I'm curious. I'd also like to know if this happens consistently enough that you can capture the traffic on the network interface when the failure happens. This looks to be a waste of time. Although I have not tested cards other than the Intel Corporation 82545GM Gigabit Ethernet Controller the problem has been resolved by putting a fan directly on the network card. The system case is in a very cool room where the CPU and hard drives temperatures are below normal. The problem is that the card without any airflow directly on the interface will cause it to hang. After thinking this over and over the problem resembled a heat issue. After touching the card and feeling that some of the chips were very warm I put a fan on it. Taking the fan away cause the chips to get very hot immediately and produce the following errors. NETDEV WATCHDOG: eth1: transmit timed out e1000: eth1: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang Tx Queue <0> TDH <af> TDT <af> next_to_use <af> next_to_clean <a3> buffer_info[next_to_clean] time_stamp <ffffaad8> next_to_watch <a3> jiffies <ffffb00c> next_to_watch.status <0> There has to be a load on the interface for this to happen. If there is not enough network traffic the interface seems to operate okay. When the fan is returned the errors stop and network traffic resumes. I wonder if many of the famous Tx Unit Hang messages can be attributed to this. I may have spoken to soon. This morning I found the card hanging again but without the Tx Unit Hang messages. I will get the sosreport asap. This board does not have IPMI. I'm positive that I'm only using tcp because of my firewall rules. I captured the traffic with tcpdump and it looks normal other than pausing for a few seconds while the network is hanging, resumes and then pauses again. Steve, it would be great if you can you post the messages from the most recent tx unit hang as well as the sosreport tar-ball. I don't have a cheat-sheet for decoding these messages -- one would think I would have written this down my now :-/ -- but I should really develop one. It would go a long way to helping others with these problems. Sometimes there are some 'false hangs' detected and I think this might be one of those. I'm sure Jesse can enlighten us. Interesting about the heat issue, that case must really be designed poorly for airflow around the slots. It was a dead giveaway that more than one kind of card was failing. You could always glue a heatsink on top of the MAC chip. I'd be glad to look over some of the dmesg output for the Intel. if you were having heat problems that card could have been damaged. Do you happen to have >= 4GB of memory? If you change the driver to only advertise gigabit link it might come back faster during reset (reducing your downtime) - if you're running gig. ethtool -s ethX advertise 0x20 Steve, any updates for Jesse and I? Sounds like the heat on the system was causing multiple network card failures. |