Bug 436966
| Summary: | e1000_clean_tx_irq: Detected Tx Unit Hang - 82546EB | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Flavio Leitner <fleitner> | ||||||||||||
| Component: | kernel | Assignee: | Andy Gospodarek <agospoda> | ||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||
| Priority: | high | ||||||||||||||
| Version: | 5.1 | CC: | akarlsso, bilias, hasuzuki, jcavallaro, jesse.brandeburg, k.georgiou, pasteur, peterm, syeghiay, tao | ||||||||||||
| Target Milestone: | beta | ||||||||||||||
| Target Release: | --- | ||||||||||||||
| Hardware: | x86_64 | ||||||||||||||
| OS: | Linux | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2009-01-20 19:42:59 UTC | Type: | --- | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | |||||||||||||||
| Bug Blocks: | 448732 | ||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Flavio Leitner
2008-03-11 13:15:07 UTC
Created attachment 297607 [details]
ifconfig output, ethtool -k,-i outputs and others
Could you check if this issue still reproduces with kernel available at http://people.redhat.com/agospoda/#rhel5 ? That kernel is updated and has some test patches, so would be good to check if that still reproduces it. thanks, Flavio The issue is still seen with the latest kernel from gospo. Flavio There are probably still a few bits (watchdog timer stuff) that might be in the rhel5 e1000 driver that are NOT upstream though it was promised they would get there. I don't think it's worth removing since it will cause another bug to appear again, but we could consider removing those bits and retesting. Is there ANY chance we can get this reproduced on a non-customer system? If this is only seen under load, this patch should apply just fine on RHEL5 and can be used along with new module parameters to work around this issue: http://people.redhat.com/agospoda/rhel4/0019-e1000-add-module-parameter-to-set-transmit-descript.patch Please see the following entry for how to use this new module parameter to try and workaround issues with the 82545/6. https://bugzilla.redhat.com/show_bug.cgi?id=334411#c47 My test kernels have been updated to include a patch for this bugzilla. http://people.redhat.com/agospoda/#rhel5 Please test them and report back your results. Anders, Can you tell me what tuning parameters they used? I'd like to know what they used for TxDescPower and TxDescriptors. Thanks! I don't think disabling TSO is a guaranteed way to prevent this problem, but if it works for the customer I would say they should continue to do that. in kernel-2.6.18-118.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Hello, this is also happening on RHEL 4.7. Is there any available *official* fix? Thanks There will be a fix for 4.8. See bug 334411 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html I just had this again
on 5.3 kernel 2.6.18-128.1.16.el5PAE
TSO is disabled. I had this in the past with Fedora
and disabling TSO:
ethtool -K eth0 tso off
solved the problem. No luck now. However the system
didn't hung this time.
Aug 6 04:10:14 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Aug 6 04:10:14 localhost kernel: Tx Queue <0>
Aug 6 04:10:14 localhost kernel: TDH <2d>
Aug 6 04:10:14 localhost kernel: TDT <2d>
Aug 6 04:10:14 localhost kernel: next_to_use <2d>
Aug 6 04:10:14 localhost kernel: next_to_clean <d9>
Aug 6 04:10:14 localhost kernel: buffer_info[next_to_clean]
Aug 6 04:10:14 localhost kernel: time_stamp <7aa7f69>
Aug 6 04:10:14 localhost kernel: next_to_watch <d9>
Aug 6 04:10:14 localhost kernel: jiffies <7aa845f>
Aug 6 04:10:14 localhost kernel: next_to_watch.status <1>
Aug 6 04:10:52 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Aug 6 04:10:52 localhost kernel: Tx Queue <0>
Aug 6 04:10:52 localhost kernel: TDH <42>
Aug 6 04:10:52 localhost kernel: TDT <42>
Aug 6 04:10:52 localhost kernel: next_to_use <42>
Aug 6 04:10:52 localhost kernel: next_to_clean <21>
Aug 6 04:10:52 localhost kernel: buffer_info[next_to_clean]
Aug 6 04:10:52 localhost kernel: time_stamp <7ab0523>
Aug 6 04:10:52 localhost kernel: next_to_watch <24>
Aug 6 04:10:52 localhost kernel: jiffies <7ab0964>
Aug 6 04:10:52 localhost kernel: next_to_watch.status <1>
Aug 6 04:10:54 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Aug 6 04:10:54 localhost kernel: Tx Queue <0>
Aug 6 04:10:54 localhost kernel: TDH <ca>
Aug 6 04:10:54 localhost kernel: TDT <ca>
Aug 6 04:10:54 localhost kernel: next_to_use <ca>
Aug 6 04:10:54 localhost kernel: next_to_clean <a1>
Aug 6 04:10:54 localhost kernel: buffer_info[next_to_clean]
Aug 6 04:10:54 localhost kernel: time_stamp <7ab0cfa>
Aug 6 04:10:54 localhost kernel: next_to_watch <a4>
Aug 6 04:10:54 localhost kernel: jiffies <7ab11bf>
Aug 6 04:10:54 localhost kernel: next_to_watch.status <1>
Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03)
Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter
Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 50
Memory at dd620000 (64-bit, non-prefetchable) [size=128K]
I/O ports at 3000 [size=64]
Capabilities: [dc] Power Management version 2
Capabilities: [e4] PCI-X non-bridge device
Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
ethtool -i eth0
driver: e1000
version: 7.3.20-k2-NAPI
firmware-version: N/A
bus-info: 0000:04:02.0
ethtool -k eth0
Offload parameters for eth0:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
regards,
Giannis
(In reply to comment #38) > I just had this again > on 5.3 kernel 2.6.18-128.1.16.el5PAE > > TSO is disabled. I had this in the past with Fedora > and disabling TSO: > ethtool -K eth0 tso off > solved the problem. No luck now. However the system > didn't hung this time. > > Aug 6 04:10:14 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx > Unit Hang > Aug 6 04:10:14 localhost kernel: Tx Queue <0> > Aug 6 04:10:14 localhost kernel: TDH <2d> > Aug 6 04:10:14 localhost kernel: TDT <2d> > Aug 6 04:10:14 localhost kernel: next_to_use <2d> > Aug 6 04:10:14 localhost kernel: next_to_clean <d9> > Aug 6 04:10:14 localhost kernel: buffer_info[next_to_clean] > Aug 6 04:10:14 localhost kernel: time_stamp <7aa7f69> > Aug 6 04:10:14 localhost kernel: next_to_watch <d9> > Aug 6 04:10:14 localhost kernel: jiffies <7aa845f> > Aug 6 04:10:14 localhost kernel: next_to_watch.status <1> > Aug 6 04:10:52 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx > Unit Hang > Aug 6 04:10:52 localhost kernel: Tx Queue <0> > Aug 6 04:10:52 localhost kernel: TDH <42> > Aug 6 04:10:52 localhost kernel: TDT <42> > Aug 6 04:10:52 localhost kernel: next_to_use <42> > Aug 6 04:10:52 localhost kernel: next_to_clean <21> > Aug 6 04:10:52 localhost kernel: buffer_info[next_to_clean] > Aug 6 04:10:52 localhost kernel: time_stamp <7ab0523> > Aug 6 04:10:52 localhost kernel: next_to_watch <24> > Aug 6 04:10:52 localhost kernel: jiffies <7ab0964> > Aug 6 04:10:52 localhost kernel: next_to_watch.status <1> > Aug 6 04:10:54 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx > Unit Hang > Aug 6 04:10:54 localhost kernel: Tx Queue <0> > Aug 6 04:10:54 localhost kernel: TDH <ca> > Aug 6 04:10:54 localhost kernel: TDT <ca> > Aug 6 04:10:54 localhost kernel: next_to_use <ca> > Aug 6 04:10:54 localhost kernel: next_to_clean <a1> > Aug 6 04:10:54 localhost kernel: buffer_info[next_to_clean] > Aug 6 04:10:54 localhost kernel: time_stamp <7ab0cfa> > Aug 6 04:10:54 localhost kernel: next_to_watch <a4> > Aug 6 04:10:54 localhost kernel: jiffies <7ab11bf> > Aug 6 04:10:54 localhost kernel: next_to_watch.status <1> > > Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev > 03) > Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter > Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 50 > Memory at dd620000 (64-bit, non-prefetchable) [size=128K] > I/O ports at 3000 [size=64] > Capabilities: [dc] Power Management version 2 > Capabilities: [e4] PCI-X non-bridge device > Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 > Enable- > Giannis, there are really only 3 ways to try and workaround the known problem with this hardware. 1. Disable TSO. (You have tried this.) 2. Use the new module option described below (usually combined with an increase in the number of ring buffers so that you can keep the same amount of packet memory). /* Transmit Descriptor Power * * Valid Range: 6-12 * This value represents the size-order of each transmit descriptor. * The valid size for descriptors would be 2^6 (64) to 2^12 (4096) bytes * each. As this value decreases one may want to consider increasing * the TxDescriptors value to maintain the same amount of frame memory. * * Default Value: 12 */ E1000_PARAM(TxDescPower, "Binary exponential size (2^X) of each transmit descriptor"); 3. Effectively disable adaptive interrupt modulation, by setting the module option InterruptThrottleRate=8000 for all devices. /* Interrupt Throttle Rate (interrupts/sec) * * Valid Range: 100-100000 (0=off, 1=dynamic, 3=dynamic conservative) */ E1000_PARAM(InterruptThrottleRate, "Interrupt Throttling Rate"); Unfortunately many of our users have reported that the only method to truly stop seeing these errors is to use a different network adapter. In this(In reply to comment #38) > I just had this again > on 5.3 kernel 2.6.18-128.1.16.el5PAE > > TSO is disabled. I had this in the past with Fedora > and disabling TSO: > ethtool -K eth0 tso off > solved the problem. No luck now. However the system > didn't hung this time. In the past with Fedora *on this system*? what kind of system is this? Can you please attach lspci output? > Aug 6 04:10:14 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx > Unit Hang > Aug 6 04:10:14 localhost kernel: Tx Queue <0> > Aug 6 04:10:14 localhost kernel: TDH <2d> > Aug 6 04:10:14 localhost kernel: TDT <2d> Since TDH==TDT here, this is a "false hang" where the hardware actually completed all available work, and something went wrong in the writeback process. OR, if these messages were all that was in your log, ie there was no NETDEV_WATCHDOG message this is actually a false hang, indicating that your system is for some reason taking an extremely long time to transmit some packets, and they sit in the hardware tx ring for longer than two seconds, and are eventually completed. Are you running at 10Mb or 100Mb? Do you have flow control enabled? Can you please send the output of ethtool -S eth0 after one of these messages in the log? > Aug 6 04:10:14 localhost kernel: next_to_use <2d> > Aug 6 04:10:14 localhost kernel: next_to_clean <d9> > Aug 6 04:10:14 localhost kernel: buffer_info[next_to_clean] > Aug 6 04:10:14 localhost kernel: time_stamp <7aa7f69> > Aug 6 04:10:14 localhost kernel: next_to_watch <d9> > Aug 6 04:10:14 localhost kernel: jiffies <7aa845f> > Aug 6 04:10:14 localhost kernel: next_to_watch.status <1> > Aug 6 04:10:52 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx > Unit Hang > Aug 6 04:10:52 localhost kernel: Tx Queue <0> > Aug 6 04:10:52 localhost kernel: TDH <42> > Aug 6 04:10:52 localhost kernel: TDT <42> > Aug 6 04:10:52 localhost kernel: next_to_use <42> > Aug 6 04:10:52 localhost kernel: next_to_clean <21> > Aug 6 04:10:52 localhost kernel: buffer_info[next_to_clean] > Aug 6 04:10:52 localhost kernel: time_stamp <7ab0523> > Aug 6 04:10:52 localhost kernel: next_to_watch <24> > Aug 6 04:10:52 localhost kernel: jiffies <7ab0964> > Aug 6 04:10:52 localhost kernel: next_to_watch.status <1> > Aug 6 04:10:54 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx > Unit Hang > Aug 6 04:10:54 localhost kernel: Tx Queue <0> > Aug 6 04:10:54 localhost kernel: TDH <ca> > Aug 6 04:10:54 localhost kernel: TDT <ca> > Aug 6 04:10:54 localhost kernel: next_to_use <ca> > Aug 6 04:10:54 localhost kernel: next_to_clean <a1> > Aug 6 04:10:54 localhost kernel: buffer_info[next_to_clean] > Aug 6 04:10:54 localhost kernel: time_stamp <7ab0cfa> > Aug 6 04:10:54 localhost kernel: next_to_watch <a4> > Aug 6 04:10:54 localhost kernel: jiffies <7ab11bf> > Aug 6 04:10:54 localhost kernel: next_to_watch.status <1> > > Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev > 03) > Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter > Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 50 > Memory at dd620000 (64-bit, non-prefetchable) [size=128K] > I/O ports at 3000 [size=64] > Capabilities: [dc] Power Management version 2 > Capabilities: [e4] PCI-X non-bridge device > Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 > Enable- > > > ethtool -i eth0 > driver: e1000 > version: 7.3.20-k2-NAPI > firmware-version: N/A > bus-info: 0000:04:02.0 > > ethtool -k eth0 > Offload parameters for eth0: > Cannot get device udp large send offload settings: Operation not supported > rx-checksumming: on > tx-checksumming: on > scatter-gather: on > tcp segmentation offload: off > udp fragmentation offload: off > generic segmentation offload: off I mostly agree with what Andy said in his post, but in this case I'm not sure his statements would apply. Created attachment 356571 [details]
lspci -vvv
I've attached the lspci output. System dual Xeon(TM) CPU 3.20GHz @ 4G Ram running as an ftp/http mirror with htb enabled. eth0 is connected to a 1 Gigabit port. System used to be a Fedora. At that time I had come to this problem again http://bugzilla.kernel.org/show_bug.cgi?id=9808 and I solved it by disabling TSO. System had random cold hungs. I had this again (cold hungs) when I moved to Centos and disabling TSO solved it again. This is the first time I've come to this with the TSO disabled. This time the system didn't hung. However it was slow... Other system options: net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 2500 net.ipv4.ip_conntrack_max=131072 This time it happened right after my data tranfer from a backup server. I was rsyncing the data back to the server for more than one day at 700-800Mbps with no problem. 10 minutes after the transfer was finished and I enabled the services I had the Tx Unit Hang. Bear in mind that I've updated to 2.6.18-128.4.1.el5PAE today. If you need any more info I would be glad to help Giannis Can you paste the output of: # ethtool -a eth0 # ethtool eth0 from any time the system is in use. And also # ethtool -S eth0 after a failure like you have seen in comment #38? ethtool -a eth0
Pause parameters for eth0:
Autonegotiate: on
RX: on
TX: off
ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: umbg
Wake-on: g
Current message level: 0x00000007 (7)
Link detected: yes
If a have a failure I will post ethtool -S eth0
Giannis
Hi,
I had another Tx Unit Hang this morning on this same machine (5.4).
2.6.18-164.10.1.el5PAE
TSO is disabled. I will post all the detail:
Jan 13 04:02:20 host kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jan 13 04:02:21 host kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Jan 13 04:02:21 host kernel: Tx Queue <0>
Jan 13 04:02:21 host kernel: TDH <e7>
Jan 13 04:02:21 host kernel: TDT <e7>
Jan 13 04:02:21 host kernel: next_to_use <e7>
Jan 13 04:02:21 host kernel: next_to_clean <bd>
Jan 13 04:02:21 host kernel: buffer_info[next_to_clean]
Jan 13 04:02:21 host kernel: time_stamp <88b1b22>
Jan 13 04:02:21 host kernel: next_to_watch <bf>
Jan 13 04:02:21 host kernel: jiffies <88b3b60>
Jan 13 04:02:21 host kernel: next_to_watch.status <1>
Jan 13 04:02:24 host kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
# ethtool -i eth0
driver: e1000
version: 7.3.20-k2-NAPI
firmware-version: N/A
bus-info: 0000:04:02.0
# ethtool -k eth0
Offload parameters for eth0:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: off
# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: umbg
Wake-on: g
Current message level: 0x00000007 (7)
Link detected: yes
# ethtool -a eth0
Pause parameters for eth0:
Autonegotiate: on
RX: on
TX: off
# ethtool -S eth0
NIC statistics:
rx_packets: 939575758
tx_packets: 1521712731
rx_bytes: 263510394579
tx_bytes: 2171648060330
rx_broadcast: 293
tx_broadcast: 292
rx_multicast: 0
tx_multicast: 13754
rx_errors: 0
tx_errors: 0
tx_dropped: 0
multicast: 0
collisions: 0
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_no_buffer_count: 596322
rx_missed_errors: 302320
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 1
tx_restart_queue: 1933172
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 747
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 263510394579
rx_csum_offload_good: 939560507
rx_csum_offload_errors: 6022
rx_header_split: 0
alloc_rx_buff_failed: 0
tx_smbus: 0
rx_smbus: 0
dropped_smbus: 0
04:02.0 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03)
Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (63750ns min), Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 233
Region 0: Memory at dd620000 (64-bit, non-prefetchable) [size=128K]
Region 4: I/O ports at 3000 [size=64]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [e4] PCI-X non-bridge device
Command: DPERE- ERO+ RBC=512 OST=1
Status: Dev=04:02.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
Address: 0000000000000000 Data: 0000
best regards,
Giannis
Giannis, thank you for posting that detailed information in comment #44 and comment #45. I see that flow control is enabled, but the statistics do not indicate XOFF frames were received. Jesse, that seems to rule out flow-control as the problem, right? yes, flow control is confirmed not the issue. This seems to be a slightly different report than before because we are now seeing NETDEV WATCHDOG which means that transmits were not completed for a long time. the hangs in comment 38 and 45 are showing that TDH==TDT, which means that the hardware has finished processing tx packets. At this stage the only way we can figure out what is going on is to get a full descriptor ring dump from the e1000_dump function (or get a pci-x bus trace). Typically the driver is not getting the DD bit back from the descriptor writeback in these cases, usually due to some weird race, but usually these issues aren't reported in Intel systems. I have a prototype patch I made for upstream that I will attach, Andy, not sure if you could build a kernel or driver for Kapetanakis. The patch builds against net-next but I've not done much testing on it. He will need to either load the module with debug_dump=2 module option or modify sysfs parameter of the same name at runtime. This will dump a ton to dmesg/syslog so sometimes decreasing tx/rx descriptors can be helpful in reducing the amount of data dumped (maybe 80/80 or 128/128 using ethtool -G) btw, I have a system of that class/vintage in my office, but I don't have 4GB of ram. The slot that you're in, in that machine, is typically a shared PCI-X slot, is there a chance you could rearrange the adapters so the adaptec and 82546 switch slots? It might make a difference, but I realize this might not be an easy experiment on a production machine. Andy, I don't if you're referring to me, but this a production server (http/ftp official mirror for tons of sites, including fedora, centos etc). Thus I cannot play with custom kernels... In advance the network interface is on board. Also I didn't have another issue the last 10 days, so we can't be sure if something makes a change or not... best rgds Giannis I was wrong. I had one more yesterday: Jan 19 04:02:21 host kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 19 04:02:24 host kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX Jan 19 04:05:26 host kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Jan 19 04:05:26 host kernel: Tx Queue <0> Jan 19 04:05:26 host kernel: TDH <3> Jan 19 04:05:26 host kernel: TDT <3> Jan 19 04:05:26 host kernel: next_to_use <3> Jan 19 04:05:26 host kernel: next_to_clean <bb> Jan 19 04:05:26 host kernel: buffer_info[next_to_clean] Jan 19 04:05:26 host kernel: time_stamp <27746cb3> Jan 19 04:05:26 host kernel: next_to_watch <bd> Jan 19 04:05:26 host kernel: jiffies <27747376> Jan 19 04:05:26 host kernel: next_to_watch.status <1> These NETDEV WATCHDOG: eth0: transmit timed out are quite often. At least one every day. But there is not always a Tx Unit Hang following Jan 20 04:02:20 host kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 20 04:02:24 host kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX Created attachment 385761 [details]
e1000 dump code upstream proposal (for reference)
untested upstream patch.
Giannis, Jesse, Hi, I think it could be a related to this https://bugzilla.redhat.com/show_bug.cgi?id=499355#c8 Maybe it would be good to give a try. Flavio Also it might be related to htb code. I'm runnning QoS on this machine https://bugzilla.redhat.com/show_bug.cgi?id=481546 Kapetanakis, is there a chance you can try running with the patch from comment 50? If you need me to build you an e1000 driver from sourceforge with the patch applied I can do that. Should this bug be reopened? I could do that, but how would you know if it fixes my problem? The system is up for 14 days with no problem...except for the NETDEV WATCHDOG: eth0: transmit timed out message. Anyway, do you want me to try e1000-8.0.16.tar.gz with your patch on it? I wasn't wanting the sourceforge driver to fix your problem, I (perhaps wrongly) assumed that you would need a "stand alone" driver to replace your redhat e1000. I can actually build you a driver source *for* your kernel, from the redhat sources with my patch applied, if I know exactly what kernel you're running. Also, it would help answer some questions for me if you could attach your /proc/interrupts and your dmesg. One of the things about this 7320 system is that the kernel won't allow interrupt affinity due to some system bug, so irqs are moving every interrupt to the next processor. I have a 7320 system in my office. This is probably not related but worth mentioning because it can encourage some racy code behavior. There are also some test kernels for a backport bug in https://bugzilla.redhat.com/show_bug.cgi?id=499355, but that bug is against RHEL4. You're right we should apply the patch on the running driver... and not on sourceforge's driver. Yes you can send me the patch to apply it on my kernel sources. Best if only recompiles the e1000 module and not the whole monster. I will attach dmesg and /proc/interrupts Giannis Created attachment 389387 [details]
/proc/interrupts
Created attachment 389388 [details]
dmesg
Sorry forgot to add kernel version: PAE 2.6.18-164.11.1.el5PAE it's on dmesg, but never mind :) |