Bug 112377
Summary: | (TG3) driver stops sending packages. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Need Real Name <juha.o.ylitalo> | ||||||
Component: | kernel | Assignee: | David Miller <davem> | ||||||
Status: | CLOSED WONTFIX | QA Contact: | |||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.0 | CC: | eric.eisenhart, jgarzik, lakamine, msattler, ngaywood, pcrooker, petrides, riel, rperkins, shawn174, tao | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | i686 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2007-10-19 19:32:06 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Need Real Name
2003-12-18 17:09:21 UTC
I get the same problem... See also comments for bug# 111250. Created attachment 98447 [details]
tg3 timeout messages
I'm getting this problem as well. This is with both the RH9 and EL kernel.
kernel-bigmem-2.4.20-30.9 and kernel-smp-2.4.21-9.0.1.EL
As you can see from the attached log, the timeouts come in bursts, lasts for a
few minutes, and then goes away for a day or two.
The system in a quad processor Dell 6600 with 16G RAM on cisco 100Mbit switch.
We have 20-60 active ltsp X workstations on 10Mbit cisco switches attached.
This timeout causes gdm to log them all off.
Note that the network is not necessarily loaded when this happens. It has
happend late at night when noone is around. We also quite often get through
times of very high network load with no trouble.
I'm trying to workout with the network guys if they do anything to the
switches at the time this problem occurs. Doesn't sound like they do anything.
We were able to (we think) resolve this issue by both upgrading to kernel 2.4.21-9.0.1.ELsmp as well as turning off autoneg on the affected interface and pegging it manually to 100 full duplex. Can you send me the full kernel log messages that get generated on this machine, not just the timeout messages? I want to see what model of tg3 chip you have in this system. Thanks. Created attachment 98986 [details]
"grep kernel /var/log/messages" from last boot
Just had another episode of watchdog timeouts. The last one was about a week
and a half ago. The current one was the longest so far. The interface seems to
be active for a few seconds between each timeout. Here is the timing:
Mar 31 10:02:45 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:04:20 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:07:25 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:12:40 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:15:55 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:20:20 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:24:00 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:26:35 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:29:55 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Attached are the boot messages for this system.
I stated previously in comment #3 that this problem did not seem to be workload related. I'm changing my mind. We have a local http: yum update mirror that we use to keep a lot of workstations updated. We run it off this troublesome server. Consistently now we can cause the watchdog timeout by doing a yum update from a workstation. The timeouts in comment #6 were caused by this. I just reproduced it again this morning. Odd, because I don't think yum update loads the network anything like the 60 odd X sessions or the network backups. We often get through those periods without problem. The yum update that triggers it is a large one, from a newly installed FC1 workstation. Also, this timeout problem seems very rare. google has very few problems like this and this bugzilla seems to be lacking "meto"s. One private email from a person who had a similar problem on their Dell laptop thought it was a H/W problem. Replacing their motherboard fixed the problem. So I'm thinking H/W problem right now. My system (comment #6) has two interfaces, one of which is unused. I'll try switching to the other interface and see how I go. This has happened twice to me, with kernels 2.4.9-e.25smp and 2.4.21-9.0.1.ELsmp, two different HP DL380 G3s, happened once on each machine. It is not triggered by load. These details are for the system running 2.4.21-9.0.1.ELsmp First, we rebooted a switch. I do not know if this is relevant, but in the interests of completeness: Apr 27 17:37:53 jc1lpm1 kernel: tg3: eth0: Link is down. Apr 27 17:37:58 jc1lpm1 kernel: tg3: eth0: Link is up at 100 Mbps, full duplex. Apr 27 17:37:58 jc1lpm1 kernel: tg3: eth0: Flow control is off for TX and off for RX. Apr 27 17:38:16 jc1lpm1 kernel: tg3: eth0: Link is down. Apr 27 17:38:17 jc1lpm1 kernel: tg3: eth0: Link is up at 100 Mbps, full duplex. Apr 27 17:38:17 jc1lpm1 kernel: tg3: eth0: Flow control is off for TX and off for RX. Apr 27 17:38:20 jc1lpm1 kernel: tg3: eth0: Link is down. Apr 27 17:38:21 jc1lpm1 kernel: tg3: eth0: Link is up at 100 Mbps, full duplex. Apr 27 17:38:21 jc1lpm1 kernel: tg3: eth0: Flow control is off for TX and off for RX. Apr 27 17:38:22 jc1lpm1 kernel: tg3: eth0: Link is down. Apr 27 17:38:24 jc1lpm1 kernel: tg3: eth0: Link is up at 100 Mbps, half duplex. Apr 27 17:38:24 jc1lpm1 kernel: tg3: eth0: Flow control is off for TX and off for RX. Notice that it eventually comes up as half duplex. I was forcing it to full with mii-tool (though I will switch to using ethtool as 2.4.21-9's tg3 driver fixes the ethtool bug). With 2.4.21-4, forcing the interface to full with mii-tool causes the autoneg to be re-enabled after losing and re-establishing link. I don't know if this happens with 2.4.21-9, as I'm usually using ethtool on these systems. (The 2.4.9 system this happened on also had the interface forced to full with mii-tool). The switch thinks full duplex, because it's forced: Apr 27 17:38:25 jc1tdssw1.XXX SYST: Port 39 link active 100Mbs FULL duplex So now I have a duplex mismatch. Hours later: Apr 27 20:51:14 ---- monitoring on another machine notices this machine off the network. It had to have not responded to a single ping for at least 5 at most 60 seconds. Apr 27 20:53:10 jc1lpm1 kernel: NETDEV WATCHDOG: eth0: transmit timed out Apr 27 20:53:10 jc1lpm1 kernel: tg3: eth0: transmit timed out, resetting Apr 27 20:53:10 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2 Apr 27 20:53:10 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2 Apr 27 21:00:10 jc1lpm1 kernel: NETDEV WATCHDOG: eth0: transmit timed out Apr 27 21:00:10 jc1lpm1 kernel: tg3: eth0: transmit timed out, resetting Apr 27 21:00:10 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2 Apr 27 21:00:10 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2 Apr 27 21:00:10 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2 Apr 27 21:00:10 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2 Apr 27 21:07:50 jc1lpm1 kernel: NETDEV WATCHDOG: eth0: transmit timed out Apr 27 21:07:50 jc1lpm1 kernel: tg3: eth0: transmit timed out, resetting Apr 27 21:07:50 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2 Apr 27 21:07:50 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2 Apr 27 21:07:50 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2 Apr 27 21:07:50 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2 Apr 27 21:15:25 jc1lpm1 kernel: NETDEV WATCHDOG: eth0: transmit timed out Apr 27 21:15:25 jc1lpm1 kernel: tg3: eth0: transmit timed out, resetting Apr 27 21:15:25 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2 Apr 27 21:15:25 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2 Apr 27 21:15:25 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2 Apr 27 21:15:25 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2 Apr 27 21:23:00 jc1lpm1 kernel: NETDEV WATCHDOG: eth0: transmit timed out Apr 27 21:23:00 jc1lpm1 kernel: tg3: eth0: transmit timed out, resetting Apr 27 21:23:00 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2 Apr 27 21:23:00 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2 Apr 27 21:23:05 ------- monitoring on another machine notices this machine is back on the network. It responded to a ping within the last 5-60 seconds. Apr 27 21:34:24 jc1lpm1 ntpd[1369]: synchronisation lost Apr 27 21:41:29 jc1lpm1 ntpd[1369]: time reset 2.103136 s Apr 27 21:41:29 jc1lpm1 ntpd[1369]: synchronisation lost Apr 27 22:01:59 jc1lpm1 ntpd[1369]: time reset -0.435547 s Apr 27 22:01:59 jc1lpm1 ntpd[1369]: synchronisation lost Either the fact that this affected the clock is worrying, or the fact that my clock drifts so badly without NTP is worrying. (I presume the former) The second incident. Someone other than myself did an ifconfig down/up , mii-tool, possibly other steps: Apr 28 08:45:11 ------ noticed it was down Apr 28 08:47:42 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2 Apr 28 08:47:43 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2 Apr 28 08:47:45 jc1lpm1 kernel: tg3: eth0: Link is up at 100 Mbps, half duplex. Apr 28 08:47:45 jc1lpm1 kernel: tg3: eth0: Flow control is off for TX and off for RX. Apr 28 08:48:12 jc1lpm1 kernel: tg3: eth0: Link is up at 100 Mbps, half duplex. Apr 28 08:48:12 jc1lpm1 kernel: tg3: eth0: Flow control is off for TX and off for RX. Apr 28 08:53:07 ------- noticed it was up Network traffic at the time: 19:10:00 IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s 20:40:00 lo 21.81 21.81 12499.35 12499.35 0.00 0.00 0.00 20:40:00 eth0 189.15 270.01 22045.19 393116.88 0.00 0.00 14.66 20:40:00 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 20:50:00 lo 21.41 21.41 12353.81 12353.81 0.00 0.00 0.00 20:50:00 eth0 188.31 267.40 22034.00 389373.65 0.00 0.00 15.00 20:50:00 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 21:00:01 lo 3.27 3.27 2970.68 2970.68 0.00 0.00 0.00 21:00:01 eth0 3.61 0.13 862.18 78.31 0.00 0.00 1.51 21:00:01 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 21:10:00 lo 1.22 1.22 125.68 125.68 0.00 0.00 0.00 21:10:00 eth0 0.45 0.26 38.93 24.53 0.00 0.00 0.00 21:10:00 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 21:20:00 lo 1.61 1.61 106.62 106.62 0.00 0.00 0.00 21:20:00 eth0 0.11 0.26 2.29 21.40 0.00 0.00 0.00 21:20:00 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 21:30:00 lo 9.50 9.50 6532.51 6532.51 0.00 0.00 0.00 21:30:00 eth0 15.86 0.99 8331.45 250.30 0.00 0.00 14.71 21:30:00 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 21:40:00 lo 11.37 11.37 8290.04 8290.04 0.00 0.00 0.00 21:40:00 eth0 21.29 1.60 10900.85 146.77 0.00 0.00 19.14 21:40:00 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 21:50:00 lo 11.37 11.37 8304.49 8304.49 0.00 0.00 0.00 21:50:00 eth0 21.30 1.62 10913.45 147.93 0.00 0.00 19.12 21:50:00 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 08:30:00 lo 44.80 44.80 19329.74 19329.74 0.00 0.00 0.00 08:30:00 eth0 155.71 134.77 30437.81 191712.49 0.00 0.00 54.71 08:30:00 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 08:40:00 lo 54.43 54.43 23011.76 23011.76 0.00 0.00 0.00 08:40:00 eth0 165.25 133.96 35862.55 190880.63 0.00 0.00 64.90 08:40:00 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 08:50:00 lo 45.69 45.69 21018.99 21018.99 0.00 0.00 0.00 08:50:00 eth0 110.65 62.30 31232.17 88666.79 0.00 0.00 63.30 08:50:00 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 09:00:00 lo 6.45 6.45 3719.48 3719.48 0.00 0.00 0.00 09:00:00 eth0 72.44 77.82 11557.82 113636.20 0.00 0.00 19.93 09:00:00 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Hello all, I came across the same problem. Here's something interesting. With the latest kernel, mii-tool does NOT work with tg3. I went back and tried using mii-tool and it said both interfaces were at 100MB Full, yet if I used mii-tool to set full100MB the messages log did not show the device coming up at 100MB Full. Whichever tool you use, changes to Speed or Duplex of an interface should always show in /var/log/messages. I then checked with ethtool and ethtool showed both interfaces at 100MB Half. Our switches are cisco switches and set to 100MB Full. I believe there is an issue with mii-tool were with the newer kernel, or the newer tg3 module it simply cannot set duplex. That would make sense, the front interface is heavily used, with it only being at 100MB Half, it started seeing collisions and upon heavy use, failed to be able to communicate. The box then thinks link is down until the traffic goes away. I believe it then tries to renegotiate with auto, and has trouble for some reason getting link back. Even with it being on Auto, it should still come back with some semblance of a link, but it doesn;t. Since switching to ethtool I have not seen the problem any more. Hope this helps, my servers have been running with no issues since. Thanks, Marcus We continue to have this problem very intermittently but consistently. But from my experience it has nothing to do with duplex settings - we always force the interfaces to 100Mb-FD and also use static addresses (no pumpd or DHCP as has been thought to be a contributing factor in other posts). This has last happened with kernel 2.6.8.1-25 and tg3.c v3.8 (July 14, 2004). lspci reports the adaptor as Broadcom Corp.|NetXtreme BCM5703X Gigabit Ethernet [NETWORK_ETHERNET]. Unfortunately there is no indication in the kernel log until the "eth0: transmit timed out, resetting" error. And it doesn't actually reset, this must be done manually. Just BTB, this has also been reported as debian bug #278119 as well as other independent posts. This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you. |