Bug 112377

Summary:

(TG3) driver stops sending packages.

Product:

Red Hat Enterprise Linux 3

Reporter:

Need Real Name <juha.o.ylitalo>

Component:

kernel

Assignee:

David Miller <davem>

Status:

CLOSED WONTFIX

QA Contact:

Severity:

medium

Docs Contact:

Priority:

medium

Version:

3.0

CC:

eric.eisenhart, jgarzik, lakamine, msattler, ngaywood, pcrooker, petrides, riel, rperkins, shawn174, tao

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-10-19 19:32:06 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
tg3 timeout messages	none
"grep kernel /var/log/messages" from last boot	none

Description Need Real Name 2003-12-18 17:09:21 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.7 (X11; Linux i686; U;) Gecko/20030131

Description of problem:
Hardware: IBM x335
When I am using one of these to benchmark application in other host,
I've start seeing following kind of messages in our syslog server:
Dec 18 17:41:34 ulises kernel: tg3: tg3_stop_block timed out, ofs=2400
enable_bit=2
Dec 18 17:41:34 ulises kernel: tg3: tg3_stop_block timed out, ofs=1400
enable_bit=2
Dec 18 17:41:34 ulises kernel: tg3: tg3_stop_block timed out, ofs=c00
enable_bit=2
Dec 18 17:41:34 ulises kernel: tg3: tg3_stop_block timed out, ofs=4800
enable_bit=2

After we've received couple of these, the machine stops sending and
received messages. If I go to console and check things with ifconfig,
it still seems to have IP address and all, but won't response to pings.

Currently only known solution to this problem is to give
'/etc/init.d/network restart' from console or make some syslog watcher
to trigger it, whenever it happens.


Version-Release number of selected component (if applicable):
kernel-2.4.21-4.EL

How reproducible:
Sometimes

Steps to Reproduce:
1. in our benchmarking, we are making 50-150 parallel https
connections to other host. We are sending and receiving something like
5-20kB for each query. Delay between sending query and getting
response can be anything between 2-100+ seconds, depending on how big
load we've managed to create into other box.
2.
3.
    

Actual Results:  Host stops responding to ping's, ssh, etc.

Expected Results:  Preferred state would be that packets would be sent
and received as they should...

Additional info:

Comment 1 Shawn Stephens 2004-01-19 13:48:28 UTC

I get the same problem...  See also comments for bug# 111250.

Comment 3 Norman Gaywood 2004-03-11 00:15:07 UTC

Created attachment 98447 [details]
tg3 timeout messages

I'm getting this problem as well. This is with both the RH9 and EL kernel.
kernel-bigmem-2.4.20-30.9 and kernel-smp-2.4.21-9.0.1.EL

As you can see from the attached log, the timeouts come in bursts, lasts for a
few minutes, and then goes away for a day or two.

The system in a quad processor Dell 6600 with 16G RAM on cisco 100Mbit switch.
We have 20-60 active ltsp X workstations on 10Mbit cisco switches attached.
This timeout causes gdm to log them all off.

Note that the network is not necessarily loaded when this happens. It has
happend late at night when noone is around. We also quite often get through
times of very high network load with no trouble.

I'm trying to  workout with the network guys if they do anything to the
switches at the time this problem occurs. Doesn't sound like they do anything.

Comment 4 Tracy White 2004-03-29 14:33:32 UTC

We were able to (we think) resolve this issue by both upgrading to 
kernel 2.4.21-9.0.1.ELsmp as well as turning off autoneg on the 
affected interface and pegging it manually to 100 full duplex.

Comment 5 David Miller 2004-03-29 23:16:07 UTC

Can you send me the full kernel log messages that get generated
on this machine, not just the timeout messages?

I want to see what model of tg3 chip you have in this system.
Thanks.

Comment 6 Norman Gaywood 2004-03-31 02:41:51 UTC

Created attachment 98986 [details]
"grep kernel /var/log/messages" from last boot

Just had another episode of watchdog timeouts. The last one was about a week
and a half ago. The current one was the longest so far. The interface seems to
be active  for a few seconds between each timeout. Here is the timing:

Mar 31 10:02:45 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:04:20 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:07:25 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:12:40 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:15:55 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:20:20 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:24:00 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:26:35 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 31 10:29:55 turing kernel: NETDEV WATCHDOG: eth0: transmit timed out

Attached are the boot messages for this system.

Comment 7 Norman Gaywood 2004-04-01 23:57:41 UTC

I stated previously in comment #3 that this problem did not seem to be
workload related. I'm changing my mind.

We have a local http: yum update mirror that we use to keep a lot of
workstations updated. We run it off this troublesome server.
Consistently now we can cause the watchdog timeout by doing a yum
update from a workstation. The timeouts in comment #6 were caused by
this. I just reproduced it again this morning.

Odd, because I don't think yum update loads the network anything like
the 60 odd X sessions or the network backups. We often get through
those periods without problem. The yum update that triggers it is a
large one, from a newly installed FC1 workstation.

Also, this timeout problem seems very rare. google has very few
problems like this and this bugzilla seems to be lacking "meto"s. One
private email from a person who had a similar problem on their Dell
laptop thought it was a H/W problem. Replacing their motherboard fixed
the problem.

So I'm thinking H/W problem right now. My system (comment #6) has two
interfaces, one of which is unused. I'll try switching to the other
interface and see how I go.

Comment 8 Chuck Berg 2004-04-28 22:36:02 UTC

This has happened twice to me, with kernels 2.4.9-e.25smp and
2.4.21-9.0.1.ELsmp, two different HP DL380 G3s, happened once on each
machine. It is not triggered by load.

These details are for the system running 2.4.21-9.0.1.ELsmp

First, we rebooted a switch. I do not know if this is relevant, but in
the interests of completeness:

Apr 27 17:37:53 jc1lpm1 kernel: tg3: eth0: Link is down.
Apr 27 17:37:58 jc1lpm1 kernel: tg3: eth0: Link is up at 100 Mbps,
full duplex.
Apr 27 17:37:58 jc1lpm1 kernel: tg3: eth0: Flow control is off for TX
and off for RX.
Apr 27 17:38:16 jc1lpm1 kernel: tg3: eth0: Link is down.
Apr 27 17:38:17 jc1lpm1 kernel: tg3: eth0: Link is up at 100 Mbps,
full duplex.
Apr 27 17:38:17 jc1lpm1 kernel: tg3: eth0: Flow control is off for TX
and off for RX.
Apr 27 17:38:20 jc1lpm1 kernel: tg3: eth0: Link is down.
Apr 27 17:38:21 jc1lpm1 kernel: tg3: eth0: Link is up at 100 Mbps,
full duplex.
Apr 27 17:38:21 jc1lpm1 kernel: tg3: eth0: Flow control is off for TX
and off for RX.
Apr 27 17:38:22 jc1lpm1 kernel: tg3: eth0: Link is down.
Apr 27 17:38:24 jc1lpm1 kernel: tg3: eth0: Link is up at 100 Mbps,
half duplex.
Apr 27 17:38:24 jc1lpm1 kernel: tg3: eth0: Flow control is off for TX
and off for RX.

Notice that it eventually comes up as half duplex. I was forcing it to
full with mii-tool (though I will switch to using ethtool as
2.4.21-9's tg3 driver fixes the ethtool bug). With 2.4.21-4, forcing
the interface to full with mii-tool causes the autoneg to be
re-enabled after losing and re-establishing link. I don't know if this
happens with 2.4.21-9, as I'm usually using ethtool on these systems.
(The 2.4.9 system this happened on also had the interface forced to
full with mii-tool).

The switch thinks full duplex, because it's forced:
Apr 27 17:38:25 jc1tdssw1.XXX SYST: Port 39 link active 100Mbs FULL duplex

So now I have a duplex mismatch.

Hours later:
Apr 27 20:51:14 ---- monitoring on another machine notices this
machine off the network. It had to have not responded to a single ping
for at least 5 at most 60 seconds.
Apr 27 20:53:10 jc1lpm1 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Apr 27 20:53:10 jc1lpm1 kernel: tg3: eth0: transmit timed out, resetting
Apr 27 20:53:10 jc1lpm1 kernel: tg3: tg3_stop_block timed out,
ofs=1400 enable_bit=2
Apr 27 20:53:10 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=c00
enable_bit=2
Apr 27 21:00:10 jc1lpm1 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Apr 27 21:00:10 jc1lpm1 kernel: tg3: eth0: transmit timed out, resetting
Apr 27 21:00:10 jc1lpm1 kernel: tg3: tg3_stop_block timed out,
ofs=3400 enable_bit=2
Apr 27 21:00:10 jc1lpm1 kernel: tg3: tg3_stop_block timed out,
ofs=2400 enable_bit=2
Apr 27 21:00:10 jc1lpm1 kernel: tg3: tg3_stop_block timed out,
ofs=1400 enable_bit=2
Apr 27 21:00:10 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=c00
enable_bit=2
Apr 27 21:07:50 jc1lpm1 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Apr 27 21:07:50 jc1lpm1 kernel: tg3: eth0: transmit timed out, resetting
Apr 27 21:07:50 jc1lpm1 kernel: tg3: tg3_stop_block timed out,
ofs=3400 enable_bit=2
Apr 27 21:07:50 jc1lpm1 kernel: tg3: tg3_stop_block timed out,
ofs=2400 enable_bit=2
Apr 27 21:07:50 jc1lpm1 kernel: tg3: tg3_stop_block timed out,
ofs=1400 enable_bit=2
Apr 27 21:07:50 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=c00
enable_bit=2
Apr 27 21:15:25 jc1lpm1 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Apr 27 21:15:25 jc1lpm1 kernel: tg3: eth0: transmit timed out, resetting
Apr 27 21:15:25 jc1lpm1 kernel: tg3: tg3_stop_block timed out,
ofs=3400 enable_bit=2
Apr 27 21:15:25 jc1lpm1 kernel: tg3: tg3_stop_block timed out,
ofs=2400 enable_bit=2
Apr 27 21:15:25 jc1lpm1 kernel: tg3: tg3_stop_block timed out,
ofs=1400 enable_bit=2
Apr 27 21:15:25 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=c00
enable_bit=2
Apr 27 21:23:00 jc1lpm1 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Apr 27 21:23:00 jc1lpm1 kernel: tg3: eth0: transmit timed out, resetting
Apr 27 21:23:00 jc1lpm1 kernel: tg3: tg3_stop_block timed out,
ofs=3400 enable_bit=2
Apr 27 21:23:00 jc1lpm1 kernel: tg3: tg3_stop_block timed out,
ofs=2400 enable_bit=2
Apr 27 21:23:05 ------- monitoring on another machine notices this
machine is back on the network. It responded to a ping within the last
5-60 seconds.
Apr 27 21:34:24 jc1lpm1 ntpd[1369]: synchronisation lost
Apr 27 21:41:29 jc1lpm1 ntpd[1369]: time reset 2.103136 s
Apr 27 21:41:29 jc1lpm1 ntpd[1369]: synchronisation lost
Apr 27 22:01:59 jc1lpm1 ntpd[1369]: time reset -0.435547 s
Apr 27 22:01:59 jc1lpm1 ntpd[1369]: synchronisation lost

Either the fact that this affected the clock is worrying, or the fact
that my clock drifts so badly without NTP is worrying. (I presume the
former)

The second incident. Someone other than myself did an ifconfig down/up
, mii-tool, possibly other steps:
Apr 28 08:45:11 ------ noticed it was down
Apr 28 08:47:42 jc1lpm1 kernel: tg3: tg3_stop_block timed out,
ofs=1400 enable_bit=2
Apr 28 08:47:43 jc1lpm1 kernel: tg3: tg3_stop_block timed out, ofs=c00
enable_bit=2
Apr 28 08:47:45 jc1lpm1 kernel: tg3: eth0: Link is up at 100 Mbps,
half duplex.
Apr 28 08:47:45 jc1lpm1 kernel: tg3: eth0: Flow control is off for TX
and off for RX.
Apr 28 08:48:12 jc1lpm1 kernel: tg3: eth0: Link is up at 100 Mbps,
half duplex.
Apr 28 08:48:12 jc1lpm1 kernel: tg3: eth0: Flow control is off for TX
and off for RX.
Apr 28 08:53:07 ------- noticed it was up

Network traffic at the time:
19:10:00        IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s  
rxcmp/s   txcmp/s  rxmcst/s
20:40:00           lo     21.81     21.81  12499.35  12499.35     
0.00      0.00      0.00
20:40:00         eth0    189.15    270.01  22045.19 393116.88     
0.00      0.00     14.66
20:40:00         eth1      0.00      0.00      0.00      0.00     
0.00      0.00      0.00
20:50:00           lo     21.41     21.41  12353.81  12353.81     
0.00      0.00      0.00
20:50:00         eth0    188.31    267.40  22034.00 389373.65     
0.00      0.00     15.00
20:50:00         eth1      0.00      0.00      0.00      0.00     
0.00      0.00      0.00
21:00:01           lo      3.27      3.27   2970.68   2970.68     
0.00      0.00      0.00
21:00:01         eth0      3.61      0.13    862.18     78.31     
0.00      0.00      1.51
21:00:01         eth1      0.00      0.00      0.00      0.00     
0.00      0.00      0.00
21:10:00           lo      1.22      1.22    125.68    125.68     
0.00      0.00      0.00
21:10:00         eth0      0.45      0.26     38.93     24.53     
0.00      0.00      0.00
21:10:00         eth1      0.00      0.00      0.00      0.00     
0.00      0.00      0.00
21:20:00           lo      1.61      1.61    106.62    106.62     
0.00      0.00      0.00
21:20:00         eth0      0.11      0.26      2.29     21.40     
0.00      0.00      0.00
21:20:00         eth1      0.00      0.00      0.00      0.00     
0.00      0.00      0.00
21:30:00           lo      9.50      9.50   6532.51   6532.51     
0.00      0.00      0.00
21:30:00         eth0     15.86      0.99   8331.45    250.30     
0.00      0.00     14.71
21:30:00         eth1      0.00      0.00      0.00      0.00     
0.00      0.00      0.00
21:40:00           lo     11.37     11.37   8290.04   8290.04     
0.00      0.00      0.00
21:40:00         eth0     21.29      1.60  10900.85    146.77     
0.00      0.00     19.14
21:40:00         eth1      0.00      0.00      0.00      0.00     
0.00      0.00      0.00
21:50:00           lo     11.37     11.37   8304.49   8304.49     
0.00      0.00      0.00
21:50:00         eth0     21.30      1.62  10913.45    147.93     
0.00      0.00     19.12
21:50:00         eth1      0.00      0.00      0.00      0.00     
0.00      0.00      0.00
...
08:30:00           lo     44.80     44.80  19329.74  19329.74     
0.00      0.00      0.00
08:30:00         eth0    155.71    134.77  30437.81 191712.49     
0.00      0.00     54.71
08:30:00         eth1      0.00      0.00      0.00      0.00     
0.00      0.00      0.00
08:40:00           lo     54.43     54.43  23011.76  23011.76     
0.00      0.00      0.00
08:40:00         eth0    165.25    133.96  35862.55 190880.63     
0.00      0.00     64.90
08:40:00         eth1      0.00      0.00      0.00      0.00     
0.00      0.00      0.00
08:50:00           lo     45.69     45.69  21018.99  21018.99     
0.00      0.00      0.00
08:50:00         eth0    110.65     62.30  31232.17  88666.79     
0.00      0.00     63.30
08:50:00         eth1      0.00      0.00      0.00      0.00     
0.00      0.00      0.00
09:00:00           lo      6.45      6.45   3719.48   3719.48     
0.00      0.00      0.00
09:00:00         eth0     72.44     77.82  11557.82 113636.20     
0.00      0.00     19.93
09:00:00         eth1      0.00      0.00      0.00      0.00     
0.00      0.00      0.00

Comment 9 Marcus 2004-12-29 16:44:03 UTC

Hello all,  I came across the same problem.

Here's something interesting.  With the latest kernel, mii-tool does
NOT work with tg3.  I went back and tried using mii-tool and it said
both interfaces were at 100MB Full, yet if I used mii-tool to set
full100MB the messages log did not show the device coming up at 100MB
Full.  Whichever tool you use, changes to Speed or Duplex of an
interface should always show in /var/log/messages.  I then checked
with ethtool and ethtool showed both interfaces at 100MB Half.  Our
switches are cisco switches and set to 100MB Full. 
 
I believe there is an issue with mii-tool were with the newer kernel,
or the newer tg3 module it simply cannot set duplex.  That would make
sense, the front interface is heavily used, with it only being at
100MB Half, it started seeing collisions and upon heavy use, failed to
be able to communicate.  The box then thinks link is down until the
traffic goes away.  I believe it then tries to renegotiate with auto,
and has trouble for some reason getting link back.  Even with it being
on Auto, it should still come back with some semblance of a link, but
it doesn;t. 
 
Since switching to ethtool I have not seen the problem any more.

Hope this helps, my servers have been running with no issues since.

Thanks,
Marcus

Comment 10 pcrooker--at--orix-dot--com-dot--au 2005-10-10 23:37:09 UTC

We continue to have this problem very intermittently but consistently. But from
my experience it has nothing to do with duplex settings - we always force the
interfaces to 100Mb-FD and also use static addresses (no pumpd or DHCP as has
been thought to be a contributing factor in other posts). 

This has last happened with kernel 2.6.8.1-25 and tg3.c v3.8 (July 14, 2004).
lspci reports the adaptor as Broadcom Corp.|NetXtreme BCM5703X Gigabit Ethernet
[NETWORK_ETHERNET].

Unfortunately there is no indication in the kernel log until the "eth0: transmit
timed out, resetting" error. And it doesn't actually reset, this must be done
manually.

Just BTB, this has also been reported as debian bug #278119 as well as other
independent posts.

Comment 11 RHEL Program Management 2007-10-19 19:32:06 UTC

This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.