|Summary:||RHEL4 kernel reports tg3_stop_block timed out and network interface stops responding|
|Product:||Red Hat Enterprise Linux 4||Reporter:||Marc Michelsen <marc>|
|Component:||kernel||Assignee:||John W. Linville <linville>|
|Status:||CLOSED CANTFIX||QA Contact:||Brian Brock <bbrock>|
|Version:||4.0||CC:||astokes, bkkh, clalance, davej, eric.eisenhart, jbaron, jukka.lehtonen, mchan, melo, ngaywood, riel, tao, tjb, togdon|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2006-02-16 17:55:30 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Marc Michelsen 2005-03-29 22:48:59 UTC
From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2 Description of problem: Under heavy network load eth0 stopped responding. I could not ping the machine or ping out from its console. service network restart fixed it and it wasn't necessary to reboot. The following was in the log after it happened: Mar 24 13:27:54 challenger kernel: tg3: tg3_stop_block timed out, ofs=2000 enable_bit=2 Mar 24 13:27:54 challenger kernel: tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2 Mar 24 13:27:54 challenger kernel: tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2 The machine is a Tyan 4882 quad-opteron with 16GB of RAM and 4 3ware 9500 SATA RAID cards and 40 400GB harddrives. There is one ext3 filesystem on each 3ware card. The machine has frequently been under heavy network load for a couple of weeks with this kernel with no problems, rsyncing several terabytes of space from other machines. Most of this went to one filesystem on one 3ware card, a card that that does not share the same PCI bus has the onboard broadcom gigabit ethernets. The problem occurred when about 100GB was being written to a filesystem on a 3ware card that did share the same PCI bus as the onboard broadcom gigabit ethernets. The motherboard has two gigabit ethernets and when the problem occurred only the one under heavy load, eth0, stopped, the other was still usable. I don't know if the broadcom and 3ware sharing the same PCI bus is part of the problem but I did find the following by someone who looks like they had the same problem with the 2.6.9 kernel http://lkml.org/lkml/2005/1/16/179 and later he follows up saying 2.6.11-rc1 fixed his problem: http://lkml.org/lkml/2005/1/23/77 Version-Release number of selected component (if applicable): kernel-smp-2.6.9-6.25.EL How reproducible: Didn't try Additional info:
Comment 1 Ian Neubert 2005-03-29 23:57:45 UTC
Me too! On x86_64 (Dual Opterons) with 2.6.9-5.0.3.ELsmp. I just start a big transfer with scp and I will either kernel panic (see bug # 152525) or get this bug's error. Slightly different ofs numbers though: tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2 tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2 tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2 tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2 tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
Comment 2 John W. Linville 2005-04-04 15:05:54 UTC
Well, let's start by trying an update of the tg3 driver. Pre-built test kernels available here: http://people.redhat.com/linville/kernels/rhel4/ Please give them a try and post the results. Thanks!
Comment 3 John W. Linville 2005-04-04 15:10:48 UTC
Created attachment 112652 [details] jwltest-tg3-3_25-rh.patch
Comment 4 Marc Michelsen 2005-04-05 16:20:55 UTC
I've been running the new kernel for almost a day [root@challenger]# uptime 08:34:25 up 23:29, 12 users, load average: 0.98, 1.29, 0.96 [root@challenger]# uname -a Linux challenger 2.6.9-6.38.EL.jwltest.10smp #1 SMP Fri Apr 1 16:53:56 EST 2005 x86_64 x86_64 x86_64 GNU/Linux [root@challenger]# and just before I got in today it did it again: Apr 5 07:39:11 challenger kernel: tg3: tg3_stop_block timed out, ofs=2000 enable_bit=2 After the first time it happened I added this crontab that runs once a minute. It tries to ping my gateway machine and if it cant it restarts the network. Well, it couldn't ping it and it restarted the network sucessfully right after the above message. [root@challenger]# cat checktg3 #!/bin/bash /bin/ping -c 1 10.95.176.3 > /dev/null 2>&1 if [ $? -ne 0 ] ; then /etc/init.d/network restart echo | mail -s "tg3 appears to be down on `uname -n`. Restarting network." marc fi [root@challenger]# I dont know for sure what was going on when it just happened but we have a cluster of dual opteron machines running parallel numerical models constantly and their output is now being written to a filesystem on this machine. This particular filesystem that users are writing to now is on one 3ware 9500 card that shares the same PCI bus as the onboard broadcoms.
Comment 5 Travis Ogdon 2005-04-30 23:04:53 UTC
I'd been able to restart the networking successfully several times without issue. Just got a kernel panic: Apr 30 15:04:01 cpq100 crond(pam_unix): session opened for user root by (uid=0) Apr 30 15:04:11 cpq100 kernel: tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2 Apr 30 15:04:11 cpq100 syslogd: sendto: Network is unreachable Apr 30 15:04:11 cpq100 kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2 Apr 30 15:04:11 cpq100 kernel: tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2 Apr 30 15:04:11 cpq100 kernel: tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2 Apr 30 15:04:11 cpq100 network: Shutting down interface eth0: succeeded Apr 30 15:04:12 cpq100 network: Shutting down interface eth1: succeeded Apr 30 15:04:12 cpq100 network: Shutting down loopback interface: succeeded Apr 30 15:04:12 cpq100 sysctl: net.ipv4.ip_forward = 0 Apr 30 15:04:12 cpq100 sysctl: net.ipv4.conf.default.rp_filter = 1 Apr 30 15:04:12 cpq100 sysctl: net.ipv4.conf.default.accept_source_route = 0 Apr 30 15:04:12 cpq100 sysctl: kernel.sysrq = 0 Apr 30 15:04:12 cpq100 sysctl: kernel.core_uses_pid = 1 Apr 30 15:04:12 cpq100 network: Setting network parameters: succeeded Apr 30 15:04:14 cpq100 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000005 Apr 30 15:04:14 cpq100 kernel: printing eip: Apr 30 15:04:14 cpq100 kernel: c0182ebb Apr 30 15:04:14 cpq100 kernel: *pde = 00004001 Apr 30 15:04:14 cpq100 kernel: Oops: 0000 [#1] Apr 30 15:04:14 cpq100 kernel: SMP Apr 30 15:04:14 cpq100 kernel: Modules linked in: ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter ip_tables md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc dm_mod button b attery ac uhci_hcd ehci_hcd hw_random tg3 floppy ext3 jbd cciss sd_mod scsi_mod Apr 30 15:04:14 cpq100 kernel: CPU: 0 Apr 30 15:04:14 cpq100 kernel: EIP: 0060:[<c0182ebb>] Not tainted VLI Apr 30 15:04:14 cpq100 kernel: EFLAGS: 00010286 (2.6.9-5.0.3.ELsmp) Apr 30 15:04:14 cpq100 kernel: EIP is at remove_proc_entry+0x2f/0xe4 Apr 30 15:04:14 cpq100 kernel: eax: 00000000 ebx: 00000005 ecx: ffffffff edx: f7f87b80 Apr 30 15:04:14 cpq100 kernel: esi: c03332e0 edi: 00000005 ebp: c03b8fcc esp: c03b8f7c Apr 30 15:04:14 cpq100 kernel: ds: 007b es: 007b ss: 0068 Apr 30 15:04:14 cpq100 kernel: Process swapper (pid: 0, threadinfo=c03b8000 task=c0312a60) Apr 30 15:04:14 cpq100 kernel: Stack: f7f87b80 00000005 c35dd800 c03332e0 00000000 c03b8fcc f8ac2a6b c35dd800 Apr 30 15:04:14 cpq100 kernel: f8aa6f72 ee501080 dae0ea80 c0276c7f dae0ea80 c0454ea0 c33ba760 c0276a4c Apr 30 15:04:14 cpq100 kernel: 00000000 c02769fc c01283df 00000246 c03b8fcc c03b8fcc 0000000a 00000001 Apr 30 15:04:14 cpq100 kernel: Call Trace: Apr 30 15:04:14 cpq100 kernel: [<f8ac2a6b>] snmp6_unregister_dev+0x2f/0x3e [ipv6] Apr 30 15:04:14 cpq100 kernel: [<f8aa6f72>] in6_dev_finish_destroy+0x71/0x80 [i pv6] Apr 30 15:04:14 cpq100 kernel: [<c0276c7f>] dst_destroy+0x63/0xac Apr 30 15:04:14 cpq100 kernel: [<c0276a4c>] dst_run_gc+0x50/0xd3 Apr 30 15:04:14 cpq100 kernel: [<c02769fc>] dst_run_gc+0x0/0xd3 Apr 30 15:04:14 cpq100 kernel: [<c01283df>] run_timer_softirq+0x123/0x145 Apr 30 15:04:14 cpq100 kernel: [<c0124b2c>] __do_softirq+0x4c/0xb1 Apr 30 15:04:14 cpq100 kernel: [<c0107f39>] do_softirq+0x4f/0x56 Apr 30 15:04:14 cpq100 kernel: ======================= Apr 30 15:04:14 cpq100 kernel: [<c011633f>] smp_apic_timer_interrupt+0xd9/0xdd Apr 30 15:04:14 cpq100 kernel: [<c02c6aea>] apic_timer_interrupt+0x1a/0x20 Apr 30 15:04:14 cpq100 kernel: [<c01040e5>] mwait_idle+0x33/0x42 Apr 30 15:04:14 cpq100 kernel: [<c010409d>] cpu_idle+0x26/0x3b Apr 30 15:04:14 cpq100 kernel: [<c0382784>] start_kernel+0x194/0x198 Apr 30 15:04:14 cpq100 kernel: Code: 56 53 55 55 89 14 24 89 44 24 04 75 13 8d 4c 24 04 89 e2 e8 11 f9 ff ff 85 c0 0f 85 b6 00 00 00 8b 5c 24 04 31 c0 83 c9 ff 89 df <f2> ae f7 d1 49 8b 04 24 89 cd 8d 70 34 83 78 34 00 0f 84 94 00 Apr 30 15:04:14 cpq100 kernel: <0>Kernel panic - not syncing: Fatal exception in interrupt Any ideas on: 1. How long until we get a patch? 2. How to mitigate things until a patch arrives? We're running RHEL 4 on an HP ProLiant DL380 G4 which appears to contain tg3-based Broadcom Gigabit Ethernet NICs ("NC7781" cards according to HP). I'd use HP's driver: http://h18004.www1.hp.com/support/files/server/us/download/22321.html but I'd rather stick with the vanilla driver from RH. We've had loads of success with both RHEL 3 and 4 on a wide variety of DL320 gear. The cron job from email@example.com has saved me several 30+ minute drives up until now... thanks Marc.
Comment 6 John W. Linville 2005-05-02 13:03:06 UTC
The oops in comment 5 would appear to be a different problem, the one reported in bug 151874. The kernels mentioned in comment 2 contain a patch for that problem. That patch has been submitted and should be available for U2.
Comment 9 John W. Linville 2005-05-11 20:22:12 UTC
I have an internal report that updating the firmware on the card resolved this issue for the RHEL3 version of this problem. Any chance you can get a firmware update for the tg3 hardware?
Comment 11 John W. Linville 2005-05-16 15:04:15 UTC
Still no solid leads...however, I have taken an update of the tg3 driver in ther kernels here (same location as in comment 2): http://people.redhat.com/linville/kernels/rhel4/ Please try those and let me know the results w/ the current driver. Thanks!
Comment 13 Brian Harvey 2005-05-27 17:09:38 UTC
(In reply to comment #9) > I have an internal report that updating the firmware on the card resolved this > issue for the RHEL3 version of this problem. Any chance you can get a > firmware update for the tg3 hardware? Do you have more specifics on this internal report? Such as which firmware level and which cards? I'm seeing the same exact problem with RHEL 3.0 update3 on HP DL360G4's with HP's NC7782 card. I want to make sure that the firmware level I'm going to will resolve this issue.
Comment 14 John W. Linville 2005-05-27 17:12:49 UTC
No, sorry, nothing more specific. It was "the latest" of about 2 weeks ago, I belive. I'm pretty sure it was an HP card in that case, FWIW...
Comment 17 John W. Linville 2005-07-06 19:43:56 UTC
More test kernels w/ latest version of tg3 available at same location as in comment 11...please give them a try and post your results here...thanks!
Comment 18 Tom Christensen 2005-08-08 19:59:17 UTC
I have a supermicro server with dual gigabit broadcom nics. I am having this problem as well, it seems for me to be associated with encrypted traffic. I run openvpn and ssh/sftp to this box often. Whenever I transfer a large file (50MB+) I get these same error messages, and the eth0 nic freezes. It comes back up after 10-15 minutes on its own as I see: Aug 8 01:15:00 office kernel: NETDEV WATCHDOG: eth0: transmit timed out Aug 8 01:15:00 office kernel: tg3: eth0: transmit timed out, resetting Aug 8 01:15:00 office kernel: tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2 Aug 8 01:15:00 office kernel: tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2 Aug 8 01:15:00 office kernel: tg3: eth0: Link is down. Aug 8 01:15:02 office kernel: tg3: eth0: Link is up at 100 Mbps, full duplex. Aug 8 01:15:02 office kernel: tg3: eth0: Flow control is on for TX and on for RX. However that occurs after the nic has been down for 10-15 minutes and I see a ton of these: Aug 8 01:13:30 office openvpn: read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Aug 8 01:13:35 office openvpn: read UDPv4 [EHOSTUNREACH]: No route to host (code=113) I just tried your latest test kernel (with tg3 v3.33) and it still occurs. the eth0 nic is plugged into a dsl router, and has an iptables firewall on it. I don't think this has much to do with load (as the dsl link is only 1.5mb, and isn't getting close to saturating the 100mb link), unless it has something to do with TCP retransmits because the DSL link is saturated..
Comment 19 Jatin Nansi 2005-08-09 10:47:14 UTC
(In reply to comment #18) Have you tried updating the firmware on the card? all reports of tg3 hangs have apparently been solved after that.
Comment 35 Thomas J. Baker 2005-09-15 16:16:39 UTC
I am experiencing this on two Dell PowerEdge 2550s that were recently upgraded to RHEL4. The 2.6.9-20.EL.jwltest.61smp kernel seems to have fixed the problem for me. A user was doing a large rsync and each time he started the tg3 would die. With this kernel, the rsync seems to be working so far.
Comment 36 John W. Linville 2005-09-15 17:42:51 UTC
Based on that info, I'll use this bug to track the update of tg3 in RHEL4 U3.
Comment 38 Thomas J. Baker 2005-09-16 15:18:57 UTC
I spoke too soon. The tg3 was dead this morning and after a reboot, another rsync run by a user killed it quickly again. These systems are dual 1.4GHz P3s with 2GB of memory. There is no firewall configured and they are pluggined into a gigabit switch with several other servers. During the install of RHEL4U1, booting with acpi off did not help. I haven't tried it since. The errors from this morning: Sep 16 07:15:10 crouchingtiger kernel: NETDEV WATCHDOG: eth1: transmit timed out Sep 16 07:15:10 crouchingtiger kernel: tg3: eth1: transmit timed out, resetting Sep 16 07:15:10 crouchingtiger kernel: tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2 Sep 16 07:15:10 crouchingtiger kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2 Sep 16 07:15:10 crouchingtiger kernel: tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2 Sep 16 07:15:10 crouchingtiger kernel: tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2 Sep 16 07:15:11 crouchingtiger kernel: tg3: eth1: Link is down. Sep 16 07:15:15 crouchingtiger kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex. Sep 16 07:15:15 crouchingtiger kernel: tg3: eth1: Flow control is on for TX and on for RX.
Comment 39 John W. Linville 2005-09-19 15:14:53 UTC
Does a simple ifdown/ifup get traffic flowing again? Or is the reboot absolutely necessary
Comment 40 Thomas J. Baker 2005-09-19 15:38:57 UTC
The last time this happened, I did a ifdown, rmmod tg3, ifup and it started working again but NFS mounts were still hosed. They may have recovered over time but rebooting was quicker.
Comment 41 Tom Christensen 2005-09-20 18:02:59 UTC
(In reply to comment #19) How do you update the firmware on the card? Where do I get it? Where do I put it? Is it like a bios update?
Comment 42 John W. Linville 2005-09-20 18:13:54 UTC
It would likely be similar to a BIOS update. If something is available, it would come from your card or system vendor.
Comment 43 Thomas J. Baker 2005-09-22 14:40:45 UTC
I just had a kernel panic with the test kernel. It seems unrelated to the tg3: Sep 22 09:42:59 crouchingtiger kernel: Unable to handle kernel paging request at virtual address 0040b709 Sep 22 09:42:59 crouchingtiger kernel: printing eip: Sep 22 09:42:59 crouchingtiger kernel: c0170496 Sep 22 09:42:59 crouchingtiger kernel: *pde = 00000000 Sep 22 09:42:59 crouchingtiger kernel: Oops: 0000 [#1] Sep 22 09:42:59 crouchingtiger kernel: SMP Sep 22 09:42:59 crouchingtiger kernel: Modules linked in: nfs nfsd exportfs lockd md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc button battery ac tg3 e100 mii floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox megaraid_mm mptscsih mptbase aic7xxx sd_mod scsi_mod Sep 22 09:42:59 crouchingtiger kernel: CPU: 0 Sep 22 09:42:59 crouchingtiger kernel: EIP: 0060:[<c0170496>] Not tainted VLI Sep 22 09:42:59 crouchingtiger kernel: EFLAGS: 00010206 (2.6.9-20.EL.jwltest.61smp) Sep 22 09:42:59 crouchingtiger kernel: EIP is at iput+0x25/0x61 Sep 22 09:42:59 crouchingtiger kernel: eax: 0040b6f5 ebx: c5d4494c ecx: f8bcabae edx: c5d4494c Sep 22 09:42:59 crouchingtiger kernel: esi: dcb0d364 edi: dcb0d36c ebp: 0000007b esp: f7cf6eec Sep 22 09:42:59 crouchingtiger kernel: ds: 007b es: 007b ss: 0068 Sep 22 09:42:59 crouchingtiger kernel: Process kswapd0 (pid: 47, threadinfo=f7cf6000 task=f7d17690) Sep 22 09:42:59 crouchingtiger kernel: Stack: c5d4494c c016e0bc 00000000 0000008e 00000000 f7ffe9c0 c016e443 c0148718 Sep 22 09:42:59 crouchingtiger kernel: 005f1e00 00000000 00000021 00000000 0002ddaa 000000d0 00000020 c0324f00 Sep 22 09:42:59 crouchingtiger kernel: 00000002 c0324f00 0000000c c01499a4 c02cf3b4 0002ddaa f7cf6f9c 00000000 Sep 22 09:42:59 crouchingtiger kernel: Call Trace: Sep 22 09:42:59 crouchingtiger kernel: [<c016e0bc>] prune_dcache+0x14b/0x19a Sep 22 09:42:59 crouchingtiger kernel: [<c016e443>] shrink_dcache_memory+0x14/0x2b Sep 22 09:42:59 crouchingtiger kernel: [<c0148718>] shrink_slab+0xf8/0x161 Sep 22 09:42:59 crouchingtiger kernel: [<c01499a4>] balance_pgdat+0x1d2/0x2f8 Sep 22 09:42:59 crouchingtiger kernel: [<c02cf3b4>] schedule+0x844/0x87a Sep 22 09:42:59 crouchingtiger kernel: [<c011fedc>] prepare_to_wait+0x12/0x4c Sep 22 09:42:59 crouchingtiger kernel: [<c0149b94>] kswapd+0xca/0xcc Sep 22 09:42:59 crouchingtiger kernel: [<c011ffb1>] autoremove_wake_function+0x0/0x2d Sep 22 09:42:59 crouchingtiger kernel: [<c02d103a>] ret_from_fork+0x6/0x14 Sep 22 09:42:59 crouchingtiger kernel: [<c011ffb1>] autoremove_wake_function+0x0/0x2d Sep 22 09:42:59 crouchingtiger kernel: [<c0149aca>] kswapd+0x0/0xcc Sep 22 09:42:59 crouchingtiger kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb Sep 22 09:42:59 crouchingtiger kernel: Code: ff e9 e5 fe ff ff 53 85 c0 89 c3 74 58 83 bb 3c 01 00 00 20 8b 80 a4 00 00 00 8b 40 24 75 08 0f 0b 54 04 6c 8a 2e c0 85 c0 74 0b <8b> 50 14 85 d2 74 04 89 d8 ff d2 8d 43 1c ba f0 9d 32 c0 e8 66 Sep 22 09:42:59 crouchingtiger kernel: <0>Fatal exception: panic in 5 seconds The tg3 hardware in question is list with lspci-vv as 03:08.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit Ethernet (rev 12) I'm running the latest Dell 2550 BIOS and I've searched and haven't been able to find any tg3 firmware updates from Dell.
Comment 49 John W. Linville 2005-09-23 17:48:00 UTC
Created attachment 119196 [details] ifdebug Collect some handy network device debugging info... Usage: ifdebug <network device>
Comment 50 John W. Linville 2005-09-23 17:53:30 UTC
Anyone/everyone seeing this problem, please be sure to attach the output of running the "ifdebug" script from comment 49 against the failing device.
Comment 52 Thomas J. Baker 2005-09-23 18:19:23 UTC
Do you want the script to be run when the device is failing or is any time OK?
Comment 53 John W. Linville 2005-09-23 18:52:07 UTC
Specifically after the failure. I suppose a "working" sample wouldn't hurt either.
Comment 54 Tom Christensen 2005-09-28 05:46:47 UTC
Created attachment 119341 [details] ifdebug output on broken eth0 running tg3 driver
Comment 55 Tom Christensen 2005-09-28 05:51:04 UTC
Created attachment 119342 [details] ifdebug output working tg3
Comment 56 Tom Christensen 2005-09-28 05:55:45 UTC
OK, I am still running 2.6.9-11.ELsmp (I've tried all of the test kernels on your site John, and none of them fix this issue for me. I am using a supermicro X6DAL-TB2 with dual integrated BCM5721 NICS. Supermicro does not have firmware updates. This is what I have so far: 1) Turning ACPI off definately helps. The NIC will stay up a couple days vs. a couple hours with it turned on. 2) Only the eth0 NIC ever goes down, I can dump 5+GB across my 1GB/s LAN to eth1 and it never drops, only eth0 which can never move data at faster than ~1.5mbps (speed of the DSL that eth0 is plugged into) ever drops. 3) I tried the bcm5700 driver from broadcom and it acts the exact same way as the tg3 driver (with ACPI turned on, it will stay up for a couple hours, ACPI off, it will work for 3-5 days). 4) I don't believe its a hardware problem as I have 3 boxes (all with the same mobo/NICs) that all exhibit the exact same problem. Ok, So I have run the ifdebug script on the latest crashes (above), these were done with ACPI turned on. As it takes longer for the box to die with ACPI off, I'll have to post again when I get those debugs (if they would be helpful).
Comment 57 Tom Christensen 2005-10-01 18:40:50 UTC
Could this bug possibly be related to a race condition somewhere in interrupt handling? I use these 3 boxes for Asterisk, and they all have digium hardware (which generates about 8000 interupts per second per card each box has 2 cards). I have been talking with Supermicro and they cannot reproduce this bug in their lab with the exact same hardware (as far as motherboard, etc, not the digium cards). The other "clue" as I mentioned above it seems that the NICs only fail when they are connected at say 100mbps, but their actual available bandwidth is much less (IE they are transmitting data across a WAN link at 1mbps)
Comment 58 Tom Christensen 2005-10-03 01:37:58 UTC
OK, a little more info.. Changed the setup of one of the boxes put eth0 on the inside eth1 on the outside, broke my theory :). The eth1 NIC just never dies, the eth0 NIC on the inside dies. Another possible clue: ifconfig shows really weird stats on eth0: eth0 Link encap:Ethernet HWaddr 00:30:48:54:E5:78 inet addr:192.168.55.1 Bcast:192.168.55.255 Mask:255.255.255.0 inet6 addr: fe80::230:48ff:fe54:e578/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:162224 errors:0 dropped:0 overruns:0 frame:4294839772 TX packets:1249098 errors:4294935415 dropped:0 overruns:0 carrier:0 collisions:4294935415 txqueuelen:1000 RX bytes:26933356 (25.6 MiB) TX bytes:1619811485 (1.5 GiB) Interrupt:145 First of all notice the number of collisions, 4.3 billion collisions on 1.4 million packets? Also, the TX bytes number jumps all over the place (517MB, then I downloaded a large file onto the internal network, and it went up to ~700, and then dropped back down to ~300, and then counted up to the 1.5GiB stated above). I then uploaded a large file, and the RX bytes number didn't increase at all. All of this weirdness doesn't happen on eth1 at all.
Comment 59 Tom Christensen 2005-10-03 04:50:40 UTC
Did some more testing, the eth0 NIC is resetting itself pretty regularly. I set up a cron job to save the output of ifconfig every minute. Pretty much if I push data over eth0, the NIC resets. eth1 doesn't reset, its stats continue counting up as it should. So, there is a bug that is causing eth0 to reset, and sometimes this reset fails and locks the NIC. I have three symptoms: 1) NIC freezes, doing a network restart brings everything back 2) NIC freezes, network restart doesn't work, must restart entire box 3) Kernel Panic, entire box freezes As I stated before this happens with both tg3 and bcm5700 drivers, so it may be a motherboard layout issue, something with the PCI-E bus, I dunno, but if anyone has an answer, at this point if it means replacing the hardware that may be preferable to sitting around waiting for a software fix. I read the tg3 source, it seems there are lots of workarounds and fixups for the various broadcom nics, but I notice there are none for the 5721 model, which is what I'm running. Could this be a workaround that is not being applied to this model?
Comment 60 John W. Linville 2005-10-05 13:36:03 UTC
Have you tried the test kernels at the location in comment 11? That has a fairly up-to-date version of the driver that would be worth trying. The 5721 is fairly new. It may be that it _needs_ a workaround or two, but they may not yet have been identified/written... :-( Please let me know the results of running with the aforementioned test kernels...thanks!
Comment 61 Tom Christensen 2005-10-05 20:07:45 UTC
Yes, I ran the latest test kernel there about a week ago. Same exact problem still. I think it was version 67. I haven't tried 70 yet, but I see that it is still the same version of the tg3 driver (3.39) as the last one. I also tried 2.6.14 rc1 which has tg3 3.40. Still broken.
Comment 62 Thomas J. Baker 2005-10-18 13:17:33 UTC
Created attachment 120121 [details] working ifdebug of tg3 on a Dell 2550 i686 machine
Comment 63 Thomas J. Baker 2005-10-18 13:19:10 UTC
Created attachment 120122 [details] broken ifdebug of tg3 on Dell 2550 i686 SMP machine
Comment 64 Thomas J. Baker 2005-10-18 13:28:37 UTC
#62 is after rebooting (from hung tg3) into kernel-smp-2.6.9-22.EL while #63 is before rebooting and is running kernel-smp-2.6.9-20.EL.jwltest.61 #62 -> working ifdebug while running kernel-smp-2.6.9-22.EL #63 -> broken ifdebug while running kernel-smp-2.6.9-20.EL.jwltest.61 since the problem started, we've switched from a D-Link DGS-1008T gigabit switch to a Dell 2716 gigabit switch. That seemed to help some but the problem persists.
Comment 65 John W. Linville 2005-11-02 14:54:11 UTC
The kernels currently at the location from comment 2 contain a very late version of the tg3 driver (based on 3.43). Please give those a try and post the results...thanks!
Comment 66 Thomas J. Baker 2005-11-03 15:30:30 UTC
Created attachment 120691 [details] ifdebug of tg3 while working running 2.6.9-22.8.EL.jwltest.80smp
Comment 67 Thomas J. Baker 2005-11-03 15:31:37 UTC
Created attachment 120692 [details] ifdebug of tg3 while broken running 2.6.9-22.8.EL.jwltest.80smp
Comment 68 Thomas J. Baker 2005-11-03 15:37:11 UTC
The kernel with the latest driver (2.6.9-22.8.EL.jwltest.80smp) doesn't seem to help. I've got three PE2550 SMP systems all with this problem but this one doesn't display the "transmit timed out" error messages, it just stops working. I don't know if the latest erratta kernel (kernel-smp-2.6.9-22.0.1.EL) and the test kernel have eliminated those messages or not. I can provide any other information you need. Anything to get this fixed!
Comment 69 Michael Chan 2005-11-03 19:04:14 UTC
(In reply to comment #68) > I've got three PE2550 SMP systems all with this problem It seems that you're using 5700 B2 chip. Is this problem recently introduced? Does it happen before tg3 v3.32? Can you try running it in UP (uni-processor) kernel and see if the problem goes away?
Comment 70 Thomas J. Baker 2005-11-03 19:11:46 UTC
The problem started when I went from RHEL3 to RHEL4. I don't know which revision of the driver was in RHEL3. It's difficult to try UP because these are productions systems. I'll try. Details of the controller: 03:08.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit Ethernet (rev 12) Subsystem: Dell Broadcom BCM5700 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (16000ns min), Cache Line Size 08 Interrupt: pin A routed to IRQ 193 Region 0: Memory at feb00000 (64-bit, non-prefetchable) [size=64K] Capabilities:  PCI-X non-bridge device. Command: DPERE- ERO- RBC=0 OST=0 Status: Bus=255 Dev=31 Func=1 64bit+ 133MHz+ SCD- USC-, DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM- Capabilities:  Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities:  Vital Product Data Capabilities:  Message Signalled Interrupts: 64bit+ Queue=0/3 Enable- Address: bede463202944c04 Data: fc9f
Comment 71 John W. Linville 2005-11-03 19:23:27 UTC
Thomas, which update of RHEL3 were you using when you switched? Which RHEL4 update were you using when it started? I should be able to pin-down the tg3 versions involved.
Comment 72 Thomas J. Baker 2005-11-03 19:48:29 UTC
I'm pretty sure it was RHEL3U5 clean installed to RHEL4U1 with an outside chance of it being RHEL3U6. I don't recall ever having a problem with any version of RHEL3 on these systems though. And I've actually got one system running the UP versions of the latest eratta kernel right now. Did it matter which UP kernel I ran? Do you want me to try the test one?
Comment 73 Michael Chan 2005-11-03 19:54:22 UTC
(In reply to comment #72) > Did it matter which UP kernel I ran? It doesn't matter which UP kernel you use, as long as you use the same tg3 driver on the UP kernel that would otherwise fail on an SMP kernel. I just want to know if the failure is SMP-related or not.
Comment 74 Thomas J. Baker 2005-11-03 19:58:20 UTC
It's just failed with the UP kernel (2.6.9-22.0.1.EL to be exact.)
Comment 75 Michael Chan 2005-11-09 17:51:59 UTC
(In reply to comment #61) > I also tried 2.6.14 rc1 which has tg3 3.40. Still broken. Please modify tg3_tx_timeout() to call tg3_dump_state() before the reset. Or comment out the schedule_work(&tp->reset_task) line in tg3_tx_timeout() and use ethtool -d to dump the registers after transmit timeout. This way I can get a register dump in the failed state before the chip is reset.
Comment 76 John W. Linville 2005-11-09 22:06:50 UTC
Created attachment 120861 [details] jwltest-tg3-debug.patch
Comment 77 John W. Linville 2005-11-09 22:10:22 UTC
Test kernels w/ the above patch are available at the same location as in comment 2. Please give those a try, and post the information Michael requested in comment 75...thanks!
Comment 79 Michael Chan 2005-11-10 19:46:27 UTC
(In reply to comment #72) > I'm pretty sure it was RHEL3U5 clean installed to RHEL4U1 John sent me the tg3 drivers in these 2 RH kernels and they were essentially the same version 3.22 ported to run on the 2 different kernels. I don't see anything in these 2 drivers that could explain one driver working and the other failing. So please use John's patch in comment 76 to capture the registers before the chip is reset during tx timeout.
Comment 80 Thomas J. Baker 2005-11-16 14:36:43 UTC
Which kernel should I patch?
Comment 81 John W. Linville 2005-11-16 14:49:57 UTC
The kernels here are already patched appropriately: http://people.redhat.com/linville/kernels/rhel4/
Comment 82 Thomas J. Baker 2005-11-16 14:59:59 UTC
OK, I'll try go get those kernels booted soon.
Comment 84 Thomas J. Baker 2005-11-21 15:52:51 UTC
Nevermind that ethtool dump. It's not valid.
Comment 85 Thomas J. Baker 2005-11-22 19:45:47 UTC
Created attachment 121368 [details] ethtool dumb of problematic interface I'm not seeing the tg_stop_block messages or any error messages for that matter, but the interface seems to be locking up. Running the 2.6.9-22.17.EL.jwltest.88smp kernel.
Comment 86 Michael Chan 2005-11-23 18:48:56 UTC
(In reply to comment #85) > ethtool dumb of problematic interface Register dump doesn't show anything unusual. No error status in any register, and interrupts were enabled. I'll ask our QA department to try to reproduce this if they have the same machine.
Comment 88 John W. Linville 2005-12-08 17:52:12 UTC
Regarding comment 9, I only meant to indicate that someone had told me that their problem was resolved by updating the firmware. I am not privy to any specific information in that regard.
Comment 91 John W. Linville 2006-01-03 20:58:20 UTC
From bug 123218 comment 47: > I solved our tg3 locking problem by installing the HP firmware update found > here: > > http://h18004.www1.hp.com/support/files/server/us/download/23367.html > > It's for hpnicfwupg-1.2.2-1.i386.rpm which has been running without issue > for a couple months now. YMMV. > > Steve
Comment 97 John W. Linville 2006-02-16 17:55:30 UTC
This bug has hung-around for a long time... The tg3_stop_block timed-out message is somewhat generic and not necessarily a problem. Please do not report it as a bug (or re-open this one) unless it results in an actual stop in the flow of traffic (i.e. an actual failure). Nearly everyone who has reported this as a real issue (i.e. an actual failure) has seen the problem disappear after applying a tg3 firmware update from their vendor. Please do not report this as a bug (or re-open this one) unless you have already obtained and applied the latest tg3 firmware from your vendor. Given the above two facts and the fact that this bug has persisted so long that it no longer contains coherent information, I am closing this bug as "CANTFIX". If you really believe you are having a problem, then please open a new bugzilla. Thanks...