Bug 243960 - 3c905C network crashes on F7
3c905C network crashes on F7
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
7
x86_64 Linux
low Severity high
: ---
: ---
Assigned To: Andy Gospodarek
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-06-12 22:01 EDT by Lane
Modified: 2014-06-29 18:58 EDT (History)
9 users (show)

See Also:
Fixed In Version: F7
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-01-09 09:44:21 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Lane 2007-06-12 22:01:25 EDT
Upon going from FC6 to F7, my 

01:07.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 74)

network card fails within a few hours of use with the message below.  rmmod'ing
the 3c59x kernel module and reloading it does not help the problem.  The only
way I know of to get the network functional again is to reboot.  

Jun 10 04:45:14 greylock kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jun 10 04:45:14 greylock kernel: eth0: transmit timed out, tx_status 00 status e
681.
Jun 10 04:45:14 greylock kernel:   diagnostics: net 0cfa media 8880 dma 0000003a
 fifo 8000
Jun 10 04:45:14 greylock kernel: eth0: Interrupt posted but not delivered -- IRQ
 blocked by another device?
Jun 10 04:45:14 greylock kernel:   Flags; bus-master 1, dirty 452601(9) current 
452601(9)
Jun 10 04:45:14 greylock kernel:   Transmit list 00000000 vs. ffff81007d2627a0.
Jun 10 04:45:14 greylock kernel:   0: @ffff81007d262200  length 8000002a status 
0001002a
Jun 10 04:45:14 greylock kernel:   1: @ffff81007d2622a0  length 8000002a status 
0001002a
Jun 10 04:45:14 greylock kernel:   2: @ffff81007d262340  length 8000002a status 
0001002a
Jun 10 04:45:14 greylock kernel:   3: @ffff81007d2623e0  length 8000002a status 
0001002a
Jun 10 04:45:14 greylock kernel:   4: @ffff81007d262480  length 8000002a status 
0001002a
Jun 10 04:45:14 greylock kernel:   5: @ffff81007d262520  length 8000002a status 
0001002a
Jun 10 04:45:14 greylock kernel:   6: @ffff81007d2625c0  length 8000002a status 
0001002a
Jun 10 04:45:14 greylock kernel:   7: @ffff81007d262660  length 8000002a status 
8001002a
Jun 10 04:45:14 greylock kernel:   8: @ffff81007d262700  length 8000002a status 
8001002a
Jun 10 04:45:14 greylock kernel:   9: @ffff81007d2627a0  length 8000002a status 
0001002a
Jun 10 04:45:14 greylock kernel:   10: @ffff81007d262840  length 8000002a status
 0001002a
Jun 10 04:45:14 greylock kernel:   11: @ffff81007d2628e0  length 8000002a status
 0001002a
Jun 10 04:45:14 greylock kernel:   12: @ffff81007d262980  length 8000002a status
 0001002a
Jun 10 04:45:14 greylock kernel:   13: @ffff81007d262a20  length 8000002a status
 0001002a
Jun 10 04:45:14 greylock kernel:   14: @ffff81007d262ac0  length 8000002a status
 0001002a
Jun 10 04:45:14 greylock kernel:   15: @ffff81007d262b60  length 8000002a status
 0001002a
Comment 1 Jean-Baptiste Vignaud 2007-06-19 04:50:09 EDT
Same probleme here.
got two 3com cards :
Jun 18 19:54:33 loki kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jun 18 19:54:33 loki kernel: eth0: transmit timed out, tx_status 00 status 8601.
Jun 18 19:54:33 loki kernel:   diagnostics: net 0ccc media 8880 dma 0000003a
fifo 0000
Jun 18 19:54:33 loki kernel: eth0: Interrupt posted but not delivered -- IRQ
blocked by another device?
Jun 18 19:54:33 loki kernel:   Flags; bus-master 1, dirty 80643(3) current 80643(3)
Jun 18 19:54:33 loki kernel:   Transmit list 00000000 vs. ffff81007852d3e0.
Jun 18 19:54:33 loki kernel:   0: @ffff81007852d200  length 8000004a status 0c01004a
Jun 18 19:54:33 loki kernel:   1: @ffff81007852d2a0  length 8000002a status 8001002a
Jun 18 19:54:33 loki kernel:   2: @ffff81007852d340  length 8000002a status 8001002a
Jun 18 19:54:33 loki kernel:   3: @ffff81007852d3e0  length 8000003e status 0001003e
Jun 18 19:54:33 loki kernel:   4: @ffff81007852d480  length 8000003e status 0001003e
Jun 18 19:54:33 loki kernel:   5: @ffff81007852d520  length 8000003e status 0001003e
Jun 18 19:54:33 loki kernel:   6: @ffff81007852d5c0  length 80000055 status 0c010055
Jun 18 19:54:33 loki kernel:   7: @ffff81007852d660  length 80000055 status 0c010055
Jun 18 19:54:33 loki kernel:   8: @ffff81007852d700  length 8000002a status 0001002a
Jun 18 19:54:33 loki kernel:   9: @ffff81007852d7a0  length 80000055 status 0c010055
Jun 18 19:54:33 loki kernel:   10: @ffff81007852d840  length 80000055 status
0c010055
Jun 18 19:54:33 loki kernel:   11: @ffff81007852d8e0  length 8000002a status
0001002a
Jun 18 19:54:33 loki kernel:   12: @ffff81007852d980  length 80000055 status
0c010055
Jun 18 19:54:33 loki kernel:   13: @ffff81007852da20  length 80000055 status
0c010055
Jun 18 19:54:33 loki kernel:   14: @ffff81007852dac0  length 8000002a status
0001002a
Jun 18 19:54:33 loki kernel:   15: @ffff81007852db60  length 8000004a status
0c01004a


got some Call Trace also two days ago :

Jun 17 21:16:22 loki kernel: Unable to handle kernel NULL pointer dereference at
00000000000000e0 RIP:
Jun 17 21:16:22 loki kernel:  [<ffffffff881660f8>] :3c59x:boomerang_rx+0x1c0/0x509
Jun 17 21:16:22 loki kernel: PGD 36110067 PUD 3c30e067 PMD 0
Jun 17 21:16:23 loki kernel: Oops: 0000 [1] SMP
Jun 17 21:16:23 loki kernel: last sysfs file: /class/net/eth0/address
Jun 17 21:16:23 loki kernel: CPU 0
Jun 17 21:16:23 loki kernel: Modules linked in: vfat fat usb_storage hfsplus
nf_nat_ftp iptable_nat nf_nat iptable_mangle nf_conntrack_ftp autofs4 it87
hwmon_vid i2c_isa eeprom sunrpc ipv6 nf_conntrack_netbios_ns nf_conntrack_ipv4
xt_state nf_conntrack nfnetlink xt_tcpudp ipt_REJECT iptable_filter ip_tables
x_tables dm_multipath raid1 video sbs i2c_ec button dock battery ac lp loop
snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm parport_pc snd_timer parport
serio_raw snd pcspkr k8_edac soundcore edac_mc k8temp hwmon snd_page_alloc
i2c_nforce2 3c59x i2c_core mii shpchp forcedeth sr_mod cdrom sg dm_snapshot
dm_zero dm_mirror dm_mod pata_amd ata_generic sata_nv libata sd_mod scsi_mod
raid456 xor ext3 jbd mbcache ehci_hcd ohci_hcd uhci_hcd
Jun 17 21:16:23 loki kernel: Pid: 4824, comm: ip Not tainted 2.6.21-1.3228.fc7 #1
Jun 17 21:16:23 loki kernel: RIP: 0010:[<ffffffff881660f8>] 
[<ffffffff881660f8>] :3c59x:boomerang_rx+0x1c0/0x509
Jun 17 21:16:23 loki kernel: RSP: 0018:ffff81003dd23c98  EFLAGS: 00010246
Jun 17 21:16:23 loki kernel: RAX: 0000000000000000 RBX: ffff810065e41d80 RCX:
0000000000000000
Jun 17 21:16:23 loki kernel: RDX: ffff81007f0eb000 RSI: 0000000000000000 RDI:
ffff81007f557870
Jun 17 21:16:23 loki kernel: RBP: ffff81007f0eb700 R08: 0000000000000000 R09:
ffff810065e41d80
Jun 17 21:16:23 loki kernel: R10: 00007fff5a892ea4 R11: 0000000000000246 R12:
000000000000000c
Jun 17 21:16:23 loki kernel: R13: 000000000000003c R14: 000000000000003c R15:
000000006000803c
Jun 17 21:16:23 loki kernel: FS:  00002aaaaaab5810(0000)
GS:ffffffff8059d000(0000) knlGS:0000000000000000
Jun 17 21:16:23 loki kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 17 21:16:23 loki kernel: CR2: 00000000000000e0 CR3: 000000003c30d000 CR4:
00000000000006e0
Jun 17 21:16:23 loki kernel: Process ip (pid: 4824, threadinfo ffff81003dd22000,
task ffff810016309100)
Jun 17 21:16:23 loki kernel: Stack:  ffff81007f0eb000 0000078000001000
ffffc20000028000 0000001f7fee5d40
Jun 17 21:16:23 loki kernel:  000000006ad23812 00000000fffffff4 ffff810058115680
0000000000008601
Jun 17 21:16:23 loki kernel:  ffff81007f0eb700 ffff81007f0eb000 0000000000000010
ffffffff88166e57
Jun 17 21:16:23 loki kernel: Call Trace:
Jun 17 21:16:23 loki kernel:  [<ffffffff88166e57>]
:3c59x:boomerang_interrupt+0x13b/0x3f1
Jun 17 21:16:23 loki kernel:  [<ffffffff88166d1c>]
:3c59x:boomerang_interrupt+0x0/0x3f1
Jun 17 21:16:23 loki kernel:  [<ffffffff802ae090>] request_irq+0xcc/0xfd
Jun 17 21:16:23 loki kernel:  [<ffffffff88169033>] :3c59x:vortex_open+0x44/0x20c
Jun 17 21:16:23 loki kernel:  [<ffffffff803f0da0>] dev_open+0x2f/0x6e
Jun 17 21:16:23 loki kernel:  [<ffffffff803ef650>] dev_change_flags+0x5a/0x11a
Jun 17 21:16:23 loki kernel:  [<ffffffff80422705>] devinet_ioctl+0x235/0x59c
Jun 17 21:16:23 loki kernel:  [<ffffffff803e7f35>] sock_ioctl+0x1c8/0x1e5
Jun 17 21:16:23 loki kernel:  [<ffffffff8023d539>] do_ioctl+0x21/0x6b
Jun 17 21:16:23 loki kernel:  [<ffffffff8022deac>] vfs_ioctl+0x24e/0x267
Jun 17 21:16:23 loki kernel:  [<ffffffff8024737b>] sys_ioctl+0x59/0x78
Jun 17 21:16:23 loki kernel:  [<ffffffff8025711e>] system_call+0x7e/0x83
Jun 17 21:16:23 loki kernel:
Jun 17 21:16:23 loki kernel:
Jun 17 21:16:23 loki kernel: Code: 48 8b 80 e0 00 00 00 48 89 44 24 08 83 bb 8c
00 00 00 00 4c
Jun 17 21:16:23 loki kernel: RIP  [<ffffffff881660f8>]
:3c59x:boomerang_rx+0x1c0/0x509
Jun 17 21:16:23 loki kernel:  RSP <ffff81003dd23c98>
Jun 17 21:16:23 loki kernel: CR2: 00000000000000e0
Jun 17 21:16:34 loki kernel: BUG: soft lockup detected on CPU#1!
Jun 17 21:16:34 loki kernel:
Jun 17 21:16:34 loki kernel: Call Trace:
Jun 17 21:16:34 loki kernel:  <IRQ>  [<ffffffff802ad44c>] softlockup_tick+0xd5/0xe7
Jun 17 21:16:34 loki kernel:  [<ffffffff8028ac56>] update_process_times+0x42/0x68
Jun 17 21:16:34 loki kernel:  [<ffffffff8026e710>]
smp_local_timer_interrupt+0x34/0x55
Jun 17 21:16:34 loki kernel:  [<ffffffff8026ee36>]
smp_apic_timer_interrupt+0x43/0x5b
Jun 17 21:16:34 loki kernel:  [<ffffffff80257d56>] apic_timer_interrupt+0x66/0x70
Jun 17 21:16:34 loki kernel:  <EOI>  [<ffffffff8025d90c>]
_spin_lock_irqsave+0x12/0x24
Jun 17 21:16:34 loki kernel:  [<ffffffff88168994>] :3c59x:vortex_get_stats+0x2a/0x52
Jun 17 21:16:34 loki kernel:  [<ffffffff803eedad>] dev_seq_show+0x35/0x115
Jun 17 21:16:34 loki kernel:  [<ffffffff8023c75e>] seq_read+0x1b8/0x28c
Jun 17 21:16:34 loki kernel:  [<ffffffff8020af0a>] vfs_read+0xcb/0x173
Jun 17 21:16:34 loki kernel:  [<ffffffff80210606>] sys_read+0x45/0x6e
Jun 17 21:16:34 loki kernel:  [<ffffffff8025711e>] system_call+0x7e/0x83
Comment 2 rh 2007-06-19 05:24:10 EDT
Same here on a x86_64 with 3c905C but on a fresh install.
Comment 3 Jean-Baptiste Vignaud 2007-06-26 05:23:36 EDT
some more informations, i have 3 nic cards 2 are 3com cards.
It's a fresh installation uptodate and the system is working in runlevel 3.

3com cards are eth0 and eth1
onboard nVidia card is eth2.

[root@loki ~]# lspci | grep -i 3com
01:06.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78)
01:07.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78)


[root@loki ~]# uname -r
2.6.21-1.3228.fc7

[root@loki ~]# cat /proc/interrupts 
           CPU0       CPU1       
  0:   41138893          0   IO-APIC-edge      timer
  1:         30        451   IO-APIC-edge      i8042
  7:          0          0   IO-APIC-edge      parport0
  8:          0          0   IO-APIC-edge      rtc
  9:          0          0   IO-APIC-fasteoi   acpi
 12:        142          0   IO-APIC-edge      i8042
 14:          0          0   IO-APIC-edge      libata
 15:          0          0   IO-APIC-edge      libata
 16:         94      12279   IO-APIC-fasteoi   eth0
 17:         75     102569   IO-APIC-fasteoi   eth1
 20:       5668      40663   IO-APIC-fasteoi   libata
 21:       6719      49834   IO-APIC-fasteoi   libata, HDA Intel
 22:       1642    4132391   IO-APIC-fasteoi   ehci_hcd:usb2, eth2
 23:        878      56737   IO-APIC-fasteoi   ohci_hcd:usb1, libata
NMI:          0          0 
LOC:   41131878   41131854 
ERR:          0


problems can be on eth0 and eth1 but never occurs on eth2.
eth0:
Jun 17 21:13:10 loki kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jun 17 21:13:10 loki kernel: eth0: transmit timed out, tx_status 00 status 8601.
Jun 17 21:13:10 loki kernel:   diagnostics: net 0ccc media 8880 dma 0000003a
fifo 0000
Jun 17 21:13:10 loki kernel: eth0: Interrupt posted but not delivered -- IRQ
blocked by another device?
Jun 17 21:13:10 loki kernel:   Flags; bus-master 1, dirty 22522(10) current
22522(10)
Jun 17 21:13:10 loki kernel:   Transmit list 00000000 vs. ffff81007d601840.
Jun 17 21:13:10 loki kernel:   0: @ffff81007d601200  length 8000003e status 0001003e
Jun 17 21:13:10 loki kernel:   1: @ffff81007d6012a0  length 8000002a status 0001002a
Jun 17 21:13:10 loki kernel:   2: @ffff81007d601340  length 8000002a status 0001002a
Jun 17 21:13:10 loki kernel:   3: @ffff81007d6013e0  length 8000002a status 0001002a
Jun 17 21:13:10 loki kernel:   4: @ffff81007d601480  length 8000002a status 0001002a
Jun 17 21:13:10 loki kernel:   5: @ffff81007d601520  length 8000002a status 0001002a
Jun 17 21:13:10 loki kernel:   6: @ffff81007d6015c0  length 8000002a status 0001002a
Jun 17 21:13:10 loki kernel:   7: @ffff81007d601660  length 8000002a status 0001002a
Jun 17 21:13:10 loki kernel:   8: @ffff81007d601700  length 8000002a status 8001002a
Jun 17 21:13:10 loki kernel:   9: @ffff81007d6017a0  length 8000002a status 8001002a
Jun 17 21:13:10 loki kernel:   10: @ffff81007d601840  length 80000042 status
00010042
Jun 17 21:13:10 loki kernel:   11: @ffff81007d6018e0  length 8000003e status
0001003e
Jun 17 21:13:10 loki kernel:   12: @ffff81007d601980  length 8000003e status
0001003e
Jun 17 21:13:10 loki kernel:   13: @ffff81007d601a20  length 8000003e status
0001003e
Jun 17 21:13:10 loki kernel:   14: @ffff81007d601ac0  length 8000002a status
0001002a
Jun 17 21:13:10 loki kernel:   15: @ffff81007d601b60  length 8000002a status
0001002a


eth1:
Jun 25 23:44:32 loki kernel: NETDEV WATCHDOG: eth1: transmit timed out
Jun 25 23:44:32 loki kernel: eth1: transmit timed out, tx_status 00 status e601.
Jun 25 23:44:32 loki kernel:   diagnostics: net 0ccc media 8880 dma 0000003a
fifo 0000
Jun 25 23:44:32 loki kernel: eth1: Interrupt posted but not delivered -- IRQ
blocked by another device?
Jun 25 23:44:32 loki kernel:   Flags; bus-master 1, dirty 438800(0) current
438800(0)
Jun 25 23:44:32 loki kernel:   Transmit list 00000000 vs. ffff81007f327200.
Jun 25 23:44:32 loki kernel:   0: @ffff81007f327200  length 00000042 status 0c0105ea
Jun 25 23:44:32 loki kernel:   1: @ffff81007f3272a0  length 00000042 status 0c0105ea
Jun 25 23:44:32 loki kernel:   2: @ffff81007f327340  length 00000042 status 0c0105ea
Jun 25 23:44:32 loki kernel:   3: @ffff81007f3273e0  length 00000042 status 0c0105ea
Jun 25 23:44:32 loki kernel:   4: @ffff81007f327480  length 00000042 status 0c0105ea
Jun 25 23:44:32 loki kernel:   5: @ffff81007f327520  length 00000042 status 0c0105ea
Jun 25 23:44:32 loki kernel:   6: @ffff81007f3275c0  length 00000042 status 0c0105ea
Jun 25 23:44:32 loki kernel:   7: @ffff81007f327660  length 00000042 status 0c0105ea
Jun 25 23:44:32 loki kernel:   8: @ffff81007f327700  length 00000042 status 0c0105ea
Jun 25 23:44:32 loki kernel:   9: @ffff81007f3277a0  length 00000042 status 0c0105ea
Jun 25 23:44:32 loki kernel:   10: @ffff81007f327840  length 00000042 status
0c0105ea
Jun 25 23:44:32 loki kernel:   11: @ffff81007f3278e0  length 00000042 status
0c0105ea
Jun 25 23:44:32 loki kernel:   12: @ffff81007f327980  length 00000042 status
0c0105ea
Jun 25 23:44:32 loki kernel:   13: @ffff81007f327a20  length 00000042 status
0c0105ea
Jun 25 23:44:32 loki kernel:   14: @ffff81007f327ac0  length 00000042 status
8c0105ea
Jun 25 23:44:32 loki kernel:   15: @ffff81007f327b60  length 00000042 status
8c0105ea


eth0 is adsl
eth1 is local network
eth2 is controlling a WAP access point.

Comment 4 Chuck Ebbert 2007-06-26 12:28:26 EDT
Same as bug #231687; bug has been in Linux kernel since 2.6.19 and nobody has a
clue what is happening.

Lots of hits on google::
http://www.google.com/search?q=%22interrupt+posted+but+not+delivered%22&start=0
Comment 5 Chuck Ebbert 2007-06-26 14:21:49 EDT
Looking through some of the reports, the card seems to work in some slots but
not others. I had this problem with one of these adapters all the way back on
kernel 2.2; moving it to another slot did cure the problem.
Comment 6 Lane 2007-06-26 15:18:37 EDT
I have both FC6 and F7 installed on my machine (dual boot).  Both are fresh
installs and completely updated.  The F7 kernel (2.6.21-1.3228.fc7) gives me
this problem within a few hours of use, and heavy network traffic seems to
increase the occurance rate.  I never experienced this problem on the same
machine in any of the FC6 kernels (which is at 2.6.20-1.2952.fc6 right now).  I
never experienced this problem in FC5 for that matter either.

This machine is x86_64 and does contain a second ethernet card (not 3com), which
does not have the problem.  I have not tried moving the cards around, but I will
try it the next opportunity that I have.
Comment 7 Jean-Baptiste Vignaud 2007-06-29 04:24:49 EDT
ok, on a fresh installation + latest patchs from fc7, i removed the two 3com
cards and replaced by two old space nics i had:

01:06.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)

and the problem is still here...
Jun 29 09:34:10 loki kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jun 29 09:34:51 loki last message repeated 14 times
Jun 29 09:35:18 loki last message repeated 8 times

Another note, before that, i tried acpi=off with the two 3com cards and it tuned
very bad. while stressing the network by coping big files from a samba share the
disk controlers suddenly stopped to work aswell, causing a total failure of the
LVM'd raid 5 array, with no possibility to restore data (this is the reason for
the fresh install).

Comment 8 Philip Craig 2007-07-12 10:19:29 EDT
I am having the same problems on my amd64 Debian-based machine. 3c905tx-b card
works fine on a 2.6.18 kernel. Random lockups after heavy network use for an
hour or so with kernel 2.6.21. I can supply additional info if anyone cares to
ask for some.
Comment 9 Andy Gospodarek 2007-07-20 16:14:08 EDT
Philip, I'd be curious to see more info with the 3c905 and an upstream kernel. 
Feel free to post it to this BZ.
Comment 10 Jouni Väliaho 2007-07-23 07:17:02 EDT
Same problem here with Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01) on my ConRoe
945G-DVI motherboard. After updating FC6 to F7, network works only about 30
minutes with kernels 2.6.21-1.3194.fc7 and 2.6.22.1-27.fc7. When I boot old
kernel 2.6.20-1.2962.fc6 network works normally with exatly same PC and
configuration.
Comment 11 Andy Gospodarek 2007-07-23 08:59:22 EDT
Someone recently discovered that changing the .config so that 

CONFIG_DEBUG_SHIRQ=y

is 

CONFIG_DEBUG_SHIRQ=n

dropped out some code that was problematic for a patictular NIC.  I'd be curious
if this also helped out in any of your situations.  I plan to address the patch
that included this today, but if anyone can try out a build with
CONFIG_DEBUG_SHIRQ=n I would certainly appreciate it.

Comment 12 Chuck Ebbert 2007-07-23 13:41:30 EDT
Should we even be building release kernels with that option set?
Debug kernels should probably have it, though.

Comment 13 Jouni Väliaho 2007-07-24 03:56:11 EDT
I downloaded kernel-2.6.22.1-27.fc7.src.rpm, changed .config settings according
to comment 11, recompiled, installed and booted it. Network stop working after
ten minutes while I was downloading a video stream. The error message was
kernel: NETDEV WATCHDOG: eth0: transmit timed out. I use r8169 network driver.
Comment 14 Michael Weiser 2007-07-26 13:07:49 EDT
I'm seeing this problem on my Gentoo system. From my point of view it is a
regression in linux-2.6.21. Everything works fine and reliably with 2.6.20 and
before. When upgrading to 2.6.21 I get the error within a few hours. I'm running
amd64 on Athlon 64 X2 with SMP and 3C905TX-C. According my my googling this is
an ACPI-related issue.
Comment 15 Andy Gospodarek 2007-07-27 09:35:31 EDT
(In reply to comment #14)
> I'm seeing this problem on my Gentoo system. From my point of view it is a
> regression in linux-2.6.21. Everything works fine and reliably with 2.6.20 and
> before. When upgrading to 2.6.21 I get the error within a few hours. I'm running

I feel the same way.

> amd64 on Athlon 64 X2 with SMP and 3C905TX-C. According my my googling this is
> an ACPI-related issue.

Interesting.  I hadn't heard anything like that yet, but I'll check it out.
Comment 16 Jouni Väliaho 2007-07-30 02:08:49 EDT
I booted kernel 2.6.22.1-27.fc7 with parameters "acpi=off noapci" three days ago
and my r8169 is still working.

Thanks!
Comment 17 Michael Weiser 2007-07-30 07:59:33 EDT
Hi again,

I'll have to revise ACPI to APIC. There was a well known (Andrew Morton calls it
infamous) problem with APIC on SMP and IO-APIC on Uniprocessor in 2.4. The
Donald Becker network drivers spot it because they explicitly check for it.

It seems to have been fixed in 2.4-ac and then pushed into mainstream. The
accepted workaround at the time was to boot with noapic.

Maybe it's cropping up in 2.6.21+ again. But maybe it's something completely
unrelated this time.

More info can be found at:

http://www.scyld.com/pipermail/vortex/2001-June/001207.html
http://www.scyld.com/pipermail/vortex/2001-August/001276.html
http://www.scyld.com/pipermail/vortex/2000-November/000771.html
http://www.uwsg.iu.edu/hypermail/linux/net/0105.1/0003.html
http://lkml.org/lkml/2007/3/25/208
http://lkml.org/lkml/2007/4/6/221

Sorry, I can't be of more help.
Comment 18 Jean-Baptiste Vignaud 2007-08-06 09:56:08 EDT
I updated my kernel to kernel-2.6.22.1-41.fc7 2/3 days ago and the uptime is 2
days, 20:00 and still no sign of "NETDEV WATCHDOG: eth0: transmit timed out" !


During the previous weeks i was following a thread on lkml concerning a possible
irq problem, similar to what we observed here and i also posted a kind of copy
of this bugzilla to the existing thread:

"2.6.20->2.6.21 - networking dies after random time"
http://lkml.org/lkml/2007/6/29/58


Marcin Ślusarz apparently found a possible patch in the kernel that introduced
the problem in 2.6.21 : http://lkml.org/lkml/2007/7/23/13

(patch was [PATCH] genirq: do not mask interrupts by default)

this thread is still going in lkml, devs are trying to find a correct solution i
guess.


Then i just did a rpm -q --changelog kernel-2.6.22.1-41.fc7 to find in the first
lines that this very patch was been reverted by ebbert@redhat.com:

* Sat Jul 28 2007 Chuck Ebbert <cebbert@redhat.com>
- revert upstream "genirq: do not mask interrupts by default"

So, the problem has been reverted, but the real cause has not been found yet by
the linux core...

I'll continue to test the network, but for now it seems to me that this new
kernel has correct networking abilities.
Comment 19 Jean-Baptiste Vignaud 2007-08-06 10:04:30 EDT
I guess that the last developments are in this thread :

http://lkml.org/lkml/2007/8/6/11
latest : http://lkml.org/lkml/2007/8/6/36

Comment 20 Jean-Baptiste Vignaud 2007-08-06 16:46:53 EDT
As posted on lkml, after 4 hours of intensive stressing, this new kernel failed
also...
This box has been running for 3 days with average load on the interfaces, then i
launched intensive transfers and the kernel gave me the :

Aug  6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out
Aug  6 22:31:09 loki kernel: eth2: transmit timed out, tx_status 00 status e601.
Aug  6 22:31:09 loki kernel:   diagnostics: net 0ccc media 8880 dma 0000003a
fifo 8000
Aug  6 22:31:09 loki kernel: eth2: Interrupt posted but not delivered -- IRQ
blocked by another device?
Aug  6 22:31:09 loki kernel:   Flags; bus-master 1, dirty 26085000(8) current
26085000(8)
Aug  6 22:31:09 loki kernel:   Transmit list 00000000 vs. ffff81007c807700.

...

There is a progress with the removal of genirq but there should be something else.
Comment 21 David Rees 2007-08-20 13:57:47 EDT
Me too. F7 on a Sempron 3000+, kernel-2.6.22.1-41.fc7. Interestingly, the same
NICs were running just fine for weeks on an old K6-2 450 on F7 i586 kernel. The
time it takes to trigger the error seems to vary between nearly instantly and 12
hours.

# lspci|grep -i 3c
02:05.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 30)
02:07.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 30)

Have tried booting with acpi=off which didn't help. I've tried moving the cards
into different slots, but that didn't help, either, but I suspect that is
because no matter what slots I use, the cards still share an interrupt with some
other device.

For now I've been able to unload any modules for devices which share interrupts
with the NICs and hopefully the machine will remain usable.
Comment 22 David Rees 2007-08-20 18:59:36 EDT
(In reply to comment #21)
> For now I've been able to unload any modules for devices which share interrupts
> with the NICs and hopefully the machine will remain usable.

Well that didn't help (unload modules so that the cards no longer share
interrupts). I will try booting with noapic and/or nolapic next, if that fails I
will try to see if I can dig up replacement NICs. The machine is totally
unreliable and unusable as it is now.
Comment 23 David Rees 2007-08-22 14:27:25 EDT
Latest update. noapic didn't affect the issue at all. I also tried a i586 kernel
which seemed to make it less likely. The machine wouldn't boot with nolapic so
no luck there.

I also tried two other Netgear NICs which used the tulip driver, but these also
were not much more stable than the 3com NICs. The tulip driver identified them as:

Aug 21 00:56:21 summit kernel: eth1: Lite-On 82c168 PNIC rev 33 at MMIO
0xfbffbc00, 00:A0:CC:3C:1A:B3, IRQ 21.
Aug 21 00:56:21 summit kernel: eth2: Lite-On 82c168 PNIC rev 32 at MMIO
0xfbffb800, 00:A0:CC:5F:1E:74, IRQ 16.

But these NICs are also fairly old as well. I dug up some newer Realtek NICs
which use the 8139too driver and I haven't been able to lock up an interface
with these even with ping floods going across both NICs for 30 minutes (which
would have locked up the other NICs within 10 minutes and usually much faster):

Aug 22 00:29:57 summit kernel: eth1: RealTek RTL8139 at 0xf8826c00,
00:50:ba:53:cc:21, IRQ 21
Aug 22 00:29:57 summit kernel: eth2: RealTek RTL8139 at 0xf894e800,
00:0d:88:21:9a:f6, IRQ 16

lspci shows them as:
02:06.0 Ethernet controller: D-Link System Inc RTL8139 Ethernet (rev 10)
02:07.0 Ethernet controller: D-Link System Inc RTL8139 Ethernet (rev 10)

So for anyone experiencing this issue, it appears that it can be worked around
by using by replacing those old NICs for something newer.
Comment 24 Christopher Brown 2008-01-09 09:15:15 EST
Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the Fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug and will try and assist you in resolving it if I can.

There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel?

If the problem no longer exists then please close this bug or I'll do so in a
few days if there is no additional information lodged.
Comment 25 Jean-Baptiste Vignaud 2008-01-09 09:36:16 EST
After several patches / tries on lkml (i guess it was in august 2007), some
patches were done in 2.6.23-rc3. I tested it with success and i still use it
24/7 and it works perfectly.

Jean-Baptiste
Comment 26 Andy Gospodarek 2008-01-09 09:44:21 EST
Great!  

I was quite sure this was fixed, but now that I have some external confirmation
I will close this bug.

Note You need to log in before you can comment on or make changes to this bug.