Bug 214717

Summary: NETDEV WATCHDOG: peth0: transmit timed out -> no network conectivity
Product: [Fedora] Fedora Reporter: Andy Bailey <andy>
Component: kernel-xenAssignee: Herbert Xu <herbert.xu>
Status: CLOSED WONTFIX QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 6CC: xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-02-26 23:40:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
The dmesg output
none
The dmesg for the standard kernel
none
Dmesg
none
Output of lspci -vvv
none
Output of ifconfig
none
Output of lsmod
none
xend.log
none
assorted files none

Description Andy Bailey 2006-11-08 23:15:05 UTC
Description of problem:
I have fc6 installed on a 64 bit hp pavillion with xen enabled
Linux haz.hazlorealidad.com 2.6.18-1.2798.fc6xen #1 SMP Mon Oct 16 14:59:01 EDT
2006 x86_64 x86_64 x86_64 GNU/Linux

I was running pup, 
and wireshark to get feedback that pup was doing something

After about 30 minutes of network activity the network died and a network
restart didnt cure it. A reboot did the trick.

I have a hunch that its to do with xen (and also couldnt find networking in the
components hence my assigning it to xend)

log shows

Nov  8 16:13:20 haz kernel: device eth0 entered promiscuous mode
Nov  8 16:15:09 haz gconfd (root-3733): starting (version 2.14.0), pid 3733 user
'root'
Nov  8 16:15:10 haz gconfd (root-3733): Resolved address
"xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only configuration
source at position 0
Nov  8 16:15:10 haz gconfd (root-3733): Resolved address
"xml:readwrite:/root/.gconf" to a writable configuration source at position 1
Nov  8 16:15:10 haz gconfd (root-3733): Resolved address
"xml:readonly:/etc/gconf/gconf.xml.defaults" to a read-only configuration source
at position 2
Nov  8 16:15:13 haz kernel: SoftMAC: ASSERTION FAILED (0) at:
net/ieee80211/softmac/ieee80211softmac_wx.c:306:ieee80211softmac_wx_get_rate()
Nov  8 16:15:14 haz last message repeated 2 times
Nov  8 16:57:34 haz kernel: NETDEV WATCHDOG: peth0: transmit timed out
Nov  8 16:57:37 haz kernel: peth0: link up, 100Mbps, full-duplex, lpa 0x41E1
Nov  8 16:57:37 haz kernel: peth0: Promiscuous mode enabled.
Nov  8 16:57:49 haz kernel: NETDEV WATCHDOG: peth0: transmit timed out
Nov  8 16:57:52 haz kernel: peth0: link up, 100Mbps, full-duplex, lpa 0x41E1
Nov  8 16:57:52 haz kernel: peth0: Promiscuous mode enabled.
Nov  8 16:58:04 haz kernel: NETDEV WATCHDOG: peth0: transmit timed out
---
repeat 1000 times

a ping to the router on the other end of the cable didnt respond

ifconfig eth0 shows
eth0      Link encap:Ethernet  HWaddr 00:0F:B0:07:75:7D
          inet addr:192.168.0.3  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::20f:b0ff:fe07:757d/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:14603 errors:0 dropped:0 overruns:0 frame:0
          TX packets:11187 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:19113869 (18.2 MiB)  TX bytes:873580 (853.1 KiB)

ifconfigPeth0
peth0     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:14604 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10940 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:19114067 (18.2 MiB)  TX bytes:865056 (844.7 KiB)
          Interrupt:17 Base address:0xc800

did a service network restart 
the interface came up but didnt solve the problem

How reproducible:
Its happened twice since install in fresh partitions

Im not sure what other information would be useful
nothing is in the xen logs at the time of the problem

Comment 1 Stephen Tweedie 2006-11-09 15:28:43 UTC
Nov  8 16:15:13 haz kernel: SoftMAC: ASSERTION FAILED (0) at:
net/ieee80211/softmac/ieee80211softmac_wx.c:306:ieee80211softmac_wx_get_rate()
Nov  8 16:15:14 haz last message repeated 2 times
Nov  8 16:57:34 haz kernel: NETDEV WATCHDOG: peth0: transmit timed out
Nov  8 16:57:37 haz kernel: peth0: link up, 100Mbps, full-duplex, lpa 0x41E1
Nov  8 16:57:37 haz kernel: peth0: Promiscuous mode enabled.
Nov  8 16:57:49 haz kernel: NETDEV WATCHDOG: peth0: transmit timed out
Nov  8 16:57:52 haz kernel: peth0: link up, 100Mbps, full-duplex, lpa 0x41E1
Nov  8 16:57:52 haz kernel: peth0: Promiscuous mode enabled.
Nov  8 16:58:04 haz kernel: NETDEV WATCHDOG: peth0: transmit timed out

certainly looks like a kernel problem.  And networking problems related to Xen
usually occur at boot time.

Can you reproduce this on non-xen kernels too?

Comment 2 Andy Bailey 2006-11-09 16:45:45 UTC
I have had fc5 installed on the same machine and never had a network
problem like this. Today I installed the non xen kernel and have been
connected for over an hour and a half without a problem. with the same 2
programs running pup and wireshark. 
So the answer to your question is no.

However I did a google for "peth0: transmit timed out" and found

http://lists.xensource.com/archives/html/xen-users/2006-05/msg00753.html

http://lists.xensource.com/archives/cgi-bin/mesg.cgi?a=xen-
users&i=44759305.30301%40freemail.hu

http://www.linuxquestions.org/questions/showthread.php?p=2137801

http://blog.gmane.org/gmane.comp.emulators.xen.user/page=1?
set_skin=leftmenu

Not sure if they are related

Andy

Comment 3 Stephen Tweedie 2006-11-09 21:49:13 UTC
OK, but how often were you having the trouble under Xen?

The peth0: isn't my main concern; I'd *expect* the driver to show further
problems  after an initial assert failure.  It's the "SoftMAC: ASSERTION FAILED"
that we need to get to the bottom of first.


Comment 4 Andy Bailey 2006-11-10 01:58:36 UTC
I just installed fc6 yesterday and it happened several times, in fact every time
the system was up for a reasonable period of time. 

Would it have anything to do with a Broadcom wireless card that was present but
not configured (it uses the bcwl43xxx driver if I remember rightly). I thought
it wouldnt so I didnt send the whole ifconfig.

It seems strange that the MAC assertion happened almost 45 minutes before the
problem presented itself.

What information do you need? 

Comment 5 Herbert Xu 2006-11-14 07:04:17 UTC
Please start by attaching the full dmesg output to show all kernel messages
since boot.  Thanks.

Comment 6 Andy Bailey 2006-11-15 19:27:03 UTC
Created attachment 141298 [details]
The dmesg output 

I upgraded to the latest xen kernel available via yum

uname -a
Linux haz.hazlorealidad.com 2.6.18-1.2849.fc6xen #1 SMP Fri Nov 10 12:57:36 EST
2006 x86_64 x86_64 x86_64 GNU/Linux

The same problem happened again, this time 20 minutes after boot
the network was working fine up to that point

uptime
 20:13:33 up 20 min,  1 user,  load average: 0.15, 0.18, 0.23

Comment 7 Andy Bailey 2006-11-15 19:31:06 UTC
Forgot to add that I used neat to down eth0 and then up it.


Comment 8 Herbert Xu 2006-11-16 07:00:30 UTC
Andy, could you email me a copy of the dmesg? I'm having problems downloading it
from Bugzilla.  Thakns.

Comment 9 Herbert Xu 2006-11-16 07:46:21 UTC
Nevermind I've got the dmesg now.

Comment 10 Herbert Xu 2006-11-16 07:53:07 UTC
Thanks for the dmesg.  Could you please get a dmesg from a baremetal kernel as
well for comparison?

Comment 11 Andy Bailey 2006-11-16 20:13:56 UTC
Created attachment 141409 [details]
The dmesg for the standard kernel

uname -a for this kernel gives

Linux haz.hazlorealidad.com 2.6.18-1.2849.fc6 #1 SMP Fri Nov 10 12:34:46 EST
2006 x86_64 x86_64 x86_64 GNU/Linux

Comment 12 Herbert Xu 2006-11-17 06:07:51 UTC
When the problem occurs can you check /proc/interrupts to see whether you're
getting any more interrupts on the IRQ for your NIC? Also, are there any
messages in "xm dmesg" when it happens? Thanks.

Comment 13 Micah Quinn 2006-12-07 00:28:22 UTC
Created attachment 143006 [details]
Dmesg

Comment 14 Micah Quinn 2006-12-07 00:31:48 UTC
I've attached my dmesg to this bug report because I believe I am experiencing
similar issues.

1.  After booting into FC6 Xen kernel, the network will function for 60 seconds
to 30 minutes, then hangs.
2.  Pinging from outside the box fails; pinging from the box to the outside fails.
3.  Restarting the network does not fix the issue.
4.  rmmod'ing the forcedeth NIC module and modprobe'ing it does not fix the issue
5.  Interrupts to appear to stop incrementing when the network hangs.
6.  Nothing but a reboot appears to bring the network back.

I'll attach some additional information that may be useful.

Comment 15 Micah Quinn 2006-12-07 00:32:53 UTC
Created attachment 143007 [details]
Output of lspci -vvv

Comment 16 Micah Quinn 2006-12-07 00:34:34 UTC
Created attachment 143008 [details]
Output of ifconfig

Comment 17 Micah Quinn 2006-12-07 00:35:09 UTC
Created attachment 143009 [details]
Output of lsmod

Comment 18 Micah Quinn 2006-12-07 00:35:42 UTC
Created attachment 143010 [details]
xend.log

Comment 19 Micah Quinn 2006-12-07 13:12:14 UTC
I turned off tx crc's on *ETH0, not *PETH0.  PETH0 said the operation was not
supported on that device.  I'll follow-up here to let you know if it still locks
up after some period of time.

Below is my xm dmesg after the hang.  I don't see anything obvious:

 __  __            _____  ___   _____             ____    __       __
 \ \/ /___ _ __   |___ / / _ \ |___ /    _ __ ___| ___|  / _| ___ / /_
  \  // _ \ '_ \    |_ \| | | |  |_ \ __| '__/ __|___ \ | |_ / __| '_ \
  /  \  __/ | | |  ___) | |_| | ___) |__| | | (__ ___) ||  _| (__| (_) |
 /_/\_\___|_| |_| |____(_)___(_)____/   |_|  \___|____(_)_|  \___|\___/

 http://www.cl.cam.ac.uk/netos/xen
 University of Cambridge Computer Laboratory

 Xen version 3.0.3-rc5-1.2849.fc6 (brewbuilder.com) (gcc version 4.
1.1 20061011 (Red Hat 4.1.1-30)) Fri Nov 10 12:30:42 EST 2006
 Latest ChangeSet: unavailable

(XEN) Command line: /xen.gz-2.6.18-1.2849.fc6
(XEN) Physical RAM map:
(XEN)  0000000000000000 - 000000000009fc00 (usable)
(XEN)  000000000009fc00 - 00000000000a0000 (reserved)
(XEN)  00000000000e7000 - 0000000000100000 (reserved)
(XEN)  0000000000100000 - 000000003ffc0000 (usable)
(XEN)  000000003ffc0000 - 000000003ffd0000 (ACPI data)
(XEN)  000000003ffd0000 - 0000000040000000 (ACPI NVS)
(XEN)  00000000fec00000 - 00000000fec01000 (reserved)
(XEN)  00000000fee00000 - 00000000fee01000 (reserved)
(XEN)  00000000ff7c0000 - 0000000100000000 (reserved)
(XEN) System RAM: 1023MB (1047932kB)
(XEN) Xen heap: 14MB (14420kB)
(XEN) found SMP MP-table at 000ff780
(XEN) DMI 2.3 present.
(XEN) Using APIC driver default
(XEN) ACPI: RSDP (v000 ACPIAM                                ) @ 0x00000000000fa 6d0
(XEN) ACPI: RSDT (v001 A M I  OEMRSDT  0x08000431 MSFT 0x00000097) @ 0x000000003
ffc0000
(XEN) ACPI: FADT (v002 A M I  OEMFACP  0x08000431 MSFT 0x00000097) @ 0x000000003
ffc0200
(XEN) ACPI: MADT (v001 A M I  OEMAPIC  0x08000431 MSFT 0x00000097) @ 0x000000003
ffc0390
(XEN) ACPI: OEMB (v001 A M I  OEMBIOS  0x08000431 MSFT 0x00000097) @ 0x000000003
ffd0040
(XEN) ACPI: DSDT (v001  N8XLA N8XLA308 0x00000308 INTL 0x02002026) @ 0x000000000
0000000
(XEN) ACPI: Local APIC address 0xfee00000
(XEN) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
(XEN) Processor #0 15:4 APIC version 16
(XEN) ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
(XEN) IOAPIC[0]: apic_id 1, version 17, address 0xfec00000, GSI 0-23
(XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
(XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
(XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 high edge)
(XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 15 global_irq 15 high edge)
(XEN) ACPI: IRQ0 used by override.
(XEN) ACPI: IRQ2 used by override.
(XEN) ACPI: IRQ9 used by override.
(XEN) ACPI: IRQ14 used by override.
(XEN) ACPI: IRQ15 used by override.
(XEN) Enabling APIC mode:  Flat.  Using 1 I/O APICs
(XEN) Using ACPI (MADT) for SMP configuration information
(XEN) Using scheduler: SMP Credit Scheduler (credit)
(XEN) Initializing CPU#0
(XEN) Detected 2194.523 MHz processor.
(XEN) CPU0: AMD Flush Filter disabled
(XEN) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
(XEN) CPU: L2 Cache: 1024K (64 bytes/line)
(XEN) Intel machine check architecture supported.
(XEN) Intel machine check reporting enabled on CPU#0.
(XEN) CPU0: AMD Athlon(tm) 64 Processor 3400+ stepping 08
(XEN) Total of 1 processors activated.
(XEN) ENABLING IO-APIC IRQs
(XEN)  -> Using new ACK method
(XEN) ..TIMER: vector=0xF0 apic1=0 pin1=2 apic2=-1 pin2=-1
(XEN) ..MP-BIOS bug: 8254 timer not connected to IO-APIC
(XEN) ...trying to set up timer (IRQ0) through the 8259A ...  failed.
(XEN) ...trying to set up timer as Virtual Wire IRQ... failed.
(XEN) ...trying to set up timer as ExtINT IRQ...spurious 8259A interrupt: IRQ7.
(XEN)  works.
(XEN) Platform timer is 1.193MHz PIT
(XEN) Brought up 1 CPUs
(XEN) Machine check exception polling timer started.
(XEN) *** LOADING DOMAIN 0 ***
(XEN) Domain 0 kernel supports features = { 0000001f }.
(XEN) Domain 0 kernel requires features = { 00000000 }.
(XEN) PHYSICAL MEMORY ARRANGEMENT:
(XEN)  Dom0 alloc.:   0000000003000000->0000000004000000 (234879 pages to be all
ocated)
(XEN) VIRTUAL MEMORY ARRANGEMENT:
(XEN)  Loaded kernel: ffffffff80200000->ffffffff806b8264
(XEN)  Init. ramdisk: ffffffff806b9000->ffffffff80c7ca00
(XEN)  Phys-Mach map: ffffffff80c7d000->ffffffff80e4fbf8
(XEN)  Start info:    ffffffff80e50000->ffffffff80e5049c
(XEN)  Page tables:   ffffffff80e51000->ffffffff80e5c000
(XEN)  Boot stack:    ffffffff80e5c000->ffffffff80e5d000
(XEN)  TOTAL:         ffffffff80000000->ffffffff81000000
(XEN)  ENTRY ADDRESS: ffffffff80200000
(XEN) Dom0 has maximum 1 VCPUs
(XEN) Initrd len 0x5c3a00, start at 0xffffffff806b9000
(XEN) Scrubbing Free RAM: ...........done.
(XEN) Xen trace buffers: disabled
(XEN) Xen is relinquishing VGA console.
(XEN) *** Serial input -> DOM0 (type 'CTRL-a' three times to switch input to Xen ).
(XEN) (file=io_apic.c, line=2085)
(XEN) ioapic_guest_write: apic=0, pin=2, old_irq=-1, new_irq=0
(XEN) ioapic_guest_write: old_entry=00010000, new_entry=000009f0
(XEN) ioapic_guest_write: Installing bogus unmasked IO-APIC entry!

Comment 20 Micah Quinn 2006-12-07 13:42:14 UTC
Just an update;  As I mentioned before I booted FC6 xen kernel and as soon as I
could I typed "/usr/sbin/ethtool -K eth0 tx off".  The network stayed alive for
27 minutes during which time I was able to get roughly 763 pings to my local
router.  

Restarting the network seems to do nothing.  Interrupts on peth0 are static at
171755.  Neither dmesg nor xm dmesg show anything different from above.

Interestingly enough when I "/sbin/rmmod forcedeth" and then "/sbin/modprobe
forcedeth", dmesg shows the module attaching to device "ETH1" instead of "ETH0".
 Running dhclient on eth1 ("/sbin/dhclient eth1") does not yield an IP address.
 Rather it simply fails to get a DHCPOFFER.  xm dmesg shows nothing different at
this point.

What other information can I provide?

Comment 21 Herbert Xu 2006-12-08 02:22:53 UTC
Hi Micah, please use #218733 for your problem.

Comment 22 Andy Bailey 2006-12-10 18:37:40 UTC
Created attachment 143250 [details]
assorted files

Shortly before the network locked up completely, I had a problem with a NFS
mounted filesystem, I was using mplayer to play some music at first it worked
fine then it would only play 2 seconds or so and stop, I tried it on several
mp3 and the same each time.

The contents of the zip are
 Length     Date   Time    Name
 --------    ----   ----    ----
     2166  12-10-06 12:12   ifconfig
     5141  12-10-06 12:10   xm-dmesg
       74  12-10-06 12:27   ethtool-i_peth0
      847  12-10-06 12:10   int1
      847  12-10-06 12:10   int2
    37043  12-10-06 12:13   messages
       63  12-10-06 12:11   uptime
    26796  12-10-06 12:14   dmesg
       60  12-10-06 12:30   ethtool-K_peth0_tx_off
      320  12-10-06 12:28   ethtool-k_peth0
      847  12-10-06 12:13   int3
    14431  12-10-06 12:17   lspci-vvv
     3313  12-10-06 12:17   lsmod
      116  12-10-06 12:11   uname-a
 --------		    -------
    92064		    14 files
The only files with the filenames not self explanatory are int1,2,3 these are a
snapshot of /proc/interrupts at different times.

I rebooted and tried
ethtool-K eth0 tx off
and have been up for 60 minutes with no network problem yet (a record) however
yum update is stuck (I dont think its related but who knows)

It does seem to be reproducible with the xen kernel it has happened every time
so far.
I have never had a network problem with the non zen kernel.

Let me know what more information is needed

Andy Bailey

Comment 23 Herbert Xu 2006-12-12 03:04:48 UTC
Hi Andy, could you please try disabling xend while running the Xen kernel to see
if the problem persists? If it does please attach the dmesg for it.  Thanks.

Comment 24 Red Hat Bugzilla 2007-07-25 01:35:19 UTC
change QA contact

Comment 25 Chris Lalancette 2008-02-26 23:40:48 UTC
This report targets FC6, which is now end-of-life.

Please re-test against Fedora 7 or later, and if the issue persists, open a new bug.

Thanks