Bug 693442

Summary: NETDEV WATCHDOG: em1 (sky2): transmit queue 0 timed out
Product: [Fedora] Fedora Reporter: Steven Haigh <netwiz>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 15CC: agospoda, awilliam, collura, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, sumancse, tflink
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: RejectedBlocker, RejectedNTH
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-27 12:49:02 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
Output of lspci -v on affected system. none

Description Steven Haigh 2011-04-04 13:22:19 EDT
Description of problem:
When booting, even though the sky2 ethernet adapter comes up, manages to get a DHCP address, ipv6 autoconfig'ed, it seems no traffic works properly over the ethernet adapter until the watchdog fires. After this, all traffic flows normally.

Version-Release number of selected component (if applicable):
kernel-2.6.38.2-9.fc15.x86_64

How reproducible:
Every boot.

Crash log in dmesg:
[  120.966968] ------------[ cut here ]------------
[  120.966978] WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0x108/0x17c()
[  120.966981] Hardware name: System Product Name
[  120.966984] NETDEV WATCHDOG: em1 (sky2): transmit queue 0 timed out
[  120.966986] Modules linked in: fuse sunrpc rfcomm cpufreq_ondemand acpi_cpufreq freq_table mperf sco bnep l2cap ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables pl2303 usbserial snd_usb_audio gspca_zc3xx gspca_main snd_usbmidi_lib videodev snd_rawmidi v4l2_compat_ioctl32 joydev snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd btusb bluetooth microcode rfkill iTCO_wdt sky2 i7core_edac i2c_i801 edac_core xhci_hcd iTCO_vendor_support soundcore snd_page_alloc asus_atk0110 wmi uinput ipv6 raid0 firewire_ohci firewire_core crc_itu_t radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
[  120.967043] Pid: 0, comm: swapper Not tainted 2.6.38.2-9.fc15.x86_64 #1
[  120.967046] Call Trace:
[  120.967048]  <IRQ>  [<ffffffff81055066>] ? warn_slowpath_common+0x83/0x9b
[  120.967058]  [<ffffffff81055121>] ? warn_slowpath_fmt+0x46/0x48
[  120.967062]  [<ffffffff813da9e9>] ? netif_tx_lock+0x4a/0x7c
[  120.967066]  [<ffffffff813dab77>] ? dev_watchdog+0x108/0x17c
[  120.967071]  [<ffffffff81076974>] ? timekeeping_get_ns+0x18/0x3a
[  120.967077]  [<ffffffff81061274>] ? run_timer_softirq+0x1a4/0x266
[  120.967083]  [<ffffffff8100fc75>] ? paravirt_read_tsc+0x9/0xd
[  120.967087]  [<ffffffff813daa6f>] ? dev_watchdog+0x0/0x17c
[  120.967092]  [<ffffffff8105ad98>] ? __do_softirq+0xd2/0x19d
[  120.967096]  [<ffffffff810ac961>] ? handle_IRQ_event+0x58/0x11f
[  120.967101]  [<ffffffff8100fc75>] ? paravirt_read_tsc+0x9/0xd
[  120.967105]  [<ffffffff8100aadc>] ? call_softirq+0x1c/0x30
[  120.967109]  [<ffffffff8100c101>] ? do_softirq+0x46/0x81
[  120.967113]  [<ffffffff8105af1c>] ? irq_exit+0x49/0x8b
[  120.967119]  [<ffffffff8147b286>] ? do_IRQ+0x8e/0xa5
[  120.967124]  [<ffffffff81475193>] ? ret_from_intr+0x0/0x15
[  120.967126]  <EOI>  [<ffffffff8100fc75>] ? paravirt_read_tsc+0x9/0xd
[  120.967135]  [<ffffffff812743fb>] ? intel_idle+0xdb/0x100
[  120.967139]  [<ffffffff812743da>] ? intel_idle+0xba/0x100
[  120.967145]  [<ffffffff81398a28>] ? cpuidle_idle_call+0xe7/0x166
[  120.967151]  [<ffffffff81008321>] ? cpu_idle+0xa5/0xdf
[  120.967156]  [<ffffffff8145424e>] ? rest_init+0x72/0x74
[  120.967161]  [<ffffffff81b58c2a>] ? start_kernel+0x3f2/0x3fe
[  120.967167]  [<ffffffff81b582c4>] ? x86_64_start_reservations+0xaf/0xb3
[  120.967171]  [<ffffffff81b58140>] ? early_idt_handler+0x0/0x71
[  120.967176]  [<ffffffff81b583ca>] ? x86_64_start_kernel+0x102/0x111
[  120.967179] ---[ end trace c7cb32492134a36e ]---
[  120.967183] sky2 0000:05:00.0: em1: tx timeout
[  120.967189] sky2 0000:05:00.0: em1: transmit ring 73 .. 32 report=73 done=73
[  124.125835] sky2 0000:05:00.0: em1: Link is up at 1000 Mbps, full duplex, flow control rx

Not too sure how to debug this further?
Comment 1 Steven Haigh 2011-04-04 13:24:12 EDT
Created attachment 489815 [details]
Output of lspci -v on affected system.
Comment 2 Steven Haigh 2011-04-04 21:54:07 EDT
Starting a ping after boot / login.

From 203.56.246.83 icmp_seq=66 Destination Host Unreachable
From 203.56.246.83 icmp_seq=70 Destination Host Unreachable
From 203.56.246.83 icmp_seq=72 Destination Host Unreachable
From 203.56.246.83 icmp_seq=73 Destination Host Unreachable
From 203.56.246.83 icmp_seq=74 Destination Host Unreachable
64 bytes from 203.23.237.200: icmp_req=75 ttl=60 time=2056 ms
64 bytes from 203.23.237.200: icmp_req=76 ttl=60 time=1058 ms
64 bytes from 203.23.237.200: icmp_req=77 ttl=60 time=59.8 ms
64 bytes from 203.23.237.200: icmp_req=78 ttl=60 time=51.9 ms
64 bytes from 203.23.237.200: icmp_req=79 ttl=60 time=51.9 ms
64 bytes from 203.23.237.200: icmp_req=80 ttl=60 time=51.9 ms

DHCP had completed and the interface did have the IP of 203.56.246.83 the entire time. As soon as the watchdog fired, pings started working again.
Comment 3 Steven Haigh 2011-04-10 19:26:42 EDT
Adding this to the F15Blocker - as I don't believe the final release should have no network connectivity until a kernel watchdog crash event fires. While I couldn't find any specific Alpha/Beta/Final clauses to cover this, I shall also refer to adamw to seek his thoughts.
Comment 4 Steven Haigh 2011-04-10 19:34:54 EDT
I seem to have found a more reliable way to reproduce this. After logging in, the network connection comes up as it should. I can get traffic to hosts OUTSIDE the local nets (via the default gateway).

After starting an ssh connection to a host on the local net, all traffic going through the adapter will halt until the crash noted above occurs.

For reference:
# ifconfig em1
em1       Link encap:Ethernet  HWaddr BC:AE:C5:12:73:7E  
          inet addr:203.56.246.83  Bcast:203.56.246.95  Mask:255.255.255.240
          inet6 addr: 2002:cb38:f71b:1:beae:c5ff:fe12:737e/64 Scope:Global
          inet6 addr: fe80::beae:c5ff:fe12:737e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:4000  Metric:1
          RX packets:2901 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3649 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:773830 (755.6 KiB)  TX bytes:430380 (420.2 KiB)
          Interrupt:18

The MTU is configured at 4000 via the DHCP option:
        option interface-mtu            4000;
Comment 5 Steven Haigh 2011-04-10 19:48:49 EDT
Related info from dmesg:
[    9.903332] sky2: driver version 1.28
[    9.903367] sky2 0000:05:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[    9.903379] sky2 0000:05:00.0: setting latency timer to 64
[    9.903415] sky2 0000:05:00.0: Yukon-2 EC Ultra chip revision 3
[    9.903510] sky2 0000:05:00.0: irq 76 for MSI/MSI-X
[    9.903807] sky2 0000:05:00.0: eth0: addr bc:ae:c5:12:73:7e
...
[   14.716939] sky2 0000:05:00.0: em1: enabling interface
[   14.717447] ADDRCONF(NETDEV_UP): em1: link is not ready
...
[   17.202068] sky2 0000:05:00.0: em1: Link is up at 1000 Mbps, full duplex, flow control rx
[   17.203858] ADDRCONF(NETDEV_CHANGE): em1: link becomes ready
...
[  131.950152] ------------[ cut here ]------------
[  131.950161] WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0x108/0x17c()
[  131.950164] Hardware name: System Product Name
[  131.950167] NETDEV WATCHDOG: em1 (sky2): transmit queue 0 timed out
[  131.950169] Modules linked in: fuse sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables gspca_zc3xx snd_usb_audio gspca_main snd_usbmidi_lib videodev joydev snd_rawmidi v4l2_compat_ioctl32 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device microcode btusb snd_pcm bluetooth iTCO_wdt iTCO_vendor_support snd_timer sky2 asus_atk0110 rfkill i2c_i801 snd i7core_edac xhci_hcd wmi edac_core soundcore snd_page_alloc uinput ipv6 raid0 firewire_ohci firewire_core crc_itu_t radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
[  131.950219] Pid: 0, comm: swapper Not tainted 2.6.38.2-9.fc15.x86_64 #1
[  131.950221] Call Trace:
[  131.950223]  <IRQ>  [<ffffffff81055066>] ? warn_slowpath_common+0x83/0x9b
[  131.950233]  [<ffffffff81055121>] ? warn_slowpath_fmt+0x46/0x48
[  131.950237]  [<ffffffff813da9e9>] ? netif_tx_lock+0x4a/0x7c
[  131.950241]  [<ffffffff813dab77>] ? dev_watchdog+0x108/0x17c
[  131.950246]  [<ffffffff810615d7>] ? mod_timer+0x8c/0x93
[  131.950251]  [<ffffffff81061274>] ? run_timer_softirq+0x1a4/0x266
[  131.950256]  [<ffffffff8100fc75>] ? paravirt_read_tsc+0x9/0xd
[  131.950260]  [<ffffffff813daa6f>] ? dev_watchdog+0x0/0x17c
[  131.950266]  [<ffffffff8105ad98>] ? __do_softirq+0xd2/0x19d
[  131.950270]  [<ffffffff810ac961>] ? handle_IRQ_event+0x58/0x11f
[  131.950274]  [<ffffffff8100fc75>] ? paravirt_read_tsc+0x9/0xd
[  131.950278]  [<ffffffff8100aadc>] ? call_softirq+0x1c/0x30
[  131.950282]  [<ffffffff8100c101>] ? do_softirq+0x46/0x81
[  131.950286]  [<ffffffff8105af1c>] ? irq_exit+0x49/0x8b
[  131.950292]  [<ffffffff8147b286>] ? do_IRQ+0x8e/0xa5
[  131.950297]  [<ffffffff81475193>] ? ret_from_intr+0x0/0x15
[  131.950299]  <EOI>  [<ffffffff8100fc75>] ? paravirt_read_tsc+0x9/0xd
[  131.950308]  [<ffffffff812743fb>] ? intel_idle+0xdb/0x100
[  131.950312]  [<ffffffff812743da>] ? intel_idle+0xba/0x100
[  131.950317]  [<ffffffff81398a28>] ? cpuidle_idle_call+0xe7/0x166
[  131.950323]  [<ffffffff81008321>] ? cpu_idle+0xa5/0xdf
[  131.950329]  [<ffffffff8145424e>] ? rest_init+0x72/0x74
[  131.950334]  [<ffffffff81b58c2a>] ? start_kernel+0x3f2/0x3fe
[  131.950339]  [<ffffffff81b582c4>] ? x86_64_start_reservations+0xaf/0xb3
[  131.950343]  [<ffffffff81b58140>] ? early_idt_handler+0x0/0x71
[  131.950348]  [<ffffffff81b583ca>] ? x86_64_start_kernel+0x102/0x111
[  131.950351] ---[ end trace 01ccfccde7df1532 ]---
[  131.950355] sky2 0000:05:00.0: em1: tx timeout
[  131.950361] sky2 0000:05:00.0: em1: transmit ring 78 .. 37 report=78 done=78
[  134.745823] sky2 0000:05:00.0: em1: Link is up at 1000 Mbps, full duplex, flow control rx
Comment 6 Adam Williamson 2011-04-15 15:39:17 EDT
This hits the implied 'networking must work' criterion, but is mitigated by the fact the network does start working after a bit, and is subject to the clause:

"There may be times where a requirement is unmet only in a particular configuration, such as with some keyboard layouts but not others, or if a particular character is used in a username, password or passphrase. In such cases, the release team should use their judgment and refer to precedent to determine whether or not the issue should be considered to block the release. They should consider the number of users likely to be affected by the issue, the severity of the case when the issue is encountered, and the ease or otherwise with which the issue can be avoided by both informed and uninformed users. "

which makes bugs that only happen in specific configurations a judgement call. So, Steven, can you provide more details on exactly what configuration you need to hit this bug? Thanks.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers
Comment 7 Steven Haigh 2011-04-15 18:22:56 EDT
Hi Adam,

The desktop configuration is all default. The thing that triggers the watchdog is accessing an IPv6 host on the local LAN. The protocol doesn't seem to matter. If the destination host is not on the local LAN (ie out the default gateway), things work fine. I have currently reproduced this with ssh from the terminal, and Thunderbird connecting to a mail server on the local LAN.

The desktop is set to configure IPv6 automatically as per the defaults (which gets done via radvd from my Fedora-based router).

This is where things get interesting - as technically, networking works - however it hangs until the watchdog fires when you access a resource on the local network.
Comment 8 Adam Williamson 2011-04-18 12:39:20 EDT
when you say 'hangs until the watchdog fires' - how long is that? and do you mean that each separate operation on the individual network hits this problem, or just the first one?
Comment 9 Steven Haigh 2011-04-18 19:11:54 EDT
The watchdog seems to take between 60 and 90 seconds - I haven't physically timed it. The first access will cause the hang until watchdog fires, then the adapter will work as expected after the watchdog fires until the system is rebooted.

Subsequent accesses to hosts on the local network do not hang or fire the watchdog.
Comment 10 Tim Flink 2011-04-21 14:43:59 EDT
Discussed at the 2011-04-21 blocker bug review meeting. Since this doesn't prevent the network adapter from working, updates could fix this issue and it is only on the local network anyways. Therefore, it does not hit any release criteria and is not a release blocker.

This seems to only affect a small number of users and isn't a system-breaking issue, so it was also not deemed NTH.

A tested fix would be accepted before the final freeze, though.
Comment 11 Josh Boyer 2011-09-27 15:20:44 EDT
Does this still happen with the 2.6.40.x kernels now in F15?
Comment 12 Dave Jones 2011-12-14 19:57:19 EST
*** Bug 767451 has been marked as a duplicate of this bug. ***
Comment 13 Josh Boyer 2012-02-27 12:49:02 EST
No update in 5 months.  If you can recreate this on the 2.6.42.7 or newer kernel, please reopen.