Description of problem: After 2 to 3 minutes of work network adapter (Atheros Communications Inc. AR8152 v2.0 Fast Ethernet [1969:2062] (rev c1)) loses network connection. Version-Release number of selected component (if applicable): $ lspci | awk '/[nN]et/{print $1}' | xargs -i lspci -knns {} 01:00.0 Ethernet controller [0200]: Atheros Communications Inc. AR8152 v2.0 Fast Ethernet [1969:2062] (rev c1) Subsystem: Lenovo Device [17aa:3979] Kernel driver in use: atl1c $ uname -a Linux ses-laptop.localnet 3.8.4-202.fc18.i686.PAE #1 SMP Thu Mar 21 17:14:12 UTC 2013 i686 i686 i386 GNU/Linux How reproducible: Steps to Reproduce: 1. Turn on computer and plug-in network cable 2. Ping remote host Actual results: $ ping utro.ru .............. .............. 64 bytes from www-martan.utro.ru (194.186.36.190): icmp_seq=168 ttl=248 time=41.5 ms 64 bytes from www-martan.utro.ru (194.186.36.190): icmp_seq=169 ttl=248 time=43.4 ms 64 bytes from www-martan.utro.ru (194.186.36.190): icmp_seq=170 ttl=248 time=44.7 ms 64 bytes from www-martan.utro.ru (194.186.36.190): icmp_seq=171 ttl=248 time=70.6 ms 64 bytes from www-martan.utro.ru (194.186.36.190): icmp_seq=172 ttl=248 time=45.7 ms 64 bytes from www-martan.utro.ru (194.186.36.190): icmp_seq=173 ttl=248 time=42.4 ms 64 bytes from www-martan.utro.ru (194.186.36.190): icmp_seq=174 ttl=248 time=44.4 ms From ses-laptop.localnet (192.168.1.79) icmp_seq=200 Destination Host Unreachable From ses-laptop.localnet (192.168.1.79) icmp_seq=201 Destination Host Unreachable From ses-laptop.localnet (192.168.1.79) icmp_seq=202 Destination Host Unreachable Expected results: Continuous sequence of echo replies. Additional info: If I unplug/replug the network cable network connection is restored for some time
I can confirm atl1c driver bug in 3.7+ kernels, works fine in 3.6.x kernels.
Kernel 3.8.5-201.fc18.i686.PAE The same problem :(
Kernel 3.8.6-203.fc18.i686.PAE When I switch off wireless adapter (Atheros Communications Inc. AR9285 Wireless Network Adapter) and then turn on my laptop, Ethernet adapter works fine for a long time (more then 2 hours). Then I switch on wireless adapter (don't connect to AP). 2..3 minutes late the wired connection breaks.
Kernel 3.8.7-201.fc18.i686.PAE The same problem.
Additional information: $ lspci | awk '/[nN]et/{print $1}' | xargs -i lspci -knns {} 01:00.0 Ethernet controller [0200]: Atheros Communications Inc. AR8152 v2.0 Fast Ethernet [1969:2062] (rev c1) Subsystem: Lenovo Device [17aa:3979] Kernel driver in use: atl1c 02:00.0 Network controller [0280]: Atheros Communications Inc. AR9285 Wireless Network Adapter (PCI-Express) [168c:002b] (rev 01) Subsystem: Lenovo Device [17aa:30a1] Kernel driver in use: ath9k If you need some additional information, I'll try give it.
Kernel 3.8.8-202.fc18.i686.PAE The same problem. How can I help?
You can give more information in upstream bug http://bugzilla.kernel.org/show_bug.cgi?id=54021
Kernel 3.8.11-200.fc18.i686.PAE works fine
(In reply to comment #9) > Kernel 3.8.11-200.fc18.i686.PAE works fine Excuse me. NOT work.
Kernel 3.9.4-200.fc18.i686.PAE Wired connection cuts off 5 minutes later after switch on wireless adapter.
What do you mean by "cuts off"? Can you be more detailed? does NM report a disconnect or do you just see a failure in connections that were going over that interface?
Created attachment 757085 [details] Before wireless adapter switched on.
Created attachment 757086 [details] After wireless adapter is switched on
1) I switch off wireless adapter (by hardware switch) and turn on my laptop. Commands I type in console and screen output are in attached file "before.txt". 2) I switch on wireless adapter (configured to don't connect automatically) by hardware switch. Commands I type in console and screen output are in attached file "after.txt". If You need some more experiments I'll be glad to be useful. (Excuse me for poor English)
After I switch off wi-fi, to rectote connection I need to unplug and replug the network cable.
Thats strange. Can you attach a copy of /var/log/message from before and after you toggle the rfkill switch? I'm guessing, given that your routing tables hasn't changed, that the switch must be doing something to the wired phy state
Created attachment 757124 [details] /var/log/messages Wireless adapter switched on at 12:54:46
Network cable unplugged/replugged approximately at 13:05
hmm, doesn't show anything suspicious. Can you do the following: 1) enable the wireless interface so the problem reproduces 2) check /sys/class/net/<if>/carrier, where <if> is the name of your wired interface, to see if the carrier state changes after reproducing 3) Check on the peer switch port that your wired interface is connected to to see if any link state changes are noted.
Before: ------- [root@ses-laptop ~]# cat /sys/class/net/eth0/carrier 1 After switch on Wi-Fi: --------------------- [root@ses-laptop ~]# cat /sys/class/net/eth0/carrier 1 LED on switch blinking after as before. When link is broken, I see on gate continuous arp requests: arp who-has 192.168.1.254 tell 192.168.1.79 192.168.1.79 -- my IP 192.168.1.254 -- gate IP
very odd, what does your neighbor table (/proc/net/arp) look like before and after you enable the wireless interface?
Before: [root@ses-laptop ~]# cat /proc/net/arp IP address HW type Flags HW address Mask Device 192.168.1.250 0x1 0x2 00:26:18:b7:c1:e4 * eth0 192.168.1.254 0x1 0x2 00:0d:87:c4:22:3e * eth0 After: [root@ses-laptop ~]# cat /proc/net/arp IP address HW type Flags HW address Mask Device 192.168.1.250 0x1 0x2 00:26:18:b7:c1:e4 * eth0 192.168.1.254 0x1 0x2 00:0d:87:c4:22:3e * eth0
Ok, so I'm officially lost, There shouldn't be any reason we start seeing ARP requests go out for the gateway when the wireless interface is activated. It doesn't appear to be configured to get an ip address so it shouldn't arp, and we already have an arp entry for that ip anyway, so we know where to send it on wired device. At this point I think we should: 1) Enable dynamic debug on the atl1c nl80211 drivers and and set your console level to 7 to catch any debug messages we might be missing. Let me know if you're not familiar with the process to do so. 2) What model laptop is this? It seems that what ever is happening here, the rfkill switch is doing more than just enabling the wireless interface. Perhaps the vendor has some documentation that would clarify some of this behavior.
1) How I can do that? 2) Lenovo G570
Before I had Fedora-18 on this laptop. Kernels 3.6.x worked normally.
Not Fedora-18 but Fedora-16
If I don't use hardware switch (it's continuosly on) and disable wireless by "rmmod ath9k" wired connection isn't lost.
to enable dynamic debug do the following: 1) cd /sys/kernel/debug/dynamic_debug (you may need to mount debugfs on /sys/kernel/debug first) 2) echo "module atl1c +p" > ./control 3) echo "module ath9k +p" > ./control 4) echo "7 7 7 7" > /proc/sys/kernel/printk then reproduce the problem. Lenovo is pretty bad about hiding past releases of bios updates, and not expanding upon what the latest bios updates fix in aggregate. That said, there are some mentinos of an rfkill switch fix in a bios update floating around, you may want to check your bios level (but I wouldn't invest alot of time on it). Could you also send in the output of the following command: rfkill list Thanks!
Created attachment 757989 [details] Dynamic debug result [root@ses-laptop ~]# cd /sys/kernel/debug/dynamic_debug [root@ses-laptop dynamic_debug]# echo "module atl1c +p" > ./control [root@ses-laptop dynamic_debug]# echo "module ath9k +p" > ./control [root@ses-laptop dynamic_debug]# echo "7 7 7 7" > /proc/sys/kernel/printk [root@ses-laptop dynamic_debug]# rfkill list 0: ideapad_wlan: Wireless LAN Soft blocked: no Hard blocked: no 1: phy0: Wireless LAN Soft blocked: no Hard blocked: no
Ten minutes after this manipulanions wired connection self-restored. The file control doesn't changed. Some time after link interrupted again. Two minutes later -- new self-restore.
I leave wireless enabled for a long time and got interlaced periods link/unlink approximately each 2 to 3 minutes long.
Anything dumped out in /var/log/messages now that you have dynamic debug enabled?
Only^=: [root@ses-laptop dynamic_debug]# tail -F /var/log/messages Jun 7 10:49:46 ses-laptop chronyd[542]: Source 89.221.207.113 online Jun 7 14:57:08 ses-laptop NetworkManager[536]: <info> WiFi now enabled by radio killswitch Jun 7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): bringing up device. Jun 7 14:57:08 ses-laptop kernel: [18305.611539] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready Jun 7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0) supports 4 scan SSIDs Jun 7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): supplicant interface state: starting -> ready Jun 7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): device state change: unavailable -> disconnected (reason 'supplicant-available') [20 30 42] Jun 7 14:57:08 ses-laptop NetworkManager[536]: <warn> Trying to remove a non-existant call id. Jun 7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): supplicant interface state: ready -> inactive Jun 7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0) supports 4 scan SSIDs Jun 7 15:01:11 ses-laptop NetworkManager[536]: <info> (eth0): carrier now OFF (device state 100, deferring action for 4 seconds) Jun 7 15:01:11 ses-laptop kernel: [18548.450191] atl1c 0000:01:00.0: atl1c: eth0 NIC Link is Up<100 Mbps Full Duplex> Jun 7 15:01:11 ses-laptop NetworkManager[536]: <info> (eth0): carrier now ON (device state 100)
And so on every 2-3 minutes
[root@ses-laptop dynamic_debug]# tail -F /var/log/messages Jun 7 10:49:46 ses-laptop chronyd[542]: Source 89.221.207.113 online Jun 7 14:57:08 ses-laptop NetworkManager[536]: <info> WiFi now enabled by radio killswitch Jun 7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): bringing up device. Jun 7 14:57:08 ses-laptop kernel: [18305.611539] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready Jun 7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0) supports 4 scan SSIDs Jun 7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): supplicant interface state: starting -> ready Jun 7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): device state change: unavailable -> disconnected (reason 'supplicant-available') [20 30 42] Jun 7 14:57:08 ses-laptop NetworkManager[536]: <warn> Trying to remove a non-existant call id. Jun 7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): supplicant interface state: ready -> inactive Jun 7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0) supports 4 scan SSIDs link off at 14:59 ================= Jun 7 15:01:11 ses-laptop NetworkManager[536]: <info> (eth0): carrier now OFF (device state 100, deferring action for 4 seconds) Jun 7 15:01:11 ses-laptop kernel: [18548.450191] atl1c 0000:01:00.0: atl1c: eth0 NIC Link is Up<100 Mbps Full Duplex> Jun 7 15:01:11 ses-laptop NetworkManager[536]: <info> (eth0): carrier now ON (device state 100) now link on =========== link off at 15:03 ================= Jun 7 15:09:26 ses-laptop NetworkManager[536]: <info> (eth0): carrier now OFF (device state 100, deferring action for 4 seconds) Jun 7 15:09:26 ses-laptop kernel: [19043.146975] atl1c 0000:01:00.0: atl1c: eth0 NIC Link is Up<100 Mbps Full Duplex> Jun 7 15:09:26 ses-laptop NetworkManager[536]: <info> (eth0): carrier now ON (device state 100) now link on ============
Well, at least at the end there, that at least shows some level of link flap on the wired NIC (I presume you're not plugging/unplugging the cable there). this message: Jun 7 15:01:11 ses-laptop kernel: [18548.450191] atl1c 0000:01:00.0: atl1c: eth0 NIC Link is Up<100 Mbps Full Duplex> Is generated in atl1c_check_link_status when the link speed/duplex changes (or when the link was down but comes up again. that suggests that some MAC layer renegotiation is going on. We may want to instrument the kernel, or write a stap script to track what the MAC is reporting tot he driver exactly here. Do you feel comfortable doing that? Note that I'm going on vacation here for a few days, so if you want me to write it, I will, but it will be next week before you get anything.
1) > I presume you're not plugging/unplugging the cable there No, I didn't. I'm not in a hurry. I can work with wireless switched off. But I want to help resolve this problem :)
Created attachment 760272 [details] debug patch heres the patch I promised you, it should show us whats going on when we get a link change notification. If you could build and run this, recreating your problem please, that would be great. Also, if you could dump out the ethtool stats for the wired interface after the problem occurs, I'd appreciate it. I'm starting to suspect that the atl1c driver is getting a carrier error and its taking it a few minutes to recover for some reason.
Excuse me for my stupidity, but can you give me detailed (step-by-step) instructions what namely I have to do? I have no experience in driver compilation :(
I'll just build you a kernel: http://koji.fedoraproject.org/koji/taskinfo?taskID=5499289 Just install the kernel from that url when it finishes building and try reproduce the problem. If you send me the /var/log/messages and ethtool stats from that exercise, and we should be good. Thanks!
3.9.5 kernel with debug patch... #ip link set dev cable up [ 329.489424] atl1c 0000:02:00.0: irq 45 for MSI/MSI-X [ 329.489967] IN ATL1C_CHECK_LINK_STATUS [ 329.489975] phy_data = 796d [ 329.489996] SPEED/DUPLEX RETREVAL ERR = 0 [ 329.490003] SPEED: 65535 : 100 [ 329.490008] DUPLEX: 2 : 2 [ 329.490026] atl1c 0000:02:00.0: atl1c: cable NIC Link is Up<100 Mbps Full Duplex> [ 329.490032] CARRIER IS DOWN [ 329.490413] IN ATL1C_CHECK_LINK_STATUS [ 329.490431] phy_data = 796d [ 329.490492] SPEED/DUPLEX RETREVAL ERR = 0 [ 329.490498] SPEED: 100 : 100 [ 329.490504] DUPLEX: 2 : 2 [ 329.490510] CARRIER IS UP # start pppoe over atl1c [ 598.644586] PPP generic driver version 2.4.2 [ 604.491895] NET: Registered protocol family 24 [ 640.815739] sit: IPv6 over IPv4 tunneling driver # connect wifi [ 708.720798] brcmsmac bcma0:0: brcms_ops_bss_info_changed: qos enabled: false (implement) [ 708.720816] brcmsmac bcma0:0: brcms_ops_config: change power-save mode: false (implement) [ 708.720979] IPv6: ADDRCONF(NETDEV_UP): wifi: link is not ready [ 711.100939] wifi: authenticate with 14:a9:e3:01:f2:a8 [ 711.101029] wifi: send auth to 14:a9:e3:01:f2:a8 (try 1/3) [ 711.103228] wifi: authenticated [ 711.110683] wifi: associate with 14:a9:e3:01:f2:a8 (try 1/3) [ 711.113136] wifi: RX AssocResp from 14:a9:e3:01:f2:a8 (capab=0x411 status=0 aid=1) [ 711.113653] brcmsmac bcma0:0: brcmsmac: brcms_ops_bss_info_changed: associated [ 711.113668] brcmsmac bcma0:0: brcms_ops_bss_info_changed: qos enabled: true (implement) [ 711.113690] wifi: associated [ 711.113820] IPv6: ADDRCONF(NETDEV_CHANGE): wifi: link becomes ready [ 711.407990] brcmsmac bcma0:0: brcms_ops_bss_info_changed: arp filtering: 1 addresses (implement) [ 714.494946] FS-Cache: Loaded [ 714.504377] FS-Cache: Netfs 'nfs' registered for caching [ 775.928873] wifi: deauthenticating from 14:a9:e3:01:f2:a8 by local choice (reason=3) [ 775.930377] brcmsmac bcma0:0: brcmsmac: brcms_ops_bss_info_changed: disassociated [ 775.930408] brcmsmac bcma0:0: brcms_ops_bss_info_changed: arp filtering: 1 addresses (implement) [ 775.930418] brcmsmac bcma0:0: brcms_ops_bss_info_changed: qos enabled: false (implement) [ 775.932814] cfg80211: Calling CRDA for country: US [ 996.752751] wifi: authenticate with 14:a9:e3:01:f2:a8 [ 996.753544] wifi: send auth to 14:a9:e3:01:f2:a8 (try 1/3) [ 996.754780] wifi: authenticated [ 996.755819] wifi: associate with 14:a9:e3:01:f2:a8 (try 1/3) [ 996.758647] wifi: RX AssocResp from 14:a9:e3:01:f2:a8 (capab=0x411 status=0 aid=1) [ 996.759121] brcmsmac bcma0:0: brcmsmac: brcms_ops_bss_info_changed: associated [ 996.759136] brcmsmac bcma0:0: brcms_ops_bss_info_changed: qos enabled: true (implement) [ 996.759165] wifi: associated [ 997.257176] brcmsmac bcma0:0: brcms_ops_bss_info_changed: arp filtering: 1 addresses (implement) [ 1114.977252] brcmsmac bcma0:0: phyerr 0x1, rate 0x37 [ 1114.977265] brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: ampdu tx phy error (0x1) [ 1114.977776] brcmsmac bcma0:0: phyerr 0x1, rate 0x37 # waiting ))) [ 8805.651818] brcmsmac bcma0:0: phyerr 0x1, rate 0x37 [ 8805.651838] brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: ampdu tx phy error (0x1) [ 8805.652435] brcmsmac bcma0:0: phyerr 0x1, rate 0x37 [ 8805.652455] brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: ampdu tx phy error (0x1) # atl1c freeze!!! pppd: No response to 3 echo-requests # Serial link appears to be disconnected. # Connect time 133.2 minutes. # mii-tool -r cable [ 8847.388818] atl1c 0000:02:00.0: atl1c: cable NIC Link is Down [ 8847.389305] IN ATL1C_CHECK_LINK_STATUS [ 8847.389314] phy_data = 7949 [ 8848.996401] IN ATL1C_CHECK_LINK_STATUS [ 8848.996418] phy_data = 796d [ 8848.996600] SPEED/DUPLEX RETREVAL ERR = 0 [ 8848.996606] SPEED: 65535 : 100 [ 8848.996611] DUPLEX: 2 : 2 [ 8848.996705] atl1c 0000:02:00.0: atl1c: cable NIC Link is Up<100 Mbps Full Duplex> [ 8848.996710] CARRIER IS DOWN [ 8871.649211] brcmsmac bcma0:0: phyerr 0x1, rate 0x37 [ 8871.649232] brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: ampdu tx phy error (0x1) # atl1c working 2-3 min after "mii-tool -r" and "CARRIER IS DOWN" forewer. But # ip link 2: cable: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000 link/ether f4:6d:04:57:10:f3 brd ff:ff:ff:ff:ff:ff
Information for task build (f18-candidate, kernel-3.9.5-201.fc18.src.rpm) ID 5499289 Method build Parameters Source: cli-build/1371132405.122448.yryyYuFL/kernel-3.9.5-201.fc18.src.rpm Build Target: f18-candidate Options: scratch = True State failed Created Thu, 13 Jun 2013 14:12:38 UTC Started Thu, 13 Jun 2013 14:12:39 UTC Completed Thu, 13 Jun 2013 14:36:30 UTC Owner nhorman Channel default Host buildvm-24.phx2.fedoraproject.org Arch noarch Parent Descendants build buildArch (kernel-3.9.5-201.fc18.src.rpm, x86_64) buildArch (kernel-3.9.5-201.fc18.src.rpm, noarch) buildArch (kernel-3.9.5-201.fc18.src.rpm, i686) Waiting? yes Awaited? no Priority 20 Weight 0.20 Result BuildError: error building package (arch x86_64), mock exited with status 1; see build.log for more information Output
(In reply to Neil Horman from comment #39) > Created attachment 760272 [details] > debug patch WARN need 2 argument!! :) --- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c +++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c @@ -258,6 +258,8 @@ static void atl1c_check_link_status(struct atl1c_adapter *adapter) atl1c_read_phy_reg(hw, MII_BMSR, &phy_data); spin_unlock_irqrestore(&adapter->mdio_lock, flags); + WARN("IN ATL1C_CHECK_LINK_STATUS\n");
Thank you for fixing that Mihail, I didnt' have time to test out the debug code myself yesterday. So looking at your debug output I'm starting to think that the hardware isn't behaving as well as it should here Lets annotate the debug log output: [ 8805.652435] brcmsmac bcma0:0: phyerr 0x1, rate 0x37 [ 8805.652455] brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: ampdu tx phy error (0x1) >> NH - This is the last moment your wired interface is working. Sometime between this moment and about 40 seconds later below, something goes wrong. The fact that we don't get any driver level notification about it is troubling. # atl1c freeze!!! pppd: No response to 3 echo-requests # Serial link appears to be disconnected. # Connect time 133.2 minutes. # mii-tool -r cable [ 8847.388818] atl1c 0000:02:00.0: atl1c: cable NIC Link is Down >> NH - Here we finally discover that the link is down. This message specifically is printed in the function atl1c_link_chg_event, which is only called from the drivers interrupt routine, in response to hardware notification that the phy state has changed. [ 8847.389305] IN ATL1C_CHECK_LINK_STATUS [ 8847.389314] phy_data = 7949 >> NH - Its a bit troubling here that one of the bits set in this register is marked as reserved, but theres not much we can do about that. Its also set above when the driver initial worked, and its software read here, not written, so theres we need to assume that its ok. the important bit here is the the toggle between the 9 and the d in the low nibble of the phy_data. It indicates link state change. This pass above indicates the link has gone down, while the one below indicates its comming up. [ 8848.996401] IN ATL1C_CHECK_LINK_STATUS [ 8848.996418] phy_data = 796d [ 8848.996600] SPEED/DUPLEX RETREVAL ERR = 0 [ 8848.996606] SPEED: 65535 : 100 >> NH - This indicates we're restarting the link negotiation, the 65535 is from the down event above resetting the link speed to 'unknown'. The trigger for what brought the link back isn't clear here, as we would need to get another interrupt to trigger this operation, but I bet its the hardware error interrupt. If you re-run this test one more time with driver logging enabled (you can turn that on with the command: ethtool -s <ifname> msglevel 0xffff ). If you look at /var/log/messages after the above occurs, I expect you'll see: "atl1c hardware error (status = <some number>" in there somewhere [ 8848.996611] DUPLEX: 2 : 2 [ 8848.996705] atl1c 0000:02:00.0: atl1c: cable NIC Link is Up<100 Mbps Full Duplex> [ 8848.996710] CARRIER IS DOWN >> NH - And we're back. So the bottom line (it would seem) from this log, is that the phy status got out of sync with what the driver beleived it to be, and the hardware took almost a minute to inform the driver of that situation. Its possible that interrupts may have been disabled erroneously for that long, but I don't see how that could occur in the driver currently. I think the best thing to do is do the ethtool message log test I describe above, and we'll see if it confirms my analysis here. If thats the case though, I think you may just need to replace your card, as I don't see how we can fix it.
I would agree with You that problem is in hardware. I have not another network card to test in my notebook. But I can test another driver. I installed Fedora 16 (kernel 3.6.11-4.fc16.i686) on external USB drive. Network card works fine without link losses with wireless enabled and with wireless disabled.
(In reply to Neil Horman from comment #45) > ethtool -s <ifname> msglevel 0xffff > ). If you look at /var/log/messages after the above occurs, I expect you'll > see: > "atl1c hardware error (status = <some number>" > in there somewhere > > [ 5924.158862] atl1c 0000:02:00.0: TX/RX overflow (status = 0x8)
hmm, If you are able to run with an older driver, that would suggest its not a hardware error. Mihail, can you send the full message log in please?
Created attachment 763065 [details] full log driver detect faulure but unable to recover (( Jun 19 09:52:20 [kernel] [24398.696410] atl1c 0000:02:00.0: TX/RX overflow (status = 0x8) Jun 19 09:52:20 [kernel] [24398.696509] atl1c 0000:02:00.0: atl1c hardware error (status = 0x400) Jun 19 09:52:20 [kernel] [24398.722209] IN ATL1C_CHECK_LINK_STATUS Jun 19 09:52:20 [kernel] [24398.722216] phy_data = 796d Jun 19 09:52:20 [kernel] [24398.722236] SPEED/DUPLEX RETREVAL ERR = 0 Jun 19 09:52:20 [kernel] [24398.722242] SPEED: 65535 : 100 Jun 19 09:52:20 [kernel] [24398.722248] DUPLEX: 65535 : 2 Jun 19 09:52:20 [kernel] [24398.722265] atl1c 0000:02:00.0: atl1c: cable NIC Link is Up<100 Mbps Full Duplex> Jun 19 09:52:20 [kernel] [24398.722271] CARRIER IS DOWN Jun 19 09:52:20 [pppd] PPP session is 6853 Jun 19 09:52:20 [pppd] Connected to 00:90:1a:a3:db:3d via interface cable Jun 19 09:52:20 [pppd] Using interface ppp0 Jun 19 09:52:20 [pppd] Connect: ppp0 <--> cable
you did get the hardware error code! Why didn't you indicate that before? It was right under all the TX/RX overflow messages: un 19 09:52:20 [kernel] [24398.696509] atl1c 0000:02:00.0: atl1c hardware error (status = 0x400) That indicates that the card needs to be reset due to a DMA error. It looks like the NIC just took a few minutes to figure out that it hit a DMA error, as is evidenced by the fact that for every frame we sent we hit a tx/rx overflow error.
Created attachment 763133 [details] patch to reset adapter on TX/RX overrun Could you please test this patch out? Ithink we probably need to be resetting the adapter on TX/RX overruns faster.
Hi all, Sorry for my poor english... (In reply to Neil Horman from comment #51) > Created attachment 763133 [details] > patch to reset adapter on TX/RX overrun > > Could you please test this patch out? Ithink we probably need to be > resetting the adapter on TX/RX overruns faster. Thanks Neil for your patch. I tried it on a 3.9.6 kernel (debian unstable). It's a big improvement since I have no longer to manually unload/reload the atl1c kernel module to reset the NIC adapter. But the "TX/RX overflow" error is still happening, causing a network freeze for few seconds. (probably the NIC reset duration ?) Jun 20 01:00:25 bunny kernel: [119984.224089] atl1c 0000:05:00.0: TX/RX overflow (status = 0x8) Jun 20 01:00:25 bunny NetworkManager[3201]: <info> (eth0): carrier now OFF (device state 100, deferring action for 4 seconds) Jun 20 01:00:25 bunny NetworkManager[3201]: <info> (eth0): carrier now ON (device state 100) Jun 20 01:00:25 bunny kernel: [119984.248673] atl1c 0000:05:00.0: irq 50 for MSI/MSI-X Jun 20 01:00:25 bunny kernel: [119984.248857] IN ATL1C_CHECK_LINK_STATUS Jun 20 01:00:25 bunny kernel: [119984.248866] phy_data = 796d Jun 20 01:00:25 bunny kernel: [119984.248892] SPEED/DUPLEX RETREVAL ERR = 0 Jun 20 01:00:25 bunny kernel: [119984.248898] SPEED: 65535 : 1000 Jun 20 01:00:25 bunny kernel: [119984.248903] DUPLEX: 65535 : 2 Jun 20 01:00:25 bunny kernel: [119984.249027] atl1c 0000:05:00.0: atl1c: eth0 NIC Link is Up<1000 Mbps Full Duplex> Jun 20 01:00:25 bunny kernel: [119984.249033] CARRIER IS DOWN I don't think the patch is solving the good problem... Why do we have this "TX/RX overflow" error ? I'm not even sure the problem comes from the atl1c module. I tried to compile the atl1c module from linux 3.6.11 on my 3.9.6 kernel and run it. I still had the same problem. Changes on the atl1c between the 3.6.11 and the 3.7.x are very small... diff atl1c-3.6.11/atl1c_main.c atl1c-3.7/atl1c_main.c 152c152 < pci_write_config_word(pdev, pci_pcie_cap(pdev) + PCI_EXP_DEVSTA, --- > pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, 705c705 < static const struct atl1c_platform_patch plats[] __devinitdata = { --- > static const struct atl1c_platform_patch plats[] __devinitconst = { 2688c2688 < static struct pci_error_handlers atl1c_err_handler = { --- > static const struct pci_error_handlers atl1c_err_handler = { If the problem really appears for the first time in the 3.7.x branch, is it possible that it is the result of a side effect of a kernel change or another module change ?
(In reply to Vinz from comment #52) > If the problem really appears for the first time in the 3.7.x branch, is it > possible that it is the result of a side effect of a kernel change or > another module change ? Change bisected in kernel bugzilla net: use bigger pages in __netdev_alloc_frag https://bugzilla.kernel.org/show_bug.cgi?id=54021#c14 (In reply to Neil Horman from comment #50) > you did get the hardware error code! Why didn't you indicate that before? > It was right under all the TX/RX overflow messages: Never waiting long time ofter freeze before.
Thank you for the bisect Mihail. That commit makes sense to me, in that it aligns with the error we receive. I think the the TX/RX error indicates that the the ring buffer that holds dma descriptors gets overrun by the additional fragments created by the netdev_alloc_frag changes. We certainly could do something like linearize the skb prior to transmission so that it only requires one descriptor, but the possibility still remains that this error can occur. I think the right thing to do is reset the adapter when it does. We can certainly take other action to try prevent its happening, but it seems once it does, its a fatal condition. My first thought would be to turn the carrier off on the device when the ring buffer filled. That would prevent the TX/RX error, but if the condition doesn't clear quickly, the same result will occur in that we will encounter a netdev watchdog timeout, that will result in a reset of the device. Any other thoughts?
I think that Vinz (Comment 52) is right: > it is the result of a side effect of a kernel change or another module change If I remove module ath9k (by command rmmod) wired connection works fine without disconnections. So not sole atl1c cause the error.
This is true, and we should probably dig into that, but the fact remains, the atl1c driver can produce this error, whatever the reason, and it appears fatal, in that once it occurs, the system doesn't recover, or do you think if it were not for whatever influence ath9k may have on this situation, if it did not occur, the TX/RX problem may just be transient?
(In reply to Eugene Saenko from comment #55) > I think that Vinz (Comment 52) is right: > > it is the result of a side effect of a kernel change or another module change > > If I remove module ath9k (by command rmmod) wired connection works fine > without disconnections. So not sole atl1c cause the error. If I unload ath9k, I still got the "RX/TX overflow" error, but less frequently. Neil, you are right. atl1c driver can produce this error and it's a good thing to recover in this case. So, the patch is definitely a noticeable improvement.
> or do you think if it were not for whatever influence ath9k may have on this situation, if it did not occur, the TX/RX problem may just be transient? No I don't. I want to say that we haven't to limit search area by sole atl1c, but look, what happens when two these modules work together
Neil Horman TX/RX problem did not occur, if wifi (brcmsmac) not using.
Something just occured to me. Can you all run the fedora debug kernel? Wtih that we should be able to track debug problems with the ath9k driver, and I expect an issue like this will be related to dma.
I have installed kernel-debug-modules-extra-3.9.6-200.fc18.i686, kernel-debug-devel-3.9.6-200.fc18.i686, kernel-debug-3.9.6-200.fc18.i686 What I have to do now?
recreate the problem, and send me your /var/log/messages file. If this is a DMA problem the dma-debug code (which is enabled in the -debug kernel variant) should spit some errors about what your wireless driver might be doing to affect your wired driver
Created attachment 765624 [details] messages with 3.9.6-200.fc18.i686.PAEdebug kernel From 17:55 up to 18:13 I did no manipulations with Wi-Fi switch
It looks like you edited the log heavily, I was hoping to see all those debug messages. Please don't exclude things like that
Created attachment 765683 [details] full file messages In previous attachment I only removed lines before current boot and added my comments to indicate moments of link break and restore. This one is untouched.
ok, I'm sorry, with the large time gaps between your edits, I had presumed you removed some messages. I've verified that the DMA-API debug code is enabled, but its not showing anything, so if it is a dma problem, its not one thats caught by the debug code. I'm not really sure where to take debugging here at this point. It certainly seems like the wireless driver is doing something incorrectly, but I have no visibility into what, nor am I familiar with the wireless code. I would suggest we do this: 1) I'll push a change for this driver to reset the phy if we get too many TX/RX overflows, so the immediately problem can be handled. 2) you open a new bz against the wireless driver, indicating that turning it on breaks the connection of the wired interface, and I'll have our wireless person (CC'ed) look into it further. Please post the bz number that you open here, and I'll make sure it gets into the right hands. sound good?
OK. I'll try do all you suggest.
Created attachment 765753 [details] [PATCH] atl1c: stop tx queue when a TX overflow occurs This bug was reported recently: https://bugzilla.redhat.com/show_bug.cgi?id=928220 The root cause is still being investigated, but an interesting behavior was observed during the process. Its possible for the atl1c driver to recieve a multitude of interrupts indicating a TX overflow. In isolation, a single instance of this event isn't problematic, but during this reproducer 1000's of these events were encountered in rapid succession. This led to the NIC being unable to transmit frames for a long period of time, without the OS being aware of the condition. A reset of teh adapter would bring it back, but no such reset was forthcomming, because the TX queue was still on. Fix this by calling netif_queue_stop when an overrun is encountered, and there are subsequent frames pending for tx in the hardware queue. Having additional frames pending ensures that subsequent interrupts will be generated, giving the irq handler a chance to wake the OS tx queue again if the overflow status clears on its own. If its persistent however, this give the net watchdog a chance to expire and reset the adapter through normal channels Signed-off-by: Neil Horman <nhorman> CC: "David S. Miller" <davem> CC: Jay Cliburn <jcliburn> CC: xiong <xiong.com> CC: Cloud Ren <cjren.com> --- drivers/net/ethernet/atheros/atl1c/atl1c_main.c | 41 +++++++++++++++++++++++-- 1 file changed, 39 insertions(+), 2 deletions(-)
I'm sorry to ask, but after having looked at it some, I like the solution I have in comment 68 a bit better than my origional patch, it lets the net watchdog task handle the device reset rather than making the driver decide when to do it, which I think is cleaner. Can one of you test it out and make sure I didn't break anything in the process. Thanks!
(In reply to Neil Horman from comment #60) > I expect an issue like this will be related to dma. [ 594.349211] brcmsmac bcma0:0: brcms_ops_bss_info_changed: qos enabled: false (implement) [ 594.349237] brcmsmac bcma0:0: brcms_ops_config: change power-save mode: false (implement) [ 594.352114] IPv6: ADDRCONF(NETDEV_UP): wifi: link is not ready [ 594.623795] ------------[ cut here ]------------ [ 594.623819] WARNING: at lib/dma-debug.c:937 check_unmap+0x5a8/0x70d() [ 594.623824] Hardware name: 1215B [ 594.623832] bcma-pci-bridge 0000:01:00.0: DMA-API: device driver failed to check map error[device address=0x00000000a40bc800] [size=1876 bytes] [mapped as single] [ 594.623837] Modules linked in: [ 594.623842] sit tunnel4 pppoe pppox ppp_generic slhc ipv6 bnep 8021q garp stp llc snd_hda_codec_realtek uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core usb_storage acpi_cpufreq mperf brcmsmac cordic brcmutil snd_hda_codec_hdmi dm_mod radeon fbcon bitblit softcursor font tileblit cfbfillrect cfbimgblt cfbcopyarea i2c_algo_bit drm_kms_helper microcode ttm snd_hda_intel snd_hda_codec drm snd_hwdep xhci_hcd pcspkr snd_pcm snd_page_alloc snd_timer ohci_hcd snd ehci_pci ehci_hcd atl1c bcma soundcore [ 594.623963] Pid: 1622, comm: kwin Not tainted 3.9.5-gentoo #2 [ 594.623967] Call Trace: [ 594.623972] <IRQ> [<ffffffff8103752e>] warn_slowpath_common+0x7e/0x99 [ 594.623991] [<ffffffff810375dd>] warn_slowpath_fmt+0x41/0x43 [ 594.624014] [<ffffffff8121b79e>] check_unmap+0x5a8/0x70d [ 594.624023] [<ffffffff8121ba51>] debug_dma_unmap_page+0x5b/0x65 [ 594.624059] [<ffffffffa02c644a>] dma_unmap_single_attrs.constprop.21+0x55/0x5e [brcmsmac] [ 594.624085] [<ffffffffa02c64f2>] dma64_getnextrxp+0x9f/0xde [brcmsmac] [ 594.624112] [<ffffffffa02c7324>] dma_rx+0x7b/0x302 [brcmsmac] [ 594.624123] [<ffffffff810819f5>] ? __module_address+0x9a/0xc6 [ 594.624172] [<ffffffffa029f8a9>] brcms_c_dpc+0x632/0x88a [brcmsmac] [ 594.624182] [<ffffffff81075489>] ? get_lock_stats+0x19/0x4f [ 594.624209] [<ffffffffa0295017>] ? spin_lock_bh+0x9/0xb [brcmsmac] [ 594.624229] [<ffffffffa02965d5>] brcms_dpc+0x5d/0xb4 [brcmsmac] [ 594.624239] [<ffffffff8103d66a>] tasklet_action+0x82/0xd5 [ 594.624247] [<ffffffff8103dac3>] __do_softirq+0xc3/0x187 [ 594.624256] [<ffffffff8103dca2>] irq_exit+0x5c/0xa1 [ 594.624265] [<ffffffff81003414>] do_IRQ+0x88/0x9f [ 594.624275] [<ffffffff81471eef>] common_interrupt+0x6f/0x6f [ 594.624279] <EOI> [<ffffffff814727b7>] ? sysret_check+0x1b/0x56 [ 594.624291] ---[ end trace 4dbab6f1180a7ce8 ]--- [ 594.624295] Mapped at: [ 594.624299] [<ffffffff8121bd2d>] debug_dma_map_page+0x4d/0x10b [ 594.624306] [<ffffffffa02c66b9>] dma_map_single_attrs.constprop.22+0xc5/0xd8 [brcmsmac] [ 594.624327] [<ffffffffa02c776a>] dma_rxfill+0x1bf/0x266 [brcmsmac] [ 594.624346] [<ffffffffa02a03d0>] brcms_c_init+0x8cf/0xda7 [brcmsmac] [ 594.624365] [<ffffffffa02964f6>] brcms_init+0x53/0x57 [brcmsmac] [ 597.016662] wifi: authenticate with 14:a9:e3:01:f2:a8 [ 597.018120] wifi: send auth to 14:a9:e3:01:f2:a8 (try 1/3) [ 597.023421] wifi: authenticated [ 597.044000] wifi: associate with 14:a9:e3:01:f2:a8 (try 1/3) [ 597.046626] wifi: RX AssocResp from 14:a9:e3:01:f2:a8 (capab=0x411 status=0 aid=1) [ 597.047187] brcmsmac bcma0:0: brcmsmac: brcms_ops_bss_info_changed: associated [ 597.047202] brcmsmac bcma0:0: brcms_ops_bss_info_changed: qos enabled: true (implement) [ 597.047998] wifi: associated
In reply to comment 70, that indicates the brcms driver is not checking the return code of a dma_map operation. While thats not a smoking gun, it could mean that we're using an invalid memory address for a dma operation, and that could result in the behavior we are seeing. Please open a separate bug for that and give me the number here, I can get that fixed. Any results from my patch in comment 68?
(In reply to Neil Horman from comment #71) > Any results from my patch in comment 68? Yes. Patch in comment 68 did not help. Flooding log by messages "atl1c 0000:02:00.0: TX/RX overflow (status = 0x8)", and break data transmission/recive after shitch wifi on.
Ok, I need to clean up the message flood, but after about 5 seconds the netdev watchdog timer kicked in and should have reset the device, returning it to normal operation. did it?
Device did not reset after overflow. But reset after Jun 28 04:25:28 [kernel] [ 9557.832510] atl1c 0000:02:00.0: atl1c hardware error (status = 0x400) Jun 28 04:25:28 [kernel] [ 9557.857023] atl1c 0000:02:00.0: atl1c: cable NIC Link is Up<100 Mbps Full Duplex>
what the output of "tc qdisc show" and "uname -a" on your system?
# tc qdisc show qdisc pfifo_fast 0: dev cable root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc mq 0: dev wifi root qdisc pfifo_fast 0: dev ppp0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 # uname -a Linux red 3.9.6-gentoo #1 SMP PREEMPT Fri Jun 28 00:19:08 MSK 2013 x86_64 AMD E-350 Processor AuthenticAMD GNU
gentoo? you sure you're running the patched kernel? All the interfaces have a qdisc on them, so this should be working, I'll get you an instrumented kernel to run this down...
(In reply to Neil Horman from comment #77) > gentoo? little patched vanilla. I can use pure vanilla, if you preffer. >you sure you're running the patched kernel? Yes, I am patch and build kernel manually. https://bugzilla.kernel.org/show_bug.cgi?id=60261 for tracking warning "brcmsmac device driver failed to check map error"
Created attachment 767380 [details] updated patch to stop tx queue when a TX overflow occurs sorry, I think I found the problem, I forgot to remove the code that re-wakes the queue in the clean_tx_irq portion of the interrupt handler. This updated version should fix it. It also has some debug code if we run into subsequent problems. Please give it a test and confirm that it fixes the issue for you. Thanks
Created attachment 767637 [details] log for patch in attachment 767380 [details] New patch try to reset controller. But controller freeze again shortly after reset. It better than nothing, but ethernet still unusable if wifi is active.
That makes no sense at all. It looks like as soon as we reset the device now, we hit exactly the same problem again. So this fix is useless. And I have no visibility as to what the ath9k driver might be doing to cause this problem. I'm about ready to just start bisecting this issue. Can any of you tell me if this problem occured in Fedora 17, or point to an earlier kernel where the problem did not occur?
I had no problems in Fedora-16 and earlier releases. But I didn't try Fedora-17 (omit it).
Here is a workaround. This the one I have tested in Fedora 18 x86-64 with kernel 3.9.6-200 Get some little old driver. wget https://www.kernel.org/pub/linux/kernel/projects/backports/2012/12/19/compat-drivers-2012-12-19-u.tar.gz Select driver as atl1c. ./scripts/driver-select atl1c make sudo make install And reboot. I add more information here. http://kasunc.blogspot.com/2013/07/workaround-for-atheros-communications.html
*** Bug 980806 has been marked as a duplicate of this bug. ***
I'm going to track this upstream for now, as several others are reporting this problem there.
looks like the upstream fix for this was pulled in as part of the 3.10.7 rebase