928220 – Wired network diskonnects after 2 to 3 minutes

Bug 928220 - Wired network diskonnects after 2 to 3 minutes

Summary: Wired network diskonnects after 2 to 3 minutes

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	18
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Neil Horman
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	980806 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-03-27 07:56 UTC by Eugene Saenko
Modified:	2013-08-17 08:23 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-08-16 17:25:53 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Before wireless adapter switched on. (5.77 KB, text/plain) 2013-06-05 09:17 UTC, Eugene Saenko	no flags	Details
After wireless adapter is switched on (12.11 KB, text/plain) 2013-06-05 09:20 UTC, Eugene Saenko	no flags	Details
/var/log/messages (127.96 KB, text/plain) 2013-06-05 10:58 UTC, Eugene Saenko	no flags	Details
Dynamic debug result (218.31 KB, text/plain) 2013-06-07 06:07 UTC, Eugene Saenko	no flags	Details
debug patch (1.51 KB, patch) 2013-06-12 17:46 UTC, Neil Horman	no flags	Details \| Diff
full log (64.44 KB, application/octet-stream) 2013-06-19 17:12 UTC, Mihail	no flags	Details
patch to reset adapter on TX/RX overrun (751 bytes, patch) 2013-06-19 20:06 UTC, Neil Horman	no flags	Details \| Diff
messages with 3.9.6-200.fc18.i686.PAEdebug kernel (97.03 KB, text/plain) 2013-06-26 14:38 UTC, Eugene Saenko	no flags	Details
full file messages (725.08 KB, text/plain) 2013-06-26 16:08 UTC, Eugene Saenko	no flags	Details
[PATCH] atl1c: stop tx queue when a TX overflow occurs (3.96 KB, patch) 2013-06-26 19:34 UTC, Neil Horman	no flags	Details \| Diff
updated patch to stop tx queue when a TX overflow occurs (2.75 KB, patch) 2013-07-01 12:25 UTC, Neil Horman	no flags	Details \| Diff
log for patch in attachment 767380 (7.26 KB, text/plain) 2013-07-02 06:34 UTC, Mihail	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Linux Kernel	54021	0	None	None	None	Never
Red Hat Bugzilla	980806	0	unspecified	CLOSED	Wireless adapter obstructs wired connection	2021-02-22 00:41:40 UTC

Internal Links: 980806

Description Eugene Saenko 2013-03-27 07:56:28 UTC

Description of problem:
After 2 to 3 minutes of work network adapter (Atheros Communications Inc. AR8152 v2.0 Fast Ethernet [1969:2062] (rev c1)) loses network connection.


Version-Release number of selected component (if applicable):

$ lspci | awk '/[nN]et/{print $1}' | xargs -i lspci -knns {}
01:00.0 Ethernet controller [0200]: Atheros Communications Inc. AR8152 v2.0 Fast Ethernet [1969:2062] (rev c1)
        Subsystem: Lenovo Device [17aa:3979]
        Kernel driver in use: atl1c

$ uname -a
Linux ses-laptop.localnet 3.8.4-202.fc18.i686.PAE #1 SMP Thu Mar 21 17:14:12 UTC 2013 i686 i686 i386 GNU/Linux



How reproducible:


Steps to Reproduce:
1. Turn on computer and plug-in network cable
2. Ping remote host

  
Actual results:
$ ping utro.ru
..............
..............
64 bytes from www-martan.utro.ru (194.186.36.190): icmp_seq=168 ttl=248 time=41.5 ms
64 bytes from www-martan.utro.ru (194.186.36.190): icmp_seq=169 ttl=248 time=43.4 ms
64 bytes from www-martan.utro.ru (194.186.36.190): icmp_seq=170 ttl=248 time=44.7 ms
64 bytes from www-martan.utro.ru (194.186.36.190): icmp_seq=171 ttl=248 time=70.6 ms
64 bytes from www-martan.utro.ru (194.186.36.190): icmp_seq=172 ttl=248 time=45.7 ms
64 bytes from www-martan.utro.ru (194.186.36.190): icmp_seq=173 ttl=248 time=42.4 ms
64 bytes from www-martan.utro.ru (194.186.36.190): icmp_seq=174 ttl=248 time=44.4 ms
From ses-laptop.localnet (192.168.1.79) icmp_seq=200 Destination Host Unreachable
From ses-laptop.localnet (192.168.1.79) icmp_seq=201 Destination Host Unreachable
From ses-laptop.localnet (192.168.1.79) icmp_seq=202 Destination Host Unreachable

Expected results:
Continuous sequence of echo replies.

Additional info:
If I unplug/replug the network cable network connection is restored for some time

Comment 1 nucleo 2013-03-27 11:37:31 UTC

I can confirm atl1c driver bug in 3.7+ kernels, works fine in 3.6.x kernels.

Comment 2 Eugene Saenko 2013-04-04 14:05:24 UTC

Kernel 3.8.5-201.fc18.i686.PAE
The same problem :(

Comment 3 Eugene Saenko 2013-04-12 08:26:12 UTC

Kernel 3.8.6-203.fc18.i686.PAE
When I switch off wireless adapter (Atheros Communications Inc. AR9285 Wireless Network Adapter) and then turn on my laptop, Ethernet adapter works fine for a long time (more then 2 hours).

Then I switch on wireless adapter (don't connect to AP). 2..3 minutes late the wired connection breaks.

Comment 4 Eugene Saenko 2013-04-16 10:12:21 UTC

Kernel 3.8.7-201.fc18.i686.PAE 
The same problem.

Comment 5 Eugene Saenko 2013-04-16 10:51:51 UTC

Additional information:
$ lspci | awk '/[nN]et/{print $1}' | xargs -i lspci -knns {}
01:00.0 Ethernet controller [0200]: Atheros Communications Inc. AR8152 v2.0 Fast Ethernet [1969:2062] (rev c1)
        Subsystem: Lenovo Device [17aa:3979]
        Kernel driver in use: atl1c
02:00.0 Network controller [0280]: Atheros Communications Inc. AR9285 Wireless Network Adapter (PCI-Express) [168c:002b] (rev 01)
        Subsystem: Lenovo Device [17aa:30a1]
        Kernel driver in use: ath9k

If you need some additional information, I'll try give it.

Comment 6 Eugene Saenko 2013-04-23 05:05:10 UTC

Kernel 3.8.8-202.fc18.i686.PAE
The same problem.
How can I help?

Comment 7 Eugene Saenko 2013-04-23 05:12:46 UTC

Kernel 3.8.8-202.fc18.i686.PAE
The same problem.
How can I help?

Comment 8 nucleo 2013-04-23 12:17:43 UTC

You can give more information in upstream bug http://bugzilla.kernel.org/show_bug.cgi?id=54021

Comment 9 Eugene Saenko 2013-05-08 10:59:23 UTC

Kernel 3.8.11-200.fc18.i686.PAE works fine

Comment 10 Eugene Saenko 2013-05-08 11:07:15 UTC

(In reply to comment #9)
> Kernel 3.8.11-200.fc18.i686.PAE works fine

Excuse me. NOT work.

Comment 11 Eugene Saenko 2013-05-30 06:37:27 UTC

Kernel 3.9.4-200.fc18.i686.PAE
Wired connection cuts off 5 minutes later after switch on wireless adapter.

Comment 12 Neil Horman 2013-06-03 20:37:15 UTC

What do you mean by "cuts off"?  Can you be more detailed?  does NM report a disconnect or do you just see a failure in connections that were going over that interface?

Comment 13 Eugene Saenko 2013-06-05 09:17:58 UTC

Created attachment 757085 [details]
Before wireless adapter switched on.

Comment 14 Eugene Saenko 2013-06-05 09:20:14 UTC

Created attachment 757086 [details]
After wireless adapter is switched on

Comment 15 Eugene Saenko 2013-06-05 09:32:46 UTC

1) I switch off wireless adapter (by hardware switch) and turn on my laptop.
Commands I type in console and screen output are in attached file "before.txt".

2) I switch on wireless adapter (configured to don't connect automatically) by hardware switch.
Commands I type in console and screen output are in attached file "after.txt".

If You need some more experiments I'll be glad to be useful. (Excuse me for poor English)

Comment 16 Eugene Saenko 2013-06-05 09:44:13 UTC

After I switch off wi-fi, to rectote connection I need to unplug and replug the network cable.

Comment 17 Neil Horman 2013-06-05 10:42:14 UTC

Thats strange.  Can you attach a copy of /var/log/message from before and after you toggle the rfkill switch?  I'm guessing, given that your routing tables hasn't changed, that the switch must be doing something to the wired phy state

Comment 18 Eugene Saenko 2013-06-05 10:58:59 UTC

Created attachment 757124 [details]
/var/log/messages

Wireless adapter switched on at 12:54:46

Comment 19 Eugene Saenko 2013-06-05 11:06:56 UTC

Network cable unplugged/replugged approximately at 13:05

Comment 20 Neil Horman 2013-06-05 13:24:38 UTC

hmm, doesn't show anything suspicious.  Can you do the following:

1) enable the wireless interface so the problem reproduces

2) check /sys/class/net/<if>/carrier, where <if> is the name of your wired interface, to see if the carrier state changes after reproducing

3) Check on the peer switch port that your wired interface is connected to to see if any link state changes are noted.

Comment 21 Eugene Saenko 2013-06-05 13:50:16 UTC

Before:
-------
[root@ses-laptop ~]# cat /sys/class/net/eth0/carrier 
1

After switch on Wi-Fi:
---------------------
[root@ses-laptop ~]# cat /sys/class/net/eth0/carrier 
1

LED on switch blinking after as before.

When link is broken, I see on gate continuous arp requests:
arp who-has 192.168.1.254 tell 192.168.1.79

192.168.1.79 -- my IP
192.168.1.254 -- gate IP

Comment 22 Neil Horman 2013-06-05 14:20:04 UTC

very odd, what does your neighbor table (/proc/net/arp) look like before and after you enable the wireless interface?

Comment 23 Eugene Saenko 2013-06-06 07:13:44 UTC

Before:
[root@ses-laptop ~]# cat /proc/net/arp
IP address       HW type     Flags       HW address            Mask     Device
192.168.1.250    0x1         0x2         00:26:18:b7:c1:e4     *        eth0
192.168.1.254    0x1         0x2         00:0d:87:c4:22:3e     *        eth0

After:
[root@ses-laptop ~]# cat /proc/net/arp
IP address       HW type     Flags       HW address            Mask     Device
192.168.1.250    0x1         0x2         00:26:18:b7:c1:e4     *        eth0
192.168.1.254    0x1         0x2         00:0d:87:c4:22:3e     *        eth0

Comment 24 Neil Horman 2013-06-06 10:58:19 UTC

Ok, so I'm officially lost, There shouldn't be any reason we start seeing ARP requests go out for the gateway when the wireless interface is activated.  It doesn't appear to be configured to get an ip address so it shouldn't arp, and we already have an arp entry for that ip anyway, so we know where to send it on wired device.  At this point I think we should:

1) Enable dynamic debug on the atl1c nl80211 drivers and and set your console level to 7 to catch any debug messages we might be missing.  Let me know if you're not familiar with the process to do so.

2) What model laptop is this?  It seems that what ever is happening here, the rfkill switch is doing more than just enabling the wireless interface.  Perhaps the vendor has some documentation that would clarify some of this behavior.

Comment 25 Eugene Saenko 2013-06-06 11:06:56 UTC

1) How I can do that?

2) Lenovo G570

Comment 26 Eugene Saenko 2013-06-06 12:15:19 UTC

Before I had Fedora-18 on this laptop. Kernels 3.6.x worked normally.

Comment 27 Eugene Saenko 2013-06-06 12:16:39 UTC

Not Fedora-18 but Fedora-16

Comment 28 Eugene Saenko 2013-06-06 12:32:55 UTC

If I don't use hardware switch (it's continuosly on) and disable wireless by "rmmod ath9k" wired connection isn't lost.

Comment 29 Neil Horman 2013-06-06 14:07:59 UTC

to enable dynamic debug do the following:

1) cd /sys/kernel/debug/dynamic_debug (you may need to mount debugfs on /sys/kernel/debug first)

2) echo "module atl1c +p" > ./control

3) echo "module ath9k +p" > ./control

4) echo "7 7 7 7" > /proc/sys/kernel/printk

then reproduce the problem.


Lenovo is pretty bad about hiding past releases of bios updates, and not expanding upon what the latest bios updates fix in aggregate.  That said, there are some mentinos of an rfkill switch fix in a bios update floating around, you may want to check your bios level (but I wouldn't invest alot of time on it).

Could you also send in the output of the following command:

rfkill list

Thanks!

Comment 30 Eugene Saenko 2013-06-07 06:07:47 UTC

Created attachment 757989 [details]
Dynamic debug result

[root@ses-laptop ~]# cd /sys/kernel/debug/dynamic_debug
[root@ses-laptop dynamic_debug]# echo "module atl1c +p" > ./control
[root@ses-laptop dynamic_debug]# echo "module ath9k +p" > ./control
[root@ses-laptop dynamic_debug]# echo "7 7 7 7" > /proc/sys/kernel/printk
[root@ses-laptop dynamic_debug]# rfkill list
0: ideapad_wlan: Wireless LAN
        Soft blocked: no
        Hard blocked: no
1: phy0: Wireless LAN
        Soft blocked: no
        Hard blocked: no

Comment 31 Eugene Saenko 2013-06-07 06:17:08 UTC

Ten minutes after this manipulanions wired connection self-restored. The file control doesn't changed. Some time after link interrupted again. Two minutes later -- new self-restore.

Comment 32 Eugene Saenko 2013-06-07 06:48:57 UTC

I leave wireless enabled for a long time and got interlaced periods link/unlink approximately each 2 to 3 minutes long.

Comment 33 Neil Horman 2013-06-07 10:23:19 UTC

Anything dumped out in /var/log/messages now that you have dynamic debug enabled?

Comment 34 Eugene Saenko 2013-06-07 11:02:01 UTC

Only^=:
[root@ses-laptop dynamic_debug]# tail -F /var/log/messages
Jun  7 10:49:46 ses-laptop chronyd[542]: Source 89.221.207.113 online
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <info> WiFi now enabled by radio killswitch
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): bringing up device.
Jun  7 14:57:08 ses-laptop kernel: [18305.611539] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0) supports 4 scan SSIDs
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): supplicant interface state: starting -> ready
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): device state change: unavailable -> disconnected (reason 'supplicant-available') [20 30 42]
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <warn> Trying to remove a non-existant call id.
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): supplicant interface state: ready -> inactive
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0) supports 4 scan SSIDs
Jun  7 15:01:11 ses-laptop NetworkManager[536]: <info> (eth0): carrier now OFF (device state 100, deferring action for 4 seconds)
Jun  7 15:01:11 ses-laptop kernel: [18548.450191] atl1c 0000:01:00.0: atl1c: eth0 NIC Link is Up<100 Mbps Full Duplex>
Jun  7 15:01:11 ses-laptop NetworkManager[536]: <info> (eth0): carrier now ON (device state 100)

Comment 35 Eugene Saenko 2013-06-07 11:04:05 UTC

And so on every 2-3 minutes

Comment 36 Eugene Saenko 2013-06-07 11:13:38 UTC

[root@ses-laptop dynamic_debug]# tail -F /var/log/messages
Jun  7 10:49:46 ses-laptop chronyd[542]: Source 89.221.207.113 online
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <info> WiFi now enabled by radio killswitch
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): bringing up device.
Jun  7 14:57:08 ses-laptop kernel: [18305.611539] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0) supports 4 scan SSIDs
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): supplicant interface state: starting -> ready
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): device state change: unavailable -> disconnected (reason 'supplicant-available') [20 30 42]
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <warn> Trying to remove a non-existant call id.
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0): supplicant interface state: ready -> inactive
Jun  7 14:57:08 ses-laptop NetworkManager[536]: <info> (wlan0) supports 4 scan SSIDs

link off at 14:59
=================

Jun  7 15:01:11 ses-laptop NetworkManager[536]: <info> (eth0): carrier now OFF (device state 100, deferring action for 4 seconds)
Jun  7 15:01:11 ses-laptop kernel: [18548.450191] atl1c 0000:01:00.0: atl1c: eth0 NIC Link is Up<100 Mbps Full Duplex>
Jun  7 15:01:11 ses-laptop NetworkManager[536]: <info> (eth0): carrier now ON (device state 100)

now link on
===========

link off at 15:03
=================

Jun  7 15:09:26 ses-laptop NetworkManager[536]: <info> (eth0): carrier now OFF (device state 100, deferring action for 4 seconds)
Jun  7 15:09:26 ses-laptop kernel: [19043.146975] atl1c 0000:01:00.0: atl1c: eth0 NIC Link is Up<100 Mbps Full Duplex>
Jun  7 15:09:26 ses-laptop NetworkManager[536]: <info> (eth0): carrier now ON (device state 100)

now link on
============

Comment 37 Neil Horman 2013-06-07 11:18:40 UTC

Well, at least at the end there, that at least shows some level of link flap on the wired NIC (I presume you're not plugging/unplugging the cable there).  this message:

Jun  7 15:01:11 ses-laptop kernel: [18548.450191] atl1c 0000:01:00.0: atl1c: eth0 NIC Link is Up<100 Mbps Full Duplex>

Is generated in atl1c_check_link_status when the link speed/duplex changes (or when the link was down but comes up again.  that suggests that some MAC layer renegotiation is going on.  We may want to instrument the kernel, or write a stap script to track what the MAC is reporting tot he driver exactly here.  Do you feel comfortable doing that?  Note that I'm going on vacation here for a few days, so if you want me to write it, I will, but it will be next week before you get anything.

Comment 38 Eugene Saenko 2013-06-07 11:32:14 UTC

1)
> I presume you're not plugging/unplugging the cable there
No, I didn't.



I'm not in a hurry. I can work with wireless switched off. But I want to help resolve this problem :)

Comment 39 Neil Horman 2013-06-12 17:46:11 UTC

Created attachment 760272 [details]
debug patch

heres the patch I promised you, it should show us whats going on when we get a link change notification.  If you could build and run this, recreating your problem please, that would be great.  Also, if you could dump out the ethtool stats for the wired interface after the problem occurs, I'd appreciate it.  I'm starting to suspect that the atl1c driver is getting a carrier error and its taking it a few minutes to recover for some reason.

Comment 40 Eugene Saenko 2013-06-13 06:23:31 UTC

Excuse me for my stupidity, but can you give me detailed (step-by-step) instructions what namely I have to do?

I have no experience in driver compilation :(

Comment 41 Neil Horman 2013-06-13 14:14:44 UTC

I'll just build you a kernel:
http://koji.fedoraproject.org/koji/taskinfo?taskID=5499289

Just install the kernel from that url when it finishes building and try reproduce the problem.  If you send me the /var/log/messages and ethtool stats from that exercise, and we should be good.  Thanks!

Comment 42 Mihail 2013-06-13 20:50:34 UTC

3.9.5 kernel with debug patch...

#ip link set dev cable up

[  329.489424] atl1c 0000:02:00.0: irq 45 for MSI/MSI-X
[  329.489967] IN ATL1C_CHECK_LINK_STATUS
[  329.489975] phy_data = 796d
[  329.489996] SPEED/DUPLEX RETREVAL ERR = 0
[  329.490003] SPEED: 65535 : 100
[  329.490008] DUPLEX: 2 : 2
[  329.490026] atl1c 0000:02:00.0: atl1c: cable NIC Link is Up<100 Mbps Full Duplex>
[  329.490032] CARRIER IS DOWN
[  329.490413] IN ATL1C_CHECK_LINK_STATUS
[  329.490431] phy_data = 796d
[  329.490492] SPEED/DUPLEX RETREVAL ERR = 0
[  329.490498] SPEED: 100 : 100
[  329.490504] DUPLEX: 2 : 2
[  329.490510] CARRIER IS UP

# start pppoe over atl1c

[  598.644586] PPP generic driver version 2.4.2
[  604.491895] NET: Registered protocol family 24
[  640.815739] sit: IPv6 over IPv4 tunneling driver

# connect wifi

[  708.720798] brcmsmac bcma0:0: brcms_ops_bss_info_changed: qos enabled: false (implement)
[  708.720816] brcmsmac bcma0:0: brcms_ops_config: change power-save mode: false (implement)
[  708.720979] IPv6: ADDRCONF(NETDEV_UP): wifi: link is not ready
[  711.100939] wifi: authenticate with 14:a9:e3:01:f2:a8
[  711.101029] wifi: send auth to 14:a9:e3:01:f2:a8 (try 1/3)
[  711.103228] wifi: authenticated
[  711.110683] wifi: associate with 14:a9:e3:01:f2:a8 (try 1/3)
[  711.113136] wifi: RX AssocResp from 14:a9:e3:01:f2:a8 (capab=0x411 status=0 aid=1)
[  711.113653] brcmsmac bcma0:0: brcmsmac: brcms_ops_bss_info_changed: associated
[  711.113668] brcmsmac bcma0:0: brcms_ops_bss_info_changed: qos enabled: true (implement)
[  711.113690] wifi: associated
[  711.113820] IPv6: ADDRCONF(NETDEV_CHANGE): wifi: link becomes ready
[  711.407990] brcmsmac bcma0:0: brcms_ops_bss_info_changed: arp filtering: 1 addresses (implement)
[  714.494946] FS-Cache: Loaded
[  714.504377] FS-Cache: Netfs 'nfs' registered for caching
[  775.928873] wifi: deauthenticating from 14:a9:e3:01:f2:a8 by local choice (reason=3)
[  775.930377] brcmsmac bcma0:0: brcmsmac: brcms_ops_bss_info_changed: disassociated
[  775.930408] brcmsmac bcma0:0: brcms_ops_bss_info_changed: arp filtering: 1 addresses (implement)
[  775.930418] brcmsmac bcma0:0: brcms_ops_bss_info_changed: qos enabled: false (implement)
[  775.932814] cfg80211: Calling CRDA for country: US
[  996.752751] wifi: authenticate with 14:a9:e3:01:f2:a8
[  996.753544] wifi: send auth to 14:a9:e3:01:f2:a8 (try 1/3)
[  996.754780] wifi: authenticated
[  996.755819] wifi: associate with 14:a9:e3:01:f2:a8 (try 1/3)
[  996.758647] wifi: RX AssocResp from 14:a9:e3:01:f2:a8 (capab=0x411 status=0 aid=1)
[  996.759121] brcmsmac bcma0:0: brcmsmac: brcms_ops_bss_info_changed: associated
[  996.759136] brcmsmac bcma0:0: brcms_ops_bss_info_changed: qos enabled: true (implement)
[  996.759165] wifi: associated
[  997.257176] brcmsmac bcma0:0: brcms_ops_bss_info_changed: arp filtering: 1 addresses (implement)
[ 1114.977252] brcmsmac bcma0:0: phyerr 0x1, rate 0x37
[ 1114.977265] brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: ampdu tx phy error (0x1)
[ 1114.977776] brcmsmac bcma0:0: phyerr 0x1, rate 0x37

# waiting )))
[ 8805.651818] brcmsmac bcma0:0: phyerr 0x1, rate 0x37
[ 8805.651838] brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: ampdu tx phy error (0x1)
[ 8805.652435] brcmsmac bcma0:0: phyerr 0x1, rate 0x37
[ 8805.652455] brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: ampdu tx phy error (0x1)


# atl1c freeze!!! pppd: No response to 3 echo-requests
# Serial link appears to be disconnected.
# Connect time 133.2 minutes.

# mii-tool -r cable

[ 8847.388818] atl1c 0000:02:00.0: atl1c: cable NIC Link is Down
[ 8847.389305] IN ATL1C_CHECK_LINK_STATUS
[ 8847.389314] phy_data = 7949
[ 8848.996401] IN ATL1C_CHECK_LINK_STATUS
[ 8848.996418] phy_data = 796d
[ 8848.996600] SPEED/DUPLEX RETREVAL ERR = 0
[ 8848.996606] SPEED: 65535 : 100
[ 8848.996611] DUPLEX: 2 : 2
[ 8848.996705] atl1c 0000:02:00.0: atl1c: cable NIC Link is Up<100 Mbps Full Duplex>
[ 8848.996710] CARRIER IS DOWN
[ 8871.649211] brcmsmac bcma0:0: phyerr 0x1, rate 0x37
[ 8871.649232] brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: ampdu tx phy error (0x1)

# atl1c working 2-3 min after "mii-tool -r" and "CARRIER IS DOWN" forewer. But
# ip link
2: cable: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000 link/ether f4:6d:04:57:10:f3 brd ff:ff:ff:ff:ff:ff

Comment 43 Eugene Saenko 2013-06-14 06:11:10 UTC

Information for task build (f18-candidate, kernel-3.9.5-201.fc18.src.rpm)
ID	5499289
Method	build
Parameters 	Source: cli-build/1371132405.122448.yryyYuFL/kernel-3.9.5-201.fc18.src.rpm
Build Target: f18-candidate
Options:
  scratch = True
State 	failed
Created	Thu, 13 Jun 2013 14:12:38 UTC
Started	Thu, 13 Jun 2013 14:12:39 UTC
Completed	Thu, 13 Jun 2013 14:36:30 UTC
Owner 	nhorman
Channel 	default
Host 	buildvm-24.phx2.fedoraproject.org
Arch	noarch
Parent 	
Descendants 	build

    buildArch (kernel-3.9.5-201.fc18.src.rpm, x86_64)
    buildArch (kernel-3.9.5-201.fc18.src.rpm, noarch)
    buildArch (kernel-3.9.5-201.fc18.src.rpm, i686)

Waiting?	yes
Awaited?	no
Priority	20
Weight	0.20
Result 	

BuildError: error building package (arch x86_64), mock exited with status 1; see build.log for more information
          

Output

Comment 44 Mihail 2013-06-14 09:24:03 UTC

(In reply to Neil Horman from comment #39)
> Created attachment 760272 [details]
> debug patch

WARN need 2 argument!! :)

--- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
+++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
@@ -258,6 +258,8 @@ static void atl1c_check_link_status(struct atl1c_adapter *adapter)
 	atl1c_read_phy_reg(hw, MII_BMSR, &phy_data);
 	spin_unlock_irqrestore(&adapter->mdio_lock, flags);
 
+	WARN("IN ATL1C_CHECK_LINK_STATUS\n");

Comment 45 Neil Horman 2013-06-14 14:02:47 UTC

Thank you for fixing that Mihail, I didnt' have time to test out the debug code myself yesterday.

So looking at your debug output I'm starting to think that the hardware isn't behaving as well as it should here  Lets annotate the debug log output:

[ 8805.652435] brcmsmac bcma0:0: phyerr 0x1, rate 0x37
[ 8805.652455] brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: ampdu tx phy error (0x1)
>> NH - This is the last moment your wired interface is working.  Sometime between this moment and about 40 seconds later below, something goes wrong.  The fact that we don't get any driver level notification about it is troubling.

# atl1c freeze!!! pppd: No response to 3 echo-requests
# Serial link appears to be disconnected.
# Connect time 133.2 minutes.

# mii-tool -r cable

[ 8847.388818] atl1c 0000:02:00.0: atl1c: cable NIC Link is Down
>> NH - Here we finally discover that the link is down.  This message specifically is printed in the function atl1c_link_chg_event, which is only called from the drivers interrupt routine, in response to hardware notification that the phy state has changed.

[ 8847.389305] IN ATL1C_CHECK_LINK_STATUS
[ 8847.389314] phy_data = 7949
>> NH - Its a bit troubling here that one of the bits set in this register is marked as reserved, but theres not much we can do about that.  Its also set above when the driver initial worked, and its software read here, not written, so theres we need to assume that its ok.  the important bit here is the the toggle between the 9 and the d in the low nibble of the phy_data. It indicates link state change.  This pass above indicates the link has gone down, while the one below indicates its comming up.

[ 8848.996401] IN ATL1C_CHECK_LINK_STATUS
[ 8848.996418] phy_data = 796d
[ 8848.996600] SPEED/DUPLEX RETREVAL ERR = 0
[ 8848.996606] SPEED: 65535 : 100
>> NH - This indicates we're restarting the link negotiation, the 65535 is from the down event above resetting the link speed to 'unknown'.  The trigger for what brought the link back isn't clear here, as we would need to get another interrupt to trigger this operation, but I bet its the hardware error interrupt.  If you re-run this test one more time with driver logging enabled (you can turn that on with the command:
ethtool -s <ifname> msglevel 0xffff
).  If you look at /var/log/messages after the above occurs, I expect you'll see:
 "atl1c hardware error (status = <some number>"
in there somewhere




[ 8848.996611] DUPLEX: 2 : 2
[ 8848.996705] atl1c 0000:02:00.0: atl1c: cable NIC Link is Up<100 Mbps Full Duplex>
[ 8848.996710] CARRIER IS DOWN
>> NH - And we're back.

So the bottom line (it would seem) from this log, is that the phy status got out of sync with what the driver beleived it to be, and the hardware took almost a minute to inform the driver of that situation.  Its possible that interrupts may have been disabled erroneously for that long, but I don't see how that could occur in the driver currently.  I think the best thing to do is do the ethtool message log test I describe above, and we'll see if it confirms my analysis here.  If thats the case though, I think you may just need to replace your card, as I don't see how we can fix it.

Comment 46 Eugene Saenko 2013-06-15 09:44:39 UTC

I would agree with You that problem is in hardware. I have not another network card to test in my notebook. But I can test another driver. I installed Fedora 16 (kernel 3.6.11-4.fc16.i686) on external USB drive.
Network card works fine without link losses with wireless enabled and with wireless disabled.

Comment 47 Mihail 2013-06-16 12:13:33 UTC

(In reply to Neil Horman from comment #45)
> ethtool -s <ifname> msglevel 0xffff
> ).  If you look at /var/log/messages after the above occurs, I expect you'll
> see:
>  "atl1c hardware error (status = <some number>"
> in there somewhere
> 
> 

[ 5924.158862] atl1c 0000:02:00.0: TX/RX overflow (status = 0x8)

Comment 48 Neil Horman 2013-06-18 19:19:42 UTC

hmm, If you are able to run with an older driver, that would suggest its not a hardware error.  Mihail, can you send the full message log in please?

Comment 49 Mihail 2013-06-19 17:12:28 UTC

Created attachment 763065 [details]
full log

driver detect faulure but unable to recover (( 

Jun 19 09:52:20 [kernel] [24398.696410] atl1c 0000:02:00.0: TX/RX overflow (status = 0x8)
Jun 19 09:52:20 [kernel] [24398.696509] atl1c 0000:02:00.0: atl1c hardware error (status = 0x400)
Jun 19 09:52:20 [kernel] [24398.722209] IN ATL1C_CHECK_LINK_STATUS
Jun 19 09:52:20 [kernel] [24398.722216] phy_data = 796d
Jun 19 09:52:20 [kernel] [24398.722236] SPEED/DUPLEX RETREVAL ERR = 0
Jun 19 09:52:20 [kernel] [24398.722242] SPEED: 65535 : 100
Jun 19 09:52:20 [kernel] [24398.722248] DUPLEX: 65535 : 2
Jun 19 09:52:20 [kernel] [24398.722265] atl1c 0000:02:00.0: atl1c: cable NIC Link is Up<100 Mbps Full Duplex>
Jun 19 09:52:20 [kernel] [24398.722271] CARRIER IS DOWN
Jun 19 09:52:20 [pppd] PPP session is 6853
Jun 19 09:52:20 [pppd] Connected to 00:90:1a:a3:db:3d via interface cable
Jun 19 09:52:20 [pppd] Using interface ppp0
Jun 19 09:52:20 [pppd] Connect: ppp0 <--> cable

Comment 50 Neil Horman 2013-06-19 19:29:13 UTC

you did get the hardware error code!  Why didn't you indicate that before? It was right under all the TX/RX overflow messages:

 un 19 09:52:20 [kernel] [24398.696509] atl1c 0000:02:00.0: atl1c hardware error (status = 0x400)

That indicates that the card needs to be reset due to a DMA error.  It looks like the NIC just took a few minutes to figure out that it hit a DMA error, as is evidenced by the fact that for every frame we sent we hit a tx/rx overflow error.

Comment 51 Neil Horman 2013-06-19 20:06:50 UTC

Created attachment 763133 [details]
patch to reset adapter on TX/RX overrun

Could you please test this patch out?  Ithink we probably need to be resetting the adapter on TX/RX overruns faster.

Comment 52 Vinz 2013-06-19 23:13:47 UTC

Hi all,

Sorry for my poor english...


(In reply to Neil Horman from comment #51)
> Created attachment 763133 [details]
> patch to reset adapter on TX/RX overrun
> 
> Could you please test this patch out?  Ithink we probably need to be
> resetting the adapter on TX/RX overruns faster.

Thanks Neil for your patch. I tried it on a 3.9.6 kernel (debian unstable).
It's a big improvement since I have no longer to manually unload/reload the atl1c kernel module to reset the NIC adapter.

But the "TX/RX overflow" error is still happening, causing a network freeze for few seconds. (probably the NIC reset duration ?)


Jun 20 01:00:25 bunny kernel: [119984.224089] atl1c 0000:05:00.0: TX/RX overflow (status = 0x8)
Jun 20 01:00:25 bunny NetworkManager[3201]: <info> (eth0): carrier now OFF (device state 100, deferring action for 4 seconds)
Jun 20 01:00:25 bunny NetworkManager[3201]: <info> (eth0): carrier now ON (device state 100)
Jun 20 01:00:25 bunny kernel: [119984.248673] atl1c 0000:05:00.0: irq 50 for MSI/MSI-X
Jun 20 01:00:25 bunny kernel: [119984.248857] IN ATL1C_CHECK_LINK_STATUS
Jun 20 01:00:25 bunny kernel: [119984.248866] phy_data = 796d
Jun 20 01:00:25 bunny kernel: [119984.248892] SPEED/DUPLEX RETREVAL ERR = 0
Jun 20 01:00:25 bunny kernel: [119984.248898] SPEED: 65535 : 1000
Jun 20 01:00:25 bunny kernel: [119984.248903] DUPLEX: 65535 : 2
Jun 20 01:00:25 bunny kernel: [119984.249027] atl1c 0000:05:00.0: atl1c: eth0 NIC Link is Up<1000 Mbps Full Duplex>
Jun 20 01:00:25 bunny kernel: [119984.249033] CARRIER IS DOWN


I don't think the patch is solving the good problem...
Why do we have this "TX/RX overflow" error ?

I'm not even sure the problem comes from the atl1c module.


I tried to compile the atl1c module from linux 3.6.11 on my 3.9.6 kernel and run it. I still had the same problem.

Changes on the atl1c between the 3.6.11 and the 3.7.x are very small...
diff atl1c-3.6.11/atl1c_main.c atl1c-3.7/atl1c_main.c
152c152
< 	pci_write_config_word(pdev, pci_pcie_cap(pdev) + PCI_EXP_DEVSTA,
---
> 	pcie_capability_write_word(pdev, PCI_EXP_DEVSTA,
705c705
< static const struct atl1c_platform_patch plats[] __devinitdata = {
---
> static const struct atl1c_platform_patch plats[] __devinitconst = {
2688c2688
< static struct pci_error_handlers atl1c_err_handler = {
---
> static const struct pci_error_handlers atl1c_err_handler = {


If the problem really appears for the first time in the 3.7.x branch, is it possible that it is the result of a side effect of a kernel change or another module change ?

Comment 53 Mihail 2013-06-20 04:40:54 UTC

(In reply to Vinz from comment #52)
> If the problem really appears for the first time in the 3.7.x branch, is it
> possible that it is the result of a side effect of a kernel change or
> another module change ?

Change bisected in kernel bugzilla
net: use bigger pages in __netdev_alloc_frag
https://bugzilla.kernel.org/show_bug.cgi?id=54021#c14

(In reply to Neil Horman from comment #50)
> you did get the hardware error code!  Why didn't you indicate that before?
> It was right under all the TX/RX overflow messages:

Never waiting long time ofter freeze before.

Comment 54 Neil Horman 2013-06-20 13:56:37 UTC

Thank you for the bisect Mihail.  That commit makes sense to me, in that it aligns with the error we receive.  I think the the TX/RX error indicates that the the ring buffer that holds dma descriptors gets overrun by the additional fragments created by the netdev_alloc_frag changes.  We certainly could do something like linearize the skb prior to transmission so that it only requires one descriptor, but the possibility still remains that this error can occur.  I think the right thing to do is reset the adapter when it does.  We can certainly take other action to try prevent its happening, but it seems once it does, its a fatal condition.  My first thought would be to turn the carrier off on the device when the ring buffer filled.  That would prevent the TX/RX error, but if the condition doesn't clear quickly, the same result will occur in that we will encounter a netdev watchdog timeout, that will result in a reset of the device.  

Any other thoughts?

Comment 55 Eugene Saenko 2013-06-20 14:36:25 UTC

I think that Vinz (Comment 52) is right:
> it is the result of a side effect of a kernel change or another module change

If I remove module ath9k (by command rmmod) wired connection works fine without disconnections. So not sole atl1c cause the error.

Comment 56 Neil Horman 2013-06-20 14:39:51 UTC

This is true, and we should probably dig into that, but the fact remains, the atl1c driver can produce this error, whatever the reason, and it appears fatal, in that once it occurs, the system doesn't recover, or do you think if it were not for whatever influence ath9k may have on this situation, if it did not occur, the TX/RX problem may just be transient?

Comment 57 Vinz 2013-06-20 14:49:11 UTC

(In reply to Eugene Saenko from comment #55)
> I think that Vinz (Comment 52) is right:
> > it is the result of a side effect of a kernel change or another module change
> 
> If I remove module ath9k (by command rmmod) wired connection works fine
> without disconnections. So not sole atl1c cause the error.

If I unload ath9k, I still got the "RX/TX overflow" error, but less frequently.

Neil, you are right. atl1c driver can produce this error and it's a good thing to recover in this case. So, the patch is definitely a noticeable improvement.

Comment 58 Eugene Saenko 2013-06-20 15:05:51 UTC

> or do you think if it were not for whatever influence ath9k may have on this situation, if it did not occur, the TX/RX problem may just be transient?
No I don't. I want to say that we haven't to limit search area by sole atl1c, but look, what happens when two these modules work together

Comment 59 Mihail 2013-06-22 05:59:00 UTC

Neil Horman
TX/RX problem did not occur, if wifi (brcmsmac) not using.

Comment 60 Neil Horman 2013-06-25 19:24:18 UTC

Something just occured to me.  Can you all run the fedora debug kernel?  Wtih that we should be able to track debug problems with the ath9k driver, and I expect an issue like this will be related to dma.

Comment 61 Eugene Saenko 2013-06-26 06:53:52 UTC

I have installed kernel-debug-modules-extra-3.9.6-200.fc18.i686, kernel-debug-devel-3.9.6-200.fc18.i686, kernel-debug-3.9.6-200.fc18.i686

What I have to do now?

Comment 62 Neil Horman 2013-06-26 12:38:02 UTC

recreate the problem, and send me your /var/log/messages file.  If this is a DMA problem the dma-debug code (which is enabled in the -debug kernel variant) should spit some errors about what your wireless driver might be doing to affect your wired driver

Comment 63 Eugene Saenko 2013-06-26 14:38:32 UTC

Created attachment 765624 [details]
messages with 3.9.6-200.fc18.i686.PAEdebug kernel

From 17:55 up to 18:13 I did no manipulations with Wi-Fi switch

Comment 64 Neil Horman 2013-06-26 15:10:15 UTC

It looks like you edited the log heavily, I was hoping to see all those debug messages. Please don't exclude things like that

Comment 65 Eugene Saenko 2013-06-26 16:08:21 UTC

Created attachment 765683 [details]
full file messages

In previous attachment I only removed lines before current boot and added my comments to indicate moments of link break and restore.
This one is untouched.

Comment 66 Neil Horman 2013-06-26 17:15:00 UTC

ok, I'm sorry, with the large time gaps between your edits, I had presumed you removed some messages.

I've verified that the DMA-API debug code is enabled, but its not showing anything, so if it is a dma problem, its not one thats caught by the debug code.

I'm not really sure where to take debugging here at this point.  It certainly seems like the wireless driver is doing something incorrectly, but I have no visibility into what, nor am I familiar with the wireless code.

I would suggest we do this:

1) I'll push a change for this driver to reset the phy if we get too many TX/RX overflows, so the immediately problem can be handled.

2) you open a new bz against the wireless driver, indicating that turning it on breaks the connection of the wired interface, and I'll have our wireless person (CC'ed) look into it further.  Please post the bz number that you open here, and I'll make sure it gets into the right hands.

sound good?

Comment 67 Eugene Saenko 2013-06-26 17:36:16 UTC

 OK. I'll try do all you suggest.

Comment 68 Neil Horman 2013-06-26 19:34:16 UTC

Created attachment 765753 [details]
[PATCH] atl1c: stop tx queue when a TX overflow occurs


This bug was reported recently:
https://bugzilla.redhat.com/show_bug.cgi?id=928220

The root cause is still being investigated, but an interesting behavior was
observed during the process.  Its possible for the atl1c driver to recieve a
multitude of interrupts indicating a TX overflow.  In isolation, a single
instance of this event isn't problematic, but during this reproducer 1000's of
these events were encountered in rapid succession.  This led to the NIC being
unable to transmit frames for a long period of time, without the OS being aware
of the condition.  A reset of teh adapter would bring it back, but no such reset
was forthcomming, because the TX queue was still on.

Fix this by calling netif_queue_stop when an overrun is encountered, and there
are subsequent frames pending for tx in the hardware queue.  Having additional
frames pending ensures that subsequent interrupts will be generated, giving the
irq handler a chance to wake the OS tx queue again if the overflow status clears
on its own.  If its persistent however, this give the net watchdog a chance to
expire and reset the adapter through normal channels

Signed-off-by: Neil Horman <nhorman>
CC: "David S. Miller" <davem>
CC: Jay Cliburn <jcliburn>
CC: xiong <xiong.com>
CC: Cloud Ren <cjren.com>
---
 drivers/net/ethernet/atheros/atl1c/atl1c_main.c | 41 +++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

Comment 69 Neil Horman 2013-06-26 19:41:49 UTC

I'm sorry to ask, but after having looked at it some, I like the solution I have in comment 68 a bit better than my origional patch, it lets the net watchdog task handle the device reset rather than making the driver decide when to do it, which I think is cleaner.  Can one of you test it out and make sure I didn't break anything in the process.

Thanks!

Comment 70 Mihail 2013-06-26 21:18:40 UTC

(In reply to Neil Horman from comment #60)
> I expect an issue like this will be related to dma.

[  594.349211] brcmsmac bcma0:0: brcms_ops_bss_info_changed: qos enabled: false (implement)
[  594.349237] brcmsmac bcma0:0: brcms_ops_config: change power-save mode: false (implement)
[  594.352114] IPv6: ADDRCONF(NETDEV_UP): wifi: link is not ready
[  594.623795] ------------[ cut here ]------------
[  594.623819] WARNING: at lib/dma-debug.c:937 check_unmap+0x5a8/0x70d()
[  594.623824] Hardware name: 1215B
[  594.623832] bcma-pci-bridge 0000:01:00.0: DMA-API: device driver failed to check map error[device address=0x00000000a40bc800] [size=1876 bytes] [mapped as single]
[  594.623837] Modules linked in:
[  594.623842]  sit tunnel4 pppoe pppox ppp_generic slhc ipv6 bnep 8021q garp stp llc snd_hda_codec_realtek uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core usb_storage acpi_cpufreq mperf brcmsmac cordic brcmutil snd_hda_codec_hdmi dm_mod radeon fbcon bitblit softcursor font tileblit cfbfillrect cfbimgblt cfbcopyarea i2c_algo_bit drm_kms_helper microcode ttm snd_hda_intel snd_hda_codec drm snd_hwdep xhci_hcd pcspkr snd_pcm snd_page_alloc snd_timer ohci_hcd snd ehci_pci ehci_hcd atl1c bcma soundcore
[  594.623963] Pid: 1622, comm: kwin Not tainted 3.9.5-gentoo #2
[  594.623967] Call Trace:
[  594.623972]  <IRQ>  [<ffffffff8103752e>] warn_slowpath_common+0x7e/0x99
[  594.623991]  [<ffffffff810375dd>] warn_slowpath_fmt+0x41/0x43
[  594.624014]  [<ffffffff8121b79e>] check_unmap+0x5a8/0x70d
[  594.624023]  [<ffffffff8121ba51>] debug_dma_unmap_page+0x5b/0x65
[  594.624059]  [<ffffffffa02c644a>] dma_unmap_single_attrs.constprop.21+0x55/0x5e [brcmsmac]
[  594.624085]  [<ffffffffa02c64f2>] dma64_getnextrxp+0x9f/0xde [brcmsmac]
[  594.624112]  [<ffffffffa02c7324>] dma_rx+0x7b/0x302 [brcmsmac]
[  594.624123]  [<ffffffff810819f5>] ? __module_address+0x9a/0xc6
[  594.624172]  [<ffffffffa029f8a9>] brcms_c_dpc+0x632/0x88a [brcmsmac]
[  594.624182]  [<ffffffff81075489>] ? get_lock_stats+0x19/0x4f
[  594.624209]  [<ffffffffa0295017>] ? spin_lock_bh+0x9/0xb [brcmsmac]
[  594.624229]  [<ffffffffa02965d5>] brcms_dpc+0x5d/0xb4 [brcmsmac]
[  594.624239]  [<ffffffff8103d66a>] tasklet_action+0x82/0xd5
[  594.624247]  [<ffffffff8103dac3>] __do_softirq+0xc3/0x187
[  594.624256]  [<ffffffff8103dca2>] irq_exit+0x5c/0xa1
[  594.624265]  [<ffffffff81003414>] do_IRQ+0x88/0x9f
[  594.624275]  [<ffffffff81471eef>] common_interrupt+0x6f/0x6f
[  594.624279]  <EOI>  [<ffffffff814727b7>] ? sysret_check+0x1b/0x56
[  594.624291] ---[ end trace 4dbab6f1180a7ce8 ]---
[  594.624295] Mapped at:
[  594.624299]  [<ffffffff8121bd2d>] debug_dma_map_page+0x4d/0x10b
[  594.624306]  [<ffffffffa02c66b9>] dma_map_single_attrs.constprop.22+0xc5/0xd8 [brcmsmac]
[  594.624327]  [<ffffffffa02c776a>] dma_rxfill+0x1bf/0x266 [brcmsmac]
[  594.624346]  [<ffffffffa02a03d0>] brcms_c_init+0x8cf/0xda7 [brcmsmac]
[  594.624365]  [<ffffffffa02964f6>] brcms_init+0x53/0x57 [brcmsmac]
[  597.016662] wifi: authenticate with 14:a9:e3:01:f2:a8
[  597.018120] wifi: send auth to 14:a9:e3:01:f2:a8 (try 1/3)
[  597.023421] wifi: authenticated
[  597.044000] wifi: associate with 14:a9:e3:01:f2:a8 (try 1/3)
[  597.046626] wifi: RX AssocResp from 14:a9:e3:01:f2:a8 (capab=0x411 status=0 aid=1)
[  597.047187] brcmsmac bcma0:0: brcmsmac: brcms_ops_bss_info_changed: associated
[  597.047202] brcmsmac bcma0:0: brcms_ops_bss_info_changed: qos enabled: true (implement)
[  597.047998] wifi: associated

Comment 71 Neil Horman 2013-06-27 10:58:53 UTC

In reply to comment 70, that indicates the brcms driver is not checking the return code of a dma_map operation.  While thats not a smoking gun, it could mean that we're using an invalid memory address for a dma operation, and that could result in the behavior we are seeing.  Please open a separate bug for that  and give me the number here, I can get that fixed.

Any results from my patch in comment 68?

Comment 72 Mihail 2013-06-28 04:51:44 UTC

(In reply to Neil Horman from comment #71)
> Any results from my patch in comment 68?

Yes. Patch in comment 68 did not help.
Flooding log by messages "atl1c 0000:02:00.0: TX/RX overflow (status = 0x8)", and break data transmission/recive after shitch wifi on.

Comment 73 Neil Horman 2013-06-28 14:20:18 UTC

Ok, I need to clean up the message flood, but after about 5 seconds the netdev watchdog timer kicked in and should have reset the device, returning it to normal operation.  did it?

Comment 74 Mihail 2013-06-28 18:46:09 UTC

Device did not reset after overflow.

But reset after
Jun 28 04:25:28 [kernel] [ 9557.832510] atl1c 0000:02:00.0: atl1c hardware error (status = 0x400)
Jun 28 04:25:28 [kernel] [ 9557.857023] atl1c 0000:02:00.0: atl1c: cable NIC Link is Up<100 Mbps Full Duplex>

Comment 75 Neil Horman 2013-06-28 19:27:27 UTC

what the output of "tc qdisc show" and "uname -a" on your system?

Comment 76 Mihail 2013-06-28 20:47:41 UTC

# tc qdisc show
qdisc pfifo_fast 0: dev cable root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc mq 0: dev wifi root
qdisc pfifo_fast 0: dev ppp0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1


# uname -a
Linux red 3.9.6-gentoo #1 SMP PREEMPT Fri Jun 28 00:19:08 MSK 2013 x86_64 AMD E-350 Processor AuthenticAMD GNU

Comment 77 Neil Horman 2013-06-30 23:37:37 UTC

gentoo?  you sure you're running the patched kernel?

All the interfaces have a qdisc on them, so this should be working, I'll get you an instrumented kernel to run this down...

Comment 78 Mihail 2013-07-01 08:18:16 UTC

(In reply to Neil Horman from comment #77)
> gentoo?  
little patched vanilla. I can use pure vanilla, if you preffer.

>you sure you're running the patched kernel?
Yes, I am patch and build kernel manually.

https://bugzilla.kernel.org/show_bug.cgi?id=60261 for tracking warning  "brcmsmac device driver failed to check map error"

Comment 79 Neil Horman 2013-07-01 12:25:02 UTC

Created attachment 767380 [details]
updated patch to stop tx queue when a TX overflow occurs

sorry, I think I found the problem, I forgot to remove the code that re-wakes the queue in the clean_tx_irq portion of the interrupt handler.  This updated version should fix it. It also has some debug code if we run into subsequent problems.  Please give it a test and confirm that it fixes the issue for you.  Thanks

Comment 80 Mihail 2013-07-02 06:34:46 UTC

Created attachment 767637 [details]
log for patch in attachment 767380 [details]

New patch try to reset controller.
But controller freeze again shortly after reset.

It better than nothing, but ethernet still unusable if wifi is active.

Comment 81 Neil Horman 2013-07-02 14:24:45 UTC

That makes no sense at all.  It looks like as soon as we reset the device now, we hit exactly the same problem again.  So this fix is useless.  And I have no visibility as to what the ath9k driver might be doing to cause this problem.  I'm about ready to just start bisecting this issue.  Can any of you tell me if this problem occured in Fedora 17, or point to an earlier kernel where the problem did not occur?

Comment 82 Eugene Saenko 2013-07-02 16:41:59 UTC

I had no problems in Fedora-16 and earlier releases.
But I didn't try Fedora-17 (omit it).

Comment 83 Kasun Chathuranga 2013-07-03 07:14:08 UTC

Here is a workaround. This the one I have tested in Fedora 18 x86-64 with kernel 3.9.6-200

Get some little old driver.
wget https://www.kernel.org/pub/linux/kernel/projects/backports/2012/12/19/compat-drivers-2012-12-19-u.tar.gz

Select driver as atl1c.
./scripts/driver-select atl1c

make
sudo make install

And reboot.

I add more information here.
http://kasunc.blogspot.com/2013/07/workaround-for-atheros-communications.html

Comment 84 Neil Horman 2013-07-03 14:19:37 UTC

*** Bug 980806 has been marked as a duplicate of this bug. ***

Comment 85 Neil Horman 2013-07-03 14:43:42 UTC

I'm going to track this upstream for now, as several others are reporting this problem there.

Comment 86 Neil Horman 2013-08-16 17:25:53 UTC

looks like the upstream fix for this was pulled in as part of the 3.10.7 rebase

Note You need to log in before you can comment on or make changes to this bug.