Bug 928968 - [brcmsmac] phyerr 0x10 spams dmeg, wireless stops working after a while
Summary: [brcmsmac] phyerr 0x10 spams dmeg, wireless stops working after a while
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 20
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: John Greene
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-03-28 21:03 UTC by Elad Alfassa
Modified: 2014-06-16 13:17 UTC (History)
12 users (show)

Fixed In Version: v3.11
Clone Of:
Environment:
Last Closed: 2014-01-17 17:19:30 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
dmesg output (184.91 KB, text/plain)
2013-03-28 21:03 UTC, Elad Alfassa
no flags Details
lspci -nn output (2.45 KB, text/plain)
2013-06-11 19:43 UTC, Elad Alfassa
no flags Details
WORKAROUND to reduce log spam for to errors (2.27 KB, patch)
2013-07-26 15:39 UTC, John Greene
no flags Details | Diff

Description Elad Alfassa 2013-03-28 21:03:02 UTC
Created attachment 717851 [details]
dmesg output

I get a lot of
[ 4810.333836] brcmsmac bcma0:0: phyerr 0x10, rate 0x37
[ 4810.333840] brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: ampdu tx phy error (0x10)

in my dmesg log.

Sometimes wifi stops working altogether until the module is unloaded and loaded again or the system is rebooted.

kernel-3.9.0-0.rc4.git0.2.fc20.x86_64 from the rawhide-nodebug repository

03:00.0 Network controller: Broadcom Corporation BCM4313 802.11b/g/n Wireless LAN Controller (rev 01)
	Subsystem: Dell Inspiron M5010 / XPS 8300
	Flags: bus master, fast devsel, latency 0, IRQ 17
	Memory at f0500000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: bcma-pci-bridge

Suspicious lines just before wifi disconnected:

[ 4060.896679] brcmsmac bcma0:0: wl0: PSM microcode watchdog fired at 3914 (seconds). Resetting.
[ 4060.896772] brcms_c_dpc : PSM Watchdog, chipid 0x4313, chiprev 0x1
[ 4060.896813] brcmsmac bcma0:0: wl0: fatal error, reinitializing
[ 4060.898821] ieee80211 phy0: Hardware restart was requested
[ 4060.957003] brcmsmac bcma0:0: brcms_ops_config: change power-save mode: false (implement)
[ 4060.957137] brcmsmac bcma0:0: brcmsmac: brcms_ops_bss_info_changed: associated
[ 4060.957141] brcmsmac bcma0:0: changing basic rates failed: -22

Comment 1 John Greene 2013-06-03 14:47:09 UTC
Wow..a lot going on in this log.  Where to begin..
It might be a number of things: firmware/driver/AP.

Is this really repeatable or intermittent?
Appears this may be a single AP: does it occur on all APs or just this location?

Comment 2 Arend van Spriel 2013-06-03 16:31:06 UTC
PSM ucode watchdog is never a good sign. It seems the hardware restart mechanism with mac80211 behaves different than expected. Need to look into mac80211 code or ask Johannes about this. Also the phy error messages are a concern and I am wondering whether these were introduced in 3.9.

Could you try a 3.8 kernel and see how that behaves?
Can you disable bluetooth (rfkill block bluetooth)?

Comment 3 Elad Alfassa 2013-06-03 16:54:23 UTC
1) repeatable
2) It occurs on more than one AP.
3) bluetooth is disabled in the BIOS
4) Since this bug was reported I've given the machine to my brother, so testing on 3.8 will have to wait at least few days

Also, if it's relevant, I logged to the machine remotely now, and it seems that the dmesg log buffer is completely filled with "[20388.400584] brcmsmac bcma0:0: START: tid 2 is not agg'able" and nothing else.

Comment 4 John Greene 2013-06-11 19:36:51 UTC
Elad,  
A couple more bits might help. 

Can you post the output of 
lspci -nn 

-nn will give us the device and vendor ID of the chip that might help.  Appears to be "PSM Watchdog, chipid 0x4313, chiprev 0x1"

Can you say what signal levels are when the problems occur: seems from logs you describe they occur nearly all the time.  Do you notice a low signal level in these locations?

Comment 5 Elad Alfassa 2013-06-11 19:43:38 UTC
Created attachment 759790 [details]
lspci -nn output

Comment 6 John Greene 2013-06-11 20:00:00 UTC
Thanks Elad..
03:00.0 Network controller [0280]: Broadcom Corporation BCM4313 802.11b/g/n Wireless LAN Controller [14e4:4727] (rev 01)

Comment 7 Arend van Spriel 2013-06-11 20:16:16 UTC
Did you have the opportunity to test with a 3.8 kernel? Could you also provide the details found in /sys/kernel/debug/brcmsmac/bcma*/hardware?

Comment 8 Elad Alfassa 2013-06-11 20:30:01 UTC
board vendor: 1028
board type: 10
board revision: 2211
board flags: 8002a00
board flags2: 800
firmware revision: 262032c


I still only have remote access to the machine, hopefully I'll be able to test 3.8 during the weekend. Note that recently when logging in to check the logs I noticed no "phyerr" messages, but they might be there and just hiding because "brcmsmac bcma0:0: START: tid 1 is not agg'able" is flooding the kernel log buffer so much that it's the only think I can see in dmesg. According to journactl -ba those messages are constantly logged every few seconds (it varies, but it seems that 3 messages per second is the average rate in which those are filling the logs. That is 3 messages per second since connecting to wifi, so it's pretty much constant)

Also, scanning the dmesg log from today, apart from all those tid message I see few "brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: Pkt tx suppressed, illegal channel possibly 7" messages, when the channel number changes (noticed 7, 5, 2 and 11). I don't know if this is relevant tho.

Comment 9 John Greene 2013-06-20 13:30:33 UTC
Arend: 
Thanks for stepping in..
I have a question: this area has a number of messages in this driver (~4 places in this source file) that create significant log spam.  While good to know about them they seem to need to be toned down.  I've posted a patch for one to do that, is this the right tact from your point of view.  Most are not really an issue with driver performance.  Is just reducing the volume in line with upstream view?  
I'll do so and post upstream, just want to get your take first.

Comment 10 Joshua 2013-06-22 16:24:44 UTC
I'm running 3.9.6-301.fc19.x86_64 and the messages have grown to 16MB in 2 hours.

Same chip - Broadcom Corporation BCM4313
----------------------------------------------

[ 1535.308928] brcmsmac bcma0:0: phyerr 0x10, rate 0x14
[ 1535.308945] brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: ampdu tx phy error (0x10)
[ 1535.309889] brcmsmac bcma0:0: phyerr 0x10, rate 0x14
[ 1535.309904] brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: ampdu tx phy error (0x10)

Comment 11 John Greene 2013-06-25 15:58:08 UTC
So, here's a roll up of messages of high log spammers I've seen here and other BZs: 

 brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: Pkt tx suppressed, illegal channel possibly 7"
(Patched: wireless-next 
99e9494 brcmsmac: Reduce log spam in heavy tx, make err print in debug

Other candidates for a same workaround patch are..wish I'd have got 'em all with a single patch..
 brcmsmac bcma0:0: START: tid 2 is not agg'able" and nothing else.

BZ 918396 has this on F19
 brcmsmac bcma0:0: phyerr 0x10, rate 0x14
 brcmsmac bcma0:0: brcms_c_ampdu_dotxstatus_complete: ampdu tx phy error (0x10)

I'll propose upstream to make this stuff only debug visible (a la 99e9494 mentioned above)  before to eliminate huge log issue elaborated here..till a better cause of the issue appearing in 3.9..

Comment 12 John Greene 2013-06-26 17:11:14 UTC
Arend, 

Looking through the 3.8 to 3.9 update, this appears to be the only realistic thing that be causing all these PHY issues.

Added at 3.8 was 
commit b6fc28a158076ca2764edc9a6d1e1402f56e1c0c

    brcmsmac: support 4313iPA
    
    Add support for 4313 iPA variant.
    It is a variant of already supported 4313 ePA
    and needs some PHY changes to work properly.

Was reverted in 3.9

commit 54683441a92ebe20c5282465ea6f21e5e74d2974
Author: John W. Linville <linville>
Date:   Wed Mar 27 10:52:11 2013 -0400

    Revert "brcmsmac: support 4313iPA"
    
    This reverts commit b6fc28a158076ca2764edc9a6d1e1402f56e1c0c.
    
    This commit is reported to cause a regression in the support for some
    revisions of 4313 ePA devices.
    
    http://marc.info/?l=linux-wireless&m=136360340200943&w=2

So, having looked though the entire list of changes: does this change for 4313iPA support look to be part of the problem, whether included or not?

It would appear to be a problem for some if in, others if not.  
Or is there something else at work here?

Working on the patch to mute the other offenders in Comment 11.  Will be testing shortly.

Comment 13 John Greene 2013-07-12 15:08:06 UTC
Patch submitted upstream to move this message to debug.  Doesn't fix the problem, but keeps non-debug logs cleaner.

Comment 14 Matthias Tingelhoff 2013-07-14 15:38:44 UTC
Hello, I've got the exact same problem, with the differece that I use Arch Linux and an vanilla kernel, but this is the only bugreport I've found on this issue.
Also my Notebook is not by Dell, I own an Lenovo E535. However, the used wi-fi controller is the same: 02:00.0 Network controller [0280]: Broadcom Corporation BCM4313 802.11b/g/n Wireless LAN Controller [14e4:4727] (rev 01)


To keep it short, I've found a workaround: My router uses mode 11bgn mixed, but if I switch to 11g only or 11bg mixed, the problem is no more appearing!
As soon as "n" comes into play (tested 11n only too), the problem occur.
Even the dmesg-spam disappears (which was only present on 3.9.x kernels, on 3.8.x there's the same problem, but nothing in dmesg).

I hope this could be helpfull information.
If you need more details, just ask.

Comment 15 John Greene 2013-07-26 15:36:51 UTC
(In reply to Matthias Tingelhoff from comment #14)
> To keep it short, I've found a workaround: My router uses mode 11bgn mixed,
> but if I switch to 11g only or 11bg mixed, the problem is no more appearing!
> As soon as "n" comes into play (tested 11n only too), the problem occur.
> Even the dmesg-spam disappears (which was only present on 3.9.x kernels, on

Thanks for the info Matthias. Yes, this is true: the ampdu is a 802.11n enhancement and doing as you suggest will disable this code and fix the issue.  Lower thruputs will result, which may be ok, but I suspect most want 802.11 for that reason.

Will upload my upstream patch here, hardware vendor investigating these issues. Will fix the spam issue, but underlying problem still will be there.

Comment 16 John Greene 2013-07-26 15:39:33 UTC
Created attachment 778800 [details]
WORKAROUND to reduce log spam for to errors

Submitted upstream, not included there as yet.

Comment 17 tod.jackson 2013-08-29 00:28:52 UTC
In brcmsmac's main.c, here is another recent source of spam for me:

   if (rate == 0) {
                brcms_err(wlc->hw->d11core, "wl%d: WAR: using rate of 1 mbps\n",
                          wlc->pub->unit);
                rate = BRCM_RATE_1M;
        }

I guess this has been around for a while according to http://permalink.gmane.org/gmane.linux.kernel.wireless.general/98898 - but I never got spammed by this until upgrading to kernel 3.10.9 (or so).

I am using the patch just above, but since this new message above is still quite verbose, I have commented out the brcms_err above too. However, my BCM43224 seems to be a lot more stable than what I've read from BCM4313 users, so suppressing it isn't as worrisome. Bandwidth intensive processes still give me tid %d is not agg'able errors.

If somebody knows of a way to set the verbosity of these messages when invoking `dmesg` or can write a patch to do so, it would be greatly appreciated. I apologize if this was slightly off-topic.

Comment 18 John Greene 2013-08-29 12:53:34 UTC
(In reply to tod.jackson from comment #17)
> In brcmsmac's main.c, here is another recent source of spam for me:
> 
>    if (rate == 0) {
>                 brcms_err(wlc->hw->d11core, "wl%d: WAR: using rate of 1
> mbps\n",
>                           wlc->pub->unit);
>                 rate = BRCM_RATE_1M;
>         }
> 
> I guess this has been around for a while according to
> http://permalink.gmane.org/gmane.linux.kernel.wireless.general/98898 - but I
> never got spammed by this until upgrading to kernel 3.10.9 (or so).
> 
> I am using the patch just above, but since this new message above is still
> quite verbose, I have commented out the brcms_err above too. However, my
> BCM43224 seems to be a lot more stable than what I've read from BCM4313
> users, so suppressing it isn't as worrisome. Bandwidth intensive processes
> still give me tid %d is not agg'able errors.

Patches I've done simply move these from brcms_err to brcms_dbg.  They will still be logged, but invisible to "normal" logs.

> If somebody knows of a way to set the verbosity of these messages when
> invoking `dmesg` or can write a patch to do so, it would be greatly
> appreciated. I apologize if this was slightly off-topic.

Look at man page for dmesg and dmesg --help.  
dmesg -n <1 to 8> will select the *display level* of messages. 1 is less 8 is more.

Comment 19 Fedora End Of Life 2013-09-16 13:22:41 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 20 development cycle.
Changing version to '20'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora20

Comment 20 John Greene 2014-01-17 17:19:30 UTC
I see the workaround patch changes here are now in kernel 3.11 (there are a couple patches that fix these noise makers: one is in 3.10 other in 3.11 ), so I'm closing this one as Fixed..Please feel free to reopen or open a new issue if you have other problems.

Comment 21 Arend van Spriel 2014-06-16 13:17:25 UTC
The needinfo flag was set a while ago. Guess it is not really relevant any more.


Note You need to log in before you can comment on or make changes to this bug.