Bug 748548 - Unrecognized RXON value logged as an warning, but marked as an error.
Summary: Unrecognized RXON value logged as an warning, but marked as an error.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 15
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: John W. Linville
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-10-24 19:15 UTC by Bill C. Riemers
Modified: 2012-06-07 15:06 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2012-06-07 15:06:03 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Abort log. (3.27 KB, text/plain)
2011-10-24 19:16 UTC, Bill C. Riemers
no flags Details
patch that disables the abort (2.68 KB, patch)
2011-10-27 07:13 UTC, Bill C. Riemers
no flags Details | Diff
Johannes Berg's patch to disable powersave and reset aid value (1.58 KB, patch)
2011-11-02 04:18 UTC, Bill C. Riemers
no flags Details | Diff

Description Bill C. Riemers 2011-10-24 19:15:24 UTC
Description of problem:

When I try and connect my laptop to the wifi network at the comfortinn hotel I get the following error:

WARNING: at drivers/net/wireless/iwlwifi/iwl-core.c:482 iwl_chec
k_rxon_cmd+0x211/0x21f [iwlagn]()
time:           Sun Oct 23 20:32:04 2011

backtrace:
:WARNING: at drivers/net/wireless/iwlwifi/iwl-core.c:482 iwl_check_rxon_cmd+0x21
1/0x21f [iwlagn]()
:Hardware name: 4384BP8
:Invalid RXON (0x40), channel 6


My guess is the WARNING is correct, in that there probably is a value here that is not supported by my hardware.   The problem is when one looks to the code they see:

	if ((rxon->flags & (RXON_FLG_CCK_MSK | RXON_FLG_SHORT_SLOT_MSK))
			== (RXON_FLG_CCK_MSK | RXON_FLG_SHORT_SLOT_MSK)) {
		IWL_WARN(priv, "CCK and short slot\n");
		errors |= BIT(7);
	}

So even those this is intended to be a warning that would probably not even prevent a successful wifi connection, it is added as an error.  Consequently between NetworkManager and the kernel, I get an endless loop of errors that I can only stop by physically turning off my wifi card.  (Even then it takes about 10 minutes for the abort messages to stop appearing after I turn my wifi card off.)

Version-Release number of selected component (if applicable):

kernel-2.6.40.4-5.fc15.x86_64

How reproducible:

100%
Steps to Reproduce:
1. Check in to the Comfort Inn on Sanderson,  in Raleigh NC
2. Select to connect to comfortinn on your wifi network.
3. Watch the errors.
  
Actual results:

Errors stream over and over, you never connect to the wifi network.

Expected results:

A fairly harmless warning is added to the system log, and the connection proceeds as normal.


Additional info:

I tested under Windows, and I was able to connect to the comfortinn wifi, so this is not a hardware issue.  However, it could be a limitation in the linux driver for the hardware.

Comment 1 Bill C. Riemers 2011-10-24 19:16:41 UTC
Created attachment 529949 [details]
Abort log.

  I tried to submit this with report-gtk, but that utility consistently failed.

Comment 2 Bill C. Riemers 2011-10-27 07:13:11 UTC
Created attachment 530432 [details]
patch that disables the abort

This is a patch that simply disables returning the error.  I'm using the wifi connection right now with absolutely no noticeable problems to submit this bugzilla.   So this proves that the messages really should be warnings.

Possibly though, this patch is not the correct way to solve the problem.  As it probably is reasonable for the function call to return a status that indicates there is a *potential* problem.   But once that status is returned it is incorrect to do an abort and generate a stack trace, rather than to still try connecting.

What is the correct site to submit this problem upstream?

Comment 3 Bill C. Riemers 2011-10-27 07:18:24 UTC
Just in passing I mentioned this problem to two people at the office yesterday.  Of the two people I mentioned it to one said he saw the same problem with some of the wifi connections he has tried using.   So based on that anecdotal evidence I would say the problem is actually fairly common.

Comment 4 Bill C. Riemers 2011-10-27 07:37:05 UTC
Koji build in progress:

http://koji.fedoraproject.org/koji/taskinfo?taskID=3465467

http://koji.fedoraproject.org/koji/taskinfo?taskID=3465466

When I installed version of the kernel I built with mock, it gave me all sorts of warnings about missing firmware.  Since I'm not using any of the hardware it was warning about, I can probably safely ignore those warnings.  But still I'm hoping the koji build does not have the same problem so others can use kernel build.

Comment 5 John W. Linville 2011-10-27 14:01:33 UTC
Short slot is an 802.11g feature, while CCK is an 802.11b modulation.  So the presence of both would seem to be a contradiction.  That said, I'm not sure what would cause that indication or what it would really mean.  Your experience suggests that the indication in the RXON command can be ignored.

Wey-yi, the current upstream code seems about the same as the code Bill is patching.  Obviously there are problems with that particular patch, but perhaps there is something we can learn here to apply upstream?

Comment 6 Bill C. Riemers 2011-10-27 15:09:32 UTC
Oops, it looks like the reason I had to comment all the errors is I was referencing the wrong block.  The block of code where the failure occurs is:

 	if (le16_to_cpu(rxon->assoc_id) > 2007) {
 		IWL_WARN(priv, "aid > 2007\n");
		errors |= BIT(6);
 	}

So the association id is what causes the problem.  So chances are if I were just to comment this one line of code I would achieve equally positive results.  I guess the thing to understand is why ignoring the associate id works, and if there is a less restrictive test that could be used that would detect only the instances when this really would cause an abort later in the code anyway.

Comment 7 wey-yi.w.guy 2011-10-27 19:42:15 UTC
(In reply to comment #5)
> Short slot is an 802.11g feature, while CCK is an 802.11b modulation.  So the
> presence of both would seem to be a contradiction.  That said, I'm not sure
> what would cause that indication or what it would really mean.  Your experience
> suggests that the indication in the RXON command can be ignored.
> Wey-yi, the current upstream code seems about the same as the code Bill is
> patching.  Obviously there are problems with that particular patch, but perhaps
> there is something we can learn here to apply upstream?

Yes, for sure it is issue in the code, sorry about it and I will make sure it is being addressed.

Thanks
Wey

Comment 8 Bill C. Riemers 2011-10-27 22:15:10 UTC
This information from the kern.log file will probably help:

Oct 27 18:02:34 briemersw kernel: [10717.032031] wlan0: RX AssocResp from 5c:0e:
8b:85:e7:20 (capab=0x401 status=0 aid=16383)


16383 > 2007 which is why the test is failing.  I was searching through the code to try and figure out how this value is actually used, and but I couldn't find anything other that looked relevant.

Comment 9 Bill C. Riemers 2011-10-27 22:18:38 UTC
BTW.  I've been discussing this with Johannes Berg via e-mail, since his e-mail is listed in the code.   Otherwise, I would not have known to look for the assoc_id in the kernel logs, or have realized the bits are counted from 0 not 1, so but 6 is 0x40 not bit 7.

Comment 10 Bill C. Riemers 2011-10-28 00:30:53 UTC
I've been tracing through the ieee80211 code to try and determine how the connection can work with a bogus value.  First off I notice an error in the test.  The value it should compare to is 0x2007, not 2007.   e.g. 

ieee80211_softmac.c: 

        assoc->aid = cpu_to_le16(ieee->assoc_id);
        if (ieee->assoc_id == 0x2007) ieee->assoc_id=0;
        else ieee->assoc_id++;


This of course only has an impact if softmac is used.   Is that always true for wireless?

Next, I see when the value is actually used it is not all the bits:

        hdr->aid = cpu_to_le16(ieee->assoc_id | 0xc000);
 
So in this case a value of 16383 has 1 added to it and becomes 16384 = 0x4000.  When this is assigned to the header it becomes:

0x4000 | 0xc000 => 0xc000

e.g. Equivalent to what would have been used with a assoc_id value of 0x2007.

So a better test might be:

if (
  (le16_to_cpu(rxon->assoc_id) != 0x2007) 
  && ((le16_to_cpu(rxon->assoc_id)+1)&0x3fff > 0x2007) )

Comment 11 Johannes Berg 2011-10-28 10:04:51 UTC
I guess we should keep it on the bug ...

As I said to Bill in email, he was looking at the wrong code (ieee80211_softmac.c AP side code? where does that even exist?).

I sent a patch to mac80211 to make it not send down invalid AID values and disable powersave since there's no way PS can work with this bogus AID.

http://mid.gmane.org/1319795987.8931.7.camel@jlt3.sipsolutions.net

Comment 12 Bill C. Riemers 2011-10-28 12:28:35 UTC
I'm doing a build of Johannes' patch right now.

http://koji.fedoraproject.org/koji/taskinfo?taskID=3468180

I'll be checking out of my hotel in a few minutes, but maybe I can come back to the hotel lobby to test it at lunch time.  If not, I have a college that has been experiencing a similar sounding problem.  If it turns out to be the same problem he should be able to test it.

Comment 13 Bill C. Riemers 2011-10-28 18:10:48 UTC
It looks like I rebuilt the wrong source RPM.  I should have added a build number or such to it, so I could tell the difference...

Comment 14 Bill C. Riemers 2011-11-02 04:15:19 UTC
I just built the correct rpm and I'm posting with it now.  I was not 100% positive my college was seeing the same problem, as my patch to comment out the errors had commented out all errors.   So I changed the value 2007 to 1, so my home network would produce the same error.  It looks like the patch successfully resolves the issue.

Comment 15 Bill C. Riemers 2011-11-02 04:18:44 UTC
Created attachment 531239 [details]
Johannes Berg's patch to disable powersave and reset aid value

Johannes Berg's patch to disable powersave and reset the aid value to 0 when it is greater than the maximum allowed value.

Comment 16 Johannes Berg 2011-11-02 10:40:25 UTC
Thanks Bill!

Comment 18 Josh Boyer 2012-06-07 15:06:03 UTC
This was fixed in 3.2.


Note You need to log in before you can comment on or make changes to this bug.