Bug 588021

Summary: iwlagn hangs the system randomly due to a rate scaling bug
Product: [Fedora] Fedora Reporter: Adel Gadllah <adel.gadllah>
Component: kernelAssignee: John W. Linville <linville>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: low    
Version: 13CC: anton, asavva, awilliam, dougsland, gansalmon, itamar, jonathan, kernel-maint, mishu, reinette.chatre
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.33.3-82.fc13 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 595845 (view as bug list) Environment:
Last Closed: 2010-05-19 00:07:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 507681, 595845    
Attachments:
Description Flags
[PATCH] iwlagn: Change the TPT calculations sanity-check to WARN_ON
none
0001-iwlagn-Change-the-TPT-calculations-sanity-check-to-W.patch
none
Recalculate tpt if not current none

Description Adel Gadllah 2010-05-02 08:43:25 UTC
Created attachment 410771 [details]
[PATCH] iwlagn: Change the TPT calculations sanity-check to WARN_ON

Description of problem:

My system hangs randomly when using on F-13, I had no clue what was causing it but yesterday it happened while I was outside of X and saw the iwlagn oops which triggers it.

Version-Release number of selected component (if applicable):
2.6.33.2-57.fc13.x86_64

How reproducible:

Random

Steps to Reproduce:
1. Conntect to a wireless network (80211n only?)
2. Wait
  
Actual results:

BUG_ON() gets triggered.

Expected results:

system should not die.

Additional info:

The oops http://193.200.113.196/apache2-default/oops.jpg

I have attached a patch that at least makes it a WARN_ON() as per Johannes Berg's suggestion.

Comment 1 John W. Linville 2010-05-03 17:27:25 UTC
Created attachment 411067 [details]
0001-iwlagn-Change-the-TPT-calculations-sanity-check-to-W.patch

Comment 2 reinette chatre 2010-05-03 17:42:22 UTC
Created attachment 411078 [details]
Recalculate tpt if not current

I can see a potential race condition here in the calculation of the average throughput so a BUG_ON seems extreme.

I looked at the history of this code and it seems as though the BUG_ON was added as a sidenote to a patch implementing something else. 

The patch adding this BUG_ON is:

commit 3110bef78cb4282c58245bc8fd6d95d9ccb19749
Author: Guy Cohen <guy.cohen>
Date:   Tue Sep 9 10:54:54 2008 +0800

    iwlwifi: Added support for 3 antennas


... and it thus seems as though this BUG_ON was added along the way while doing something else ... especially considering that the comments describing the original code has not been removed yet. Since the current code still contains:

        /* Else we have enough samples; calculate estimate of
         * actual average throughput */

.. .which is obviously not done right now.

I looked at the original code and think we can revert the portion of this patch adding the BUG_ON. Since users have not encountered the error I assume the author considered that a BUG_ON was warranted, but now we know that users do indeed encounter the error and we should return the original code.

Could you please try the attached patch instead? If this works then we can send it upstream.

Comment 3 John W. Linville 2010-05-03 19:02:56 UTC
http://koji.fedoraproject.org/koji/taskinfo?taskID=2158863

Give this a try?

Comment 4 Adel Gadllah 2010-05-03 21:13:16 UTC
(In reply to comment #3)
> http://koji.fedoraproject.org/koji/taskinfo?taskID=2158863
> 
> Give this a try?    

I cannot say whether it fixes the problem or not yet, as it is hard to trigger, but from my quick testing it does not seem to introduce a regression.

The connection seems stable and throughput is good.

Comment 5 Adam Williamson 2010-05-06 17:14:30 UTC
This seems like a no-brainer blocker to me. Let's get the fix in for final. There are several threads on the forums where people report 'mysterious random hangs' which I suspect are this issue. I will ask them to try kernel -82 or later and report. thanks.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 7 Andrew 2010-05-13 06:17:05 UTC
I'm still seeing a lot of kernel oops on my fedora 13 machine with the latest kernel (i.e. turn on wireless and then wait less than 30 mins and I'm pretty much guaranteed to get a crash):

Linux loso 2.6.33.3-85.fc13.x86_64 #1 SMP Thu May 6 18:09:49 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

------------[ cut here ]------------
WARNING: at drivers/net/wireless/iwlwifi/iwl-scan.c:658 iwl_fill_probe_req+0x75/0x99 [iwlcore]()
Hardware name: VGN-SZ691N
Modules linked in: snd_seq_dummy vboxnetadp vboxnetflt vboxdrv aes_x86_64 aes_generic fuse rfcomm sco bridge stp llc bnep l2cap autofs4 coretemp sunrpc cpufreq_ondemand acpi_cpufreq freq_table nf_conntrack_ipv6 ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables ipv6 uinput nvidia(P) snd_hda_codec_idt snd_hda_intel arc4 snd_hda_codec ecb snd_hwdep uvcvideo snd_seq iwlagn snd_seq_device iwlcore sony_laptop videodev snd_pcm btusb v4l1_compat snd_timer v4l2_compat_ioctl32 bluetooth mac80211 iTCO_wdt tifm_7xx1 snd iTCO_vendor_support tifm_core i2c_i801 joydev cfg80211 soundcore snd_page_alloc rfkill sky2 microcode usb_storage firewire_ohci firewire_core crc_itu_t yenta_socket rsrc_nonstatic nouveau ttm drm_kms_helper drm i2c_algo_bit video output i2c_core [last unloaded: vboxdrv]
Pid: 880, comm: iwlagn Tainted: P        W  2.6.33.3-85.fc13.x86_64 #1
Call Trace:
[<ffffffff8104b558>] warn_slowpath_common+0x77/0x8f
[<ffffffff8104b57f>] warn_slowpath_null+0xf/0x11
[<ffffffffa0239690>] iwl_fill_probe_req+0x75/0x99 [iwlcore]
[<ffffffffa023a721>] iwl_bg_request_scan+0x97a/0x1081 [iwlcore]
[<ffffffffa02227aa>] ? iwl_set_tx_power+0xe2/0x11d [iwlcore]
[<ffffffff81060d3d>] worker_thread+0x1a4/0x232
[<ffffffffa0239da7>] ? iwl_bg_request_scan+0x0/0x1081 [iwlcore]
[<ffffffff81064817>] ? autoremove_wake_function+0x0/0x34
[<ffffffff81060b99>] ? worker_thread+0x0/0x232
[<ffffffff810643c7>] kthread+0x7a/0x82
[<ffffffff8100a924>] kernel_thread_helper+0x4/0x10
[<ffffffff8106434d>] ? kthread+0x0/0x82
[<ffffffff8100a920>] ? kernel_thread_helper+0x0/0x10

I've noticed on some other threads that using a Cisco router seems to be triggering the bug. I am using a cisco router and haven't had the crash when I've been using other modems.

http://www.gossamer-threads.com/lists/linux/kernel/1221699

Is this meant to be fixed? Is there a test kernel that I can try?

Comment 8 John W. Linville 2010-05-13 12:58:34 UTC
Andrew, that is a completely different issue -- please open a new bug.  Feel free to Cc me on it.  Thanks!

Comment 9 Andrew 2010-05-13 17:11:08 UTC
Created new report for my issue here: https://bugzilla.redhat.com/show_bug.cgi?id=592011

Comment 10 Adam Williamson 2010-05-19 00:07:19 UTC
let's close this one, it looks fixed.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers