Bug 588021 - iwlagn hangs the system randomly due to a rate scaling bug
Summary: iwlagn hangs the system randomly due to a rate scaling bug
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 13
Hardware: All
OS: Linux
Target Milestone: ---
Assignee: John W. Linville
QA Contact: Fedora Extras Quality Assurance
Depends On:
Blocks: F13Blocker, F13FinalBlocker 595845
TreeView+ depends on / blocked
Reported: 2010-05-02 08:43 UTC by Adel Gadllah
Modified: 2010-05-25 18:41 UTC (History)
10 users (show)

Fixed In Version: kernel-
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 595845 (view as bug list)
Last Closed: 2010-05-19 00:07:19 UTC

Attachments (Terms of Use)
[PATCH] iwlagn: Change the TPT calculations sanity-check to WARN_ON (1.16 KB, patch)
2010-05-02 08:43 UTC, Adel Gadllah
no flags Details | Diff
0001-iwlagn-Change-the-TPT-calculations-sanity-check-to-W.patch (1.27 KB, patch)
2010-05-03 17:27 UTC, John W. Linville
no flags Details | Diff
Recalculate tpt if not current (1.45 KB, patch)
2010-05-03 17:42 UTC, reinette chatre
no flags Details | Diff

Description Adel Gadllah 2010-05-02 08:43:25 UTC
Created attachment 410771 [details]
[PATCH] iwlagn: Change the TPT calculations sanity-check to WARN_ON

Description of problem:

My system hangs randomly when using on F-13, I had no clue what was causing it but yesterday it happened while I was outside of X and saw the iwlagn oops which triggers it.

Version-Release number of selected component (if applicable):

How reproducible:


Steps to Reproduce:
1. Conntect to a wireless network (80211n only?)
2. Wait
Actual results:

BUG_ON() gets triggered.

Expected results:

system should not die.

Additional info:

The oops

I have attached a patch that at least makes it a WARN_ON() as per Johannes Berg's suggestion.

Comment 1 John W. Linville 2010-05-03 17:27:25 UTC
Created attachment 411067 [details]

Comment 2 reinette chatre 2010-05-03 17:42:22 UTC
Created attachment 411078 [details]
Recalculate tpt if not current

I can see a potential race condition here in the calculation of the average throughput so a BUG_ON seems extreme.

I looked at the history of this code and it seems as though the BUG_ON was added as a sidenote to a patch implementing something else. 

The patch adding this BUG_ON is:

commit 3110bef78cb4282c58245bc8fd6d95d9ccb19749
Author: Guy Cohen <guy.cohen@intel.com>
Date:   Tue Sep 9 10:54:54 2008 +0800

    iwlwifi: Added support for 3 antennas

... and it thus seems as though this BUG_ON was added along the way while doing something else ... especially considering that the comments describing the original code has not been removed yet. Since the current code still contains:

        /* Else we have enough samples; calculate estimate of
         * actual average throughput */

.. .which is obviously not done right now.

I looked at the original code and think we can revert the portion of this patch adding the BUG_ON. Since users have not encountered the error I assume the author considered that a BUG_ON was warranted, but now we know that users do indeed encounter the error and we should return the original code.

Could you please try the attached patch instead? If this works then we can send it upstream.

Comment 3 John W. Linville 2010-05-03 19:02:56 UTC

Give this a try?

Comment 4 Adel Gadllah 2010-05-03 21:13:16 UTC
(In reply to comment #3)
> http://koji.fedoraproject.org/koji/taskinfo?taskID=2158863
> Give this a try?    

I cannot say whether it fixes the problem or not yet, as it is hard to trigger, but from my quick testing it does not seem to introduce a regression.

The connection seems stable and throughput is good.

Comment 5 Adam Williamson 2010-05-06 17:14:30 UTC
This seems like a no-brainer blocker to me. Let's get the fix in for final. There are several threads on the forums where people report 'mysterious random hangs' which I suspect are this issue. I will ask them to try kernel -82 or later and report. thanks.

Fedora Bugzappers volunteer triage team

Comment 7 Andrew 2010-05-13 06:17:05 UTC
I'm still seeing a lot of kernel oops on my fedora 13 machine with the latest kernel (i.e. turn on wireless and then wait less than 30 mins and I'm pretty much guaranteed to get a crash):

Linux loso #1 SMP Thu May 6 18:09:49 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

------------[ cut here ]------------
WARNING: at drivers/net/wireless/iwlwifi/iwl-scan.c:658 iwl_fill_probe_req+0x75/0x99 [iwlcore]()
Hardware name: VGN-SZ691N
Modules linked in: snd_seq_dummy vboxnetadp vboxnetflt vboxdrv aes_x86_64 aes_generic fuse rfcomm sco bridge stp llc bnep l2cap autofs4 coretemp sunrpc cpufreq_ondemand acpi_cpufreq freq_table nf_conntrack_ipv6 ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables ipv6 uinput nvidia(P) snd_hda_codec_idt snd_hda_intel arc4 snd_hda_codec ecb snd_hwdep uvcvideo snd_seq iwlagn snd_seq_device iwlcore sony_laptop videodev snd_pcm btusb v4l1_compat snd_timer v4l2_compat_ioctl32 bluetooth mac80211 iTCO_wdt tifm_7xx1 snd iTCO_vendor_support tifm_core i2c_i801 joydev cfg80211 soundcore snd_page_alloc rfkill sky2 microcode usb_storage firewire_ohci firewire_core crc_itu_t yenta_socket rsrc_nonstatic nouveau ttm drm_kms_helper drm i2c_algo_bit video output i2c_core [last unloaded: vboxdrv]
Pid: 880, comm: iwlagn Tainted: P        W #1
Call Trace:
[<ffffffff8104b558>] warn_slowpath_common+0x77/0x8f
[<ffffffff8104b57f>] warn_slowpath_null+0xf/0x11
[<ffffffffa0239690>] iwl_fill_probe_req+0x75/0x99 [iwlcore]
[<ffffffffa023a721>] iwl_bg_request_scan+0x97a/0x1081 [iwlcore]
[<ffffffffa02227aa>] ? iwl_set_tx_power+0xe2/0x11d [iwlcore]
[<ffffffff81060d3d>] worker_thread+0x1a4/0x232
[<ffffffffa0239da7>] ? iwl_bg_request_scan+0x0/0x1081 [iwlcore]
[<ffffffff81064817>] ? autoremove_wake_function+0x0/0x34
[<ffffffff81060b99>] ? worker_thread+0x0/0x232
[<ffffffff810643c7>] kthread+0x7a/0x82
[<ffffffff8100a924>] kernel_thread_helper+0x4/0x10
[<ffffffff8106434d>] ? kthread+0x0/0x82
[<ffffffff8100a920>] ? kernel_thread_helper+0x0/0x10

I've noticed on some other threads that using a Cisco router seems to be triggering the bug. I am using a cisco router and haven't had the crash when I've been using other modems.


Is this meant to be fixed? Is there a test kernel that I can try?

Comment 8 John W. Linville 2010-05-13 12:58:34 UTC
Andrew, that is a completely different issue -- please open a new bug.  Feel free to Cc me on it.  Thanks!

Comment 9 Andrew 2010-05-13 17:11:08 UTC
Created new report for my issue here: https://bugzilla.redhat.com/show_bug.cgi?id=592011

Comment 10 Adam Williamson 2010-05-19 00:07:19 UTC
let's close this one, it looks fixed.

Fedora Bugzappers volunteer triage team

Note You need to log in before you can comment on or make changes to this bug.