Created attachment 410771 [details]
[PATCH] iwlagn: Change the TPT calculations sanity-check to WARN_ON
Description of problem:
My system hangs randomly when using on F-13, I had no clue what was causing it but yesterday it happened while I was outside of X and saw the iwlagn oops which triggers it.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Conntect to a wireless network (80211n only?)
BUG_ON() gets triggered.
system should not die.
The oops http://188.8.131.52/apache2-default/oops.jpg
I have attached a patch that at least makes it a WARN_ON() as per Johannes Berg's suggestion.
Created attachment 411067 [details]
Created attachment 411078 [details]
Recalculate tpt if not current
I can see a potential race condition here in the calculation of the average throughput so a BUG_ON seems extreme.
I looked at the history of this code and it seems as though the BUG_ON was added as a sidenote to a patch implementing something else.
The patch adding this BUG_ON is:
Author: Guy Cohen <email@example.com>
Date: Tue Sep 9 10:54:54 2008 +0800
iwlwifi: Added support for 3 antennas
... and it thus seems as though this BUG_ON was added along the way while doing something else ... especially considering that the comments describing the original code has not been removed yet. Since the current code still contains:
/* Else we have enough samples; calculate estimate of
* actual average throughput */
.. .which is obviously not done right now.
I looked at the original code and think we can revert the portion of this patch adding the BUG_ON. Since users have not encountered the error I assume the author considered that a BUG_ON was warranted, but now we know that users do indeed encounter the error and we should return the original code.
Could you please try the attached patch instead? If this works then we can send it upstream.
Give this a try?
(In reply to comment #3)
> Give this a try?
I cannot say whether it fixes the problem or not yet, as it is hard to trigger, but from my quick testing it does not seem to introduce a regression.
The connection seems stable and throughput is good.
This seems like a no-brainer blocker to me. Let's get the fix in for final. There are several threads on the forums where people report 'mysterious random hangs' which I suspect are this issue. I will ask them to try kernel -82 or later and report. thanks.
Fedora Bugzappers volunteer triage team
I'm still seeing a lot of kernel oops on my fedora 13 machine with the latest kernel (i.e. turn on wireless and then wait less than 30 mins and I'm pretty much guaranteed to get a crash):
Linux loso 184.108.40.206-85.fc13.x86_64 #1 SMP Thu May 6 18:09:49 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
------------[ cut here ]------------
WARNING: at drivers/net/wireless/iwlwifi/iwl-scan.c:658 iwl_fill_probe_req+0x75/0x99 [iwlcore]()
Hardware name: VGN-SZ691N
Modules linked in: snd_seq_dummy vboxnetadp vboxnetflt vboxdrv aes_x86_64 aes_generic fuse rfcomm sco bridge stp llc bnep l2cap autofs4 coretemp sunrpc cpufreq_ondemand acpi_cpufreq freq_table nf_conntrack_ipv6 ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables ipv6 uinput nvidia(P) snd_hda_codec_idt snd_hda_intel arc4 snd_hda_codec ecb snd_hwdep uvcvideo snd_seq iwlagn snd_seq_device iwlcore sony_laptop videodev snd_pcm btusb v4l1_compat snd_timer v4l2_compat_ioctl32 bluetooth mac80211 iTCO_wdt tifm_7xx1 snd iTCO_vendor_support tifm_core i2c_i801 joydev cfg80211 soundcore snd_page_alloc rfkill sky2 microcode usb_storage firewire_ohci firewire_core crc_itu_t yenta_socket rsrc_nonstatic nouveau ttm drm_kms_helper drm i2c_algo_bit video output i2c_core [last unloaded: vboxdrv]
Pid: 880, comm: iwlagn Tainted: P W 220.127.116.11-85.fc13.x86_64 #1
[<ffffffffa0239690>] iwl_fill_probe_req+0x75/0x99 [iwlcore]
[<ffffffffa023a721>] iwl_bg_request_scan+0x97a/0x1081 [iwlcore]
[<ffffffffa02227aa>] ? iwl_set_tx_power+0xe2/0x11d [iwlcore]
[<ffffffffa0239da7>] ? iwl_bg_request_scan+0x0/0x1081 [iwlcore]
[<ffffffff81064817>] ? autoremove_wake_function+0x0/0x34
[<ffffffff81060b99>] ? worker_thread+0x0/0x232
[<ffffffff8106434d>] ? kthread+0x0/0x82
[<ffffffff8100a920>] ? kernel_thread_helper+0x0/0x10
I've noticed on some other threads that using a Cisco router seems to be triggering the bug. I am using a cisco router and haven't had the crash when I've been using other modems.
Is this meant to be fixed? Is there a test kernel that I can try?
Andrew, that is a completely different issue -- please open a new bug. Feel free to Cc me on it. Thanks!
Created new report for my issue here: https://bugzilla.redhat.com/show_bug.cgi?id=592011
let's close this one, it looks fixed.
Fedora Bugzappers volunteer triage team