Bug 595845

Summary: iwlagn hangs the system randomly due to a rate scaling bug
Product: Red Hat Enterprise Linux 6 Reporter: John W. Linville <linville>
Component: kernelAssignee: John W. Linville <linville>
Status: CLOSED CURRENTRELEASE QA Contact: desktop-bugs <desktop-bugs>
Severity: urgent Docs Contact:
Priority: low    
Version: 6.0CC: anton, cmeadors, dougsland, jlaska, jonathan, kernel-maint, reinette.chatre, vbenes, zcerza
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 588021 Environment:
Last Closed: 2010-11-15 14:26:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 588021    
Bug Blocks: 534148, 601290    

Description John W. Linville 2010-05-25 18:41:18 UTC
+++ This bug was initially created as a clone of Bug #588021 +++

Created an attachment (id=410771)
[PATCH] iwlagn: Change the TPT calculations sanity-check to WARN_ON

Description of problem:

My system hangs randomly when using on F-13, I had no clue what was causing it but yesterday it happened while I was outside of X and saw the iwlagn oops which triggers it.

Version-Release number of selected component (if applicable):
2.6.33.2-57.fc13.x86_64

How reproducible:

Random

Steps to Reproduce:
1. Conntect to a wireless network (80211n only?)
2. Wait
  
Actual results:

BUG_ON() gets triggered.

Expected results:

system should not die.

Additional info:

The oops http://193.200.113.196/apache2-default/oops.jpg

I have attached a patch that at least makes it a WARN_ON() as per Johannes Berg's suggestion.

--- Additional comment from linville on 2010-05-03 13:27:25 EDT ---

Created an attachment (id=411067)
0001-iwlagn-Change-the-TPT-calculations-sanity-check-to-W.patch

--- Additional comment from reinette.chatre on 2010-05-03 13:42:22 EDT ---

Created an attachment (id=411078)
Recalculate tpt if not current

I can see a potential race condition here in the calculation of the average throughput so a BUG_ON seems extreme.

I looked at the history of this code and it seems as though the BUG_ON was added as a sidenote to a patch implementing something else. 

The patch adding this BUG_ON is:

commit 3110bef78cb4282c58245bc8fd6d95d9ccb19749
Author: Guy Cohen <guy.cohen>
Date:   Tue Sep 9 10:54:54 2008 +0800

    iwlwifi: Added support for 3 antennas


... and it thus seems as though this BUG_ON was added along the way while doing something else ... especially considering that the comments describing the original code has not been removed yet. Since the current code still contains:

        /* Else we have enough samples; calculate estimate of
         * actual average throughput */

.. .which is obviously not done right now.

I looked at the original code and think we can revert the portion of this patch adding the BUG_ON. Since users have not encountered the error I assume the author considered that a BUG_ON was warranted, but now we know that users do indeed encounter the error and we should return the original code.

Could you please try the attached patch instead? If this works then we can send it upstream.

--- Additional comment from linville on 2010-05-03 15:02:56 EDT ---

http://koji.fedoraproject.org/koji/taskinfo?taskID=2158863

Give this a try?

--- Additional comment from adel.gadllah on 2010-05-03 17:13:16 EDT ---

(In reply to comment #3)
> http://koji.fedoraproject.org/koji/taskinfo?taskID=2158863
> 
> Give this a try?    

I cannot say whether it fixes the problem or not yet, as it is hard to trigger, but from my quick testing it does not seem to introduce a regression.

The connection seems stable and throughput is good.

--- Additional comment from awilliam on 2010-05-06 13:14:30 EDT ---

This seems like a no-brainer blocker to me. Let's get the fix in for final. There are several threads on the forums where people report 'mysterious random hangs' which I suspect are this issue. I will ask them to try kernel -82 or later and report. thanks.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

--- Additional comment from awilliam on 2010-05-06 13:15:37 EDT ---

Forum threads:

http://forums.fedoraforum.org/showthread.php?t=241974
http://forums.fedoraforum.org/showthread.php?t=244481

--- Additional comment from asavva on 2010-05-13 02:17:05 EDT ---

I'm still seeing a lot of kernel oops on my fedora 13 machine with the latest kernel (i.e. turn on wireless and then wait less than 30 mins and I'm pretty much guaranteed to get a crash):

Linux loso 2.6.33.3-85.fc13.x86_64 #1 SMP Thu May 6 18:09:49 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

------------[ cut here ]------------
WARNING: at drivers/net/wireless/iwlwifi/iwl-scan.c:658 iwl_fill_probe_req+0x75/0x99 [iwlcore]()
Hardware name: VGN-SZ691N
Modules linked in: snd_seq_dummy vboxnetadp vboxnetflt vboxdrv aes_x86_64 aes_generic fuse rfcomm sco bridge stp llc bnep l2cap autofs4 coretemp sunrpc cpufreq_ondemand acpi_cpufreq freq_table nf_conntrack_ipv6 ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables ipv6 uinput nvidia(P) snd_hda_codec_idt snd_hda_intel arc4 snd_hda_codec ecb snd_hwdep uvcvideo snd_seq iwlagn snd_seq_device iwlcore sony_laptop videodev snd_pcm btusb v4l1_compat snd_timer v4l2_compat_ioctl32 bluetooth mac80211 iTCO_wdt tifm_7xx1 snd iTCO_vendor_support tifm_core i2c_i801 joydev cfg80211 soundcore snd_page_alloc rfkill sky2 microcode usb_storage firewire_ohci firewire_core crc_itu_t yenta_socket rsrc_nonstatic nouveau ttm drm_kms_helper drm i2c_algo_bit video output i2c_core [last unloaded: vboxdrv]
Pid: 880, comm: iwlagn Tainted: P        W  2.6.33.3-85.fc13.x86_64 #1
Call Trace:
[<ffffffff8104b558>] warn_slowpath_common+0x77/0x8f
[<ffffffff8104b57f>] warn_slowpath_null+0xf/0x11
[<ffffffffa0239690>] iwl_fill_probe_req+0x75/0x99 [iwlcore]
[<ffffffffa023a721>] iwl_bg_request_scan+0x97a/0x1081 [iwlcore]
[<ffffffffa02227aa>] ? iwl_set_tx_power+0xe2/0x11d [iwlcore]
[<ffffffff81060d3d>] worker_thread+0x1a4/0x232
[<ffffffffa0239da7>] ? iwl_bg_request_scan+0x0/0x1081 [iwlcore]
[<ffffffff81064817>] ? autoremove_wake_function+0x0/0x34
[<ffffffff81060b99>] ? worker_thread+0x0/0x232
[<ffffffff810643c7>] kthread+0x7a/0x82
[<ffffffff8100a924>] kernel_thread_helper+0x4/0x10
[<ffffffff8106434d>] ? kthread+0x0/0x82
[<ffffffff8100a920>] ? kernel_thread_helper+0x0/0x10

I've noticed on some other threads that using a Cisco router seems to be triggering the bug. I am using a cisco router and haven't had the crash when I've been using other modems.

http://www.gossamer-threads.com/lists/linux/kernel/1221699

Is this meant to be fixed? Is there a test kernel that I can try?

--- Additional comment from linville on 2010-05-13 08:58:34 EDT ---

Andrew, that is a completely different issue -- please open a new bug.  Feel free to Cc me on it.  Thanks!

--- Additional comment from asavva on 2010-05-13 13:11:08 EDT ---

Created new report for my issue here: https://bugzilla.redhat.com/show_bug.cgi?id=592011

--- Additional comment from awilliam on 2010-05-18 20:07:19 EDT ---

let's close this one, it looks fixed.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 1 RHEL Program Management 2010-05-25 18:46:38 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 2 James Laska 2010-05-25 19:01:54 UTC
Removing from F13Blocker list

Comment 3 Adam Williamson 2010-05-25 19:20:29 UTC
please be careful when cloning Fedora bugs to RHEL, a lot of info is carried over that really shouldn't be, like blocker status (see jlaska's adjustment) and CC list (I'm adjusting that).



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 4 John W. Linville 2010-05-25 19:40:04 UTC
Almost none of that stuff can be changed until after the bug is cloned, at least AFAICT.  Perhaps you could open bugzilla bug asking that stuff like that not be carried-over to a clone?

Comment 8 Aristeu Rozanski 2010-07-01 16:21:50 UTC
Patch(es) available on kernel-2.6.32-42.el6

Comment 11 Vladimir Benes 2010-09-16 15:32:02 UTC
as no errors are seen marking this SanityOnly

http://patchwork.usersys.redhat.com/patch/26049/   [ OK ]

Comment 12 releng-rhel@redhat.com 2010-11-15 14:26:49 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.