Bug 595847 - iwl3945: drops connection when under load
iwl3945: drops connection when under load
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
All Linux
low Severity medium
: rc
: ---
Assigned To: John W. Linville
desktop-bugs@redhat.com
:
Depends On: 589777
Blocks: 534148 601290
  Show dependency treegraph
 
Reported: 2010-05-25 14:46 EDT by John W. Linville
Modified: 2010-11-12 08:44 EST (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 589777
Environment:
Last Closed: 2010-11-12 08:44:41 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description John W. Linville 2010-05-25 14:46:24 EDT
+++ This bug was initially created as a clone of Bug #589777 +++

This seems to be a known issue: under relatively heavy load (the easiest way to reproduce is to run few HD streaming flash videos from YouTube/Vimeo), the iwl3945 drops connection and does not reconnect. Cycling through rmmod/modprobe seems to help.

The quick search https://bugzilla.redhat.com/buglist.cgi?quicksearch=REPLY_RXON seems to produce few related/duplicate issues, but as they all are reported for the Fedora 12 or earlier, I am opening this ticket. I had the same issue with Fedora 12, on pre-release kernel for 13 issue is still present.

Kernel version: 2.6.33.3-72.fc13.x86_64

Adding the /var/log/messages for the iwl3945:


May  6 12:36:39 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May  6 12:36:39 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May  6 12:36:40 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May  6 12:36:40 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May  6 12:36:40 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May  6 12:36:40 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May  6 12:36:41 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
May  6 12:36:41 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_TX_PWR_TABLE_CMD: time out after 500ms.
May  6 12:36:42 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
May  6 12:36:42 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May  6 12:36:42 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May  6 12:36:43 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May  6 12:36:43 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May  6 12:36:43 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_TX_PWR_TABLE_CMD: time out after 500ms.
May  6 12:36:47 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May  6 12:36:47 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May  6 12:36:48 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
May  6 12:36:48 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_TX_PWR_TABLE_CMD: time out after 500ms.
May  6 12:36:49 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
May  6 12:36:49 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May  6 12:36:49 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).

[rmmod]

May  6 12:36:49 miracle kernel: iwl3945 0000:02:00.0: PCI INT A disabled

[modprobe]

May  6 12:36:51 miracle kernel: iwl3945: Intel(R) PRO/Wireless 3945ABG/BG Network Connection driver for Linux, 2.6.33.3-72.fc13.x86_64-kds
May  6 12:36:51 miracle kernel: iwl3945: Copyright(c) 2003-2009 Intel Corporation
May  6 12:36:51 miracle kernel: iwl3945 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
May  6 12:36:51 miracle kernel: iwl3945 0000:02:00.0: Tunable channels: 13 802.11bg, 23 802.11a channels
May  6 12:36:51 miracle kernel: iwl3945 0000:02:00.0: Detected Intel Wireless WiFi Link 3945ABG
...
May  6 12:36:51 miracle kernel: iwl3945 0000:02:00.0: firmware: requesting iwlwifi-3945-2.ucode
...
May  6 12:36:51 miracle kernel: iwl3945 0000:02:00.0: loaded firmware version 15.32.2.9
May  6 12:36:51 miracle kernel: ADDRCONF(NETDEV_UP): wlan0: link is not ready
...
May  6 12:36:54 miracle kernel: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready

--- Additional comment from kyryl.bilokurov@gmail.com on 2010-05-06 17:06:54 EDT ---

There seems to be the related thread in linux-wireless

http://marc.info/?l=linux-wireless&m=127151315515173&w=2

--- Additional comment from linville@redhat.com on 2010-05-10 13:53:03 EDT ---

Please create a file named /etc/modprobe.d/iwl3945.conf with this one line:

    options iwl3945 disable_hw_scan=1

After that, either reboot or execute this sequence:

   modprobe -r iwl3945
   modprobe iwl3945

Does that produce any better results?

--- Additional comment from kyryl.bilokurov@gmail.com on 2010-05-13 06:16:05 EDT ---

This seems to enhance stability a bit, but the issue still happens:


May 13 01:30:10 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May 13 01:30:10 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-110).
May 13 01:30:11 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May 13 01:30:11 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-110).

[skipped repeating rows]

May 13 01:30:18 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May 13 01:30:18 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-110).
May 13 01:30:19 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_TX_PWR_TABLE_CMD: time out after 500ms.

[skipped repeating rows]

May 13 01:30:31 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May 13 01:30:31 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-110).
May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_TX_PWR_TABLE_CMD: time out after 500ms.
May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: No space in command queue
May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: Restarting adapter due to queue full
May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: enqueue_hcmd failed: -28
May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-28).



Also, the new log messages started appearing in syslog with disable_hw_scan=1 :

May 13 01:38:13 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 100 [0]
May 13 01:38:14 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 104 [0]

[skipped repeating rows]

May 13 01:48:17 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 136 [0]
May 13 01:48:18 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 140 [0]

--- Additional comment from kyryl.bilokurov@gmail.com on 2010-05-13 08:30:55 EDT ---

There is an improvement - after same hiccup, iwl3945 reported restart of adapter and then managed to reconnect. IIRC, before it was not able to reconnect. Log below:


May 13 14:23:45 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May 13 14:23:45 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May 13 14:23:45 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May 13 14:23:45 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).

[skipped repeating rows]

May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: No space in command queue
May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Restarting adapter due to queue full
May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: enqueue_hcmd failed: -28
May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-28).
May 13 14:24:00 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 140 [0]

[skipped some rows]

May 13 14:24:09 miracle dhclient[16668]: Listening on LPF/wlan0/00:13:02:b8:20:df
May 13 14:24:09 miracle dhclient[16668]: Sending on   LPF/wlan0/00:13:02:b8:20:df
May 13 14:24:09 miracle dhclient[16668]: Sending on   Socket/fallback
May 13 14:24:09 miracle dhclient[16668]: DHCPREQUEST on wlan0 to 255.255.255.255 port 67
May 13 14:24:09 miracle dhclient[16668]: DHCPACK from 10.0.0.138
May 13 14:24:09 miracle dhclient[16668]: bound to 10.0.0.2 -- renewal in 35749 seconds.

--- Additional comment from reinette.chatre@intel.com on 2010-05-19 14:35:24 EDT ---

Looks like the hardware is stuck here ... the kernel you are running (as you discovered) has some safety in it to detect a stuck queue when it is full. In later kernels we have a new mechanism to hopefully do this detection earlier.

The patch that accomplishes this is (taken from wireless-next-2.6):

commit b74e31a9bc1013e69b85b139072485dc153453dd
Author: Wey-Yi Guy <wey-yi.w.guy@intel.com>
Date:   Mon Mar 1 17:23:50 2010 -0800

    iwlwifi: Recover TX flow stall due to stuck queue
    
    Monitors the internal TX queues periodically.  When a queue is stuck
    for some unknown conditions causing the throughput to drop and the
    transfer is stop, the driver will force firmware reload and bring the
    system back to normal operational state.
    
    The iwlwifi devices behave differently in this regard so this feature is
    made part of the ops infrastructure so we can have more control on how to
    monitor and recover from tx queue stall case per device.
    
    Signed-off-by: Trieu 'Andrew' Nguyen <trieux.t.nguyen@intel.com>
    Signed-off-by: Wey-Yi Guy <wey-yi.w.guy@intel.com>
    Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>


Note that this patch requires firmware restart to be supported ("iwl_force_reset(priv, IWL_FW_RESET);"), but I believe you already ported this to FC13. 

One small additional change needed on top of this patch is to actually enable this feature for iwl3945 since we have not yet seen the need for this on iwl3945, but that obviously changed.

I think it will be something like this (whitespace broken copy&pasted and will not apply to your repo ... just to get idea):

diff --git a/drivers/net/wireless/iwlwifi/iwl-3945.c b/drivers/net/wireless/iwlwifi/iwl-3945.c
index 7cdda5b..8e68026 100644
--- a/drivers/net/wireless/iwlwifi/iwl-3945.c
+++ b/drivers/net/wireless/iwlwifi/iwl-3945.c
@@ -2700,6 +2700,7 @@ static struct iwl_lib_ops iwl3945_lib = {
        .config_ap = iwl3945_config_ap,
        .manage_ibss_station = iwl3945_manage_ibss_station,
        .check_plcp_health = iwl3945_good_plcp_health,
+       .recover_from_tx_stall = iwl_bg_monitor_recover,
 
        .debugfs_ops = {
                .rx_stats_read = iwl3945_ucode_rx_stats_read,
@@ -2744,6 +2745,7 @@ static struct iwl_cfg iwl3945_bg_cfg = {
        .monitor_recover_period = IWL_MONITORING_PERIOD,
        .max_event_log_size = 512,
        .tx_power_by_driver = true,
+       .monitor_recover_period = IWL_MONITORING_PERIOD,
 };
 
 static struct iwl_cfg iwl3945_abg_cfg = {

--- Additional comment from linville@redhat.com on 2010-05-19 16:28:11 EDT ---

"iwlwifi: Recover TX flow stall due to stuck queue" is already in F-13.  Also, it looks like we already have the '.monitor_recover_period = IWL_MONITORING_PERIOD,' bits, but I've added the '.recover_from_tx_stall = iwl_bg_monitor_recover,' bit to the test kernels here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=2197986

Kyryl, when the build finishes can you give those a try?

--- Additional comment from kyryl.bilokurov@gmail.com on 2010-05-21 17:01:04 EDT ---

Had this new kernel running for about two days, the following was logged into the /var/log/messages:


May 20 15:19:06 miracle kernel: iwl3945 0000:02:00.0: queue 2 stuck 3 time. Fw reload.
May 20 15:19:06 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload

....

May 21 22:42:16 miracle kernel: iwl3945 0000:02:00.0: queue 2 stuck 3 time. Fw reload.
May 21 22:42:16 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload
May 21 22:42:25 miracle kernel: iwl3945 0000:02:00.0: queue 4 stuck 3 time. Fw reload.
May 21 22:42:25 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload
May 21 22:43:10 miracle kernel: iwl3945 0000:02:00.0: queue 4 stuck 3 time. Fw reload.
May 21 22:43:10 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload
May 21 22:44:22 miracle kernel: iwl3945 0000:02:00.0: queue 4 stuck 3 time. Fw reload.
May 21 22:44:22 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload


All the above is for the case when option "disable_hw_scan=1" is not passed to module (i.e. hw scan is enabled).

I didn't notice any wifi connection interruptions when this was logged, so things look very good!

Thanks.
Comment 1 RHEL Product and Program Management 2010-05-25 15:06:35 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.
Comment 5 Aristeu Rozanski 2010-07-01 12:21:47 EDT
Patch(es) available on kernel-2.6.32-42.el6
Comment 8 Vladimir Benes 2010-11-12 04:59:42 EST
iwl3945 was heavily tested. Download of a large file and download of large amount of files was tested.. no problems found any more.
-> VERIFIED
Comment 9 releng-rhel@redhat.com 2010-11-12 08:44:41 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.