589777 – iwl3945: drops connection when under load

Bug 589777 - iwl3945: drops connection when under load

Summary: iwl3945: drops connection when under load

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	13
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	John W. Linville
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	595847
TreeView+	depends on / blocked

Reported:	2010-05-06 20:56 UTC by Kyryl Bilokurov
Modified:	2010-08-07 05:55 UTC (History)
CC List:	13 users (show)
Fixed In Version:	kernel-2.6.33.4-107.fc13
Clone Of:
Clones:	595847 (view as bug list)
Environment:
Last Closed:	2010-08-07 05:55:54 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Kyryl Bilokurov 2010-05-06 20:56:49 UTC

This seems to be a known issue: under relatively heavy load (the easiest way to reproduce is to run few HD streaming flash videos from YouTube/Vimeo), the iwl3945 drops connection and does not reconnect. Cycling through rmmod/modprobe seems to help.

The quick search https://bugzilla.redhat.com/buglist.cgi?quicksearch=REPLY_RXON seems to produce few related/duplicate issues, but as they all are reported for the Fedora 12 or earlier, I am opening this ticket. I had the same issue with Fedora 12, on pre-release kernel for 13 issue is still present.

Kernel version: 2.6.33.3-72.fc13.x86_64

Adding the /var/log/messages for the iwl3945:


May  6 12:36:39 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May  6 12:36:39 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May  6 12:36:40 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May  6 12:36:40 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May  6 12:36:40 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May  6 12:36:40 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May  6 12:36:41 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
May  6 12:36:41 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_TX_PWR_TABLE_CMD: time out after 500ms.
May  6 12:36:42 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
May  6 12:36:42 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May  6 12:36:42 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May  6 12:36:43 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May  6 12:36:43 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May  6 12:36:43 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_TX_PWR_TABLE_CMD: time out after 500ms.
May  6 12:36:47 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May  6 12:36:47 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May  6 12:36:48 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
May  6 12:36:48 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_TX_PWR_TABLE_CMD: time out after 500ms.
May  6 12:36:49 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
May  6 12:36:49 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May  6 12:36:49 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).

[rmmod]

May  6 12:36:49 miracle kernel: iwl3945 0000:02:00.0: PCI INT A disabled

[modprobe]

May  6 12:36:51 miracle kernel: iwl3945: Intel(R) PRO/Wireless 3945ABG/BG Network Connection driver for Linux, 2.6.33.3-72.fc13.x86_64-kds
May  6 12:36:51 miracle kernel: iwl3945: Copyright(c) 2003-2009 Intel Corporation
May  6 12:36:51 miracle kernel: iwl3945 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
May  6 12:36:51 miracle kernel: iwl3945 0000:02:00.0: Tunable channels: 13 802.11bg, 23 802.11a channels
May  6 12:36:51 miracle kernel: iwl3945 0000:02:00.0: Detected Intel Wireless WiFi Link 3945ABG
...
May  6 12:36:51 miracle kernel: iwl3945 0000:02:00.0: firmware: requesting iwlwifi-3945-2.ucode
...
May  6 12:36:51 miracle kernel: iwl3945 0000:02:00.0: loaded firmware version 15.32.2.9
May  6 12:36:51 miracle kernel: ADDRCONF(NETDEV_UP): wlan0: link is not ready
...
May  6 12:36:54 miracle kernel: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready

Comment 1 Kyryl Bilokurov 2010-05-06 21:06:54 UTC

There seems to be the related thread in linux-wireless

http://marc.info/?l=linux-wireless&m=127151315515173&w=2

Comment 2 John W. Linville 2010-05-10 17:53:03 UTC

Please create a file named /etc/modprobe.d/iwl3945.conf with this one line:

    options iwl3945 disable_hw_scan=1

After that, either reboot or execute this sequence:

   modprobe -r iwl3945
   modprobe iwl3945

Does that produce any better results?

Comment 3 Kyryl Bilokurov 2010-05-13 10:16:05 UTC

This seems to enhance stability a bit, but the issue still happens:


May 13 01:30:10 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May 13 01:30:10 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-110).
May 13 01:30:11 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May 13 01:30:11 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-110).

[skipped repeating rows]

May 13 01:30:18 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May 13 01:30:18 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-110).
May 13 01:30:19 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_TX_PWR_TABLE_CMD: time out after 500ms.

[skipped repeating rows]

May 13 01:30:31 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May 13 01:30:31 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-110).
May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_TX_PWR_TABLE_CMD: time out after 500ms.
May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: No space in command queue
May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: Restarting adapter due to queue full
May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: enqueue_hcmd failed: -28
May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-28).



Also, the new log messages started appearing in syslog with disable_hw_scan=1 :

May 13 01:38:13 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 100 [0]
May 13 01:38:14 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 104 [0]

[skipped repeating rows]

May 13 01:48:17 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 136 [0]
May 13 01:48:18 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 140 [0]

Comment 4 Kyryl Bilokurov 2010-05-13 12:30:55 UTC

There is an improvement - after same hiccup, iwl3945 reported restart of adapter and then managed to reconnect. IIRC, before it was not able to reconnect. Log below:


May 13 14:23:45 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May 13 14:23:45 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May 13 14:23:45 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May 13 14:23:45 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).

[skipped repeating rows]

May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms.
May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110).
May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: No space in command queue
May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Restarting adapter due to queue full
May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: enqueue_hcmd failed: -28
May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-28).
May 13 14:24:00 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 140 [0]

[skipped some rows]

May 13 14:24:09 miracle dhclient[16668]: Listening on LPF/wlan0/00:13:02:b8:20:df
May 13 14:24:09 miracle dhclient[16668]: Sending on   LPF/wlan0/00:13:02:b8:20:df
May 13 14:24:09 miracle dhclient[16668]: Sending on   Socket/fallback
May 13 14:24:09 miracle dhclient[16668]: DHCPREQUEST on wlan0 to 255.255.255.255 port 67
May 13 14:24:09 miracle dhclient[16668]: DHCPACK from 10.0.0.138
May 13 14:24:09 miracle dhclient[16668]: bound to 10.0.0.2 -- renewal in 35749 seconds.

Comment 5 reinette chatre 2010-05-19 18:35:24 UTC

Looks like the hardware is stuck here ... the kernel you are running (as you discovered) has some safety in it to detect a stuck queue when it is full. In later kernels we have a new mechanism to hopefully do this detection earlier.

The patch that accomplishes this is (taken from wireless-next-2.6):

commit b74e31a9bc1013e69b85b139072485dc153453dd
Author: Wey-Yi Guy <wey-yi.w.guy>
Date:   Mon Mar 1 17:23:50 2010 -0800

    iwlwifi: Recover TX flow stall due to stuck queue
    
    Monitors the internal TX queues periodically.  When a queue is stuck
    for some unknown conditions causing the throughput to drop and the
    transfer is stop, the driver will force firmware reload and bring the
    system back to normal operational state.
    
    The iwlwifi devices behave differently in this regard so this feature is
    made part of the ops infrastructure so we can have more control on how to
    monitor and recover from tx queue stall case per device.
    
    Signed-off-by: Trieu 'Andrew' Nguyen <trieux.t.nguyen>
    Signed-off-by: Wey-Yi Guy <wey-yi.w.guy>
    Signed-off-by: Reinette Chatre <reinette.chatre>


Note that this patch requires firmware restart to be supported ("iwl_force_reset(priv, IWL_FW_RESET);"), but I believe you already ported this to FC13. 

One small additional change needed on top of this patch is to actually enable this feature for iwl3945 since we have not yet seen the need for this on iwl3945, but that obviously changed.

I think it will be something like this (whitespace broken copy&pasted and will not apply to your repo ... just to get idea):

diff --git a/drivers/net/wireless/iwlwifi/iwl-3945.c b/drivers/net/wireless/iwlwifi/iwl-3945.c
index 7cdda5b..8e68026 100644
--- a/drivers/net/wireless/iwlwifi/iwl-3945.c
+++ b/drivers/net/wireless/iwlwifi/iwl-3945.c
@@ -2700,6 +2700,7 @@ static struct iwl_lib_ops iwl3945_lib = {
        .config_ap = iwl3945_config_ap,
        .manage_ibss_station = iwl3945_manage_ibss_station,
        .check_plcp_health = iwl3945_good_plcp_health,
+       .recover_from_tx_stall = iwl_bg_monitor_recover,
 
        .debugfs_ops = {
                .rx_stats_read = iwl3945_ucode_rx_stats_read,
@@ -2744,6 +2745,7 @@ static struct iwl_cfg iwl3945_bg_cfg = {
        .monitor_recover_period = IWL_MONITORING_PERIOD,
        .max_event_log_size = 512,
        .tx_power_by_driver = true,
+       .monitor_recover_period = IWL_MONITORING_PERIOD,
 };
 
 static struct iwl_cfg iwl3945_abg_cfg = {

Comment 6 John W. Linville 2010-05-19 20:28:11 UTC

"iwlwifi: Recover TX flow stall due to stuck queue" is already in F-13.  Also, it looks like we already have the '.monitor_recover_period = IWL_MONITORING_PERIOD,' bits, but I've added the '.recover_from_tx_stall = iwl_bg_monitor_recover,' bit to the test kernels here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=2197986

Kyryl, when the build finishes can you give those a try?

Comment 7 Kyryl Bilokurov 2010-05-21 21:01:04 UTC

Had this new kernel running for about two days, the following was logged into the /var/log/messages:


May 20 15:19:06 miracle kernel: iwl3945 0000:02:00.0: queue 2 stuck 3 time. Fw reload.
May 20 15:19:06 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload

....

May 21 22:42:16 miracle kernel: iwl3945 0000:02:00.0: queue 2 stuck 3 time. Fw reload.
May 21 22:42:16 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload
May 21 22:42:25 miracle kernel: iwl3945 0000:02:00.0: queue 4 stuck 3 time. Fw reload.
May 21 22:42:25 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload
May 21 22:43:10 miracle kernel: iwl3945 0000:02:00.0: queue 4 stuck 3 time. Fw reload.
May 21 22:43:10 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload
May 21 22:44:22 miracle kernel: iwl3945 0000:02:00.0: queue 4 stuck 3 time. Fw reload.
May 21 22:44:22 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload


All the above is for the case when option "disable_hw_scan=1" is not passed to module (i.e. hw scan is enabled).

I didn't notice any wifi connection interruptions when this was logged, so things look very good!

Thanks.

Comment 8 Fedora Update System 2010-05-27 21:25:01 UTC

kernel-2.6.33.5-112.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/kernel-2.6.33.5-112.fc13

Comment 9 Fedora Update System 2010-05-31 18:30:14 UTC

kernel-2.6.33.5-112.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.