Bug 589777
Summary: | iwl3945: drops connection when under load | |||
---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Kyryl Bilokurov <kyryl.bilokurov> | |
Component: | kernel | Assignee: | John W. Linville <linville> | |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | |
Severity: | medium | Docs Contact: | ||
Priority: | low | |||
Version: | 13 | CC: | anton, dougsland, gansalmon, ikke, itamar, jdenemar, jonathan, kernel-maint, kyryl.bilokurov, linville, reinette.chatre, thenzl, vbotka | |
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | kernel-2.6.33.4-107.fc13 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 595847 (view as bug list) | Environment: | ||
Last Closed: | 2010-08-07 05:55:54 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 595847 |
Description
Kyryl Bilokurov
2010-05-06 20:56:49 UTC
There seems to be the related thread in linux-wireless http://marc.info/?l=linux-wireless&m=127151315515173&w=2 Please create a file named /etc/modprobe.d/iwl3945.conf with this one line: options iwl3945 disable_hw_scan=1 After that, either reboot or execute this sequence: modprobe -r iwl3945 modprobe iwl3945 Does that produce any better results? This seems to enhance stability a bit, but the issue still happens: May 13 01:30:10 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms. May 13 01:30:10 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-110). May 13 01:30:11 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms. May 13 01:30:11 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-110). [skipped repeating rows] May 13 01:30:18 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms. May 13 01:30:18 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-110). May 13 01:30:19 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_TX_PWR_TABLE_CMD: time out after 500ms. [skipped repeating rows] May 13 01:30:31 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms. May 13 01:30:31 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-110). May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_TX_PWR_TABLE_CMD: time out after 500ms. May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: No space in command queue May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: Restarting adapter due to queue full May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: enqueue_hcmd failed: -28 May 13 01:30:32 miracle kernel: iwl3945 0000:02:00.0: Error clearing ASSOC_MSK on current configuration (-28). Also, the new log messages started appearing in syslog with disable_hw_scan=1 : May 13 01:38:13 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 100 [0] May 13 01:38:14 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 104 [0] [skipped repeating rows] May 13 01:48:17 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 136 [0] May 13 01:48:18 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 140 [0] There is an improvement - after same hiccup, iwl3945 reported restart of adapter and then managed to reconnect. IIRC, before it was not able to reconnect. Log below: May 13 14:23:45 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms. May 13 14:23:45 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110). May 13 14:23:45 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms. May 13 14:23:45 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110). [skipped repeating rows] May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110). May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: time out after 500ms. May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-110). May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: No space in command queue May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Restarting adapter due to queue full May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error sending REPLY_RXON: enqueue_hcmd failed: -28 May 13 14:23:58 miracle kernel: iwl3945 0000:02:00.0: Error setting new configuration (-28). May 13 14:24:00 miracle kernel: iwl3945 0000:02:00.0: Failed to get channel info for channel 140 [0] [skipped some rows] May 13 14:24:09 miracle dhclient[16668]: Listening on LPF/wlan0/00:13:02:b8:20:df May 13 14:24:09 miracle dhclient[16668]: Sending on LPF/wlan0/00:13:02:b8:20:df May 13 14:24:09 miracle dhclient[16668]: Sending on Socket/fallback May 13 14:24:09 miracle dhclient[16668]: DHCPREQUEST on wlan0 to 255.255.255.255 port 67 May 13 14:24:09 miracle dhclient[16668]: DHCPACK from 10.0.0.138 May 13 14:24:09 miracle dhclient[16668]: bound to 10.0.0.2 -- renewal in 35749 seconds. Looks like the hardware is stuck here ... the kernel you are running (as you discovered) has some safety in it to detect a stuck queue when it is full. In later kernels we have a new mechanism to hopefully do this detection earlier. The patch that accomplishes this is (taken from wireless-next-2.6): commit b74e31a9bc1013e69b85b139072485dc153453dd Author: Wey-Yi Guy <wey-yi.w.guy> Date: Mon Mar 1 17:23:50 2010 -0800 iwlwifi: Recover TX flow stall due to stuck queue Monitors the internal TX queues periodically. When a queue is stuck for some unknown conditions causing the throughput to drop and the transfer is stop, the driver will force firmware reload and bring the system back to normal operational state. The iwlwifi devices behave differently in this regard so this feature is made part of the ops infrastructure so we can have more control on how to monitor and recover from tx queue stall case per device. Signed-off-by: Trieu 'Andrew' Nguyen <trieux.t.nguyen> Signed-off-by: Wey-Yi Guy <wey-yi.w.guy> Signed-off-by: Reinette Chatre <reinette.chatre> Note that this patch requires firmware restart to be supported ("iwl_force_reset(priv, IWL_FW_RESET);"), but I believe you already ported this to FC13. One small additional change needed on top of this patch is to actually enable this feature for iwl3945 since we have not yet seen the need for this on iwl3945, but that obviously changed. I think it will be something like this (whitespace broken copy&pasted and will not apply to your repo ... just to get idea): diff --git a/drivers/net/wireless/iwlwifi/iwl-3945.c b/drivers/net/wireless/iwlwifi/iwl-3945.c index 7cdda5b..8e68026 100644 --- a/drivers/net/wireless/iwlwifi/iwl-3945.c +++ b/drivers/net/wireless/iwlwifi/iwl-3945.c @@ -2700,6 +2700,7 @@ static struct iwl_lib_ops iwl3945_lib = { .config_ap = iwl3945_config_ap, .manage_ibss_station = iwl3945_manage_ibss_station, .check_plcp_health = iwl3945_good_plcp_health, + .recover_from_tx_stall = iwl_bg_monitor_recover, .debugfs_ops = { .rx_stats_read = iwl3945_ucode_rx_stats_read, @@ -2744,6 +2745,7 @@ static struct iwl_cfg iwl3945_bg_cfg = { .monitor_recover_period = IWL_MONITORING_PERIOD, .max_event_log_size = 512, .tx_power_by_driver = true, + .monitor_recover_period = IWL_MONITORING_PERIOD, }; static struct iwl_cfg iwl3945_abg_cfg = { "iwlwifi: Recover TX flow stall due to stuck queue" is already in F-13. Also, it looks like we already have the '.monitor_recover_period = IWL_MONITORING_PERIOD,' bits, but I've added the '.recover_from_tx_stall = iwl_bg_monitor_recover,' bit to the test kernels here: http://koji.fedoraproject.org/koji/taskinfo?taskID=2197986 Kyryl, when the build finishes can you give those a try? Had this new kernel running for about two days, the following was logged into the /var/log/messages: May 20 15:19:06 miracle kernel: iwl3945 0000:02:00.0: queue 2 stuck 3 time. Fw reload. May 20 15:19:06 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload .... May 21 22:42:16 miracle kernel: iwl3945 0000:02:00.0: queue 2 stuck 3 time. Fw reload. May 21 22:42:16 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload May 21 22:42:25 miracle kernel: iwl3945 0000:02:00.0: queue 4 stuck 3 time. Fw reload. May 21 22:42:25 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload May 21 22:43:10 miracle kernel: iwl3945 0000:02:00.0: queue 4 stuck 3 time. Fw reload. May 21 22:43:10 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload May 21 22:44:22 miracle kernel: iwl3945 0000:02:00.0: queue 4 stuck 3 time. Fw reload. May 21 22:44:22 miracle kernel: iwl3945 0000:02:00.0: On demand firmware reload All the above is for the case when option "disable_hw_scan=1" is not passed to module (i.e. hw scan is enabled). I didn't notice any wifi connection interruptions when this was logged, so things look very good! Thanks. kernel-2.6.33.5-112.fc13 has been submitted as an update for Fedora 13. http://admin.fedoraproject.org/updates/kernel-2.6.33.5-112.fc13 kernel-2.6.33.5-112.fc13 has been pushed to the Fedora 13 stable repository. If problems still persist, please make note of it in this bug report. |