527824 – Wireless dies, followed by kernel DMA API warning when unloading the modele

Bug 527824 - Wireless dies, followed by kernel DMA API warning when unloading the modele

Summary: Wireless dies, followed by kernel DMA API warning when unloading the modele

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	12
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	John W. Linville
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	581996
TreeView+	depends on / blocked

Reported:	2009-10-07 19:55 UTC by Jonathan Blandford
Modified:	2013-04-02 04:24 UTC (History)
CC List:	11 users (show)
Fixed In Version:	kernel-2.6.32.11-102.fc12
Clone Of:
Clones:	581996 (view as bug list)
Environment:
Last Closed:	2010-05-21 17:15:35 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
microcode debug output (29.25 KB, text/plain) 2010-02-23 23:46 UTC, Derek Atkins	no flags	Details
View All

Description Jonathan Blandford 2009-10-07 19:55:41 UTC

My wireless died mid-use.  I unloaded the module and then reloaded it to try to get it working ago.  Wireless card is an Intel Corporation Wireless WiFi Link 5300.  Here's the output from dmesg:

iwlagn 0000:03:00.0: Error sending REPLY_SCAN_ABORT_CMD: time out after 500ms.
wlan0: disassociating by local choice (reason=3)
iwlagn 0000:03:00.0: Error sending REPLY_SCAN_ABORT_CMD: time out after 500ms.
iwlagn 0000:03:00.0: Aborted scan still in progress after 100ms
iwlagn 0000:03:00.0: Error sending REPLY_SCAN_ABORT_CMD: time out after 500ms.
iwlagn 0000:03:00.0: Error sending REPLY_SCAN_ABORT_CMD: time out after 500ms.
iwlagn 0000:03:00.0: Error sending REPLY_RXON: time out after 500ms.
iwlagn 0000:03:00.0: Error setting new RXON (-110)
iwlagn 0000:03:00.0: Error sending REPLY_SCAN_ABORT_CMD: time out after 500ms.
iwlagn 0000:03:00.0: PCI INT A disabled
------------[ cut here ]------------
WARNING: at lib/dma-debug.c:687 dma_debug_device_change+0x14b/0x192() (Not tainted)
Hardware name: 2777CTO
pci 0000:03:00.0: DMA-API: device driver has pending DMA allocations while released from device [count=42]
Modules linked in: tun fuse rfcomm sco bridge stp llc bnep l2cap sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 cpufreq_ondemand acpi_cpufreq freq_table dm_multipath sha256_generic cryptd aes_x86_64 aes_generic cbc dm_crypt uinput arc4 ecb iwlagn(-) snd_hda_codec_conexant iwlcore uvcvideo snd_hda_intel videodev v4l1_compat v4l2_compat_ioctl32 snd_hda_codec mac80211 snd_hwdep joydev snd_pcm i2c_i801 cfg80211 snd_timer thinkpad_acpi hwmon iTCO_wdt iTCO_vendor_support btusb bluetooth e1000e snd rfkill soundcore snd_page_alloc wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video output [last unloaded: microcode]
Pid: 13472, comm: rmmod Not tainted 2.6.31.1-56.fc12.x86_64 #1
Call Trace:
 [<ffffffff8106422c>] warn_slowpath_common+0x95/0xc3
 [<ffffffff810642e7>] warn_slowpath_fmt+0x50/0x66
 [<ffffffff8128e771>] ? dma_debug_device_change+0xd6/0x192
 [<ffffffff8128e7e6>] dma_debug_device_change+0x14b/0x192
 [<ffffffff81086db1>] ? __blocking_notifier_call_chain+0x4c/0x8e
 [<ffffffff81509a35>] notifier_call_chain+0x72/0xba
 [<ffffffff81086db1>] ? __blocking_notifier_call_chain+0x4c/0x8e
 [<ffffffff81086dc8>] __blocking_notifier_call_chain+0x63/0x8e
 [<ffffffff81086e1a>] blocking_notifier_call_chain+0x27/0x3d
 [<ffffffff81351bd7>] __device_release_driver+0xc3/0xde
 [<ffffffff81351ca2>] driver_detach+0xb0/0xe6
 [<ffffffff813509d6>] bus_remove_driver+0xb8/0x10d
 [<ffffffff81352524>] driver_unregister+0x7b/0x9a
 [<ffffffff81298032>] pci_unregister_driver+0x57/0xb7
 [<ffffffffa025b1a8>] iwl_exit+0x28/0x43 [iwlagn]
 [<ffffffff810a2667>] sys_delete_module+0x1e3/0x279
 [<ffffffff81505e96>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff81011f42>] system_call_fastpath+0x16/0x1b
---[ end trace 0e80a0cf85f72dbf ]---

Comment 1 Dan Williams 2009-10-07 20:03:02 UTC

Reinette thinks this is a dupe of:

http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2037

Comment 2 Bug Zapper 2009-11-16 13:23:18 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 3 Derek Atkins 2010-02-23 22:35:53 UTC

There appear to be a bunch of issues with the Intel 5300 on F12.  I'm having tons of problems staying online.   I can unload and reload the module and sometimes it'll stay up for a little while and other times it'll die almost immediately.  I get a bunch of different errors along the way, such as the one listed above, and sometimes:

iwlagn 0000:03:00.0: No space for Tx
iwlagn 0000:03:00.0: Error sending REPLY_SCAN_CMD: enqueue_hcmd failed: -28
iwlagn 0000:03:00.0: No space for Tx
iwlagn 0000:03:00.0: Error sending REPLY_RXON: enqueue_hcmd failed: -28
iwlagn 0000:03:00.0: Error setting new RXON (-28)
iwlagn 0000:03:00.0: No space for Tx
iwlagn 0000:03:00.0: Error sending REPLY_RXON: enqueue_hcmd failed: -28
iwlagn 0000:03:00.0: Error setting new RXON (-28)
iwlagn 0000:03:00.0: No space for Tx
iwlagn 0000:03:00.0: Error sending REPLY_TX_POWER_DBM_CMD: enqueue_hcmd failed: -28


I tried building the compat-wireless drivers (2010-02-23) but then I lost access to my a/n access point so I backed down to the drivers in kernel 2.6.31.12-174.2.22.fc12.x86_64

Comment 4 Derek Atkins 2010-02-23 22:43:12 UTC

Also seeing:

iwlagn 0000:03:00.0: Error sending REPLY_RXON: time out after 500ms.
iwlagn 0000:03:00.0: Error setting new RXON (-110)
iwlagn 0000:03:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
iwlagn 0000:03:00.0: Error sending REPLY_RXON: time out after 500ms.
iwlagn 0000:03:00.0: Error setting new RXON (-110)
...

Which error I see seems to vary from run to run.  But it all winds up in the same condition -- I have to unload/reload the driver and then hope it will stay up long enough to do something before it dies again.

Comment 5 Derek Atkins 2010-02-23 22:47:09 UTC

And then like that error I then got:

iwlagn 0000:03:00.0: Microcode SW error detected.  Restarting 0x2000000.
Registered led device: iwl-phy0::radio
Registered led device: iwl-phy0::assoc
Registered led device: iwl-phy0::RX
Registered led device: iwl-phy0::TX

and the network is unresponsive.  I'm reloading with modprobe iwlagn debug50=0x40000 to see if I can get more data.

Comment 6 Derek Atkins 2010-02-23 22:58:48 UTC

... but of course this hasn't given me any data.  :-(  It's crashed twice so far and I've seen no firmware error messages.

Comment 7 Derek Atkins 2010-02-23 23:46:20 UTC

Created attachment 395852 [details]
microcode debug output

Aha, I finally got a firmware crash.  Here's the dump I got into my dmesg log.  Hope this helps debug this issue; I'm tired to restarting my network ever 5-10 minutes!

Comment 8 John W. Linville 2010-02-24 14:58:43 UTC

This has been plaguing us to varying degrees for several kernel releases (both in Fedora and upstream)... :-(

Comment 9 Derek Atkins 2010-02-24 15:09:01 UTC

Well, I've certainly got a good test-case scenario at home... It's kinda keeping me from getting work done..  To the point where I'm considering buying a USB 802.11 token or running a very long ethernet cable.  My home AP is a WRT610N running dd-wrt and I'm connecting via 802.11(a/n) using WPA2/TKIP.  I'm certainly willing to help debug this anyway I can, including running test kernels or running in debug mode.  I seem to have found ways to tickle the bug pretty consistently, such that I can cause it to happen within about 15-30 minutes (at most -- sometimes even as short as 5min!)

So, John, is there anything I can do to help?

Comment 10 John W. Linville 2010-02-24 15:31:40 UTC

I hope so, but I'll rely on Reinette to advise.  The logs in comment 7 seem like they might be helpful.

Any reason you are using TKIP?  CCMP is generally better (i.e. more secure).  I wonder if using CCMP has any effect on reproduceability?  FWIW I see this irregularly on a WEP network here.

Comment 11 reinette chatre 2010-02-24 17:13:11 UTC

(In reply to comment #10)
> I hope so, but I'll rely on Reinette to advise.  The logs in comment 7 seem
> like they might be helpful.

The logs still point to a bug that is haunting us also, http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2037. The logs in comment 7 mention that the ucode error follows some other errors, but that log does not contain those other errors. Once there is a problem it does not really help much to trace ucode errors that occur when we are already in problem state.

Comment 12 reinette chatre 2010-03-19 22:33:13 UTC

(In reply to comment #11)
> The logs still point to a bug that is haunting us also,
> http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2037. 

This bug report has been updated with some patches that address this issue. See http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2037#c113 

Also see your own bug report https://bugzilla.redhat.com/show_bug.cgi?id=573029 which may be a duplicate of this one. I did update that bug report with some details similar to the update to the intellinuxwireless.org bug. See https://bugzilla.redhat.com/show_bug.cgi?id=573029#c16

Comment 13 John W. Linville 2010-03-22 21:11:50 UTC

Please try the test kernels here (when the build completes):

http://koji.fedoraproject.org/koji/taskinfo?taskID=2068739

These contain backports of the patches Reinette identified.  Do these improve the situation?

Comment 14 reinette chatre 2010-03-23 20:17:34 UTC

(In reply to comment #13)
> Please try the test kernels here (when the build completes):
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=2068739
> 
> These contain backports of the patches Reinette identified.  Do these improve
> the situation?    

In addition to this Zhu Yi just created a patch to address the DMA warnings. This has not been pushed upstream yet, but if you are interested you can try out http://git.kernel.org/?p=linux/kernel/git/iwlwifi/iwlwifi-2.6.git;a=commit;h=7c9e64c19c02ab9f9450cceb2c2372143d3fa38e

Comment 15 John W. Linville 2010-03-31 16:37:51 UTC

Any word on the kernels from comment 13?  I don't know how much longer Koji will keep them available...

Comment 16 Nitin Kumar Bansal 2010-04-09 16:49:28 UTC

These messages started to appear in messages when laptop lost wireless:

Apr  9 21:15:34 localhost kernel: iwlagn 0000:0c:00.0: Error sending REPLY_RXON: time out after 500ms.
Apr  9 21:15:34 localhost kernel: iwlagn 0000:0c:00.0: Error setting new RXON (-110)


Then these:


Apr  9 21:20:30 localhost kernel: iwlagn 0000:0c:00.0: Error sending REPLY_RXON: enqueue_hcmd failed: -28
Apr  9 21:20:30 localhost kernel: iwlagn 0000:0c:00.0: Error setting new RXON (-28)
Apr  9 21:20:30 localhost kernel: iwlagn 0000:0c:00.0: No space for Tx
Apr  9 21:20:30 localhost kernel: iwlagn 0000:0c:00.0: Error sending REPLY_RXON: enqueue_hcmd failed: -28
Apr  9 21:20:30 localhost kernel: iwlagn 0000:0c:00.0: Error setting new RXON (-28)
Apr  9 21:20:30 localhost kernel: iwlagn 0000:0c:00.0: No space for Tx
Apr  9 21:20:30 localhost kernel: iwlagn 0000:0c:00.0: Error sending REPLY_TX_POWER_DBM_CMD: enqueue_hcmd failed: -28
Apr  9 21:22:30 localhost kernel: iwlagn 0000:0c:00.0: No space for Tx


And after that:

Apr  9 21:44:26 localhost kernel: iwlagn 0000:0c:00.0: Error setting new RXON (-28)
Apr  9 21:44:26 localhost kernel: iwlagn 0000:0c:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
Apr  9 21:44:26 localhost kernel: iwlagn 0000:0c:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
Apr  9 21:44:26 localhost kernel: iwlagn 0000:0c:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF

Do we still have a test kernel. Please let me know if full logs and hardware details are needed.

Comment 17 John W. Linville 2010-04-09 17:14:07 UTC

Nitin, what kernel are you using?  Did you try the ones from comment 13?

Comment 18 Nitin Kumar Bansal 2010-04-11 07:24:50 UTC

John, I am using:

kernel-firmware-2.6.32.10-90.fc12.noarch
kernel-2.6.32.10-90.fc12.x86_64

Looks like koji does not have test kernels any more.

Comment 19 John W. Linville 2010-04-12 14:20:55 UTC

Updated test kernels building now:

http://koji.fedoraproject.org/koji/taskinfo?taskID=2110994

Comment 20 Nitin Kumar Bansal 2010-04-12 16:58:39 UTC

John, koji is displaying an error "BuildError: error building package (arch noarch), mock exited with status 1; see build.log for more information"

Comment 21 John W. Linville 2010-04-12 17:57:15 UTC

Ugh -- Koji hiccup...

I think this one will make it:

http://koji.fedoraproject.org/koji/taskinfo?taskID=2111335

Comment 22 John W. Linville 2010-04-12 20:23:55 UTC

Build completed -- please test! :-)

Comment 23 Nitin Kumar Bansal 2010-04-13 14:20:18 UTC

John, so far its good, was running on wireless whole day and it did not disconnect ( though after locking (ctrl+alt+L) when i tried to resume system response was very sluggish, I think it was due to compiz, after disabling compiz I cant reproduce that behavior ) .. let me know if you need some information from this system

Comment 24 Fedora Update System 2010-04-13 18:47:19 UTC

kernel-2.6.32.11-102.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/kernel-2.6.32.11-102.fc12

Comment 25 Fedora Update System 2010-04-15 03:15:24 UTC

kernel-2.6.32.11-102.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/kernel-2.6.32.11-102.fc12

Comment 26 John W. Linville 2010-04-19 18:07:01 UTC

Adel, can you detail the problems you saw w/ kernel-2.6.32.11-104.fc12 (almost certainly would be the same w/kernel-2.6.32.11-102.fc12) ?

FWIW, I've been using -102.fc12 on several boxes w/ no apparent problems over the past few days.

Comment 27 Adel Gadllah 2010-04-19 18:28:28 UTC

(In reply to comment #26)
> Adel, can you detail the problems you saw w/ kernel-2.6.32.11-104.fc12 (almost
> certainly would be the same w/kernel-2.6.32.11-102.fc12) ?
> 
> FWIW, I've been using -102.fc12 on several boxes w/ no apparent problems over
> the past few days.    

OK; my setup is:

HP 6930p; iwlagn 5300; F-12 x86_64

When using -99 there are no issues at all (i.e everything is fine).

With -104 I get two problems:

1) After ~1-2 hours of 80211n usage the driver restarts the firmware 3-4 times; after that the connection is either _very_ slow or comes to a complete halt (no data being sent). Disconnecting means I can no longer connect to the AP, the only way to get back a working connection is to reload the module (or reboot).

2) When resuming from suspend the card simply can't scan (no errors in dmesg though); scanning just returns that the device is busy. I tried disable_hw_scan=1 which seemed to fix it at first but after a second suspend / resume cycle it happened again. After that I played with it for a while and it seems that it is not 100% but something like 95% reproduce able.

Again neither 1) nor 2) happens with -99 (the former seems to be caused by the firmware restart patches; while I have no idea what could have caused the later).

If you need any more information feel free to ask. (I don't have the exact dmesg output at hand right now).

Comment 28 John W. Linville 2010-04-19 19:41:56 UTC

http://koji.fedoraproject.org/koji/taskinfo?taskID=2126607

This kernel contains "mac80211: fix deferred hardware scan requests", which I suspect might address some part of the scan-related issues you are experiencing.  Could you give those a try (once the build completes)?

Comment 29 Adel Gadllah 2010-04-19 21:36:19 UTC

(In reply to comment #28)
> http://koji.fedoraproject.org/koji/taskinfo?taskID=2126607
> 
> This kernel contains "mac80211: fix deferred hardware scan requests", which I
> suspect might address some part of the scan-related issues you are
> experiencing.  Could you give those a try (once the build completes)?    

I just downloaded the x86_64 build and tested with it.

I could not reproduce the scan issue even after 6 suspend / resume cycles.

(wpa_supplicant segfaulted once but it seems unrelated).

Comment 30 Nitin Kumar Bansal 2010-04-20 04:27:42 UTC

I tried kernel-2.6.32.11-102.fc12, and though less frequent I am still seeing:

Apr 19 23:03:40 localhost kernel: iwlagn 0000:0c:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
Apr 19 23:03:40 localhost kernel: iwlagn 0000:0c:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
Apr 19 23:03:40 localhost kernel: iwlagn 0000:0c:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
Apr 19 23:03:45 localhost kernel: iwlagn 0000:0c:00.0: Could not load the INST uCode section
Apr 19 23:03:45 localhost kernel: iwlagn 0000:0c:00.0: Unable to set up bootstrap uCode: -110
Apr 19 23:03:45 localhost kernel: iwlagn 0000:0c:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF

and wireless disconnects afterwards, only solution is to reboot the machine.

Comment 31 Fedora Update System 2010-04-20 17:29:52 UTC

kernel-2.6.32.11-105.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/kernel-2.6.32.11-105.fc12

Comment 32 John W. Linville 2010-04-20 17:36:39 UTC

Nitin, I'm not sure that the "MAC is in deep sleep!" issue is the same as the "Error sending REPLY_RXON" issue.  You may want to submit a separate bug for that one.

Comment 33 reinette chatre 2010-04-20 17:56:41 UTC

(In reply to comment #32)
> Nitin, I'm not sure that the "MAC is in deep sleep!" issue is the same as the
> "Error sending REPLY_RXON" issue.  You may want to submit a separate bug for
> that one.    

Yes ... different problem ... when driver starts getting all 1s when reading from device it means that the device disconnected itself from the PCI bus. We have one patch floating around addressing the issue, please see http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2037#c112 ... you could maybe respond with your test results of that patch when you submit a new bug report.

Comment 34 Adel Gadllah 2010-04-20 22:03:40 UTC

(In reply to comment #27)
> (In reply to comment #26)
> > Adel, can you detail the problems you saw w/ kernel-2.6.32.11-104.fc12 (almost
> > certainly would be the same w/kernel-2.6.32.11-102.fc12) ?
> > 
> > FWIW, I've been using -102.fc12 on several boxes w/ no apparent problems over
> > the past few days.    
> 
> OK; my setup is:
> 
> HP 6930p; iwlagn 5300; F-12 x86_64
> 
> When using -99 there are no issues at all (i.e everything is fine).
> 
> With -104 I get two problems:
> 
> 1) After ~1-2 hours of 80211n usage the driver restarts the firmware 3-4 times;
> after that the connection is either _very_ slow or comes to a complete halt (no
> data being sent). Disconnecting means I can no longer connect to the AP, the
> only way to get back a working connection is to reload the module (or reboot).

Here is the log output when this happens:

------------
iwlagn 0000:02:00.0: low ack count detected, restart firmware
iwlagn 0000:02:00.0: On demand firmware reload
Registered led device: iwl-phy0::radio
Registered led device: iwl-phy0::assoc
Registered led device: iwl-phy0::RX
Registered led device: iwl-phy0::TX
iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
iwlagn 0000:02:00.0: iwl_tx_agg_start on ra = 00:24:b2:d8:20:82 tid = 0
iwlagn 0000:02:00.0: low ack count detected, restart firmware
iwlagn 0000:02:00.0: On demand firmware reload
Registered led device: iwl-phy0::radio
Registered led device: iwl-phy0::assoc
Registered led device: iwl-phy0::RX
Registered led device: iwl-phy0::TX
iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
iwlagn 0000:02:00.0: iwl_tx_agg_start on ra = 00:24:b2:d8:20:82 tid = 0
iwlagn 0000:02:00.0: low ack count detected, restart firmware
iwlagn 0000:02:00.0: On demand firmware reload
Registered led device: iwl-phy0::radio
Registered led device: iwl-phy0::assoc
Registered led device: iwl-phy0::RX
Registered led device: iwl-phy0::TX
iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
iwlagn 0000:02:00.0: iwl_tx_agg_start on ra = 00:24:b2:d8:20:82 tid = 0
iwlagn 0000:02:00.0: iwl_tx_agg_start on ra = 00:24:b2:d8:20:82 tid = 0
iwlagn 0000:02:00.0: low ack count detected, restart firmware
iwlagn 0000:02:00.0: On demand firmware reload
Registered led device: iwl-phy0::radio
Registered led device: iwl-phy0::assoc
Registered led device: iwl-phy0::RX
Registered led device: iwl-phy0::TX
iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
--------------

Comment 35 John W. Linville 2010-04-21 13:32:26 UTC

Not timestamps...how often are those firmware restarts?  After the restarts, do you recover connectivity?  Or is it lost forever?

The whole point of the patchset introduced here is to allow for restarting the firmware rather than just dying on the "Error sending REPLY_RXON" or similar errors.

Comment 36 Adel Gadllah 2010-04-21 14:25:04 UTC

(In reply to comment #35)
> Not timestamps...how often are those firmware restarts? 

Here some timestamps:

-------------
Apr 20 21:18:38 localhost kernel: iwlagn 0000:02:00.0: low ack count detected, restart firmware
Apr 20 21:18:38 localhost kernel: iwlagn 0000:02:00.0: On demand firmware reload
Apr 20 21:18:38 localhost kernel: Registered led device: iwl-phy0::radio
Apr 20 21:18:38 localhost kernel: Registered led device: iwl-phy0::assoc
Apr 20 21:18:38 localhost kernel: Registered led device: iwl-phy0::RX
Apr 20 21:18:38 localhost kernel: Registered led device: iwl-phy0::TX
Apr 20 21:18:38 localhost kernel: iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
Apr 20 21:18:38 localhost kernel: iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
Apr 20 21:18:38 localhost kernel: iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
Apr 20 21:18:38 localhost kernel: iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
Apr 20 21:19:14 localhost kernel: iwlagn 0000:02:00.0: iwl_tx_agg_start on ra = 00:24:b2:d8:20:82 tid = 0
Apr 20 21:19:22 localhost kernel: iwlagn 0000:02:00.0: low ack count detected, restart firmware
Apr 20 21:19:22 localhost kernel: iwlagn 0000:02:00.0: On demand firmware reload
Apr 20 21:19:22 localhost kernel: Registered led device: iwl-phy0::radio
Apr 20 21:19:22 localhost kernel: Registered led device: iwl-phy0::assoc
Apr 20 21:19:22 localhost kernel: Registered led device: iwl-phy0::RX
Apr 20 21:19:22 localhost kernel: Registered led device: iwl-phy0::TX
Apr 20 21:19:22 localhost kernel: iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
Apr 20 21:19:22 localhost kernel: iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
Apr 20 21:19:35 localhost kernel: iwlagn 0000:02:00.0: iwl_tx_agg_start on ra = 00:24:b2:d8:20:82 tid = 0
Apr 20 21:19:46 localhost kernel: iwlagn 0000:02:00.0: low ack count detected, restart firmware
Apr 20 21:19:46 localhost kernel: iwlagn 0000:02:00.0: On demand firmware reload
Apr 20 21:19:46 localhost kernel: Registered led device: iwl-phy0::radio
Apr 20 21:19:46 localhost kernel: Registered led device: iwl-phy0::assoc
Apr 20 21:19:46 localhost kernel: Registered led device: iwl-phy0::RX
Apr 20 21:19:46 localhost kernel: Registered led device: iwl-phy0::TX
Apr 20 21:19:46 localhost kernel: iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
Apr 20 21:19:46 localhost kernel: iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
Apr 20 21:20:43 localhost kernel: iwlagn 0000:02:00.0: iwl_tx_agg_start on ra = 00:24:b2:d8:20:82 tid = 0
Apr 20 21:20:55 localhost kernel: iwlagn 0000:02:00.0: iwl_tx_agg_start on ra = 00:24:b2:d8:20:82 tid = 0
Apr 20 21:21:03 localhost kernel: iwlagn 0000:02:00.0: low ack count detected, restart firmware
Apr 20 21:21:03 localhost kernel: iwlagn 0000:02:00.0: On demand firmware reload
Apr 20 21:21:03 localhost kernel: Registered led device: iwl-phy0::radio
Apr 20 21:21:03 localhost kernel: Registered led device: iwl-phy0::assoc
Apr 20 21:21:03 localhost kernel: Registered led device: iwl-phy0::RX
Apr 20 21:21:03 localhost kernel: Registered led device: iwl-phy0::TX
Apr 20 21:21:03 localhost kernel: iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
Apr 20 21:21:03 localhost kernel: iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
Apr 20 21:21:20 localhost kernel: iwlagn 0000:02:00.0: iwl_tx_agg_start on ra = 00:24:b2:d8:20:82 tid = 0
Apr 20 21:21:42 localhost kernel: iwlagn 0000:02:00.0: queue 10 stuck 3 time. Fw reload.
Apr 20 21:21:42 localhost kernel: iwlagn 0000:02:00.0: On demand firmware reload
Apr 20 21:21:43 localhost kernel: Registered led device: iwl-phy0::radio
Apr 20 21:21:43 localhost kernel: Registered led device: iwl-phy0::assoc
Apr 20 21:21:43 localhost kernel: Registered led device: iwl-phy0::RX
Apr 20 21:21:43 localhost kernel: Registered led device: iwl-phy0::TX
Apr 20 21:21:43 localhost kernel: iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
Apr 20 21:21:43 localhost kernel: iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
Apr 20 21:21:50 localhost kernel: iwlagn 0000:02:00.0: iwl_tx_agg_start on ra = 00:24:b2:d8:20:82 tid = 0
Apr 20 21:22:15 localhost kernel: iwlagn 0000:02:00.0: queue 10 stuck 3 time. Fw reload.
Apr 20 21:22:15 localhost kernel: iwlagn 0000:02:00.0: On demand firmware reload
Apr 20 21:22:15 localhost kernel: Registered led device: iwl-phy0::radio
Apr 20 21:22:15 localhost kernel: Registered led device: iwl-phy0::assoc
Apr 20 21:22:15 localhost kernel: Registered led device: iwl-phy0::RX
Apr 20 21:22:15 localhost kernel: Registered led device: iwl-phy0::TX
Apr 20 21:22:15 localhost kernel: iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
Apr 20 21:22:15 localhost kernel: iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
Apr 20 21:22:21 localhost kernel: iwlagn 0000:02:00.0: queue 2 stuck 3 time. Fw reload.
Apr 20 21:22:21 localhost kernel: iwlagn 0000:02:00.0: On demand firmware reload
----------------

> After the restarts, do you recover connectivity?  Or is it lost forever?

It becomes very slow and comes to a complete halt (transfers no data), when I try to reconnect to the AP the driver is pretty much in a dead state (have to reload the module to get it back working).

> The whole point of the patchset introduced here is to allow for restarting the
> firmware rather than just dying on the "Error sending REPLY_RXON" or similar
> errors.    

It seems to be a bit too aggressive when deciding whether a restart is needed or not (and it seems restart aren't "free" so they should only be done when really needed.)

Comment 37 John W. Linville 2010-04-21 14:29:18 UTC

Ugh... :-(

Anyone want to review my backports of those patches to see if I screwed-up something?

Comment 38 Fedora Update System 2010-04-21 22:02:17 UTC

kernel-2.6.32.11-105.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/kernel-2.6.32.11-105.fc12

Comment 39 reinette chatre 2010-04-21 23:29:09 UTC

(In reply to comment #36)
 
> It seems to be a bit too aggressive when deciding whether a restart is needed
> or not (and it seems restart aren't "free" so they should only be done when
> really needed.)    

If you load your module with "debug=0x80" it will print the details of what is actually used to decide the low ack count and give an idea of what problem in environment is causing the system to reset itself.

Comment 40 Adel Gadllah 2010-04-22 16:34:20 UTC

(In reply to comment #39)
> (In reply to comment #36)
> 
> > It seems to be a bit too aggressive when deciding whether a restart is needed
> > or not (and it seems restart aren't "free" so they should only be done when
> > really needed.)    
> 
> If you load your module with "debug=0x80" it will print the details of what is
> actually used to decide the low ack count and give an idea of what problem in
> environment is causing the system to reset itself.    

---------------
ieee80211 phy0: I iwl_good_ack_health agg ba_timeout delta = 6
ieee80211 phy0: I iwl_good_ack_health actual_ack_cnt delta = 7, expected_ack_cnt = 29
ieee80211 phy0: I iwl_good_ack_health agg ba_timeout delta = 6
ieee80211 phy0: I iwl_good_ack_health actual_ack_cnt delta = 0, expected_ack_cnt = 96
ieee80211 phy0: I iwl_good_ack_health agg ba_timeout delta = 51
iwlagn 0000:02:00.0: low ack count detected, restart firmware
iwlagn 0000:02:00.0: On demand firmware reload
Registered led device: iwl-phy0::radio
Registered led device: iwl-phy0::assoc
Registered led device: iwl-phy0::RX
Registered led device: iwl-phy0::TX
iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
iwlagn 0000:02:00.0: iwl_tx_agg_start on ra = 00:24:b2:d8:20:82 tid = 0
ieee80211 phy0: I iwl_good_ack_health actual_ack_cnt delta = 0, expected_ack_cnt = 32
ieee80211 phy0: I iwl_good_ack_health agg ba_timeout delta = 16
iwlagn 0000:02:00.0: low ack count detected, restart firmware
iwlagn 0000:02:00.0: On demand firmware reload
Registered led device: iwl-phy0::radio
Registered led device: iwl-phy0::assoc
Registered led device: iwl-phy0::RX
Registered led device: iwl-phy0::TX
iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
iwlagn 0000:02:00.0: iwl_tx_agg_start on ra = 00:24:b2:d8:20:82 tid = 0
ieee80211 phy0: I iwl_check_stuck_queue queue 10, not read 1 time
ieee80211 phy0: I iwl_check_stuck_queue queue 10, not read 2 time
ieee80211 phy0: I iwl_check_stuck_queue queue 10, not read 3 time
ieee80211 phy0: I iwl_check_stuck_queue queue 10, not read 1 time
ieee80211 phy0: I iwl_check_stuck_queue queue 10, not read 2 time
ieee80211 phy0: I iwl_check_stuck_queue queue 10, not read 3 time
iwlagn 0000:02:00.0: queue 10 stuck 3 time. Fw reload.
iwlagn 0000:02:00.0: On demand firmware reload
Registered led device: iwl-phy0::radio
Registered led device: iwl-phy0::assoc
Registered led device: iwl-phy0::RX
Registered led device: iwl-phy0::TX
iwlagn 0000:02:00.0: Stopping AGG while state not ON or starting
iwlagn 0000:02:00.0: queue number out of range: 0, must be 10 to 19
ieee80211 phy0: I iwl_check_stuck_queue queue 2, not read 1 time
ieee80211 phy0: I iwl_check_stuck_queue queue 2, not read 2 time
ieee80211 phy0: I iwl_check_stuck_queue queue 2, not read 3 time
iwlagn 0000:02:00.0: queue 2 stuck 3 time. Fw reload.
iwlagn 0000:02:00.0: iwl_tx_agg_start on ra = 00:24:b2:d8:20:82 tid = 0
-------------

Also it seems it is not triggered by time but by opening and closing the lid (but not 100% reproduce able and I have it configured to only blank the screen on lid close so it does not suspend the system).

Comment 41 Adel Gadllah 2010-04-24 22:39:48 UTC

I have been using 2.6.32.12-110.rc2.fc12.x86_64 (checked out from cvs and built locally) for a day now and the problem has not show up until now.

I don't know why though (the patches / changes seem unrelated) it might be a coincidence or an unrelated patch fixed it.

I will keep using this kernel and see report back if whether it happens again or not.

Comment 42 Adel Gadllah 2010-04-24 22:41:13 UTC

(In reply to comment #41)
> I have been using 2.6.32.12-110.rc2.fc12.x86_64 (checked out from cvs and built
> locally) for a day now and the problem has not show up until now.

Err.. badly worded it has not shown up at all (yet?).

Comment 43 Adel Gadllah 2010-04-25 15:42:37 UTC

(In reply to comment #42)
> (In reply to comment #41)
> > I have been using 2.6.32.12-110.rc2.fc12.x86_64 (checked out from cvs and built
> > locally) for a day now and the problem has not show up until now.
> 
> Err.. badly worded it has not shown up at all (yet?).    

No problems today either so I'd say 2.6.32.12-110.rc2.fc12.x86_64 is fine.

Comment 44 Chuck Ebbert 2010-04-28 12:30:28 UTC

kernel-2.6.32.12-114.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/kernel-2.6.32.12-114.fc12

Note You need to log in before you can comment on or make changes to this bug.