Bug 523075

Summary:

iwlwifi driver - warn_slowpath_fmt warning messages - 2.6.30.5-43.fc11.x86_64

Product:

[Fedora] Fedora

Reporter:

gene c <gjunk>

Component:

kernel

Assignee:

Stanislaw Gruszka <sgruszka>

Status:

CLOSED NEXTRELEASE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

low

Version:

CC:

itamar, kernel-maint, linville, madko, reinette.chatre

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-06-21 06:26:04 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
dmesg from fresh boot	none
messages log file including traceback	none

Description gene c 2009-09-13 22:13:32 UTC

Description of problem: After updating to 2.6.30.5 kernel I am getting the warning messages below from the iwl driver. It seems to be load related - ie when i do backups - tho I need more testing to verify this. Tainting is due to rpmfusion nvidia driver.

Version-Release number of selected component (if applicable):
kernel -  2.6.30.5-43.fc11.x86_64
wifi hardware is 4965 AGN

How reproducible:
Seems to be load related.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info: Log Inlcuded
Sep 11 06:46:18 f11 kernel: ------------[ cut here ]------------
Sep 11 06:46:18 f11 kernel: WARNING: at drivers/net/wireless/iwlwifi/iwl-tx.c:1127 iwl_tx_cmd_complete+0x72/0x2e2 [iwlcore]() (Tainted: P          )
Sep 11 06:46:18 f11 kernel: Hardware name: 6459CTO
Sep 11 06:46:18 f11 kernel: wrong command queue 0, sequence 0x0 readp=182 writep=184
Sep 11 06:46:18 f11 kernel: Modules linked in: fuse hidp ipt_MASQUERADE rfcomm bridge stp llc bnep sco l2cap sunrpc autofs4 iptable_nat nf_nat iptable_mangle ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 cpufreq_ondemand acpi_cpufreq freq_table dm_multipath kvm_intel kvm uinput snd_hda_codec_analog nvidia(P) arc4 ecb iwlagn iwlcore snd_hda_intel snd_hda_codec thinkpad_acpi firewire_ohci video rfkill firewire_core snd_hwdep hwmon joydev crc_itu_t pcspkr btusb snd_pcm output sdhci_pci snd_timer yenta_socket sdhci lib80211 ricoh_mmc mac80211 e1000e snd cfg80211 mmc_core bluetooth i2c_i801 wmi iTCO_wdt soundcore iTCO_vendor_support rsrc_nonstatic snd_page_alloc i2c_core sha256_generic cbc cryptd aes_x86_64 aes_generic dm_crypt [last unloaded: microcode]
Sep 11 06:46:18 f11 kernel: Pid: 0, comm: swapper Tainted: P           2.6.30.5-43.fc11.x86_64 #1
Sep 11 06:46:18 f11 kernel: Call Trace:
Sep 11 06:46:18 f11 kernel: <IRQ>  [<ffffffff81057104>] warn_slowpath_common+0x95/0xc3
Sep 11 06:46:18 f11 kernel: [<ffffffff810571bf>] warn_slowpath_fmt+0x50/0x66
Sep 11 06:46:18 f11 kernel: [<ffffffffa01f3087>] iwl_tx_cmd_complete+0x72/0x2e2 [iwlcore]
Sep 11 06:46:18 f11 kernel: [<ffffffff81235a97>] ? swiotlb_unmap_page+0x2a/0x7d
Sep 11 06:46:18 f11 kernel: [<ffffffffa0218c57>] iwl_rx_handle+0x2c8/0x3b3 [iwlagn]
Sep 11 06:46:18 f11 kernel: [<ffffffff8106c502>] ? queue_work_on+0x59/0x74
Sep 11 06:46:18 f11 kernel: [<ffffffff8106c68d>] ? queue_work+0x32/0x48
Sep 11 06:46:18 f11 kernel: [<ffffffffa021ad75>] iwl_irq_tasklet+0x4c8/0x6ff [iwlagn]
Sep 11 06:46:18 f11 kernel: [<ffffffffa01ec527>] ? iwl_isr+0x1e3/0x22c [iwlcore]
Sep 11 06:46:18 f11 kernel: [<ffffffff810748e2>] ? ktime_get_ts+0x61/0x7d
Sep 11 06:46:18 f11 kernel: [<ffffffff8105d12b>] tasklet_action+0x9f/0x11f
Sep 11 06:46:18 f11 kernel: [<ffffffff8105db54>] __do_softirq+0xd2/0x1d2
Sep 11 06:46:18 f11 kernel: [<ffffffff8101323c>] call_softirq+0x1c/0x30
Sep 11 06:46:18 f11 kernel: [<ffffffff81014ba3>] do_softirq+0x5f/0xd7
Sep 11 06:46:18 f11 kernel: [<ffffffff8105d633>] irq_exit+0x66/0xb7
Sep 11 06:46:18 f11 kernel: [<ffffffff814986d8>] ? trace_hardirqs_off_thunk+0x3a/0x6c
Sep 11 06:46:18 f11 kernel: [<ffffffff810143f8>] do_IRQ+0xc1/0xee
Sep 11 06:46:18 f11 kernel: [<ffffffff81012a93>] ret_from_intr+0x0/0x16
Sep 11 06:46:18 f11 kernel: <EOI>  [<ffffffff812a5dc3>] ? acpi_idle_enter_bm+0x26a/0x2b2
Sep 11 06:46:18 f11 kernel: [<ffffffff812a5dbc>] ? acpi_idle_enter_bm+0x263/0x2b2
Sep 11 06:46:18 f11 kernel: [<ffffffff813b34ef>] ? cpuidle_idle_call+0xa0/0xed
Sep 11 06:46:18 f11 kernel: [<ffffffff81010f96>] ? cpu_idle+0xbf/0x10a
Sep 11 06:46:18 f11 kernel: [<ffffffff814809da>] ? rest_init+0x7e/0x94
Sep 11 06:46:18 f11 kernel: [<ffffffff81758034>] ? start_kernel+0x412/0x433
Sep 11 06:46:18 f11 kernel: [<ffffffff817572d0>] ? x86_64_start_reservations+0xbb/0xd6
Sep 11 06:46:18 f11 kernel: [<ffffffff817573f0>] ? x86_64_start_kernel+0x105/0x128
Sep 11 06:46:18 f11 kernel: ---[ end trace 54777fc7f280841f ]---

Comment 1 John W. Linville 2009-09-14 15:13:06 UTC

Here is the warning in the code:

        /* If a Tx command is being handled and it isn't in the actual
         * command queue then there a command routing bug has been introduced
         * in the queue management code. */
        if (WARN(txq_id != IWL_CMD_QUEUE_NUM,
                 "wrong command queue %d, sequence 0x%X readp=%d writep=%d\n",
                  txq_id, sequence,
                  priv->txq[IWL_CMD_QUEUE_NUM].q.read_ptr,
                  priv->txq[IWL_CMD_QUEUE_NUM].q.write_ptr)) {
                iwl_print_hex_dump(priv, IWL_DL_INFO , rxb, 32);
                return;
        }

Comment 2 gene c 2009-09-27 19:41:20 UTC

I ran some further tests - and it is indeed triggered by load - seems to be either direction too. I'm still running 2.6.30.5-43.fc11.x86_64.

Comment 3 Stanislaw Gruszka 2009-10-06 12:56:53 UTC

I'm trying to reproduce this on my computer, as far no luck.

Gene c, what is your network configuration (abgn, encryption) ? How much data need to be transferred to trigger a warning?  And just in case, do you use the newest firmware?

If you think there is also any other information relevant to reproduce this bug, please provide it.

Comment 4 gene c 2009-10-07 00:09:54 UTC

Respinding to https://bugzilla.redhat.com/show_bug.cgi?id=523075#c3

Network is N 
Encryption is wpa2 personal
firmware is iwl4965-firmware.noarch  228.61.2.24-1.fc11 

How much data - not sure - I run rdiff backup to do backups.

I do recall similar bug discussed on LKML (I think) and have vague recollections it may be fixed in 2.6.32.

Comment 5 gene c 2009-10-07 01:37:22 UTC

could this be related or no ?

http://patchwork.kernel.org/patch/50172/

Comment 6 gene c 2009-10-07 01:39:29 UTC

no sorry for noise - that trace looks quite different ..

Comment 7 gene c 2010-02-19 05:14:27 UTC

I can confirm this being fixed in 2.6.32.8 - at least after several days of testing I am unable to produce this with the newer kernel.

Comment 8 gene c 2010-02-20 00:35:36 UTC

Please reopen this - I have just got the same traceback with 2.6.32.8.58 (I did not see this with the 2.6.32.8-55 build tho I see no wifi changes between the 2)

Course this happens the moment I closed the bug!!!

Sigh ... I really thought it was fixed.

The trace now is different however - included here:



Feb 19 19:22:29 lap1 kernel: Pid: 2837, comm: chrome Not tainted 2.6.32.8-58.fc12.x86_64 #1
Feb 19 19:22:29 lap1 kernel: Call Trace:
Feb 19 19:22:29 lap1 kernel: <IRQ>  [<ffffffff81056348>] warn_slowpath_common+0x7c/0x94
Feb 19 19:22:29 lap1 kernel: [<ffffffff81056374>] warn_slowpath_null+0x14/0x16
Feb 19 19:22:29 lap1 kernel: [<ffffffffa0191ae2>] ___ieee80211_stop_tx_ba_session+0x7b/0x80 [mac80211]
Feb 19 19:22:29 lap1 kernel: [<ffffffffa0191ccf>] sta_addba_resp_timer_expired+0x64/0x75 [mac80211]
Feb 19 19:22:29 lap1 kernel: [<ffffffff810650b4>] run_timer_softirq+0x1c4/0x268
Feb 19 19:22:29 lap1 kernel: [<ffffffff81026b8e>] ? apic_write+0x16/0x18
Feb 19 19:22:29 lap1 kernel: [<ffffffff8105d964>] __do_softirq+0xe5/0x1a9
Feb 19 19:22:29 lap1 kernel: [<ffffffff810804ce>] ? tick_program_event+0x2a/0x2c
Feb 19 19:22:29 lap1 kernel: [<ffffffff81012e6c>] call_softirq+0x1c/0x30
Feb 19 19:22:29 lap1 kernel: [<ffffffff810143ea>] do_softirq+0x46/0x86
Feb 19 19:22:29 lap1 kernel: [<ffffffff8105d7a2>] irq_exit+0x3b/0x7d
Feb 19 19:22:29 lap1 kernel: [<ffffffff81459c1a>] smp_apic_timer_interrupt+0x86/0x94
Feb 19 19:22:29 lap1 kernel: [<ffffffff81012833>] apic_timer_interrupt+0x13/0x20
Feb 19 19:22:29 lap1 kernel: <EOI> 
Feb 19 19:22:29 lap1 kernel: ---[ end trace 4944414491987e6e ]---

Comment 9 gene c 2010-02-20 04:06:46 UTC

Once this occurs the network in not useable - rmmod iwlagn ; modprobe iwlagn does bring te network back to life.

Comment 10 Stanislaw Gruszka 2010-02-22 13:36:51 UTC

Hi gene c

Could you provide full log from system start (/var/log/messages or dmesg if it contains everything)? What is your AP model/firmware?

Comment 11 gene c 2010-02-22 14:03:26 UTC

Created attachment 395474 [details]
dmesg from fresh boot

As requested - this is the dmesg from a fresh boot.

AP is a Cisco-Linksys WRT160N ( Version 1 Hardware ) running the latest firmware which is

01/10/2008
Ver.1.02.2

Thanks

Comment 12 Stanislaw Gruszka 2010-02-22 14:49:43 UTC

Oops sorry I was not clear, I meant full logs when this issue happens.

I have access to WRT160 (not sure which version), I'm going to start use it. I know I asked you before, but want to do this again, any ideas how to reproduce that bug?

Comment 13 gene c 2010-02-22 15:22:27 UTC

Created attachment 395489 [details]
messages log file including traceback

1) Log file attached. Please note that in this case, I rmmod' iwlagn and modprobe'd it to get it working again - you will see that in the log - The network was not useable right after the warn_slowpath_xxx stuff.

2) The traceback I'm getting now (with 2.6.32.8-58) is similar but not the same as the original I saw in 2.6.30. They both begin warn_slow_path_common().

3) In the past - load (such as running a backup using rdiff-backup) would trigger this.

4) In this case reflected in this log (2.6.32.8) it was merely booting, logging in - nm connected - start up google-chrome and use it. Chrome had several tabs so may have caused a some network activity for sure - but less than a backup.

5) I had successfully run backups using 2.6.32.8-55 build and it did NOT trigger - as soon as I installed the -58 build the above problem happened (coincidence?)

I'd suggest putting the wifi under headvy load - not sure if the encryption type plays a roll - i belive it is WPA2/personal AES (im not in front of the AP at the moment).

Let me know how else I can help.

These may be related maybe ? Googling warn_slowpath_common() brings up lots but many look different traces

 https://bugzilla.redhat.com/show_bug.cgi?id=556990 
 https://bugzilla.redhat.com/show_bug.cgi?id=502329
 http://lkml.indiana.edu/hypermail/linux/kernel/0910.1/00250.html

Comment 14 Stanislaw Gruszka 2010-02-22 16:03:20 UTC

Here is a bit longer except from logs.

>Feb 19 19:22:28 lap1 kernel: iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:21:29:a3:3c:d0 tid = 0
>Feb 19 19:22:28 lap1 kernel: iwlagn 0000:03:00.0: Hardware error detected.  Restarting.
>Feb 19 19:22:29 lap1 kernel: Registered led device: iwl-phy0::radio
>Feb 19 19:22:29 lap1 kernel: Registered led device: iwl-phy0::assoc
>Feb 19 19:22:29 lap1 kernel: Registered led device: iwl-phy0::RX
>Feb 19 19:22:29 lap1 kernel: Registered led device: iwl-phy0::TX
>Feb 19 19:22:29 lap1 kernel: iwlagn 0000:03:00.0: Stopping AGG while state not ON or starting
>Feb 19 19:22:29 lap1 kernel: iwlagn 0000:03:00.0: queue number out of range: 0, must be 7 to 14
>Feb 19 19:22:29 lap1 kernel: ------------[ cut here ]------------
>Feb 19 19:22:29 lap1 kernel: WARNING: at net/mac80211/agg-tx.c:150 ___ieee80211_stop_tx_ba_session+0x7b/0x80 [mac80211]()
>Feb 19 19:22:29 lap1 kernel: Hardware name: 6459CTO
>Feb 19 19:22:29 lap1 kernel: Modules linked in: fuse hidp rfcomm sco bridge stp llc bnep l2cap autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table iptable_mangle ipt_MASQUERADE iptable_nat nf_nat ip6t_RE
>JECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 kvm_intel kvm uinput snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep arc4 snd_seq snd_seq_device snd_pcm thinkpad_acpi ecb sdhci_pci sdhci mm
>c_core ricoh_mmc i2c_i801 snd_timer snd iwlagn btusb iwlcore bluetooth mac80211 e1000e cfg80211 rfkill joydev soundcore iTCO_wdt snd_page_alloc wmi iTCO_vendor_support sha256_generic cryptd aes_x86_64 aes_ge
>neric cbc dm_crypt dm_multipath firewire_ohci yenta_socket rsrc_nonstatic firewire_core crc_itu_t video output nouveau ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: microcode]
>Feb 19 19:22:29 lap1 kernel: Pid: 2837, comm: chrome Not tainted 2.6.32.8-58.fc12.x86_64 #1
>Feb 19 19:22:29 lap1 kernel: Call Trace:
>Feb 19 19:22:29 lap1 kernel: <IRQ>  [<ffffffff81056348>] warn_slowpath_common+0x7c/0x94 
>Feb 19 19:22:29 lap1 kernel: [<ffffffff81056374>] warn_slowpath_null+0x14/0x16
>Feb 19 19:22:29 lap1 kernel: [<ffffffffa0191ae2>] ___ieee80211_stop_tx_ba_session+0x7b/0x80 [mac80211]
>Feb 19 19:22:29 lap1 kernel: [<ffffffffa0191ccf>] sta_addba_resp_timer_expired+0x64/0x75 [mac80211]
>Feb 19 19:22:29 lap1 kernel: [<ffffffff810650b4>] run_timer_softirq+0x1c4/0x268
>Feb 19 19:22:29 lap1 kernel: [<ffffffff81026b8e>] ? apic_write+0x16/0x18
>Feb 19 19:22:29 lap1 kernel: [<ffffffff8105d964>] __do_softirq+0xe5/0x1a9
>Feb 19 19:22:29 lap1 kernel: [<ffffffff810804ce>] ? tick_program_event+0x2a/0x2c
>Feb 19 19:22:29 lap1 kernel: [<ffffffff81012e6c>] call_softirq+0x1c/0x30
>Feb 19 19:22:29 lap1 kernel: [<ffffffff810143ea>] do_softirq+0x46/0x86
>Feb 19 19:22:29 lap1 kernel: [<ffffffff8105d7a2>] irq_exit+0x3b/0x7d
>Feb 19 19:22:29 lap1 kernel: [<ffffffff81459c1a>] smp_apic_timer_interrupt+0x86/0x94
>Feb 19 19:22:29 lap1 kernel: [<ffffffff81012833>] apic_timer_interrupt+0x13/0x20
>Feb 19 19:22:29 lap1 kernel: <EOI> 
>Feb 19 19:22:29 lap1 kernel: ---[ end trace 4944414491987e6e ]---

First we have, hardware error, then driver try to recover it, but firmware/driver is in some inconsistent state, and firmware give driver bad queue number.

Regarding this bug, I think we have to do better recovery (firmware reset) mechanism. I going to look at related code maybe I will have some propositions. Then I think we have to report this bug upstream in inelllinuxwireless.org bugzilla.

Comment 15 gene c 2010-02-22 16:58:14 UTC

Thanks for your help - 

 Is the 'hardware error' mean my hardware is bad - or this kind of thing happens ? Do i need to have the hardware replaced - which would be tricky unless windows logs an error to make lenovo happy enough to replace the parts.
 
 I'd be happy to test a new firmware too if that helps?


  gene

Comment 16 Stanislaw Gruszka 2010-02-23 13:52:56 UTC

There is something wrong with the hardware, but we should be able to recover that with firmware/driver. If for example windows driver cope well with the same hardware, linux should cope with it as well.

I don't know if new testing firmware is available publicly. I thought you are using the newest f/w release, if no please update, http://intellinuxwireless.org/?n=Downloads (newest should be the same as in rpm).

warn_slowpath_common() is part of every linux kernel warning, at least with some configuration options enabled, just ignore this part of calltrace.

Comment 17 gene c 2010-02-23 14:00:32 UTC

I seem unable to see this in windows - but perhaps I need to run windows longer.

I am running the latest relesaed firmware - I found nothing newer than the fedora rpm.

Thanks for your help.

Comment 18 Stanislaw Gruszka 2010-04-02 12:43:33 UTC

Hi

Intel created a dozen of patches for driver recovery from fimware errors. I backported them to fedora. Could you please test below kernel and see if it helps in that issue:

http://koji.fedoraproject.org/koji/taskinfo?taskID=2090837

Comment 19 gene c 2010-04-04 03:22:03 UTC

I will try your build - what I have found fairly reliable is run a backup using rdiff-backup to a server over the wireless network. It usually lasts a while ... could take 15 mins up to an hour to die ... so make sure there's lots to backup.

My disk is encrypted, so I suspect a backup from a non-encrypted disk would go a lot faster and possibly show the problem sooner - tho I do not know.

g

Comment 20 Bug Zapper 2010-04-28 10:20:36 UTC

This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 11 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 21 gene c 2010-06-21 03:09:31 UTC

i am no longer able to produce this on f13
kernel 2.6.33.5-124.fc13.x86_64

Comment 22 Stanislaw Gruszka 2010-06-21 06:26:04 UTC

Ok gene c, thanks for info.