Bug 457154

Summary: iwl4965 oops: kernel BUG at drivers/net/wireless/iwlwifi/iwl-tx.c:1163!
Product: [Fedora] Fedora Reporter: Richard Henderson <rth>
Component: kernelAssignee: John W. Linville <linville>
Status: CLOSED NEXTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: medium    
Version: 9CC: clarkbw, cra, csmith, dkelson, fedoraproject, jarod, johannes, jonstanley, kernel-maint, kmcmartin, kwizart, marcus, pbrobinson, yi.zhu
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-12-21 08:24:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 438944    
Attachments:
Description Flags
Oops Text
none
output of "iwlist wlan0 scanning" none

Description Richard Henderson 2008-07-29 21:44:33 UTC
Description of problem:
Oops in iwl4965 driver, text to follow.

Version-Release number of selected component (if applicable):
kernel-2.6.25.11-97.fc9.x86_64

How reproducible:
Eventually, always.  The exact timing is variable.

Additional info:
This oops is new, and did not happen in 2.6.25.10-86.

Comment 1 Richard Henderson 2008-07-29 21:44:33 UTC
Created attachment 312939 [details]
Oops Text

Comment 2 Chuck Ebbert 2008-07-29 23:10:17 UTC
Maybe we should revert to the stock drivers when F8 and F9 get updated to 2.6.26?

iwl4965 has been broken for a few weeks now with the bleeding-edge drivers that
are in there now.

Comment 3 John W. Linville 2008-07-30 12:58:52 UTC
Please don't start that.  We have a plan for wireless in Fedora, so we will
stick to it.  FWIW, my t61 is working just fine with -97, both WEP and WPA.

We will only have _different_ bugs (and regressions at that) if we simply revert.

Comment 7 Charles R. Anderson 2008-09-03 15:49:36 UTC
I have this same crash for the last 5 F9 kernel versions I tried on a particular open WiFi network that I think is running Linksys access points with DD-WRT firmware.  I hadn't seen the crash anywhere else before or since.  Unfortuately, I don't have access to that WiFi hotspot anymore.

At least these kernels have this bug:

kernel-2.6.25.6-55.fc9.x86_64
kernel-2.6.25.9-76.fc9.x86_64
kernel-2.6.25.10-86.fc9.x86_64
kernel-2.6.25.11-97.fc9.x86_64
kernel-2.6.25.14-108.fc9.x86_64

Comment 8 Richard Henderson 2008-09-03 15:56:52 UTC
I "solved" the problem for myself by having the AP restrict itself to the 2GHz bands, and not allowing the card to pop up into the 5GHz bands.

Comment 9 James Cassell 2008-09-18 05:10:43 UTC
I have been having this same problem.  It is quite annoying.  It seems to happen most often when I'm in an area with a heavy concentration of access points, somewhere in the range 9-18 or so.

I came across this bug before reporting it myself.  It is 100% reproducible, only the timing varies, as the OP (bug reporter?) mentioned.

Comment 10 James Cassell 2008-09-25 00:38:46 UTC
For what it's worth, I went and installed some older kernels in an attempt to find where this started happening.

2.6.25.10-86 gets this same oops
2.6.25.5-49 gets a different, one: 
     iwl4965:iwl_rx_handle+0x2fc/0x4ad
2.6.25-14 gets yet another:
     iwl4965:iwl4965_irq_tasklet+0x943/0xd01


Based on this, it seems that every Fedora 9 kernel is affected by some form of this problem.

Comment 11 James Cassell 2008-09-25 09:28:47 UTC
I just tested the kernel currently in rawhide: 2.6.27-0.352.rc7.git1.fc10
It also has this problem.

Comment 12 John W. Linville 2008-09-26 21:09:30 UTC
I've never seen this on my t61.  Can you describe a reliable way to recreate it?

Comment 13 Chuck Ebbert 2008-09-27 15:24:40 UTC
(In reply to comment #12)
> I've never seen this on my t61.  Can you describe a reliable way to recreate
> it?

From comment #9:

> It seems to
> happen most often when I'm in an area with a heavy concentration of access
> points, somewhere in the range 9-18 or so.

Comment 14 James Cassell 2008-09-27 21:10:30 UTC
Created attachment 317871 [details]
output of "iwlist wlan0 scanning"

Here is the output of "iwlist wlan0 scanning"

I ran this at the point were the problem happens most often.  Every time I have wireless enabled, NetworkManager connects, and within 5-30 minutes (or sometimes less), I get the kernel panic.


I don't know how to describe how to reliably "recreate" this, except to turn on wireless in this area.  This is just speculation, even if one of the wireless networks has some sort of bug, I don't think it should be able to cause a kernel panic.

What would be a good way to find out how to reliably recreate the kernel panic?

Comment 15 Jon Stanley 2008-09-29 03:29:27 UTC
Again I've got an R61 - I can reliably reproduce it in a certain area (not sure that it's this exact oops - the R61 has no serial port to capture the output). It happened to me 3 times today in this area...nothing logged each time.

Maybe I could use a crash kernel to capture something (this is a hang with a blinking caps lock - would the kernel respond to a SysRQ-c in such a state?)

Comment 16 James Cassell 2008-09-29 03:38:44 UTC
(In reply to comment #15)
> Again I've got an R61 - I can reliably reproduce it in a certain area (not sure
> that it's this exact oops - the R61 has no serial port to capture the output).
> It happened to me 3 times today in this area...nothing logged each time.

Maybe if you attach the output of "iwlist wlan0 scanning" from the area where it happens, someone might be able to find something in common with the one that I attached.  (or maybe someone knowledgable in this area could ask for whatever would help most in finding the problem)

> 
> Maybe I could use a crash kernel to capture something (this is a hang with a
> blinking caps lock - would the kernel respond to a SysRQ-c in such a state?)

My caps lock blinks when this happens, too

Comment 17 Jarod Wilson 2008-09-29 18:21:26 UTC
I'm getting pretty regular panics with an iwl4965 myself, took a photo of the on-screen traceback:

http://wilsonet.com/jarod/junk/IMG_0035.JPG
http://wilsonet.com/jarod/junk/IMG_0036.JPG

The panic there is with a 2.6.25.16 kernel of my own building, which is simply 2.6.25.14-108.fc9 with the 2.6.25.14 stable patch replaced with the 2.6.25.16 patch. However, I'm seeing identical behavior with the latest rawhide kernel as well (just haven't captured the trace there yet).

So far, this only happens at home, where I have a D-Link DIR-655 802.11n access point, set up for WPA2 encryption, and roughly 6 other wireless client devices that are on most of the time. Two or three other nearby base stations also pop into range pretty regularly. No problems using the wifi here in the office.

Perhaps of interest is that with the latest rawhide kernel, another machine of mine, which has an iwl5350 card in it, is rock-solid, where this device w/the iwl4965 goes south in a hurry.

Comment 18 John W. Linville 2008-09-29 19:01:42 UTC
Line 1163 of iwl-tx.c (and a few lines before it):

        /* If a Tx command is being handled and it isn't in the actual
         * command queue then there a command routing bug has been introduced
         * in the queue management code. */
        if (txq_id != IWL_CMD_QUEUE_NUM)
                IWL_ERROR("Error wrong command queue %d command id 0x%X\n",
                          txq_id, pkt->hdr.cmd);
        BUG_ON(txq_id != IWL_CMD_QUEUE_NUM);

Comment 19 Bryan Clark (:clarkbw) 2008-10-01 16:11:38 UTC
I'm seeing this pretty regularly, let me know if there is any more information I could provide.  I'm running on an x86 T61.  With the wireless on I can get an oops pretty reliably, however I don't know what specifically is triggering it.  My suspicion is that it has something to do with certain network types because on my home network I don't see this, however at work I do.  Similar results for other networks, no recognizable pattern as of yet.

Comment 20 Chuyee 2008-10-06 06:50:12 UTC
We are working on root cause the bug on http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=1703.

Here is a workaround http://git.kernel.org/?p=linux/kernel/git/linville/wireless-testing.git;a=commitdiff;h=98fa49592bea0fe5749973ab9fc6b3f00ce2ea82

John, please make sure it is included in 2.6.27 if we cannot provide a final fix by the time 2.6.27 comes out.

Comment 21 Jarod Wilson 2008-10-06 13:38:27 UTC
I took the liberty of tacking this patch onto the rawhide kernel, will double-check that I no longer panic when using my 802.11n base station at home this evening.

Comment 22 John W. Linville 2008-10-06 15:23:58 UTC
Honestly, I'm not too comfortable with that work-around -- I'm probably going to revert it in wireless-testing.

The author has reported a number of DMA problems with that hardware.  Apparently it (i.e. the driver/hardware combination) is scribbling all over memory.  I think I would rather BUG than continue on and trash someone's filesystem.

Comment 23 Jarod Wilson 2008-10-06 16:17:44 UTC
I'll see what blows up w/my laptop, no problem if you want to revert it in rawhide too, I can wait for a proper fix.

Comment 24 James Cassell 2008-10-09 02:27:51 UTC
just when I thought it had gone away in the latest rawhide kernel (I hadn't seen it in a day), I happened again: http://www.kerneloops.org/raw.php?rawid=77730&msgid=

(I see that there are 2 others as well http://www.kerneloops.org/guilty.php?guilty=iwl_tx_cmd_complete&version=2.6.27-rc&start=1769472&end=1802239&class=warn )

Comment 25 Dax Kelson 2008-10-14 20:57:53 UTC
Myself and two other co-workers have this problem daily on our T61ps. We are using wired for now.

Comment 26 Christopher M. Smith 2008-10-14 23:42:52 UTC
I'm also not sure if it helps (and from reading the bug reports on both the Intel site and associated Bugzillas) I have been unable to replicate the oops using WEP.  WPA / WPA2 personal yes, but not straight WEP.  The same usage patterns that would ordinarily cause the panic are not.  Uptime is multiple hours and counting of very high network utilization and tests.

CMS

Comment 27 Jarod Wilson 2008-10-15 04:20:37 UTC
Interestingly enough, since updating to the kernel where I tacked on the work-around in comment #20, I've not even hit the WARN condition, and I'm using my iwl4965 w/WPA2 on an 802.11n base station here at home right now, been using it for hours at a time lately w/o a problem. Earlier kernels, I'd hit a panic within about five minutes of uptime.

Comment 28 Christopher M. Smith 2008-10-27 12:15:00 UTC
Side note: Workaround no longer applies cleanly to latest released F9 kernel as of this weekend: kernel-2.6.26.6-79.

Comment 29 John W. Linville 2008-10-27 19:03:02 UTC
Can you recreate this issue using a rawhide kernel?

Comment 30 Dax Kelson 2008-10-27 21:13:16 UTC
Current rawhide kernel require new mkinitrd, nash, plymouth, plus a bunch of other deps (whose dep tree I have not fully explored).

Do you have a F9 version of the rawhide kernel that use F9 deps?

My F9 laptop is my main production laptop and (if possible) I'd rather not do the following:

# yum --enablerepo=rawhide update kernel

Dependencies Resolved

================================================================================
 Package                    Arch   Version                   Repository   Size 
================================================================================
Installing:
 kernel                     x86_64 2.6.27.4-51.fc10          rawhide       21 M
 plymouth                   x86_64 0.6.0-0.2008.10.24.2.fc10 rawhide       44 k
     replacing  rhgb.x86_64 1:9.0.0-6.fc9

Updating:
 fedora-logos               noarch 10.0.0-2.fc10             rawhide      1.8 M
 initscripts                x86_64 8.84-1                    rawhide      1.9 M
 libbdevid-python           x86_64 6.0.67-1.fc10             rawhide       64 k
 mkinitrd                   x86_64 6.0.67-1.fc10             rawhide      112 k
 nash                       x86_64 6.0.67-1.fc10             rawhide      163 k
Removing:
 kernel                     x86_64 2.6.25.14-108.fc9         installed     70 M
Installing for dependencies:
 kernel-firmware            noarch 2.6.27.4-51.fc10          rawhide      344 k
 plymouth-libs              x86_64 0.6.0-0.2008.10.24.2.fc10 rawhide       62 k
 plymouth-plugin-label      x86_64 0.6.0-0.2008.10.24.2.fc10 rawhide       15 k
 plymouth-plugin-spinfinity x86_64 0.6.0-0.2008.10.24.2.fc10 rawhide       37 k
 plymouth-scripts           x86_64 0.6.0-0.2008.10.24.2.fc10 rawhide       13 k

Comment 31 John W. Linville 2008-10-27 21:26:58 UTC
No, sorry.  In fact, my concern is that the wireless bits in F9 are out of sync with what is in rawhide.

I certainly understand your concern about the yum stuff -- FWIW that doesn't look too dangerous to me (no glibc update)! :-)

Maybe Jarod is willing to take the plunge?

Comment 32 Christopher M. Smith 2008-10-27 21:40:44 UTC
I'll give it a whirl.  Give me a bit...

CMS

Comment 33 Jarod Wilson 2008-10-27 21:52:52 UTC
(In reply to comment #31)
> Maybe Jarod is willing to take the plunge?

What plunge? I'm already running latest rawhide... :)

(Apologies for not making that clear, I've been running nothing but rawhide kernels since throwing in that patch referenced in comment #20).

Comment 34 Marcus 2008-10-28 18:25:06 UTC
Heh, wow, its been so long since I wrote this original bug that I forgot to check its status... :)

Updating to rawhide to see if that fixes the issue now that I've reverted back to (x86 arch).

Just to continue on it since it only appears to effect the iwl4965 card.  I stopped using the internal iwl and only use a linksys usb.  Of course, networkmanager keeps the same settings for any wireless AP, so drop in that linksys usb and I have had no problems since.  It appears that it is only a problem in Fedora as Debian's dist of the same stuff works just fine.

Comment 35 Christopher M. Smith 2008-10-29 16:24:04 UTC
I have been using rawhide kernel 2.6.27.4-51.fc10.i686 and I have had 1 panic of the same type in the last 24 hours.  Same panic string (from what I can tell).  Much lower frequency, but I did manage to reproduce it.

CMS

Comment 36 Chuck Ebbert 2008-10-31 07:44:58 UTC
(In reply to comment #31)
> No, sorry.  In fact, my concern is that the wireless bits in F9 are out of sync
> with what is in rawhide.
> 

They're in sync as of today.

2.6.27.4 kernel for f9 is here:
http://koji.fedoraproject.org/koji/buildinfo?buildID=68158

Comment 37 John W. Linville 2008-11-13 15:19:03 UTC
Do you continue to see this oops w/ the 2.6.27-based kernels?

Comment 38 Marcus 2008-11-13 15:34:16 UTC
Happened last night with latest f10 kernel (and all latest updates).  Had to drop back to the linksys wireless usb card (didn't have time to fumble with what happened.

Comment 39 Charles R. Anderson 2008-11-18 01:18:00 UTC
*** Bug 464559 has been marked as a duplicate of this bug. ***

Comment 40 Johannes Berg 2008-11-18 01:25:18 UTC
Try http://marc.info/?l=linux-wireless&m=122696931311854&w=2

Comment 41 John W. Linville 2008-12-10 19:14:16 UTC
Is this still happening with the latest available F10 kernel?

Comment 42 James Cassell 2008-12-12 04:14:28 UTC
Yes, it's still happening in the latest kernel (as of now)

Comment 43 Charles R. Anderson 2008-12-13 03:37:07 UTC
I'm NOT seeing the iwl_tx_cmd_complete oops on the -132 or -134 F10 kernels, nor apparently on any kernels all the way back to November 18 (as far back as my rotated messages files go).  I am, however, seeing a different oops: rs_get_rate, which is being tracked over in bug #470225.  Please double check that you are reporting on the correct oops here, or add comments to #470225 if appropriate.  Thanks.

Comment 44 James Cassell 2008-12-15 04:01:01 UTC
http://kerneloops.org/version.php?start=1802240&end=1835007&version=27-release

it's the second-to-most reported WARN_ON on that page (scroll down), and yes, it's still happening with the latest kernel for me

Comment 45 Chuyee 2008-12-15 05:08:24 UTC
Please make sure you are using the 228.57.2.23 firmware:
http://intellinuxwireless.org/iwlwifi/downloads/iwlwifi-4965-ucode-228.57.2.23.tgz

Comment 46 James Cassell 2008-12-15 07:02:10 UTC
$ rpm -qa iwl4965-firmware
iwl4965-firmware-228.57.2.21-3.noarch

is there an RPM of the newer firmware somewhere?

Comment 47 Fedora Update System 2008-12-15 16:18:20 UTC
iwl4965-firmware-228.57.2.23-1.fc8 has been submitted as an update for Fedora 8.
http://admin.fedoraproject.org/updates/iwl4965-firmware-228.57.2.23-1.fc8

Comment 48 Fedora Update System 2008-12-15 16:23:34 UTC
iwl4965-firmware-228.57.2.23-1.fc9 has been submitted as an update for Fedora 9.
http://admin.fedoraproject.org/updates/iwl4965-firmware-228.57.2.23-1.fc9

Comment 49 Nicolas Chauvet (kwizart) 2008-12-15 16:38:44 UTC
iwl4965-firmware-228.57.2.23-2 has been submitted as an update for Fedora 10.
http://admin.fedoraproject.org/updates/iwl4965-firmware-228.57.2.23-2

Comment 50 Fedora Update System 2008-12-18 00:37:31 UTC
iwl4965-firmware-228.57.2.23-1.fc8 has been pushed to the Fedora 8 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing-newkey update iwl4965-firmware'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F8/FEDORA-2008-11287

Comment 51 Fedora Update System 2008-12-18 00:40:16 UTC
iwl4965-firmware-228.57.2.23-1.fc9 has been pushed to the Fedora 9 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing-newkey update iwl4965-firmware'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-11406

Comment 52 Charles R. Anderson 2008-12-18 20:57:24 UTC
adding detail to subject to keep all these different iwl bugs straight in my head...

Comment 53 Fedora Update System 2008-12-21 08:24:01 UTC
iwl4965-firmware-228.57.2.23-1.fc9 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.