Bug 457154
Summary: | iwl4965 oops: kernel BUG at drivers/net/wireless/iwlwifi/iwl-tx.c:1163! | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Richard Henderson <rth> | ||||||
Component: | kernel | Assignee: | John W. Linville <linville> | ||||||
Status: | CLOSED NEXTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 9 | CC: | clarkbw, cra, csmith, dkelson, fedoraproject, jarod, johannes, jonstanley, kernel-maint, kmcmartin, kwizart, marcus, pbrobinson, yi.zhu | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-12-21 08:24:07 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 438944 | ||||||||
Attachments: |
|
Description
Richard Henderson
2008-07-29 21:44:33 UTC
Created attachment 312939 [details]
Oops Text
Maybe we should revert to the stock drivers when F8 and F9 get updated to 2.6.26? iwl4965 has been broken for a few weeks now with the bleeding-edge drivers that are in there now. Please don't start that. We have a plan for wireless in Fedora, so we will stick to it. FWIW, my t61 is working just fine with -97, both WEP and WPA. We will only have _different_ bugs (and regressions at that) if we simply revert. I have this same crash for the last 5 F9 kernel versions I tried on a particular open WiFi network that I think is running Linksys access points with DD-WRT firmware. I hadn't seen the crash anywhere else before or since. Unfortuately, I don't have access to that WiFi hotspot anymore. At least these kernels have this bug: kernel-2.6.25.6-55.fc9.x86_64 kernel-2.6.25.9-76.fc9.x86_64 kernel-2.6.25.10-86.fc9.x86_64 kernel-2.6.25.11-97.fc9.x86_64 kernel-2.6.25.14-108.fc9.x86_64 I "solved" the problem for myself by having the AP restrict itself to the 2GHz bands, and not allowing the card to pop up into the 5GHz bands. I have been having this same problem. It is quite annoying. It seems to happen most often when I'm in an area with a heavy concentration of access points, somewhere in the range 9-18 or so. I came across this bug before reporting it myself. It is 100% reproducible, only the timing varies, as the OP (bug reporter?) mentioned. For what it's worth, I went and installed some older kernels in an attempt to find where this started happening. 2.6.25.10-86 gets this same oops 2.6.25.5-49 gets a different, one: iwl4965:iwl_rx_handle+0x2fc/0x4ad 2.6.25-14 gets yet another: iwl4965:iwl4965_irq_tasklet+0x943/0xd01 Based on this, it seems that every Fedora 9 kernel is affected by some form of this problem. I just tested the kernel currently in rawhide: 2.6.27-0.352.rc7.git1.fc10 It also has this problem. I've never seen this on my t61. Can you describe a reliable way to recreate it? (In reply to comment #12) > I've never seen this on my t61. Can you describe a reliable way to recreate > it? From comment #9: > It seems to > happen most often when I'm in an area with a heavy concentration of access > points, somewhere in the range 9-18 or so. Created attachment 317871 [details]
output of "iwlist wlan0 scanning"
Here is the output of "iwlist wlan0 scanning"
I ran this at the point were the problem happens most often. Every time I have wireless enabled, NetworkManager connects, and within 5-30 minutes (or sometimes less), I get the kernel panic.
I don't know how to describe how to reliably "recreate" this, except to turn on wireless in this area. This is just speculation, even if one of the wireless networks has some sort of bug, I don't think it should be able to cause a kernel panic.
What would be a good way to find out how to reliably recreate the kernel panic?
Again I've got an R61 - I can reliably reproduce it in a certain area (not sure that it's this exact oops - the R61 has no serial port to capture the output). It happened to me 3 times today in this area...nothing logged each time. Maybe I could use a crash kernel to capture something (this is a hang with a blinking caps lock - would the kernel respond to a SysRQ-c in such a state?) (In reply to comment #15) > Again I've got an R61 - I can reliably reproduce it in a certain area (not sure > that it's this exact oops - the R61 has no serial port to capture the output). > It happened to me 3 times today in this area...nothing logged each time. Maybe if you attach the output of "iwlist wlan0 scanning" from the area where it happens, someone might be able to find something in common with the one that I attached. (or maybe someone knowledgable in this area could ask for whatever would help most in finding the problem) > > Maybe I could use a crash kernel to capture something (this is a hang with a > blinking caps lock - would the kernel respond to a SysRQ-c in such a state?) My caps lock blinks when this happens, too I'm getting pretty regular panics with an iwl4965 myself, took a photo of the on-screen traceback: http://wilsonet.com/jarod/junk/IMG_0035.JPG http://wilsonet.com/jarod/junk/IMG_0036.JPG The panic there is with a 2.6.25.16 kernel of my own building, which is simply 2.6.25.14-108.fc9 with the 2.6.25.14 stable patch replaced with the 2.6.25.16 patch. However, I'm seeing identical behavior with the latest rawhide kernel as well (just haven't captured the trace there yet). So far, this only happens at home, where I have a D-Link DIR-655 802.11n access point, set up for WPA2 encryption, and roughly 6 other wireless client devices that are on most of the time. Two or three other nearby base stations also pop into range pretty regularly. No problems using the wifi here in the office. Perhaps of interest is that with the latest rawhide kernel, another machine of mine, which has an iwl5350 card in it, is rock-solid, where this device w/the iwl4965 goes south in a hurry. Line 1163 of iwl-tx.c (and a few lines before it): /* If a Tx command is being handled and it isn't in the actual * command queue then there a command routing bug has been introduced * in the queue management code. */ if (txq_id != IWL_CMD_QUEUE_NUM) IWL_ERROR("Error wrong command queue %d command id 0x%X\n", txq_id, pkt->hdr.cmd); BUG_ON(txq_id != IWL_CMD_QUEUE_NUM); I'm seeing this pretty regularly, let me know if there is any more information I could provide. I'm running on an x86 T61. With the wireless on I can get an oops pretty reliably, however I don't know what specifically is triggering it. My suspicion is that it has something to do with certain network types because on my home network I don't see this, however at work I do. Similar results for other networks, no recognizable pattern as of yet. We are working on root cause the bug on http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=1703. Here is a workaround http://git.kernel.org/?p=linux/kernel/git/linville/wireless-testing.git;a=commitdiff;h=98fa49592bea0fe5749973ab9fc6b3f00ce2ea82 John, please make sure it is included in 2.6.27 if we cannot provide a final fix by the time 2.6.27 comes out. I took the liberty of tacking this patch onto the rawhide kernel, will double-check that I no longer panic when using my 802.11n base station at home this evening. Honestly, I'm not too comfortable with that work-around -- I'm probably going to revert it in wireless-testing. The author has reported a number of DMA problems with that hardware. Apparently it (i.e. the driver/hardware combination) is scribbling all over memory. I think I would rather BUG than continue on and trash someone's filesystem. I'll see what blows up w/my laptop, no problem if you want to revert it in rawhide too, I can wait for a proper fix. just when I thought it had gone away in the latest rawhide kernel (I hadn't seen it in a day), I happened again: http://www.kerneloops.org/raw.php?rawid=77730&msgid= (I see that there are 2 others as well http://www.kerneloops.org/guilty.php?guilty=iwl_tx_cmd_complete&version=2.6.27-rc&start=1769472&end=1802239&class=warn ) Myself and two other co-workers have this problem daily on our T61ps. We are using wired for now. I'm also not sure if it helps (and from reading the bug reports on both the Intel site and associated Bugzillas) I have been unable to replicate the oops using WEP. WPA / WPA2 personal yes, but not straight WEP. The same usage patterns that would ordinarily cause the panic are not. Uptime is multiple hours and counting of very high network utilization and tests. CMS Interestingly enough, since updating to the kernel where I tacked on the work-around in comment #20, I've not even hit the WARN condition, and I'm using my iwl4965 w/WPA2 on an 802.11n base station here at home right now, been using it for hours at a time lately w/o a problem. Earlier kernels, I'd hit a panic within about five minutes of uptime. Side note: Workaround no longer applies cleanly to latest released F9 kernel as of this weekend: kernel-2.6.26.6-79. Can you recreate this issue using a rawhide kernel? Current rawhide kernel require new mkinitrd, nash, plymouth, plus a bunch of other deps (whose dep tree I have not fully explored). Do you have a F9 version of the rawhide kernel that use F9 deps? My F9 laptop is my main production laptop and (if possible) I'd rather not do the following: # yum --enablerepo=rawhide update kernel Dependencies Resolved ================================================================================ Package Arch Version Repository Size ================================================================================ Installing: kernel x86_64 2.6.27.4-51.fc10 rawhide 21 M plymouth x86_64 0.6.0-0.2008.10.24.2.fc10 rawhide 44 k replacing rhgb.x86_64 1:9.0.0-6.fc9 Updating: fedora-logos noarch 10.0.0-2.fc10 rawhide 1.8 M initscripts x86_64 8.84-1 rawhide 1.9 M libbdevid-python x86_64 6.0.67-1.fc10 rawhide 64 k mkinitrd x86_64 6.0.67-1.fc10 rawhide 112 k nash x86_64 6.0.67-1.fc10 rawhide 163 k Removing: kernel x86_64 2.6.25.14-108.fc9 installed 70 M Installing for dependencies: kernel-firmware noarch 2.6.27.4-51.fc10 rawhide 344 k plymouth-libs x86_64 0.6.0-0.2008.10.24.2.fc10 rawhide 62 k plymouth-plugin-label x86_64 0.6.0-0.2008.10.24.2.fc10 rawhide 15 k plymouth-plugin-spinfinity x86_64 0.6.0-0.2008.10.24.2.fc10 rawhide 37 k plymouth-scripts x86_64 0.6.0-0.2008.10.24.2.fc10 rawhide 13 k No, sorry. In fact, my concern is that the wireless bits in F9 are out of sync with what is in rawhide. I certainly understand your concern about the yum stuff -- FWIW that doesn't look too dangerous to me (no glibc update)! :-) Maybe Jarod is willing to take the plunge? I'll give it a whirl. Give me a bit... CMS (In reply to comment #31) > Maybe Jarod is willing to take the plunge? What plunge? I'm already running latest rawhide... :) (Apologies for not making that clear, I've been running nothing but rawhide kernels since throwing in that patch referenced in comment #20). Heh, wow, its been so long since I wrote this original bug that I forgot to check its status... :) Updating to rawhide to see if that fixes the issue now that I've reverted back to (x86 arch). Just to continue on it since it only appears to effect the iwl4965 card. I stopped using the internal iwl and only use a linksys usb. Of course, networkmanager keeps the same settings for any wireless AP, so drop in that linksys usb and I have had no problems since. It appears that it is only a problem in Fedora as Debian's dist of the same stuff works just fine. I have been using rawhide kernel 2.6.27.4-51.fc10.i686 and I have had 1 panic of the same type in the last 24 hours. Same panic string (from what I can tell). Much lower frequency, but I did manage to reproduce it. CMS (In reply to comment #31) > No, sorry. In fact, my concern is that the wireless bits in F9 are out of sync > with what is in rawhide. > They're in sync as of today. 2.6.27.4 kernel for f9 is here: http://koji.fedoraproject.org/koji/buildinfo?buildID=68158 Do you continue to see this oops w/ the 2.6.27-based kernels? Happened last night with latest f10 kernel (and all latest updates). Had to drop back to the linksys wireless usb card (didn't have time to fumble with what happened. *** Bug 464559 has been marked as a duplicate of this bug. *** Is this still happening with the latest available F10 kernel? Yes, it's still happening in the latest kernel (as of now) I'm NOT seeing the iwl_tx_cmd_complete oops on the -132 or -134 F10 kernels, nor apparently on any kernels all the way back to November 18 (as far back as my rotated messages files go). I am, however, seeing a different oops: rs_get_rate, which is being tracked over in bug #470225. Please double check that you are reporting on the correct oops here, or add comments to #470225 if appropriate. Thanks. http://kerneloops.org/version.php?start=1802240&end=1835007&version=27-release it's the second-to-most reported WARN_ON on that page (scroll down), and yes, it's still happening with the latest kernel for me Please make sure you are using the 228.57.2.23 firmware: http://intellinuxwireless.org/iwlwifi/downloads/iwlwifi-4965-ucode-228.57.2.23.tgz $ rpm -qa iwl4965-firmware iwl4965-firmware-228.57.2.21-3.noarch is there an RPM of the newer firmware somewhere? iwl4965-firmware-228.57.2.23-1.fc8 has been submitted as an update for Fedora 8. http://admin.fedoraproject.org/updates/iwl4965-firmware-228.57.2.23-1.fc8 iwl4965-firmware-228.57.2.23-1.fc9 has been submitted as an update for Fedora 9. http://admin.fedoraproject.org/updates/iwl4965-firmware-228.57.2.23-1.fc9 iwl4965-firmware-228.57.2.23-2 has been submitted as an update for Fedora 10. http://admin.fedoraproject.org/updates/iwl4965-firmware-228.57.2.23-2 iwl4965-firmware-228.57.2.23-1.fc8 has been pushed to the Fedora 8 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing-newkey update iwl4965-firmware'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F8/FEDORA-2008-11287 iwl4965-firmware-228.57.2.23-1.fc9 has been pushed to the Fedora 9 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing-newkey update iwl4965-firmware'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-11406 adding detail to subject to keep all these different iwl bugs straight in my head... iwl4965-firmware-228.57.2.23-1.fc9 has been pushed to the Fedora 9 stable repository. If problems still persist, please make note of it in this bug report. |