Bug 852761
Summary: | kernel oops after rmmod rtl8192ce | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Jonathan Kamens <h1k6zn2m> | ||||||||||||||||||||
Component: | kernel | Assignee: | John Greene <jogreene> | ||||||||||||||||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||||||||||||
Priority: | unspecified | ||||||||||||||||||||||
Version: | 20 | CC: | gansalmon, h1k6zn2m, itamar, jonathan, kernel-maint, larry.finger, linville, madhu.chinakonda, mvanross, nhorman | ||||||||||||||||||||
Target Milestone: | --- | Flags: | jforbes:
needinfo?
|
||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||
Last Closed: | 2014-06-23 14:49:08 UTC | Type: | Bug | ||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||
Bug Blocks: | 761525 | ||||||||||||||||||||||
Attachments: |
|
Description
Jonathan Kamens
2012-08-29 14:04:30 UTC
Created attachment 607948 [details]
Trial patch for oops on unload
This oops appears to be from some kind of race condition where the interrupts are disabled too late in some instances. This patch should fix the condition. Please test.
Well, I'm sorry that this has sat for so long... Test kernels with the patch from comment 1 are building here: http://koji.fedoraproject.org/koji/taskinfo?taskID=4537582 When they finish, please try them and report the results back here...thanks! Sorry it's taken so long for me to try this build. I went to the link in comment 2 and tried to download the RPMs to install, but I couldn't find any download links anywhere. Am I missing something? Test builds expire after some passage of time. Please try to test these soon after the build completes: http://koji.fedoraproject.org/koji/taskinfo?taskID=4635944 I tried the test kernel and it still oopses. Stack trace looks exactly the same. Are you still seeing this oops with the 3.6.10 or newer kernel updates? Yes. Sorry, but I am unable to duplicate this problem. Debugging will be difficult with Fedora needing to build the trials. That introduces such a long delay that it is difficult to keep my train of thought. The traceback looks as if there was an interrupt after the pci_device_remove. I'm really surprised that the trial patch did not work. Jonathan, can you please post the full oops text you see with 3.6.10 or newer? Created attachment 675195 [details]
3.6.10 Oops
Created attachment 675741 [details]
Second trial patch for oops on unload
This patch not only moves the interrupt disable as was done in the first one, but it also does a check for the dereference of a NULL pointer from the location where the oops actually happens.
Test kernels with the patch from comment 11 are building here: http://koji.fedoraproject.org/koji/taskinfo?taskID=4852875 Created attachment 676454 [details] 3.6.11-5.bz852761.2.fc17.x86_64 oops Still happening. Oops log attached. Created attachment 676477 [details]
Third trial patch for oops on unload
Thanks for the quick testing of the 2nd patch.
It seems that I misread the line that caused the oops. This time, the traceback pointed at a useless debug message that is failing because the device is stopping. This patch removes the offending debug output.
Test kernels with the patch from comment 14 are building here: http://koji.fedoraproject.org/koji/taskinfo?taskID=4860328 Created attachment 677691 [details]
Still Oops'ing
I see the same with F18. THe reason that I did rmmod rtl8192ce is that I couldn't connect to the wireless (In the past this sometimes helped). Feb 12 20:40:19 x220 NetworkManager[830]: <info> (wlan0): supplicant interface state: authenticating -> disconnected Feb 12 20:40:19 x220 NetworkManager[830]: <info> (wlan0): supplicant interface state: disconnected -> scanning uname: Linux x220 3.7.4-204.fc18.x86_64 #1 SMP Wed Jan 23 16:44:29 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux lcpci: 03:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8188CE 802.11b/g/n WiFi Adapter (rev 01) (In reply to comment #17) > THe reason that I did rmmod rtl8192ce is that I couldn't connect to the > wireless > (In the past this sometimes helped). Me too. See bug 852737, which I filed last August and has had absolutely no activity since then. Anything new on this? Still an issue? Will be engaged on this next few days.. Can someone try latest 3.8.11? Will check it out myself if I can get upstream working on my system.. till then. Where are you? Still a problem with most recently released F18 kernel. Can't test anything newer than that right now. Tag I'm it. Reproduced this on RHEL experimental driver yesterday, with: modprobe -r rtl8192ce Looking into it now. Nothing I've found upstream as yet, will be debugging it. Ah, will start with Larry's fix in C14.. Missed that earlier. John, I have never duplicated this oops using openSUSE/KDE/NetworkManager. I just ran about 50 loops of the following command: while [ 1 ] ; do sudo modprobe -rv rtl8192ce ; sleep 10 ; sudo modprobe -v rtl8192ce ; sleep 10 ; done In nearly every case, the wireless connection completed during the 10 second sleep after module loading, and it never generated any kernel fault messages. Larry Hmm. I got to finish something a bit, later today I will check to see if my crash is same signature as above, but it quite reproducible here. Gotta be able to see it first, so good first step.. Code from C14 in place, still produces an issue. I have duplicated this issue in my version of this driver, very repeatable. It contains patch C14, looking into it myself as well. Sorry that I am unable to help you. Are you using kernel 3.5 as the OP did? I was testing with 3.10-rc1. Perhaps something changed in the mac80211 level in the interum, or there is a fundamental difference between Fedora and openSUSE 12.3 user code. Larry: Just got new hardware to be able to test upstream stuff now. My work involves 3.5 version of mac80211, but 3.9+ driver (on RHEL but still very close). It may well be a difference in mac80211, lemme kick it around and see.
Jonathan:
>Still a problem with most recently released F18 kernel. Can't test anything >newer than that right now.
Ok, can you at least give the output of uname -r of your system here? Thanks.
jik-thinkpad:~!999$ uname -a Linux jik-thinkpad 3.9.3-201.fc18.x86_64 #1 SMP Tue May 21 17:02:24 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Thanks Jonathan, Hmm. Looks like it still may be an issue with 3.9.3 with uplevel mac80211. My system reproducing would seem to be validated (uplevel driver & v3.5 mac80211) a bit if so. Larry, I'll take a look at this on F18 soon and see if I can repo there. I get this exact signature..any other idea at this time? This message is a reminder that Fedora 17 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 17. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '17'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 17's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 17 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior to Fedora 17's end of life. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. Still broken in F18. Updating version. (In reply to Jonathan Kamens from comment #31) > Still broken in F18. Updating version. Jonathan, Thanks for updating to F18. I took a quick look today and see a few updates out there but nothing strikes me as applicable right out. Can you post output of uname -r of kernel you tested? 3.9 does has update to vendor driver I see, would like to know you testing in F18: did you test just stock version? 3.9.x? This gives what I need. uname -r I'm on Fedora 19 now. kernel-3.9.9-302.fc19.x86_64. The problem is still there. *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 18 kernel bugs. Fedora 18 has now been rebased to 3.11.4-101.fc18. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 19, and are still experiencing this issue, please change the version to Fedora 19. If you experience different issues, please open a new bug report for those. Still happening in F19 with 3.11.4-201.fc19.x86_64. I meanwhile got another wifi card of ebay, but anyway this bug might well be hardware specific, so please report your exact hardware when reporting. I haven't the bandwidth at the moment. I believe this issue is as follows (been a while back): driver exit is called the interrupt is disabled and released or so it seems during the release/disable, the ISR gets called, and dies in the Rx processing. It should be a straightforward fix to: ensure pending ISR are cleared and/or disabled, the ISR code needs to check for exit in process and back away if it is called. NACKing on capacity for a bit..Sorry for the delay here. I think the analysis is correct. What I do not understand is why it happens on the OPs system and not mine. I ran a test of 2000 unload/load cycles on the module without a single failure. The shutdown routine disables interrupts as soon as it can, then does some other cleanups before freeing the irq. I suppose I could add in a delay in the middle, but that just seems like a band-aid. What would cause a pending interrupt to be delayed longer on one system than another? Created attachment 834953 [details]
Patch to fix problem
The problem in rtl92c_get_desc() is fixed by checking for a NULL pointer to the descriptor.
I still have no idea why this problem only happens with Fedora installations, and not for any others.
There may be similar patches needed for the other PCI adapters in the rtlwifi tree. Now that I have an f19 setup, I can test.
Created attachment 835515 [details]
A better patch
The previous patch sometimes failed. This one ia more robust.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs. Fedora 19 has now been rebased to 3.12.6-200.fc19. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20. If you experience different issues, please open a new bug report for those. Still broken in 3.12.6 in F20. This bug is fixed by commit 9278db6279e28d4d433bc8a848e10b4ece8793ed in the wireless-testing tree. It has been pushed to the linux-net tree and should be in mainline before 3.13 is released. Once there, it will be added to all stable kernels. The patch is the same as the one listed in the attachments. The patch Larry mentions is in 3.13-rc7, so once DaveM bundles things up it should hit stable shortly. We'll put this in POST for now, and hopefully it makes 3.12.7. If not, we'll apply ourselves. (Thanks Larry!) 3.12.7 is in updates-testing now. Kernel in updates-testing appears to fix the issue. *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs. Fedora 20 has now been rebased to 3.13.4-200.fc20. Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those. Still a problem. Does your kernel contain the patch mentioned in comment 43? I have no idea. I am using the stock Fedora 20 kernel that I just got with "yum update" this morning, 3.13.3-201.fc20.x86_64. Patch in comment 43 appears in kernel 3.12 upstream, appears to be in the kernel in C50, at least at a quick look. The patch was merged between 3.13-rc4 and 3.13-rc5. It has to be in Jonathon's 3.13.3-201.fc20.x86_64. I need to see the kernel splat that is output. Created attachment 867126 [details]
3.13.3-201.fc20.x86_64 kernel splat
What are the exact steps you are doing? The reason I ask is that I installed F20 and 3.13.3-201.fc20.x86_64, and configured a wireless network on an RTL8188CE using rtl8192ce. When I used 'sudo modprobe -rv rtl8192ce', it unloaded just the way I would expect - no kernel oops. "rmmod rtl8192ce". That's it. Whenever a module has dependent modules, "modprobe -r" will remove everything just as modprobe without the -r will load everything. As rtl8192ce has rtl8192-common, rtlwifi, and pci as dependent modules, modprobe is definitely preferable. I don't have the device that uses rtl8192ce in a computer right now so I cannot tell if rmmod will error here. I thought rmmod would refuse to remove modules with other modules dependent on them? I thought I'd seen errors in the past where I tried to remove a module and it refused to let me because other modules were dependent on it. I don't get the Oops when I use modprobe -r, but regardless of whether that's the case, is it really ok for rmmod of a module to cause an Oops? No it should not, and the interrupt should have been disabled *before* rtl8192ce was removed. No matter which command you use, the dependent module cannot be removed until the one using it is removed. In other words, 'modprobe -r rtl8192ce' is the same as: rmmod rtl8192ce.ko rmmod rtl8192c-common.ko rmmod rtlwifi.ko rmmod rtl_pci.ko They do have to be removed in that order. *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs. Fedora 20 has now been rebased to 3.14.4-200.fc20. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those. *********** MASS BUG UPDATE ************** This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 4 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously. |