Bug 852761

Summary: kernel oops after rmmod rtl8192ce
Product: [Fedora] Fedora Reporter: Jonathan Kamens <jik>
Component: kernelAssignee: John Greene <jogreene>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 20CC: gansalmon, itamar, jik, jonathan, kernel-maint, larry.finger, linville, madhu.chinakonda, mvanross, nhorman
Target Milestone: ---Flags: jforbes: needinfo?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-23 14:49:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 761525    
Attachments:
Description Flags
Trial patch for oops on unload
none
3.6.10 Oops
none
Second trial patch for oops on unload
none
3.6.11-5.bz852761.2.fc17.x86_64 oops
none
Third trial patch for oops on unload
none
Still Oops'ing
none
Patch to fix problem
none
A better patch
none
3.13.3-201.fc20.x86_64 kernel splat none

Description Jonathan Kamens 2012-08-29 14:04:30 UTC
BUG: unable to handle kernel NULL pointer dereference at 00000000000002c0
IP: [<ffffffffa031aae2>] rtl92ce_get_desc+0x12/0x50 [rtl8192ce]
PGD 1e4943067 PUD 1ebdbd067 PMD 0 
Oops: 0000 [#1] SMP 
CPU 1 
Modules linked in: nls_utf8 udf crc_itu_t fuse lockd sunrpc rfcomm bnep tpm_bios ip6t_REJECT nf_conntrack_ipv6 nf_conntrack_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 xt_state nf_conntrack ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_conexant arc4 coretemp kvm_intel kvm microcode snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device i2c_i801 uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev btusb bluetooth media rtl8192ce(-) rtlwifi rtl8192c_common mac80211 lpc_ich snd_hda_intel mfd_core snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_timer cfg80211 e1000e mei thinkpad_acpi snd soundcore rfkill uinput crc32c_intel ghash_clmulni_intel sdhci_pci sdhci mmc_core wmi i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
Pid: 5659, comm: rmmod Not tainted 3.5.2-3.fc17.x86_64 #1 LENOVO 4177CTO/4177CTO
RIP: 0010:[<ffffffffa031aae2>]  [<ffffffffa031aae2>] rtl92ce_get_desc+0x12/0x50 [rtl8192ce]
RSP: 0018:ffff880126105b78  EFLAGS: 00010046
RAX: ffffffffa031c2a0 RBX: 00000000000002c0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000002c0
RBP: ffff880126105b78 R08: 0000000000000040 R09: ffff880215400000
R10: 000000000db55f01 R11: 0000000000000008 R12: ffff88021111bc00
R13: 0000000000000016 R14: ffff88020e9c9f20 R15: 0000000000000016
FS:  00007f0bc6a85740(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000002c0 CR3: 00000001efc3a000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process rmmod (pid: 5659, threadinfo ffff880126104000, task ffff88020e79c530)
Stack:
 ffff880126105ca8 ffffffffa0300bbd ffff880126105ca8 ffff88020e9ca200
 ffff88020e9ccdd8 ffff88020db553c0 ffff88020e9c8560 000000400836d540
 ffff880215802300 ffff880126105c20 ffff880126105c18 0000000000000000
Call Trace:
 [<ffffffffa0300bbd>] _rtl_pci_rx_interrupt+0x19d/0x640 [rtlwifi]
 [<ffffffffa0301c12>] _rtl_pci_interrupt+0x2d2/0x2f0 [rtlwifi]
 [<ffffffff810e3e09>] __free_irq+0x189/0x220
 [<ffffffff810e3ef4>] free_irq+0x54/0xc0
 [<ffffffffa0301f86>] rtl_pci_disconnect+0x196/0x1c0 [rtlwifi]
 [<ffffffff812f7c1f>] pci_device_remove+0x3f/0x110
 [<ffffffff813b510c>] __device_release_driver+0x7c/0xe0
 [<ffffffff813b59d8>] driver_detach+0xb8/0xc0
 [<ffffffff813b4c32>] bus_remove_driver+0x92/0x110
 [<ffffffff813b5ed2>] driver_unregister+0x62/0xa0
 [<ffffffff812f73b4>] pci_unregister_driver+0x44/0xa0
 [<ffffffffa031ab8c>] rtl92ce_driver_exit+0x10/0x484 [rtl8192ce]
 [<ffffffff810b8c6e>] sys_delete_module+0x16e/0x2d0
 [<ffffffff81185d56>] ? filp_close+0x66/0xa0
 [<ffffffff81614969>] system_call_fastpath+0x16/0x1b
Code: 3f 00 00 81 e2 00 c0 ff ff 09 d0 89 07 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 40 84 f6 74 12 84 d2 75 1e <8b> 07 5d c1 e8 1f c3 0f 1f 80 00 00 00 00 84 d2 74 ee 80 fa 05 
RIP  [<ffffffffa031aae2>] rtl92ce_get_desc+0x12/0x50 [rtl8192ce]
 RSP <ffff880126105b78>
CR2: 00000000000002c0

Comment 1 Larry Finger 2012-08-29 15:52:54 UTC
Created attachment 607948 [details]
Trial patch for oops on unload

This oops appears to be from some kind of race condition where the interrupts are disabled too late in some instances. This patch should fix the condition. Please test.

Comment 2 John W. Linville 2012-09-28 15:32:03 UTC
Well, I'm sorry that this has sat for so long...

Test kernels with the patch from comment 1 are building here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=4537582

When they finish, please try them and report the results back here...thanks!

Comment 3 Jonathan Kamens 2012-10-23 01:21:13 UTC
Sorry it's taken so long for me to try this build. I went to the link in comment 2 and tried to download the RPMs to install, but I couldn't find any download links anywhere. Am I missing something?

Comment 4 John W. Linville 2012-10-29 16:50:12 UTC
Test builds expire after some passage of time.  Please try to test these soon after the build completes:

http://koji.fedoraproject.org/koji/taskinfo?taskID=4635944

Comment 5 Jonathan Kamens 2012-10-29 21:52:19 UTC
I tried the test kernel and it still oopses. Stack trace looks exactly the same.

Comment 6 Josh Boyer 2013-01-08 13:34:11 UTC
Are you still seeing this oops with the 3.6.10 or newer kernel updates?

Comment 7 Jonathan Kamens 2013-01-08 17:26:32 UTC
Yes.

Comment 8 Larry Finger 2013-01-08 19:05:29 UTC
Sorry, but I am unable to duplicate this problem. Debugging will be difficult with Fedora needing to build the trials. That introduces such a long delay that it is difficult to keep my train of thought.

The traceback looks as if there was an interrupt after the pci_device_remove. I'm really surprised that the trial patch did not work.

Comment 9 Josh Boyer 2013-01-08 20:38:04 UTC
Jonathan, can you please post the full oops text you see with 3.6.10 or newer?

Comment 10 Jonathan Kamens 2013-01-09 03:26:20 UTC
Created attachment 675195 [details]
3.6.10 Oops

Comment 11 Larry Finger 2013-01-09 17:18:00 UTC
Created attachment 675741 [details]
Second trial patch for oops on unload

This patch not only moves the interrupt disable as was done in the first one, but it also does a check for the dereference of a NULL pointer from the location where the oops actually happens.

Comment 12 John W. Linville 2013-01-09 20:37:38 UTC
Test kernels with the patch from comment 11 are building here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=4852875

Comment 13 Jonathan Kamens 2013-01-10 17:57:58 UTC
Created attachment 676454 [details]
3.6.11-5.bz852761.2.fc17.x86_64 oops

Still happening. Oops log attached.

Comment 14 Larry Finger 2013-01-10 18:58:56 UTC
Created attachment 676477 [details]
Third trial patch for oops on unload

Thanks for the quick testing of the 2nd patch.

It seems that I misread the line that caused the oops. This time, the traceback pointed at a useless debug message that is failing because the device is stopping. This patch removes the offending debug output.

Comment 15 John W. Linville 2013-01-11 19:14:30 UTC
Test kernels with the patch from comment 14 are building here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=4860328

Comment 16 Jonathan Kamens 2013-01-13 12:39:17 UTC
Created attachment 677691 [details]
Still Oops'ing

Comment 17 Mark van Rossum 2013-02-12 20:58:24 UTC
I see the same with F18.

THe reason that I did rmmod rtl8192ce is that I couldn't connect to the wireless
(In the past this sometimes helped).

Feb 12 20:40:19 x220 NetworkManager[830]: <info> (wlan0): supplicant interface state: authenticating -> disconnected
Feb 12 20:40:19 x220 NetworkManager[830]: <info> (wlan0): supplicant interface state: disconnected -> scanning


uname:
Linux x220 3.7.4-204.fc18.x86_64 #1 SMP Wed Jan 23 16:44:29 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

lcpci:
03:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8188CE 802.11b/g/n WiFi Adapter (rev 01)

Comment 18 Jonathan Kamens 2013-02-13 17:11:55 UTC
(In reply to comment #17)
> THe reason that I did rmmod rtl8192ce is that I couldn't connect to the
> wireless
> (In the past this sometimes helped).

Me too. See bug 852737, which I filed last August and has had absolutely no activity since then.

Comment 19 John Greene 2013-05-10 13:44:50 UTC
Anything new on this?  Still an issue?  Will be engaged on this next few days..
Can someone try latest 3.8.11?  Will check it out myself if I can get upstream working on my system.. till then.  Where are you?

Comment 20 Jonathan Kamens 2013-05-16 04:54:17 UTC
Still a problem with most recently released F18 kernel. Can't test anything newer than that right now.

Comment 21 John Greene 2013-05-16 13:41:39 UTC
Tag I'm it.  Reproduced this on RHEL experimental driver yesterday, with:
 modprobe -r rtl8192ce
Looking into it now.  Nothing I've found upstream as yet, will be debugging it.

Comment 22 John Greene 2013-05-16 15:34:51 UTC
Ah, will start with Larry's fix in C14.. Missed that earlier.

Comment 23 Larry Finger 2013-05-16 20:06:12 UTC
John,

I have never duplicated this oops using openSUSE/KDE/NetworkManager.

I just ran about 50 loops of the following command:

while [ 1 ] ; do sudo modprobe -rv rtl8192ce ; sleep 10 ; sudo modprobe -v rtl8192ce ; sleep 10 ; done

In nearly every case, the wireless connection completed during the 10 second sleep after module loading, and it never generated any kernel fault messages.

Larry

Comment 24 John Greene 2013-05-17 13:49:19 UTC
Hmm.  I got to finish something a bit, later today I will check to see if my crash is same signature as above, but it quite reproducible here. Gotta be able to see it first, so good first step..

Code from C14 in place, still produces an issue.

Comment 25 John Greene 2013-05-17 19:29:31 UTC
I have duplicated this issue in my version of this driver, very repeatable.  It contains patch C14, looking into it myself as well.

Comment 26 Larry Finger 2013-05-17 19:52:38 UTC
Sorry that I am unable to help you.

Are you using kernel 3.5 as the OP did? I was testing with 3.10-rc1. Perhaps something changed in the mac80211 level in the interum, or there is a fundamental difference between Fedora and openSUSE 12.3 user code.

Comment 27 John Greene 2013-05-22 13:24:13 UTC
Larry: Just got new hardware to be able to test upstream stuff now.  My work involves 3.5 version of mac80211, but 3.9+ driver (on RHEL but still very close).  It may well be a difference in mac80211, lemme kick it around and see.


Jonathan:
>Still a problem with most recently released F18 kernel. Can't test anything >newer than that right now.
Ok, can you at least give the output of uname -r of your system here?  Thanks.

Comment 28 Jonathan Kamens 2013-05-27 18:48:52 UTC
jik-thinkpad:~!999$ uname -a
Linux jik-thinkpad 3.9.3-201.fc18.x86_64 #1 SMP Tue May 21 17:02:24 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Comment 29 John Greene 2013-05-28 18:17:15 UTC
Thanks Jonathan,

Hmm.  Looks like it still may be an issue with 3.9.3 with uplevel mac80211.  My system reproducing would seem to be validated (uplevel driver & v3.5 mac80211) a bit if so.   

Larry, I'll take a look at this on F18 soon and see if I can repo there.  I get this exact signature..any other idea at this time?

Comment 30 Fedora End Of Life 2013-07-03 23:04:11 UTC
This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 31 Jonathan Kamens 2013-07-08 01:40:11 UTC
Still broken in F18. Updating version.

Comment 32 John Greene 2013-07-23 16:19:32 UTC
(In reply to Jonathan Kamens from comment #31)
> Still broken in F18. Updating version.

Jonathan,

Thanks for updating to F18.  I took a quick look today and see a few updates out there but nothing strikes me as applicable right out.

Can you post output of uname -r of kernel you tested?  3.9 does has update to vendor driver I see, would like to know you testing in F18: did you test just stock version?  3.9.x?

This gives what I need.
uname -r

Comment 33 Jonathan Kamens 2013-07-25 12:27:53 UTC
I'm on Fedora 19 now. kernel-3.9.9-302.fc19.x86_64. The problem is still there.

Comment 34 Justin M. Forbes 2013-10-18 20:57:39 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 18 kernel bugs.

Fedora 18 has now been rebased to 3.11.4-101.fc18.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 19, and are still experiencing this issue, please change the version to Fedora 19.

If you experience different issues, please open a new bug report for those.

Comment 35 Jonathan Kamens 2013-10-18 21:13:20 UTC
Still happening in F19 with 3.11.4-201.fc19.x86_64.

Comment 36 Mark van Rossum 2013-11-29 21:57:23 UTC
I meanwhile got another wifi card of ebay, but anyway
this bug might well be hardware specific, so please report your exact hardware when reporting.

Comment 37 John Greene 2013-12-04 20:58:45 UTC
I haven't the bandwidth at the moment.  I believe this issue is as follows (been a while back):
driver exit is called
the interrupt is disabled and released or so it seems
during the release/disable, the ISR gets called, and dies in the Rx processing.

It should be a straightforward fix to:
ensure pending ISR are cleared and/or disabled,
the ISR code needs to check for exit in process and back away if it is called.

NACKing on capacity for  a bit..Sorry for the delay here.

Comment 38 Larry Finger 2013-12-04 22:09:37 UTC
I think the analysis is correct. What I do not understand is why it happens on the OPs system and not mine. I ran a test of 2000 unload/load cycles on the module without a single failure.

The shutdown routine disables interrupts as soon as it can, then does some other cleanups before freeing the irq. I suppose I could add in a delay in the middle, but that just seems like a band-aid.

What would cause a pending interrupt to be delayed longer on one system than another?

Comment 39 Larry Finger 2013-12-10 21:38:02 UTC
Created attachment 834953 [details]
Patch to fix problem

The problem in rtl92c_get_desc() is fixed by checking for a NULL pointer to the descriptor.

I still have no idea why this problem only happens with Fedora installations, and not for any others.

There may be similar patches needed for the other PCI adapters in the rtlwifi tree. Now that I have an f19 setup, I can test.

Comment 40 Larry Finger 2013-12-11 23:04:30 UTC
Created attachment 835515 [details]
A better patch

The previous patch sometimes failed. This one ia more robust.

Comment 41 Justin M. Forbes 2014-01-03 22:03:48 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.12.6-200.fc19.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20.

If you experience different issues, please open a new bug report for those.

Comment 42 Jonathan Kamens 2014-01-05 03:02:06 UTC
Still broken in 3.12.6 in F20.

Comment 43 Larry Finger 2014-01-05 04:44:40 UTC
This bug is fixed by commit 9278db6279e28d4d433bc8a848e10b4ece8793ed in the wireless-testing tree. It has been pushed to the linux-net tree and should be in mainline before 3.13 is released. Once there, it will be added to all stable kernels.

The patch is the same as the one listed in the attachments.

Comment 44 Josh Boyer 2014-01-06 19:00:58 UTC
The patch Larry mentions is in 3.13-rc7, so once DaveM bundles things up it should hit stable shortly.  We'll put this in POST for now, and hopefully it makes 3.12.7.  If not, we'll apply ourselves.

(Thanks Larry!)

Comment 45 Josh Boyer 2014-01-12 14:19:47 UTC
3.12.7 is in updates-testing now.

Comment 46 Jonathan Kamens 2014-01-15 21:59:37 UTC
Kernel in updates-testing appears to fix the issue.

Comment 47 Justin M. Forbes 2014-02-24 13:51:34 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs.

Fedora 20 has now been rebased to 3.13.4-200.fc20.  Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 48 Jonathan Kamens 2014-02-24 16:40:05 UTC
Still a problem.

Comment 49 Larry Finger 2014-02-24 16:55:18 UTC
Does your kernel contain the patch mentioned in comment 43?

Comment 50 Jonathan Kamens 2014-02-24 17:01:10 UTC
I have no idea. I am using the stock Fedora 20 kernel that I just got with "yum update" this morning, 3.13.3-201.fc20.x86_64.

Comment 51 John Greene 2014-02-24 19:22:05 UTC
Patch in comment 43 appears in kernel 3.12 upstream, appears to be in the kernel in C50, at least at a quick look.

Comment 52 Larry Finger 2014-02-24 19:56:21 UTC
The patch was merged between 3.13-rc4 and 3.13-rc5. It has to be in Jonathon's 3.13.3-201.fc20.x86_64.

I need to see the kernel splat that is output.

Comment 53 Jonathan Kamens 2014-02-24 20:35:06 UTC
Created attachment 867126 [details]
3.13.3-201.fc20.x86_64 kernel splat

Comment 54 Larry Finger 2014-02-25 06:56:03 UTC
What are the exact steps you are doing? The reason I ask is that I installed F20 and 3.13.3-201.fc20.x86_64, and configured a wireless network on an RTL8188CE using rtl8192ce. When I used 'sudo modprobe -rv rtl8192ce', it unloaded just the way I would expect - no kernel oops.

Comment 55 Jonathan Kamens 2014-02-25 20:47:37 UTC
"rmmod rtl8192ce". That's it.

Comment 56 Larry Finger 2014-02-25 21:39:45 UTC
Whenever a module has dependent modules, "modprobe -r" will remove everything just as modprobe without the -r will load everything. As rtl8192ce has rtl8192-common, rtlwifi, and pci as dependent modules, modprobe is definitely preferable. I don't have the device that uses rtl8192ce in a computer right now so I cannot tell if rmmod will error here.

Comment 57 Jonathan Kamens 2014-02-25 21:44:37 UTC
I thought rmmod would refuse to remove modules with other modules dependent on them? I thought I'd seen errors in the past where I tried to remove a module and it refused to let me because other modules were dependent on it.

I don't get the Oops when I use modprobe -r, but regardless of whether that's the case, is it really ok for rmmod of a module to cause an Oops?

Comment 58 Larry Finger 2014-02-25 22:32:16 UTC
No it should not, and the interrupt should have been disabled *before* rtl8192ce was removed.

No matter which command you use, the dependent module cannot be removed until the one using it is removed. In other words, 'modprobe -r rtl8192ce' is the same as:

rmmod rtl8192ce.ko
rmmod rtl8192c-common.ko
rmmod rtlwifi.ko
rmmod rtl_pci.ko

They do have to be removed in that order.

Comment 59 Justin M. Forbes 2014-05-21 19:37:05 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs.

Fedora 20 has now been rebased to 3.14.4-200.fc20.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 60 Justin M. Forbes 2014-06-23 14:49:08 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 4 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.