Bug 561762

Summary:	[abrt] crash in kernel (actually a WARNING)
Product:	Red Hat Enterprise Linux 6	Reporter:	Jay Turner <jturner>
Component:	kernel	Assignee:	John W. Linville <linville>
Status:	CLOSED DUPLICATE	QA Contact:	desktop-bugs <desktop-bugs>
Severity:	medium	Docs Contact:
Priority:	low
Version:	6.0	CC:	arozansk, benl, cmeadors, emcnabb, h.stilmack, louisjohnread, mishu, reinette.chatre, srevivo, tburke, woodard
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:	abrt_hash:489771501
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-08-13 10:16:39 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	534148

Description Jay Turner 2010-02-04 08:50:43 UTC

abrt 1.0.6 detected a crash.

architecture: x86_64
cmdline: not_applicable
component: kernel
executable: kernel
kernel: 2.6.32-13.el6.x86_64
package: kernel
release: Red Hat Enterprise Linux release 6.0 Beta (Santiago)

kerneloops
-----
------------[ cut here ]------------
WARNING: at net/wireless/core.c:614 wdev_cleanup_work+0x4d/0xb3 [cfg80211]() (Not tainted)
Hardware name: VGN-AW230J
Modules linked in: tun(U) nfs(U) fscache(U) fuse(U) rfcomm(U) sco(U) bridge(U) stp(U) llc(U) bnep(U) l2cap(U) nfsd(U) lockd(U) nfs_acl(U) auth_rpcgss(U) exportfs(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) xt_physdev(U) ip6t_REJECT(U) nf_conntrack_ipv6(U) ip6table_filter(U) ip6_tables(U) ipv6(U) dm_mirror(U) dm_region_hash(U) dm_log(U) sha256_generic(U) cryptd(U) aes_x86_64(U) aes_generic(U) cbc(U) dm_crypt(U) kvm(U) uinput(U) snd_hda_codec_nvhdmi(U) snd_hda_codec_realtek(U) arc4(U) ecb(U) snd_hda_intel(U) snd_hda_codec(U) snd_hwdep(U) snd_seq(U) snd_seq_device(U) iwlagn(U) snd_pcm(U) iwlcore(U) snd_timer(U) uvcvideo(U) sdhci_pci(U) snd(U) mac80211(U) sdhci(U) iTCO_wdt(U) sony_laptop(U) soundcore(U) cfg80211(U) btusb(U) mmc_core(U) sky2(U) iTCO_vendor_support(U) bluetooth(U) videodev(U) sg(U) sr_mod(U) v4l1_compat(U) v4l2_compat_ioctl32(U) joydev(U) snd_page_alloc(U) rfkill(U) cdrom(U) i2c_i801(U) serio_raw(U) raid1(U) dm_multipath(U) sd_mod(U) crc_t10dif(U) ata_
generic(U) pata_acpi(U) firewire_ohci(U) ahci(U) pata_jmicron(U) firewire_core(U) crc_itu_t(U) nouveau(U) ttm(U) drm_kms_helper(U) drm(U) i2c_algo_bit(U) i2c_core(U) dm_mod(U) [last unloaded: scsi_wait_scan]
Pid: 12911, comm: events/1 Not tainted 2.6.32-13.el6.x86_64 #1
Call Trace:
[<ffffffff8105240e>] ? warn_slowpath_common+0x7e/0x97
[<ffffffffa01ae4dd>] ? wdev_cleanup_work+0x4d/0xb3 [cfg80211]
[<ffffffff8106b6fa>] ? worker_thread+0x19b/0x227
[<ffffffff8106f67b>] ? autoremove_wake_function+0x0/0x2a
[<ffffffff8106b55f>] ? worker_thread+0x0/0x227
[<ffffffff8106f3c0>] ? kthread+0x75/0x7d
[<ffffffff81011b4a>] ? child_rip+0xa/0x20
[<ffffffff8106f34b>] ? kthread+0x0/0x7d
[<ffffffff81011b40>] ? child_rip+0x0/0x20

How to reproduce
-----
1. I was attempting to suspend my laptop.  Have only seen this once, but wanted to get it on-record in case others hit a similar issue.
2.
3.

Comment 1 RHEL Program Management 2010-02-04 08:57:23 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 2 John W. Linville 2010-03-16 19:57:21 UTC

*** Bug 572371 has been marked as a duplicate of this bug. ***

Comment 3 John W. Linville 2010-03-16 20:02:48 UTC

Reinette, we are hitting this some in with our (2.6.32-based) iwlagn drivers in RHEL6.  I think this relates to bringing the device down while a scan is pending?  Do you have any suggestions for avoiding this WARNING?

Comment 5 reinette chatre 2010-03-16 23:37:29 UTC

Please try:

commit 2ef6e4440926668cfa9eac4b79e63528ebcbe0c1
Author: Johannes Berg <johannes>
Date:   Tue Oct 20 15:08:12 2009 +0900

    mac80211: keep auth state when assoc fails
    
    When association fails, we should stay authenticated,
    which in mac80211 is represented by the existence of
    the mlme work struct, so we cannot free that, instead
    we need to just set it to idle.
    
    (Brought to you by the hacking session at Kernel Summit 2009 in Tokyo,
    Japan. -- JWL)
    
    Signed-off-by: Johannes Berg <johannes>
    Signed-off-by: John W. Linville <linville>

commit 7400f42e9d765fa0656b432f3ab1245f9710f190
Author: Johannes Berg <johannes>
Date:   Sat Oct 31 07:40:37 2009 +0100

    cfg80211: fix NULL ptr deref
    
    commit 211a4d12abf86fe0df4cd68fc6327cbb58f56f81
      Author: Johannes Berg <johannes>
      Date:   Tue Oct 20 15:08:53 2009 +0900
    
          cfg80211: sme: deauthenticate on assoc failure
    
    introduced a potential NULL pointer dereference that
    some people have been hitting for some reason -- the
    params.bssid pointer is not guaranteed to be non-NULL
    for what seems to be a race between various ways of
    reaching the same thing.
    
    While I'm trying to analyse the problem more let's
    first fix the crash. I think the real fix may be to
    avoid doing _anything_ if it ended up being NULL, but
    right now I'm not sure yet.
    
    I think
    http://bugzilla.kernel.org/show_bug.cgi?id=14342
    might also be this issue.
    
    Reported-by: Parag Warudkar <parag.lkml>
    Tested-by: Parag Warudkar <parag.lkml>
    Signed-off-by: Johannes Berg <johannes>
    Signed-off-by: John W. Linville <linville>

Comment 6 reinette chatre 2010-03-17 14:49:51 UTC

Do you only get this single warning? Please see http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2134 where this same warning appeared together with a warning in ieee80211_scan_completed. The fix for this was:

commit 6d3560d4fc9c5b9fe1a07a63926ea70512c69c32
Author: Johannes Berg <johannes>
Date:   Sat Oct 31 07:44:08 2009 +0100

    mac80211: fix scan abort sanity checks
    
    Since sometimes mac80211 queues up a scan request
    to only act on it later, it must be allowed to
    (internally) cancel a not-yet-running scan, e.g.
    when the interface is taken down. This condition
    was missing since we always checked only the
    local->scanning variable which isn't yet set in
    that situation.
    
    Reported-by: Luis R. Rodriguez <mcgrof>
    Signed-off-by: Johannes Berg <johannes>
    Signed-off-by: John W. Linville <linville>

Comment 7 John W. Linville 2010-03-17 15:02:52 UTC

Reinette, thanks for the suggestions but we already have those commits in the
RHEL6 kernels.

Jay and Tim, can you reliably trigger the WARNING in the kernel logs?

Comment 8 Jay Turner 2010-03-17 17:04:26 UTC

Sadly, I can't.  Just happened to get it that once while attempting to suspend.

I'll play around with suspend/resume a bit today and see if I can come up with a reliable reproducer.

Comment 9 John Read 2010-03-19 11:53:28 UTC

I have just started seeing this warning upon resuming from suspend. I can reliably reproduce if this is any assistance.

Comment 10 John W. Linville 2010-03-19 13:57:46 UTC

Thanks, John -- it might be.  Reinette, I would welcome any further suggestions you might have to help pinpoint the issue.

Comment 11 John W. Linville 2010-03-22 19:09:16 UTC

*** Bug 575486 has been marked as a duplicate of this bug. ***

Comment 12 reinette chatre 2010-03-23 20:25:29 UTC

(In reply to comment #10)
> Thanks, John -- it might be.  Reinette, I would welcome any further suggestions
> you might have to help pinpoint the issue.    

I do not have any other ideas to try out. Is it possible to gather more information? Since you can see this when you resume from suspend, could you please do the following:

- please add a "dump_stack()" to the beginning of iee80211_scan_completed() so that we can know exactly who is calling it.
- run iwlwifi with debugging of 0x43fff, which includes scanning debugging. Since you only see this when you resume from suspend you can enable the debugging before you suspend like so:
# echo 0x43fff > /sys/class/net/wlanX/device/debug_level

Comment 14 John W. Linville 2010-03-24 20:07:58 UTC

Reinette, I've built a kernel for Jay and asked him to test and provide the feedback you requested -- thanks!

Comment 18 Eric Sandeen 2010-03-26 18:16:23 UTC

*** Bug 577199 has been marked as a duplicate of this bug. ***

Comment 19 John W. Linville 2010-03-26 19:31:23 UTC

John Read, can you try the test kernels available here?

http://people.redhat.com/linville/kernels/rhel6/

Bonus points if you try-out the yum repo for the installation... :-)  In any case, do those kernels address the issue?

Comment 20 John Read 2010-03-27 13:37:27 UTC

I can perhaps try Sunday night or early next week. However, I am somewhat new to this, so are there instructions somewhere?

Thanks, John

Comment 21 John W. Linville 2010-03-29 13:50:04 UTC

Mostly just click on the link for the jwltest-release rpm to install it, then issue the command provided right below it. :-)

Comment 22 John Read 2010-04-05 10:08:56 UTC

Sorry I have not yet tested this... I was uncertain of the utility of doing so, as I can no longer reproduce the crash reliably. Let me know if you still want me to try, though  it may be inconclusive since I cannot reproduce the error.

Comment 23 John W. Linville 2010-04-05 14:14:31 UTC

Any testing would be welcome. :-)

Comment 24 John W. Linville 2010-04-06 17:47:46 UTC

John Read, I'm terribly sorry but due to an internal policy decision I've been required to remove my test kernels from people.redhat.com.  If you have not already installed the test kernel referenced above you will not be able to test it.

I sincerely apologize for the confusion and for whatever inconvenience this might cause.  I hope that we will find a way to address the issue you are experiencing but at this point I'm not sure how that will happen. :-(

Comment 25 John W. Linville 2010-04-09 19:11:38 UTC

OK, the situation in comment 24 has been resolved.  I again have test kernels available a the location from comment 19.  The ones there now are equivalent to the -19.el6 kernels but w/ the Intel wireless drivers back-ported from 2.6.33.  If anyone can reliably recreate the problem reported here then please give those kernels a try and report the results below -- thanks!

Comment 26 John Read 2010-04-10 02:35:28 UTC

John Linville,

I will install the kernel this weekend. However, it may be some time before I can determine if the warning issue is resolved as I only experience it occasionally.

One point of clarification -- do I install jwltest-release-6-2.noarch.rpm and the run the yum command as outlined on the link in comment 19?

Regards,

John

Comment 27 John W. Linville 2010-04-12 13:32:16 UTC

Yes, precisely -- thanks!

Comment 28 John W. Linville 2010-04-13 14:25:13 UTC

So as I read the code that creates the warning as shown in the original report here, this happens when a device is going down while a scan is pending.  After the warning, the scan is aborted.  My impression from the code is that life should go-on after that with only the log SPAM as a consequence.  In other words, the device should be able to continue operation afterwards.  Jay/Tim/John, is this not the case?

Comment 29 John W. Linville 2010-04-19 19:34:31 UTC

jwltest.8 has "mac80211: fix deferred hardware scan requests", which I think specifically addresses the issue causing this warning:

http://people.redhat.com/linville/kernels/rhel6/

Please give it a try, especially anyone that can reliable recreate this issue!

Comment 31 John W. Linville 2010-04-22 19:10:44 UTC

*** Bug 582594 has been marked as a duplicate of this bug. ***

Comment 32 John W. Linville 2010-04-30 13:58:24 UTC

*** Bug 585983 has been marked as a duplicate of this bug. ***

Comment 33 Eric Sandeen 2010-05-06 18:59:46 UTC

*** Bug 589615 has been marked as a duplicate of this bug. ***

Comment 34 Ben Woodard 2010-05-06 20:12:08 UTC

*** Bug 589752 has been marked as a duplicate of this bug. ***

Comment 35 Ben Woodard 2010-05-06 20:24:21 UTC

Seeing this with 2.6.32-23.el6.x86_64 Shouldn't that patch that you posted on 4/21 be in the that kernel?
I didn't remember seeing this with kernel-2.6.32-20.el6.sg11y_revert_drm.x86_64
but kernel-2.6.32-22.el6.x86_64 didn't work with my WiFI AP. 23 was the one that was supposed to fix the problems that I saw with 20

Comment 36 John W. Linville 2010-05-06 20:45:40 UTC

Define "should" -- that patch does not appear to be in -23 or -24 either.

Comment 37 Aristeu Rozanski 2010-05-06 20:52:29 UTC

Patch 20100421180543.GB5557 isn't committed. No ACKs so far. And unless
it gets rhel-6.0.0+, it won't be in beta2.
http://patchwork.usersys.redhat.com/patch/24285

Comment 40 John W. Linville 2010-05-07 14:27:46 UTC

QA ping?

Comment 41 John W. Linville 2010-05-07 14:58:32 UTC

OK, now what does it take to get rhel-6.0.0 set?

Comment 42 Aristeu Rozanski 2010-05-11 19:30:59 UTC

Patch(es) available on kernel-2.6.32-25.el6

Comment 45 Eric Sandeen 2010-05-13 17:17:15 UTC

*** Bug 589494 has been marked as a duplicate of this bug. ***

Comment 46 Eric Sandeen 2010-05-13 17:17:54 UTC

*** Bug 591678 has been marked as a duplicate of this bug. ***

Comment 47 Jay Turner 2010-05-24 20:57:40 UTC

*** Bug 595516 has been marked as a duplicate of this bug. ***

Comment 48 Jay Turner 2010-05-24 20:58:35 UTC

Just reproduced with 2.6.32-28.el6.

Comment 49 John W. Linville 2010-05-25 12:58:48 UTC

Was that during a suspend/resume?  Or some other activity?

Comment 50 Jay Turner 2010-05-25 13:07:01 UTC

Every time that I have seen this was during a suspend/resume cycle.  I tried reproducing this morning here at the office and the machine survived a series of 5 cycles.  I will continue poking around, but still cannot nail it down to a specific reproducer.