667459 – Hard Freeze While Using iwlagn

Bug 667459 - Hard Freeze While Using iwlagn

Summary: Hard Freeze While Using iwlagn

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	14
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	urgent
Target Milestone:	---
Assignee:	Stanislaw Gruszka
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	676196 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-01-05 16:18 UTC by Mathieu Chouquet-Stringer
Modified:	2011-02-13 20:53 UTC (History)
CC List:	12 users (show)
Fixed In Version:	kernel-2.6.35.10-77.fc14
Clone Of:
Environment:
Last Closed:	2011-02-11 02:46:35 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
0001-mac80211-fix-addba_resp_timer-hard-lockup.patch (1.16 KB, text/plain) 2011-01-06 14:01 UTC, Stanislaw Gruszka	no flags	Details
View All

Description Mathieu Chouquet-Stringer 2011-01-05 16:18:26 UTC

Description of problem:
Using iwlagn on my Thinkpad T61p I can reliably freeze the OS while using wifi.

I never got anything in /var/log/messages even after multiple crashes so I setup kdump and was able to capture multiple traces, they all look similar.

Version-Release number of selected component (if applicable):
2.6.35.10-74.fc14.x86_64


Steps to Reproduce:
1. Use wifi
2. Wait some random amount of time (usually less than an hour, sometimes it doesn't even take 5 minutes)
3. Profit, I mean crash
  
Actual results:
Kernel is dead frozen, the only "fix" is to reboot

Additional info:

Here's the first trace (this is dmesg saved by kdump):
<4>[ 1635.418366] iwlagn 0000:03:00.0: iwlagn_tx_agg_start on ra = c0:3f:0e:7a:90:34 tid = 0
<4>[ 1663.439932] iwlagn 0000:03:00.0: iwlagn_tx_agg_start on ra = c0:3f:0e:7a:90:34 tid = 0
<6>[ 1678.690234] SysRq : Trigger a crash
<1>[ 1678.690274] BUG: unable to handle kernel NULL pointer dereference at (null)
<1>[ 1678.690283] IP: [<ffffffff812ba89b>] sysrq_handle_crash+0x16/0x20
<4>[ 1678.690300] PGD 0 
<0>[ 1678.690307] Oops: 0002 [#1] SMP 
<0>[ 1678.690314] last sysfs file: /sys/devices/pci0000:00/0000:00:1c.1/0000:03:00.0/net/wlan0/statistics/collisions
<4>[ 1678.690323] CPU 0 
<4>[ 1678.690327] Modules linked in: tcp_lp nfs lockd fscache nfs_acl auth_rpcgss fuse rfcomm sco bnep l2cap cryptd aes_x86_64 aes_generic sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ipt_MASQUERADE iptable_nat nf_nat ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 kvm_intel kvm uinput arc4 ecb snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep iwlagn iwlcore mac80211 cfg80211 thinkpad_acpi snd_seq snd_seq_device btusb snd_pcm snd_timer e1000e snd i2c_i801 r852 sm_common nand nand_ids nand_ecc bluetooth mtd soundcore iTCO_wdt iTCO_vendor_support snd_page_alloc rfkill wmi microcode btrfs zlib_deflate libcrc32c sdhci_pci sdhci mmc_core firewire_ohci yenta_socket firewire_core crc_itu_t nouveau ttm drm_kms_helper drm i2c_algo_bit video output i2c_core [last unloaded: scsi_wait_scan]
<4>[ 1678.690487] 
<4>[ 1678.690494] Pid: 0, comm: swapper Not tainted 2.6.35.10-74.fc14.x86_64 #1 6458V5C/6458V5C
<4>[ 1678.690501] RIP: 0010:[<ffffffff812ba89b>]  [<ffffffff812ba89b>] sysrq_handle_crash+0x16/0x20
<4>[ 1678.690515] RSP: 0018:ffff88000a203950  EFLAGS: 00010082
<4>[ 1678.690521] RAX: 0000000000000010 RBX: 0000000000000063 RCX: 0000000000003f7b
<4>[ 1678.690528] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000063
<4>[ 1678.690534] RBP: ffff88000a203950 R08: 0000000000000001 R09: ffffffffffffffff
<4>[ 1678.690541] R10: ffff88000a203850 R11: 0000000000000000 R12: 0000000000000000
<4>[ 1678.690547] R13: ffffffff81a8b880 R14: 0000000000000007 R15: 0000000000000086
<4>[ 1678.690555] FS:  0000000000000000(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
<4>[ 1678.690563] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>[ 1678.690569] CR2: 0000000000000000 CR3: 0000000001a42000 CR4: 00000000000006f0
<4>[ 1678.690576] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[ 1678.690583] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[ 1678.690591] Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a4a020)
<0>[ 1678.690596] Stack:
<4>[ 1678.690600]  ffff88000a2039a0 ffffffff812bae18 ffff88000a203970 ffffffff00000001
<4>[ 1678.690610] <0> ffff88000a2039c0 ffff8800378a2a80 0000000000000000 000000000000002e
<4>[ 1678.690621] <0> 0000000000000001 0000000000000001 ffff88000a2039b0 ffffffff812bafa5
<0>[ 1678.690634] Call Trace:
<0>[ 1678.690638]  <IRQ> 
<4>[ 1678.690650]  [<ffffffff812bae18>] __handle_sysrq+0xab/0x14a
<4>[ 1678.690660]  [<ffffffff812bafa5>] sysrq_filter+0x94/0x9c
<4>[ 1678.690670]  [<ffffffff8135b7a8>] input_pass_event+0x8a/0xbd
<4>[ 1678.690680]  [<ffffffff8135d763>] input_handle_event+0x3c6/0x3d5
<4>[ 1678.690689]  [<ffffffff8135d864>] input_event+0x69/0x87
<4>[ 1678.690700]  [<ffffffff81363c2b>] atkbd_interrupt+0x543/0x645
<4>[ 1678.690712]  [<ffffffff8101057c>] ? native_sched_clock+0x35/0x37
<4>[ 1678.690723]  [<ffffffff8106847b>] ? run_posix_cpu_timers+0x2a/0x5bb
<4>[ 1678.690734]  [<ffffffff81052b08>] ? __raw_local_irq_save+0x1b/0x21
<4>[ 1678.690745]  [<ffffffff81358f02>] serio_interrupt+0x45/0x7f
<4>[ 1678.690754]  [<ffffffff81359c1d>] i8042_interrupt+0x288/0x29a
<4>[ 1678.690764]  [<ffffffff8106dc85>] ? timekeeping_get_ns+0x1b/0x3d
<4>[ 1678.690774]  [<ffffffff810a5ac9>] handle_IRQ_event+0x5a/0x11f
<4>[ 1678.690785]  [<ffffffff810235d8>] ? ack_APIC_irq+0x15/0x17
<4>[ 1678.690794]  [<ffffffff810a7d2b>] handle_edge_irq+0xe2/0x12a
<4>[ 1678.690803]  [<ffffffff8100c2ea>] handle_irq+0x88/0x90
<4>[ 1678.690813]  [<ffffffff8146fb44>] do_IRQ+0x5c/0xb4
<4>[ 1678.690823]  [<ffffffff8146a093>] ret_from_intr+0x0/0x11
<4>[ 1678.690833]  [<ffffffff8107802b>] ? raw_local_irq_restore+0xb/0x12
<4>[ 1678.690843]  [<ffffffff81469c5f>] ? _raw_spin_unlock_irqrestore+0x17/0x19
<4>[ 1678.690854]  [<ffffffff81059fcc>] ? try_to_del_timer_sync+0x77/0x85
<4>[ 1678.690863]  [<ffffffff81059ff3>] ? del_timer_sync+0x19/0x26
<4>[ 1678.690892]  [<ffffffffa0475e9b>] ? ___ieee80211_stop_tx_ba_session+0x3f/0xc9 [mac80211]
<4>[ 1678.690918]  [<ffffffffa0475f76>] ? sta_addba_resp_timer_expired+0x51/0x62 [mac80211]
<4>[ 1678.690929]  [<ffffffff81059e28>] ? run_timer_softirq+0x1d6/0x2a3
<4>[ 1678.690938]  [<ffffffff81071690>] ? clockevents_program_event+0x8e/0x90
<4>[ 1678.690963]  [<ffffffffa0475f25>] ? sta_addba_resp_timer_expired+0x0/0x62 [mac80211]
<4>[ 1678.690975]  [<ffffffff81053a39>] ? __do_softirq+0xdd/0x199
<4>[ 1678.690984]  [<ffffffff8100ca3a>] ? timer_interrupt+0x1e/0x25
<4>[ 1678.690994]  [<ffffffff8100abdc>] ? call_softirq+0x1c/0x30
<4>[ 1678.691002]  [<ffffffff8100c338>] ? do_softirq+0x46/0x82
<4>[ 1678.691006]  [<ffffffff81053b99>] ? irq_exit+0x3b/0x7d
<4>[ 1678.691006]  [<ffffffff8146fb85>] ? do_IRQ+0x9d/0xb4
<4>[ 1678.691006]  [<ffffffff8146a093>] ? ret_from_intr+0x0/0x11
<0>[ 1678.691006]  <EOI> 
<4>[ 1678.691006]  [<ffffffff8128f900>] ? raw_local_irq_enable+0x10/0x12
<4>[ 1678.691006]  [<ffffffff8106b5d8>] ? sched_clock_idle_wakeup_event+0x17/0x1b
<4>[ 1678.691006]  [<ffffffff8129076c>] ? acpi_idle_enter_bm+0x228/0x260
<4>[ 1678.691006]  [<ffffffff81394201>] ? cpuidle_idle_call+0x8b/0xe9
<4>[ 1678.691006]  [<ffffffff81008325>] ? cpu_idle+0xaa/0xcc
<4>[ 1678.691006]  [<ffffffff81451906>] ? rest_init+0x8a/0x8c
<4>[ 1678.691006]  [<ffffffff81ba1c49>] ? start_kernel+0x40b/0x416
<4>[ 1678.691006]  [<ffffffff81ba12c6>] ? x86_64_start_reservations+0xb1/0xb5
<4>[ 1678.691006]  [<ffffffff81ba13c2>] ? x86_64_start_kernel+0xf8/0x107
<0>[ 1678.691006] Code: e0 81 83 e2 03 8a 41 03 c1 e2 04 83 e0 cf 09 d0 88 41 03 c9 c3 55 48 89 e5 0f 1f 44 00 00 c7 05 34 0f a3 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c9 c3 55 48 89 e5 0f 1f 44 00 00 8d 47 
<1>[ 1678.691006] RIP  [<ffffffff812ba89b>] sysrq_handle_crash+0x16/0x20
<4>[ 1678.691006]  RSP <ffff88000a203950>
<0>[ 1678.691006] CR2: 0000000000000000

Here's a diff of the call trace of my second dump:
- [<ffffffffa0475e9b>] ? ___ieee80211_stop_tx_ba_session+0x3f/0xc9 [mac80211]
- [<ffffffffa0475f76>] ? sta_addba_resp_timer_expired+0x51/0x62 [mac80211]
+ [<ffffffffa039ce9b>] ? ___ieee80211_stop_tx_ba_session+0x3f/0xc9 [mac80211]
+ [<ffffffffa039cf76>] ? sta_addba_resp_timer_expired+0x51/0x62 [mac80211]
  [<ffffffff81059e28>] ? run_timer_softirq+0x1d6/0x2a3
  [<ffffffff81071690>] ? clockevents_program_event+0x8e/0x90
- [<ffffffffa0475f25>] ? sta_addba_resp_timer_expired+0x0/0x62 [mac80211]
+ [<ffffffffa039cf25>] ? sta_addba_resp_timer_expired+0x0/0x62 [mac80211]


Looking at past kernel bugs, I guess this could be it:

commit 44271488b91c9eecf249e075a1805dd887e222d2
Author: Johannes Berg <johannes.berg>
Date:   Tue Oct 5 21:40:33 2010 +0200

    mac80211: delete AddBA response timer
    
    We never delete the addBA response timer, which
    is typically fine, but if the station it belongs
    to is deleted very quickly after starting the BA
    session, before the peer had a chance to reply,
    the timer may fire after the station struct has
    been freed already. Therefore, we need to delete
    the timer in a suitable spot -- best when the
    session is being stopped (which will happen even
    then) in which case the delete will be a no-op
    most of the time.
    
    I've reproduced the scenario and tested the fix.
    
    This fixes the crash reported at
    http://mid.gmane.org/4CAB6F96.6090701@candelatech.com
    
    Cc: stable
    Reported-by: Ben Greear <greearb>
    Signed-off-by: Johannes Berg <johannes.berg>
    Signed-off-by: John W. Linville <linville>

Comment 1 Stanislaw Gruszka 2011-01-06 14:01:25 UTC

Created attachment 472059 [details]
0001-mac80211-fix-addba_resp_timer-hard-lockup.patch

Thank you for good bug report. Here is proposed patch, let me know if it fix the problem. Kernel build with patch is here: http://koji.fedoraproject.org/koji/taskinfo?taskID=2704610

Comment 2 Stanislaw Gruszka 2011-01-07 09:47:36 UTC

Any news on above?

Comment 3 Mathieu Chouquet-Stringer 2011-01-07 09:51:30 UTC

Hello,

I've downloaded the kernel and installed it but because I'm not in my usual wifi crowded environment I can't say it works for sure (though I think it will)...

I should be able to make sure on Sunday night.

Comment 4 Mathieu Chouquet-Stringer 2011-01-10 11:19:39 UTC

Ok,

It looks like we have a case of "it works for me": my laptop doesn't hang anymore.

Thanks for the quick turn around!

Comment 5 Stanislaw Gruszka 2011-01-10 12:39:50 UTC

(In reply to comment #4)
> It looks like we have a case of "it works for me": my laptop doesn't hang
> anymore.

As you are a bug reporter, that mean patch fix the bug :-)

Comment 6 Stanislaw Gruszka 2011-01-12 14:07:39 UTC

Applied in fedora kernel:
http://koji.fedoraproject.org/koji/buildinfo?buildID=213595

Comment 7 Lebenskuenstler 2011-01-19 10:10:58 UTC

Did also had many hard freezes on Dell e4300 with iwlagn driver for Intel 5300 wlan card.

Did a fresh install but after 1 day, freeze reoccured.

Now testing new 77 kernel, looks good so far, will give further feedback.

Comment 8 Fedora Update System 2011-02-07 13:35:53 UTC

kernel-2.6.35.11-83.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/kernel-2.6.35.11-83.fc14

Comment 9 Fedora Update System 2011-02-10 21:26:30 UTC

kernel-2.6.35.11-83.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 10 Stanislaw Gruszka 2011-02-13 20:53:52 UTC

*** Bug 676196 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.