Bug 811142

Summary:

[abrt] kernel: BUG: soft lockup - CPU#5 stuck for 22s! [kworker/u:6:91]

Product:

[Fedora] Fedora

Reporter:

Rolf Offermanns <rolf.offermanns>

Component:

kernel

Assignee:

John W. Linville <linville>

Status:

CLOSED ERRATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

emilis, gansalmon, hongfengwbw, itamar, jonathan, kernel-maint, madhu.chinakonda, shafi.wireless

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

abrt_hash:75be5411b084a8d6a5755a53b0282cf9f8d50f22

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-09-04 17:50:18 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
possible suspicious busy loop and fixing that	none
lspci -vvvvxxxx output	none
softlockup fix/debug patch with WARN_ON	none
v2 patch for debugging softlockup	none
/var/log/messages	none
do chip reset if PLL4 measurement done is not set for long time	none
upstreamed fix for softlockup	none
backported fix for this issue	none

Description Rolf Offermanns 2012-04-10 09:20:23 UTC

libreport version: 2.0.8
abrt_version:   2.0.7
cmdline:        BOOT_IMAGE=/vmlinuz-3.3.1-3.fc16.x86_64 root=/dev/mapper/vg_roflap-lv_root ro rd.md=0 rd.lvm.lv=vg_roflap/lv_root KEYTABLE=de-latin1 quiet SYSFONT=latarcyrheb-sun16 rhgb rd.luks=0 rd.lvm.lv=vg_roflap/lv_swap LANG=en_US.UTF-8 rd.dm=0 2
comment:        happens on every boot with new (3.3.1) kernel.
kernel:         3.3.1-3.fc16.x86_64
reason:         BUG: soft lockup - CPU#5 stuck for 22s! [kworker/u:6:91]
time:           Tue 10 Apr 2012 11:02:21 AM CEST

backtrace:
:BUG: soft lockup - CPU#5 stuck for 22s! [kworker/u:6:91]
:Modules linked in: rfcomm bnep be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi snd_hda_codec_hdmi snd_hda_codec_realtek uvcvideo videobuf2_core videodev snd_hda_intel media snd_hda_codec v4l2_compat_ioctl32 videobuf2_vmalloc videobuf2_memops btusb arc4 ath9k snd_hwdep bluetooth snd_seq snd_seq_device snd_pcm mac80211 ath9k_common ath9k_hw snd_timer ath snd sony_laptop r8169 joydev cfg80211 iTCO_wdt soundcore iTCO_vendor_support mii i2c_i801 microcode rfkill snd_page_alloc vhost_net macvtap macvlan tun virtio_net nfsd lockd nfs_acl kvm_intel kvm auth_rpcgss uinput sunrpc sdhci_pci sdhci mmc_core firewire_ohci firewire_core crc_itu_t nouveau ttm drm_kms_helper drm i2c_core mxm_wmi video wmi [last unloaded: scsi_wait_scan]
:CPU 5 
:Modules linked in: rfcomm bnep be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi snd_hda_codec_hdmi snd_hda_codec_realtek uvcvideo videobuf2_core videodev snd_hda_intel media snd_hda_codec v4l2_compat_ioctl32 videobuf2_vmalloc videobuf2_memops btusb arc4 ath9k snd_hwdep bluetooth snd_seq snd_seq_device snd_pcm mac80211 ath9k_common ath9k_hw snd_timer ath snd sony_laptop r8169 joydev cfg80211 iTCO_wdt soundcore iTCO_vendor_support mii i2c_i801 microcode rfkill snd_page_alloc vhost_net macvtap macvlan tun virtio_net nfsd lockd nfs_acl kvm_intel kvm auth_rpcgss uinput sunrpc sdhci_pci sdhci mmc_core firewire_ohci firewire_core crc_itu_t nouveau ttm drm_kms_helper drm i2c_core mxm_wmi video wmi [last unloaded: scsi_wait_scan]
:Pid: 91, comm: kworker/u:6 Not tainted 3.3.1-3.fc16.x86_64 #1 Sony Corporation VPCF23A9E/VAIO
:RIP: 0010:[<ffffffff812cd28a>]  [<ffffffff812cd28a>] delay_tsc+0x3a/0x80
:RSP: 0018:ffff8801a1aa7d80  EFLAGS: 00000293
:RAX: 00000026faf493e5 RBX: 0000000000000000 RCX: 00000000faf493e5
:RDX: 000000000001cd7c RSI: 000000000001618c RDI: 000000000003596f
:RBP: ffff8801a1aa7da0 R08: 00000000000186a0 R09: 0000000000000007
:R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000007000
:R13: ffff8801a1aa7da0 R14: ffff8801a2b08000 R15: ffff8801a2b08000
:FS:  0000000000000000(0000) GS:ffff8801af4a0000(0000) knlGS:0000000000000000
:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
:CR2: 00000035d592df30 CR3: 0000000001c05000 CR4: 00000000000406e0
:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
:Process kworker/u:6 (pid: 91, threadinfo ffff8801a1aa6000, task ffff8801a1ab1730)
:Stack:
: ffff8801a2b08000 ffff8801a1cb1d60 ffff8801a34bd400 ffffffffa0517490
: ffff8801a1aa7db0 ffffffff812cd1e8 ffff8801a1aa7dd0 ffffffffa03bae7a
: ffff8801a34bd400 ffff8801a1cb5828 ffff8801a1aa7e00 ffffffffa05174eb
:Call Trace:
: [<ffffffffa0517490>] ? ath_hw_check+0xe0/0xe0 [ath9k]
: [<ffffffff812cd1e8>] __const_udelay+0x28/0x30
: [<ffffffffa03bae7a>] ar9003_get_pll_sqsum_dvc+0x4a/0x80 [ath9k_hw]
: [<ffffffffa05174eb>] ath_hw_pll_work+0x5b/0xe0 [ath9k]
: [<ffffffff810744fe>] process_one_work+0x11e/0x470
: [<ffffffff8107530f>] worker_thread+0x15f/0x360
: [<ffffffff810751b0>] ? manage_workers+0x230/0x230
: [<ffffffff81079af3>] kthread+0x93/0xa0
: [<ffffffff815fd3a4>] kernel_thread_helper+0x4/0x10
: [<ffffffff81079a60>] ? kthread_freezable_should_stop+0x70/0x70
: [<ffffffff815fd3a0>] ? gs_change+0x13/0x13
:Code: 90 65 44 8b 2c 25 38 dc 00 00 41 89 fe 66 66 90 0f ae e8 e8 f9 f0 d4 ff 66 90 41 89 c4 eb 11 66 90 f3 90 65 8b 1c 25 38 dc 00 00 <41> 39 dd 75 20 66 66 90 0f ae e8 e8 d6 f0 d4 ff 66 90 89 c2 44 

smolt_data:
:
:
:General
:=================================
:UUID: a5a82fca-8064-4d98-a1f6-f052e4f88e76
:OS: Fedora release 16 (Verne)
:Default run level: Unknown
:Language: en_US.UTF-8
:Platform: x86_64
:BogoMIPS: 4389.77
:CPU Vendor: GenuineIntel
:CPU Model: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz
:CPU Stepping: 7
:CPU Family: 6
:CPU Model Num: 42
:Number of CPUs: 8
:CPU Speed: 2201
:System Memory: 5951
:System Swap: 7999
:Vendor: Sony Corporation
:System: VPCF23A9E C609SURK
:Form factor: Notebook
:Kernel: 3.3.0-8.fc16.x86_64
:SELinux Enabled: 0
:SELinux Policy: targeted
:SELinux Enforce: Unknown
:MythTV Remote: Unknown
:MythTV Role: Unknown
:MythTV Theme: Unknown
:MythTV Plugin: 
:MythTV Tuner: -1
:
:
:Devices
:=================================
:(4480:59442:4173:37001) pci, firewire_ohci, FIREWIRE, FireWire Host Controller
:(4480:57906:4173:37001) pci, None, BASE, N/A
:(32902:7241:4173:37001) pci, None, PCI/ISA, HM65 Express Chipset Family LPC Controller
:(4147:404:4173:37001) pci, xhci_hcd, USB, uPD720200 USB 3.0 Host Controller
:(32902:7188:4173:37001) pci, pcieport, PCI/PCI, 6 Series/C200 Series Chipset Family PCI Express Root Port 3
:(32902:7190:4173:37001) pci, pcieport, PCI/PCI, 6 Series/C200 Series Chipset Family PCI Express Root Port 4
:(32902:7184:4173:37001) pci, pcieport, PCI/PCI, 6 Series/C200 Series Chipset Family PCI Express Root Port 1
:(32902:7186:4173:37001) pci, pcieport, PCI/PCI, 6 Series/C200 Series Chipset Family PCI Express Root Port 2
:(4318:3572:4173:37001) pci, nouveau, VIDEO, GF106 [GeForce GT 555M SDDR3]
:(32902:7200:4173:37001) pci, snd_hda_intel, MULTIMEDIA, 6 Series/C200 Series Chipset Family High Definition Audio Controller
:(32902:7202:4173:37001) pci, None, SERIAL, 6 Series/C200 Series Chipset Family SMBus Controller
:(4480:59427:4173:37001) pci, sdhci-pci, BASE, N/A
:(5772:50:4187:57412) pci, ath9k, NETWORK, AR9485 Wireless Network Adapter
:(4332:33128:4173:37001) pci, r8169, ETHERNET, RTL8111/8168B PCI Express Gigabit Ethernet controller
:(32902:7171:4173:37001) pci, ahci, STORAGE, 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controller
:(32902:7213:4173:37001) pci, ehci_hcd, USB, 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2
:(32902:7206:4173:37001) pci, ehci_hcd, USB, 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1
:(32902:260:4173:37001) pci, None, HOST/PCI, 2nd Generation Core Processor Family DRAM Controller
:(32902:7226:4173:37001) pci, None, SIMPLE, 6 Series/C200 Series Chipset Family MEI Controller #1
:(4318:3050:4173:37001) pci, snd_hda_intel, MULTIMEDIA, GF108 High Definition Audio Controller
:(32902:257:4173:37001) pci, pcieport, PCI/PCI, Xeon E3-1200/2nd Generation Core Processor Family PCI Express Root Port
:
:
:Filesystem Information
:=================================
:device mtpt type bsize frsize blocks bfree bavail file ffree favail
:-------------------------------------------------------------------
:/dev/mapper/vg_roflap-lv_root / ext4 4096 4096 13092922 8825587 8694553 3276800 3020746 3020746
:/dev/sda5 /boot ext4 1024 1024 508745 275382 249782 128016 127733 127733
:/dev/mapper/vg_roflap-lv_home /home ext4 4096 4096 112730615 87371638 81728989 28213248 26453305 26453305
:

Comment 1 Mohammed Shafi 2012-04-10 14:50:59 UTC

*please provide your lspci -vvvvxxxx

and can you please try the attached quick fix to identify the issue

Comment 2 Mohammed Shafi 2012-04-10 14:51:48 UTC

Created attachment 576487 [details]
possible suspicious busy loop and fixing that

Comment 3 Rolf Offermanns 2012-04-10 17:48:47 UTC

Created attachment 576533 [details]
lspci -vvvvxxxx output

Comment 4 Rolf Offermanns 2012-04-10 19:12:48 UTC

The patch works for me. The system boots up and wifi is working.
Thanks!

Comment 5 Mohammed Shafi 2012-04-11 07:53:30 UTC

(In reply to comment #4)
> The patch works for me. The system boots up and wifi is working.
> Thanks!

thanks a lot for verifying this, so thats the issue. we got to find the root cause. otherwise we can have a fix like this, that can work with my system. just to make sure the chip does not goes into some unstable state, because the function itself is a some sort of workaround for some rx hang

Comment 6 Rolf Offermanns 2012-04-11 09:08:48 UTC

You are welcome. I will observe the system during the day (next 8 hours) while I am using it.

Unfortunately the ath9k driver was not very usable for me on this machine up until now. I think I am hit by Bug#736435, e.g. with kernel 3.3.0 wifi would stop working after some time (minutes to hours, maybe depending on the traffic) with messages like this in my syslog:

[ 2752.626166] ath: Failed to stop TX DMA, queues=0x10f!
[ 2752.637237] ath: DMA failed to stop in 10 ms AR_CR=0xffffffff AR_DIAG_SW=0xffffffff DMADBG_7=0xffffffff
[ 2752.637244] ath: Could not stop RX, we could be confusing the DMA engine when we start RX up

Only a reboot helped. But this is only for you as a background note. I will report back tonight if the system became unstable with your patch.

Comment 7 Rolf Offermanns 2012-04-11 09:49:52 UTC

My wifi connection just stopped working. Please find the log below. As mentioned before, this happened in previous kernels, too, although the logs looked different. I cannot say, if this is related to your patch, but I don't think so.

Let me know, if there is something else I can do. I am using a USB wifi adapter for now. :(


[ 3078.673382] wlan0: moving STA 00:04:0e:0a:39:b9 to state 2
[ 3078.673390] wlan0: moving STA 00:04:0e:0a:39:b9 to state 1
[ 3078.673395] wlan0: moving STA 00:04:0e:0a:39:b9 to state 0
[ 3078.687330] cfg80211: Calling CRDA to update world regulatory domain
[ 3078.696615] cfg80211: World regulatory domain updated:
[ 3078.696622] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
[ 3078.696630] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[ 3078.696636] cfg80211:   (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
[ 3078.696642] cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
[ 3078.696647] cfg80211:   (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[ 3078.696653] cfg80211:   (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[ 3078.696695] cfg80211: Calling CRDA for country: DE
[ 3078.699937] cfg80211: Regulatory domain changed to country: DE
[ 3078.699939] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
[ 3078.699941] cfg80211:   (2400000 KHz - 2483500 KHz @ 40000 KHz), (N/A, 2000 mBm)
[ 3078.699943] cfg80211:   (5150000 KHz - 5250000 KHz @ 40000 KHz), (N/A, 2000 mBm)
[ 3078.699944] cfg80211:   (5250000 KHz - 5350000 KHz @ 40000 KHz), (N/A, 2000 mBm)
[ 3078.699946] cfg80211:   (5470000 KHz - 5725000 KHz @ 40000 KHz), (N/A, 2698 mBm)
[ 3079.716756] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 1)
[ 3079.916121] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 2)
[ 3080.115824] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 3)
[ 3080.315600] wlan0: authentication with 00:04:0e:0a:39:b9 timed out
[ 3086.660968] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 1)
[ 3086.860406] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 2)
[ 3087.060191] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 3)
[ 3087.259956] wlan0: authentication with 00:04:0e:0a:39:b9 timed out
[ 3093.606301] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 1)
[ 3093.805770] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 2)
[ 3094.005506] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 3)
[ 3094.205219] wlan0: authentication with 00:04:0e:0a:39:b9 timed out
[ 3094.647170] ath: Failed to stop TX DMA, queues=0x001!
[ 3095.575105] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 1)
[ 3095.774630] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 2)
[ 3095.974371] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 3)
[ 3096.174194] wlan0: authentication with 00:04:0e:0a:39:b9 timed out
[ 3102.517312] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 1)
[ 3102.716869] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 2)
[ 3102.916762] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 3)
[ 3103.116545] wlan0: authentication with 00:04:0e:0a:39:b9 timed out

Comment 8 Mohammed Shafi 2012-04-11 10:11:03 UTC

please try
http://comments.gmane.org/gmane.linux.kernel.wireless.general/88543

revert patch from Sujith

Comment 9 Mohammed Shafi 2012-04-12 06:15:24 UTC

Hi Rolf,

please provide as much information about your testing where you got your soft lock up. 
we need to find out why PLL4_MEAS_DONE is not set for some time which caused the soft lockup. breaking out from the while loop with out the PLL4_MEAS_DONE being set may suggest chip may be into unstable state and we might observe tx/rx hang which lead to stress failures. i would also ask internally if we could get some reliable maximum time limit(something like in the patch) for polling for PLL4_MEAS_DONE bit. thanks.

Comment 10 Mohammed Shafi 2012-04-13 05:31:50 UTC

Created attachment 577225 [details]
softlockup fix/debug patch with WARN_ON

Hi Rolf,

can you please test with the attached patch and see how frequently you are able to trigger the WARN_ON i had introduced in the code. thanks.

Comment 11 Rolf Offermanns 2012-04-13 07:56:21 UTC

Hi Mohammed,
regarding comment#9. There is not much to tell. The soft lock up happens at  boot time. The system will not go into graphical mode, it will just hang.

I will try the patch in comment#10 and report back. I have applied the patch in comment#8 and it seemed to improve the situation. I did not have any timeouts or disconnections in the few hours I used it so far.

Comment 12 Mohammed Shafi 2012-04-13 08:57:44 UTC

(In reply to comment #11)
> Hi Mohammed,
> regarding comment#9. There is not much to tell. The soft lock up happens at 
> boot time. The system will not go into graphical mode, it will just hang.
> 
> I will try the patch in comment#10 and report back. I have applied the patch in
> comment#8 and it seemed to improve the situation. I did not have any timeouts
> or disconnections in the few hours I used it so far.

thanks for testing it comment#8 yeah its a separate issue, i just ran a bidirectional TCP iperf for around 15 hours did not see any issue with AR9485 and the bit was cleared within 200 us(observed by putting printks). i had asked internally what those PLL4_MEAS_DONE represents etc. we will just wait, otherwise we would send out the patch in comment#10 to upstream if the WARN_ONS for you are quite less during testing. thank you

Comment 13 Rolf Offermanns 2012-04-13 10:38:52 UTC

Note: There was a 
struct ath_common *common = ath9k_hw_common(ah);

missing in your patch.

When does the ar9003_get_pll_sqsum_dvc() function get called? Only at module initialization? Or during normal usage, too?

Comment 14 Mohammed Shafi 2012-04-13 14:43:25 UTC

(In reply to comment #13)
> Note: There was a 
> struct ath_common *common = ath9k_hw_common(ah);
> 
> missing in your patch.
> 
> When does the ar9003_get_pll_sqsum_dvc() function get called? Only at module
> initialization? Or during normal usage, too?

oh oops sorry, no it will be periodically called with HZ/5 time unit.
it will prevent rx hang when running stress etc.
also attached the v2 patch thats compiling ;)

Comment 15 Mohammed Shafi 2012-04-13 14:44:11 UTC

Created attachment 577370 [details]
v2 patch for debugging softlockup

Comment 16 Rolf Offermanns 2012-04-15 11:51:52 UTC

Hi again. I was not able to trigger the problem at the office, but I am plugged there, most of the time. I have been on wifi for max. 2h in a row. 

However at home today it appeared quite fast. I will attach my /var/log/messages. Check around 13:07. I unloaded and loaded the ath9k module somewhere around 13:45 and my wifi was working again.

Comment 17 Rolf Offermanns 2012-04-15 11:57:41 UTC

Created attachment 577526 [details]
/var/log/messages

Comment 18 Mohammed Shafi 2012-04-17 05:45:19 UTC

(In reply to comment #17)
> Created attachment 577526 [details]
> /var/log/messages

that seems to be quite a good number for WARNING triggered, would just think of a patch to do chip reset if this issue occurs, also need to read few docs regarding this. can you please provide your environment and AP configuration, anything interesting that issue occurs. i am not able to trigger this issue after a stress test.

Comment 19 Rolf Offermanns 2012-04-17 08:53:48 UTC

I am not doing anything special when the WARNING is triggered. As I said, I wasn't able to trigger this at my working place. I will try again on thursday. As to my home environment: My wifi is protected with WPA2/Personal, 2.4Ghz channel 6. The wifi setup at the office is the same, maybe another channel, but security wise, same settings.

One difference it the number of other wifi networks around. At home (WARNING triggered) I have around 10 APs, all crowding the 2.4Ghz space, at work I see only 2. I don't if this matters.

Comment 20 Mohammed Shafi 2012-04-17 09:16:49 UTC

Hi Rolf,

thanks a lot for your information. i did run my stress with in a congested environment. but just found some thing wrong in the hardware code PLL.
will attach a proper patch for this which you can test and see if it helps.
we would keep the chip reset option if we cannot resolve it by anything else.

Comment 21 Mohammed Shafi 2012-04-18 14:56:30 UTC

unfortunately the h/w code seems to be looking fine. need to dig some where else and check if chip reset helps if this condition occurs

Comment 22 Mohammed Shafi 2012-04-19 05:41:12 UTC

Hi Rolf,

when these WARNINGS(introduced in my patch), how does it affect you. are you disconnected and the traffic stalls as the chip may get into some hang state.
could you also produce logs with sudo modprobe ath9k debug=0xffffffff with debug enabled.
http://linuxwireless.org/en/users/Drivers/ath9k/debug
let me immediately spin a patch doing chip reset if MEAS_DONE is not set for even 100* 100 us

Comment 23 Rolf Offermanns 2012-04-20 06:16:12 UTC

Hi Mohammed,
yes, I am disconnected when the WARNINGS happen. I will try to get debug logs. The day before yesterday I worked the whole day on wifi (>8h) without a single warning and yesterday it happened again many times. I really don't see a pattern here.

Comment 24 Mohammed Shafi 2012-04-23 16:14:15 UTC

(In reply to comment #23)
> Hi Mohammed,
> yes, I am disconnected when the WARNINGS happen. I will try to get debug logs.
> The day before yesterday I worked the whole day on wifi (>8h) without a single
> warning and yesterday it happened again many times. I really don't see a
> pattern here.

Hi Rolf,

may you can see whether you be able to recreate the issue quite easily with logs
with something like this 

while true
do
sudo modprobe -v ath9k debug=0xffffffff
sleep 3
sudo ifconfig wlanX up
sleep 3
sudo iw dev wlanX connect my-ap
sleep 30
sudo modprobe -r ath9k
sleep 3
done

was out of station for some time, need to take a look at this closely will also other QCA developers. could not find anything initially from h/w doc

Comment 25 Mohammed Shafi 2012-04-24 04:48:28 UTC

Rolf,

in addition all these stuff could you please try the attached patch which in any case stops the warnings. we are doing the chip reset once we hit the state when PLL4 MEAS_DONE is never set

Comment 26 Mohammed Shafi 2012-04-24 04:50:05 UTC

Created attachment 579744 [details]
do chip reset if PLL4 measurement done is not set for long time

check if chip reset recovers the chip from a PLL measurement being never set

Comment 27 Rolf Offermanns 2012-05-02 08:04:41 UTC

Hi Mohammed,

sorry for not answering earlier. I was not able to produce the warning with debugging enabled. However my connection was getting lost anyway. I am not sure if the kernel log contains anything helpful for you. Shall I attach it?

I will try you new chip reset patch now.

Comment 28 Mohammed Shafi 2012-05-03 14:29:30 UTC

(In reply to comment #27)
> Hi Mohammed,
> 
> sorry for not answering earlier. I was not able to produce the warning with
> debugging enabled. However my connection was getting lost anyway. I am not sure
> if the kernel log contains anything helpful for you. Shall I attach it?
> 
> I will try you new chip reset patch now.

(In reply to comment #27)
> Hi Mohammed,
> 
> sorry for not answering earlier. I was not able to produce the warning with
> debugging enabled. However my connection was getting lost anyway. I am not sure
> if the kernel log contains anything helpful for you. Shall I attach it?
> 
> I will try you new chip reset patch

yes, thanks.

Comment 29 Mohammed Shafi 2012-05-03 14:31:31 UTC

i assume the disconnect happens just after we hit the condition/WARNING that avoids the soft lockup. please see if the chip reset patch recovers the chip reliably with out any issues. thanks!

Comment 30 Mohammed Shafi 2012-05-08 08:09:59 UTC

by accident i got a way to recreate it easily.
stop your supplicant network manager stuff
sudo ifconfig wlanX up will do.
please give me sometime, as i am little busy with some work. we will fix this properly

Comment 31 Mohammed Shafi 2012-05-17 08:49:40 UTC

PLL4 seems to be zero till association, need to figure out why this is happening.
the PLL4 polling seems to be kicked of when we bring the interface up (ath_set_channel) and causing the lockup.

Comment 32 Mohammed Shafi 2012-06-12 15:51:31 UTC

attached patch fixes the issue as per latest wireless-testing tree.
may not apply in bit older tree too, as some changes went into wireless testing.

Comment 33 Mohammed Shafi 2012-06-12 15:52:35 UTC

Created attachment 591228 [details]
upstreamed fix for softlockup

Comment 34 John W. Linville 2012-06-12 19:49:30 UTC

Do you have a version that applies to earlier kernels?

Comment 35 Mohammed Shafi 2012-06-13 16:00:56 UTC

Created attachment 591554 [details]
backported fix for this issue

back ported fix so that it can apply in kernels like 3.4

Comment 36 Mohammed Shafi 2012-06-13 16:02:14 UTC

(In reply to comment #34)
> Do you have a version that applies to earlier kernels?

Hi John,

attached! please let me know if it does not helps.

Comment 37 John W. Linville 2012-06-13 18:18:01 UTC

Fedora 16 test kernels w/ the above patch are available here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=4159658

When they finish building, please give them a try and post the results here...thanks!

Comment 38 Josh Boyer 2012-09-04 17:50:18 UTC

Nobody ever tried John's test kernels and koji has pruned them by now.  Closing this out as fixed since Mohammed was already working from backports and we've rebased to 3.4.  If it still triggers in 3.4/3.5, please reopen.

Comment 39 Josh Boyer 2012-09-18 15:30:24 UTC

*** Bug 813888 has been marked as a duplicate of this bug. ***

Comment 40 Josh Boyer 2012-09-18 15:41:53 UTC

*** Bug 814482 has been marked as a duplicate of this bug. ***