Bug 811142
Summary: | [abrt] kernel: BUG: soft lockup - CPU#5 stuck for 22s! [kworker/u:6:91] | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Rolf Offermanns <rolf.offermanns> |
Component: | kernel | Assignee: | John W. Linville <linville> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 16 | CC: | emilis, gansalmon, hongfengwbw, itamar, jonathan, kernel-maint, madhu.chinakonda, shafi.wireless |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Unspecified | ||
Whiteboard: | abrt_hash:75be5411b084a8d6a5755a53b0282cf9f8d50f22 | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-09-04 17:50:18 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Rolf Offermanns
2012-04-10 09:20:23 UTC
*please provide your lspci -vvvvxxxx and can you please try the attached quick fix to identify the issue Created attachment 576487 [details]
possible suspicious busy loop and fixing that
Created attachment 576533 [details]
lspci -vvvvxxxx output
The patch works for me. The system boots up and wifi is working. Thanks! (In reply to comment #4) > The patch works for me. The system boots up and wifi is working. > Thanks! thanks a lot for verifying this, so thats the issue. we got to find the root cause. otherwise we can have a fix like this, that can work with my system. just to make sure the chip does not goes into some unstable state, because the function itself is a some sort of workaround for some rx hang You are welcome. I will observe the system during the day (next 8 hours) while I am using it. Unfortunately the ath9k driver was not very usable for me on this machine up until now. I think I am hit by Bug#736435, e.g. with kernel 3.3.0 wifi would stop working after some time (minutes to hours, maybe depending on the traffic) with messages like this in my syslog: [ 2752.626166] ath: Failed to stop TX DMA, queues=0x10f! [ 2752.637237] ath: DMA failed to stop in 10 ms AR_CR=0xffffffff AR_DIAG_SW=0xffffffff DMADBG_7=0xffffffff [ 2752.637244] ath: Could not stop RX, we could be confusing the DMA engine when we start RX up Only a reboot helped. But this is only for you as a background note. I will report back tonight if the system became unstable with your patch. My wifi connection just stopped working. Please find the log below. As mentioned before, this happened in previous kernels, too, although the logs looked different. I cannot say, if this is related to your patch, but I don't think so. Let me know, if there is something else I can do. I am using a USB wifi adapter for now. :( [ 3078.673382] wlan0: moving STA 00:04:0e:0a:39:b9 to state 2 [ 3078.673390] wlan0: moving STA 00:04:0e:0a:39:b9 to state 1 [ 3078.673395] wlan0: moving STA 00:04:0e:0a:39:b9 to state 0 [ 3078.687330] cfg80211: Calling CRDA to update world regulatory domain [ 3078.696615] cfg80211: World regulatory domain updated: [ 3078.696622] cfg80211: (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp) [ 3078.696630] cfg80211: (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) [ 3078.696636] cfg80211: (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm) [ 3078.696642] cfg80211: (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm) [ 3078.696647] cfg80211: (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) [ 3078.696653] cfg80211: (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) [ 3078.696695] cfg80211: Calling CRDA for country: DE [ 3078.699937] cfg80211: Regulatory domain changed to country: DE [ 3078.699939] cfg80211: (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp) [ 3078.699941] cfg80211: (2400000 KHz - 2483500 KHz @ 40000 KHz), (N/A, 2000 mBm) [ 3078.699943] cfg80211: (5150000 KHz - 5250000 KHz @ 40000 KHz), (N/A, 2000 mBm) [ 3078.699944] cfg80211: (5250000 KHz - 5350000 KHz @ 40000 KHz), (N/A, 2000 mBm) [ 3078.699946] cfg80211: (5470000 KHz - 5725000 KHz @ 40000 KHz), (N/A, 2698 mBm) [ 3079.716756] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 1) [ 3079.916121] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 2) [ 3080.115824] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 3) [ 3080.315600] wlan0: authentication with 00:04:0e:0a:39:b9 timed out [ 3086.660968] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 1) [ 3086.860406] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 2) [ 3087.060191] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 3) [ 3087.259956] wlan0: authentication with 00:04:0e:0a:39:b9 timed out [ 3093.606301] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 1) [ 3093.805770] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 2) [ 3094.005506] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 3) [ 3094.205219] wlan0: authentication with 00:04:0e:0a:39:b9 timed out [ 3094.647170] ath: Failed to stop TX DMA, queues=0x001! [ 3095.575105] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 1) [ 3095.774630] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 2) [ 3095.974371] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 3) [ 3096.174194] wlan0: authentication with 00:04:0e:0a:39:b9 timed out [ 3102.517312] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 1) [ 3102.716869] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 2) [ 3102.916762] wlan0: authenticate with 00:04:0e:0a:39:b9 (try 3) [ 3103.116545] wlan0: authentication with 00:04:0e:0a:39:b9 timed out please try http://comments.gmane.org/gmane.linux.kernel.wireless.general/88543 revert patch from Sujith Hi Rolf, please provide as much information about your testing where you got your soft lock up. we need to find out why PLL4_MEAS_DONE is not set for some time which caused the soft lockup. breaking out from the while loop with out the PLL4_MEAS_DONE being set may suggest chip may be into unstable state and we might observe tx/rx hang which lead to stress failures. i would also ask internally if we could get some reliable maximum time limit(something like in the patch) for polling for PLL4_MEAS_DONE bit. thanks. Created attachment 577225 [details]
softlockup fix/debug patch with WARN_ON
Hi Rolf,
can you please test with the attached patch and see how frequently you are able to trigger the WARN_ON i had introduced in the code. thanks.
Hi Mohammed, regarding comment#9. There is not much to tell. The soft lock up happens at boot time. The system will not go into graphical mode, it will just hang. I will try the patch in comment#10 and report back. I have applied the patch in comment#8 and it seemed to improve the situation. I did not have any timeouts or disconnections in the few hours I used it so far. (In reply to comment #11) > Hi Mohammed, > regarding comment#9. There is not much to tell. The soft lock up happens at > boot time. The system will not go into graphical mode, it will just hang. > > I will try the patch in comment#10 and report back. I have applied the patch in > comment#8 and it seemed to improve the situation. I did not have any timeouts > or disconnections in the few hours I used it so far. thanks for testing it comment#8 yeah its a separate issue, i just ran a bidirectional TCP iperf for around 15 hours did not see any issue with AR9485 and the bit was cleared within 200 us(observed by putting printks). i had asked internally what those PLL4_MEAS_DONE represents etc. we will just wait, otherwise we would send out the patch in comment#10 to upstream if the WARN_ONS for you are quite less during testing. thank you Note: There was a struct ath_common *common = ath9k_hw_common(ah); missing in your patch. When does the ar9003_get_pll_sqsum_dvc() function get called? Only at module initialization? Or during normal usage, too? (In reply to comment #13) > Note: There was a > struct ath_common *common = ath9k_hw_common(ah); > > missing in your patch. > > When does the ar9003_get_pll_sqsum_dvc() function get called? Only at module > initialization? Or during normal usage, too? oh oops sorry, no it will be periodically called with HZ/5 time unit. it will prevent rx hang when running stress etc. also attached the v2 patch thats compiling ;) Created attachment 577370 [details]
v2 patch for debugging softlockup
Hi again. I was not able to trigger the problem at the office, but I am plugged there, most of the time. I have been on wifi for max. 2h in a row. However at home today it appeared quite fast. I will attach my /var/log/messages. Check around 13:07. I unloaded and loaded the ath9k module somewhere around 13:45 and my wifi was working again. Created attachment 577526 [details]
/var/log/messages
(In reply to comment #17) > Created attachment 577526 [details] > /var/log/messages that seems to be quite a good number for WARNING triggered, would just think of a patch to do chip reset if this issue occurs, also need to read few docs regarding this. can you please provide your environment and AP configuration, anything interesting that issue occurs. i am not able to trigger this issue after a stress test. I am not doing anything special when the WARNING is triggered. As I said, I wasn't able to trigger this at my working place. I will try again on thursday. As to my home environment: My wifi is protected with WPA2/Personal, 2.4Ghz channel 6. The wifi setup at the office is the same, maybe another channel, but security wise, same settings. One difference it the number of other wifi networks around. At home (WARNING triggered) I have around 10 APs, all crowding the 2.4Ghz space, at work I see only 2. I don't if this matters. Hi Rolf, thanks a lot for your information. i did run my stress with in a congested environment. but just found some thing wrong in the hardware code PLL. will attach a proper patch for this which you can test and see if it helps. we would keep the chip reset option if we cannot resolve it by anything else. unfortunately the h/w code seems to be looking fine. need to dig some where else and check if chip reset helps if this condition occurs Hi Rolf, when these WARNINGS(introduced in my patch), how does it affect you. are you disconnected and the traffic stalls as the chip may get into some hang state. could you also produce logs with sudo modprobe ath9k debug=0xffffffff with debug enabled. http://linuxwireless.org/en/users/Drivers/ath9k/debug let me immediately spin a patch doing chip reset if MEAS_DONE is not set for even 100* 100 us Hi Mohammed, yes, I am disconnected when the WARNINGS happen. I will try to get debug logs. The day before yesterday I worked the whole day on wifi (>8h) without a single warning and yesterday it happened again many times. I really don't see a pattern here. (In reply to comment #23) > Hi Mohammed, > yes, I am disconnected when the WARNINGS happen. I will try to get debug logs. > The day before yesterday I worked the whole day on wifi (>8h) without a single > warning and yesterday it happened again many times. I really don't see a > pattern here. Hi Rolf, may you can see whether you be able to recreate the issue quite easily with logs with something like this while true do sudo modprobe -v ath9k debug=0xffffffff sleep 3 sudo ifconfig wlanX up sleep 3 sudo iw dev wlanX connect my-ap sleep 30 sudo modprobe -r ath9k sleep 3 done was out of station for some time, need to take a look at this closely will also other QCA developers. could not find anything initially from h/w doc Rolf, in addition all these stuff could you please try the attached patch which in any case stops the warnings. we are doing the chip reset once we hit the state when PLL4 MEAS_DONE is never set Created attachment 579744 [details]
do chip reset if PLL4 measurement done is not set for long time
check if chip reset recovers the chip from a PLL measurement being never set
Hi Mohammed, sorry for not answering earlier. I was not able to produce the warning with debugging enabled. However my connection was getting lost anyway. I am not sure if the kernel log contains anything helpful for you. Shall I attach it? I will try you new chip reset patch now. (In reply to comment #27) > Hi Mohammed, > > sorry for not answering earlier. I was not able to produce the warning with > debugging enabled. However my connection was getting lost anyway. I am not sure > if the kernel log contains anything helpful for you. Shall I attach it? > > I will try you new chip reset patch now. (In reply to comment #27) > Hi Mohammed, > > sorry for not answering earlier. I was not able to produce the warning with > debugging enabled. However my connection was getting lost anyway. I am not sure > if the kernel log contains anything helpful for you. Shall I attach it? > > I will try you new chip reset patch yes, thanks. i assume the disconnect happens just after we hit the condition/WARNING that avoids the soft lockup. please see if the chip reset patch recovers the chip reliably with out any issues. thanks! by accident i got a way to recreate it easily. stop your supplicant network manager stuff sudo ifconfig wlanX up will do. please give me sometime, as i am little busy with some work. we will fix this properly PLL4 seems to be zero till association, need to figure out why this is happening. the PLL4 polling seems to be kicked of when we bring the interface up (ath_set_channel) and causing the lockup. attached patch fixes the issue as per latest wireless-testing tree. may not apply in bit older tree too, as some changes went into wireless testing. Created attachment 591228 [details]
upstreamed fix for softlockup
Do you have a version that applies to earlier kernels? Created attachment 591554 [details]
backported fix for this issue
back ported fix so that it can apply in kernels like 3.4
(In reply to comment #34) > Do you have a version that applies to earlier kernels? Hi John, attached! please let me know if it does not helps. Fedora 16 test kernels w/ the above patch are available here: http://koji.fedoraproject.org/koji/taskinfo?taskID=4159658 When they finish building, please give them a try and post the results here...thanks! Nobody ever tried John's test kernels and koji has pruned them by now. Closing this out as fixed since Mohammed was already working from backports and we've rebased to 3.4. If it still triggers in 3.4/3.5, please reopen. *** Bug 813888 has been marked as a duplicate of this bug. *** *** Bug 814482 has been marked as a duplicate of this bug. *** |