| Summary: | [abrt] kernel: BUG: soft lockup - CPU#3 stuck for 22s! [kworker/3:1:3639] | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Terry Wallwork <terrywallwork> | ||||||
| Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 16 | CC: | chapelhilllaptopshop, gansalmon, itamar, jcp, jonathan, kernel-maint, larry.finger, madhu.chinakonda, pavelmialko, sgruszka, vjain02 | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | abrt_hash:6a4bd8d33a0cff02739ed7b1040293ac92def243 | ||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2012-02-29 20:34:48 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
|
Description
Terry Wallwork
2011-11-15 11:11:02 UTC
*** Bug 756140 has been marked as a duplicate of this bug. *** Please try the following (as root): modprobe -rv rtl8192ce modprobe -v rtl8192ce ips=0 The above will disable power save in the interface, which seems to be indicated. Does it help? (In reply to comment #2) > Please try the following (as root): > > modprobe -rv rtl8192ce > modprobe -v rtl8192ce ips=0 > > The above will disable power save in the interface, which seems to be > indicated. Does it help? Hi Larry, Many thanks for help. The first line have killed the wifi and second line just freezed without any result. I've tried yesterday: iw dev wlan0 set power_save off Seems not appear again but who knows... I just updated a desktop/server system to Fedora 16 over the weekend and I am seeing this exact (and very annoying) problem. It uses wired ethernet for the network connection. In contrast to the above user, my system has absolutely NO wifi hardware of any sort in it. In one case, I was just editing a file (from vim running in an xterm), and it just froze solid as a rock, no response of any sort. What was odd, was I could open another xterm without any problems, and started "top" to see a kworker process taking 95-100% of one CPU. It took perhaps 2-3 minutes before the editor session became responsive again. Just this morning, I sat down in front of the system which had been up over-night but not doing anything of particular significance. I tried typing a command in an xterm window that was already open from the previous evening and it was already totally unresponsive. Once again, I was able to open another xterm without difficulty and found a kworker process running at close to 100% CPU utilization. I saw the problem with the kernel that Anaconda installed from the full DVD installer, and the latest kernel that was installed after doing a "yum update". More specifically; kernel-3.1.0-7.fc16.x86_64 kernel-3.1.1-2.fc16.x86_64 The system is a Nettop style system I built myself with an Intel motherboard and dual core Atom processor (330 if memory serves me correctly). Let me know what sort of detailed information would be helpful. Jast an FYI. What little searching I have done suggests this problem was observed with the 2.6.x kernels, mostly from the Ubuntu community. I found a discussion of kworker threads "going bonkers" on the kernel mailing list; https://lkml.org/lkml/2011/3/30/836 I was able to collect statistics with perf as recommended in the post above. I'm not that familiar with the kernel internals, so I am not sure what to make of it, but I see lots of time spent in ACPI stuff. A trimmed version of the output from "perf report -stdio" is attached to avoid spamming the comments here. As I mentioned above, there is no wifi networking hardware in this (nettop) box. There was mention of this being associated with Intel video, which this system does have. Trimmed output from lspci; 00:02.0 VGA compatible controller: Intel Corporation 82945G/GZ Integrated Graphics Controller (rev 02) Mind you, I'm not ranting here, but this bug is a true show stopper for me. I did a fresh reboot before turning in last night. The system was doing nothing of particular significance. When I checked it this morning, it had been up just shy of 14 hours, while one kworker thread had already consumed over 6 hours of CPU time! % uptime 10:16:57 up 13:55, 4 users, load average: 2.74, 1.72, 1.33 % ps aux | grep kworker root 5 0.0 0.0 0 0 ? S Nov25 0:00 [kworker/u:0] root 11 0.0 0.0 0 0 ? S Nov25 0:00 [kworker/0:1] root 40 0.0 0.0 0 0 ? S Nov25 0:00 [kworker/u:2] root 44 44.2 0.0 0 0 ? R Nov25 369:39 [kworker/0:2] root 2309 0.0 0.0 0 0 ? S 02:52 0:00 [kworker/0:0] root 2466 0.4 0.0 0 0 ? S 08:08 0:31 [kworker/1:1] root 2522 0.0 0.0 0 0 ? S 10:06 0:00 [kworker/1:0] root 2529 0.0 0.0 0 0 ? S 10:14 0:00 [kworker/1:2] Created attachment 536902 [details]
Output of "perf report --stdio"
Waited for the kworker thread to go "cuckoo for cocoa puffs" then from another xterm;
perf record -ag sleep 180
perf report --stdio
John, did you try to use kernel-debug ? Perhaps it will print more detail information where the problem is. For tracking Terry problem with rtl8192ce driver I'm using bug 755154 now, and I reassign this one back to kernel-main since John problem is not wireless related. *** Bug 756916 has been marked as a duplicate of this bug. *** Is there significance in the 22 seconds? I have 82 soft lockups in my logs this morning, they're almost all 22 seconds with a few that are 23. The events are spaced exactly 28 seconds apart...and I do mean exactly. Dec 9 09:07:23 studio kernel: [21852.192997] BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:27] Dec 9 09:07:51 studio kernel: [21880.192997] BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:27] Dec 9 09:08:19 studio kernel: [21908.192997] BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:27] Dec 9 09:08:47 studio kernel: [21936.192997] BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:27] Dec 9 09:09:15 studio kernel: [21964.192997] BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:27] Dec 9 09:09:43 studio kernel: [21992.192997] BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:27] (3.2.0-0.rc4.git5.1.fc17.x86_64) (In reply to comment #9) > Is there significance in the 22 seconds? I have 82 soft lockups in my logs this > morning, they're almost all 22 seconds with a few that are 23. The events are > spaced exactly 28 seconds apart...and I do mean exactly. There is a timer ticking in the kernel to detect soft lockups. It ticks at (watchdog_thresh * 2) seconds. I believe the default value of watchdog_thresh is 10 now. So every 20 seconds or so a high priority thread is supposed to write to a variable that indicates the watchdog is able to run stuff on this CPU. If it doesn't get scheduled, then you get this warning. Being highly regular is expected in the case where something is hogging the CPU. Created attachment 545442 [details]
Output of perf top on kswapd thread while in a soft-lockup loop
I don't know if this is in any way useful..
currently running 3.2.0-0.rc4.git5.1.fc17.x86_64
The patch does not occur in mainline until 3.2-rc5. If you still get lockups with that version, then please post again. John, your issue seems to be related with ACPI. Derek, in your case this seems to be i915 driver issue. Please install kernel-debug, it should print some more verbose information where the problem is. Sorry for not following up sooner, I think I have diagnosed my problem. I did not notice it right away, but there were some entries in /var/log/messages that seem to confirm my problem is related to ACPI. Entries like the one below appear at the same time that kworker goes bonkers; Nov 21 06:32:51 octagon kernel: [41779.816616] ACPI: While loop taking a really long time. loop_count=0xfff [.... snip ....] Nov 21 06:41:46 octagon kernel: [42315.419942] ACPI: While loop taking a really long time. loop_count=0xffff00 Nov 21 06:41:46 octagon kernel: [42315.428453] ACPI Error: Method parse/execution failed [\_SB_.PCI0.LPC_.SMBR] (Node ffff88007bb44f28), AE_AML_INFINITE_LOOP (20110623/psparse-536) Nov 21 06:41:46 octagon kernel: [42315.428487] ACPI Error: Method parse/execution failed [\_SB_.PCI0.LPC_.INIT] (Node ffff88007bb44f00), AE_AML_INFINITE_LOOP (20110623/psparse-536) Nov 21 06:41:46 octagon kernel: [42315.428509] ACPI Error: Method parse/execution failed [\_GPE._L00] (Node ffff88007bb402d0), AE_AML_INFINITE_LOOP (20110623/psparse-536) Nov 21 06:41:46 octagon kernel: [42315.428538] ACPI Exception: AE_AML_INFINITE_LOOP, while evaluating GPE method [_L00] (20110623/evgpe-560) In hunting around to see if others have this problem, I discovered that many people having this problem have the same system board as I have, the Intel 945GCLF2. To make a long story short, it appears this board has an unfixed ACPI bug. Upon learning of this, I downloaded and flashed the latest BIOS from Intel, but that unfortunately did NOT fix anything, same drill. Although I'm not really 100% certain this is the issue, it seemed to be pretty likely. It was reported by Intel 945GCLF2 owners using other Linux distributions, and even by users of other operating systems such as FreeBSD, so that was good enough for me: http://www.mail-archive.com/acpi-bugzilla@lists.sourceforge.net/msg27069.html https://bugzilla.novell.com/show_bug.cgi?id=689848 http://forums.freebsd.org/showpost.php?s=c7ec091918772edc6ccac2448feadc3d&p=62755&postcount=15 As I mentioned earlier, the problem really made the system unusable. So, I just installed an Intel D525MW board (the successor product, about $115 with memory) and everything works fine now. I thought Intel had mostly eliminated problems like this with all the time and money they sunk into using Formal Methods for hardware and software verification after the infamous FDIV bug back in the 1990's. Apparently not. Are all the issues associated with this bug satisfied by either getting a new mainboard, or by the fix to rtlwifi? If so, then the bug should be closed. In retrospect, the problem I was having was completely different than the person who filed the original bug report (Terry Wallwork). As I noted above, my system had/has no wifi of any kind in it. I posted here for lack of a more precise place to report it. If I understood correctly (comment 7 above), Terri's problem is now being tracked by bug 755154 Yes, that bug is fixed, at least in mainline. It is my understanding that a kernel update has been issued by Fedora. This bug was a hardware issue, and the original reporter has their issue tracked by a different bug. Closing. |