Bug 749909 - kworkers consume lots of CPU (ath5k)
Summary: kworkers consume lots of CPU (ath5k)
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 15
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: John W. Linville
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-10-28 20:27 UTC by David Nečas
Modified: 2012-07-11 17:53 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-07-11 17:53:43 UTC
Type: ---


Attachments (Terms of Use)

Description David Nečas 2011-10-28 20:27:36 UTC
Description of problem:
With recent kernels, various kworker threads consume lots of CPU, causing various short hangs, audio skips, keyboard input lags, etc.

They do not consume CPU continuously, but in several seconds (typically) long bursts during which one or more kworkers essentially monopolize the CPU.

This happens immediately after a fresh boot and logging in, no suspend, service restarts, etc. involved.


Version-Release number of selected component (if applicable):
kernel-2.6.40.4-5.fc15.x86_64
kernel-2.6.40.6-0.fc15.x86_64
possibly others


How reproducible:
Always.


Steps to Reproduce:
1. Boot Fedora 15 with the specified kernels on a MSI PR210 notebook.
  

Actual results:
It runs smoothly.


Expected results:
It hangs for a short while occasionally with kworker/* eating the CPU.


Additional info:
I have read Bug 638912, Bug 683156 and Bug 729047 that seem to be related but not identical and tried the suggested debugging methods.  Printing /proc/THE_OFFENDING_KWORKER/stack does not show anything interesting, it always looks like

[<ffffffff8104fbd6>] schedule+0x5a/0x5c
[<ffffffff8106cb66>] worker_thread+0x158/0x15d
[<ffffffff8106ff5b>] kthread+0x84/0x8c
[<ffffffff8148fe24>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff

However, the printing is also delayed so cat probably never actually had any chance to run during the hangs (on a two-core CPU).

Doing

echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
cat /sys/kernel/debug/tracing/trace_pipe >out.txt

reveals nothing interesting either.

During the hangs perf top shows typically something like

   PerfTop:      32 irqs/sec  kernel:62.5%  exact:  0.0% [1000Hz cycles],  (all, 2 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ___________________

             2469.00 68.1% delay_tsc                      [kernel.kallsyms]  
              372.00 10.3% native_read_tsc                [kernel.kallsyms]  
              200.00  5.5% paravirt_read_tsc              [kernel.kallsyms]  
              163.00  4.5% ktime_get                      [kernel.kallsyms]  
               30.00  0.8% sched_clock_tick               [kernel.kallsyms]  
               28.00  0.8% clockevents_program_event      [kernel.kallsyms]  
               24.00  0.7% arch_local_irq_restore         [kernel.kallsyms]  
               23.00  0.6% update_ts_time_stats           [kernel.kallsyms]  
               22.00  0.6% tick_broadcast_oneshot_control [kernel.kallsyms]  
               21.00  0.6% tg_load_down                   [kernel.kallsyms]  

The problem appears to be related to wireless because it is correlated to large number of messages that have started appearing in the logs with the recent kernels:

Oct 28 22:13:43 smut kernel: [28859.103257] ath5k phy0: gain calibration timeout (2484MHz)
Oct 28 22:13:43 smut kernel: [28859.564545] ath5k phy0: gain calibration timeout (2412MHz)
Oct 28 22:14:37 smut kernel: [28913.594614] ath5k phy0: gain calibration timeout (2417MHz)
Oct 28 22:14:38 smut kernel: [28914.056594] ath5k phy0: gain calibration timeout (2422MHz)
Oct 28 22:14:38 smut kernel: [28914.517244] ath5k phy0: gain calibration timeout (2427MHz)
Oct 28 22:14:39 smut kernel: [28914.978610] ath5k phy0: gain calibration timeout (2432MHz)
Oct 28 22:14:39 smut kernel: [28915.438136] ath5k phy0: gain calibration timeout (2437MHz)
Oct 28 22:14:40 smut kernel: [28915.896755] ath5k phy0: gain calibration timeout (2442MHz)
Oct 28 22:14:40 smut kernel: [28916.356033] ath5k phy0: gain calibration timeout (2447MHz)
Oct 28 22:14:40 smut kernel: [28916.817086] ath5k phy0: gain calibration timeout (2452MHz)
Oct 28 22:14:41 smut kernel: [28917.274720] ath5k phy0: gain calibration timeout (2457MHz)
Oct 28 22:14:41 smut kernel: [28917.733986] ath5k phy0: gain calibration timeout (2462MHz)

Also, switching off the wifi (with a notebook button) seems to eliminate the problem (with wired network still on).

lcpci output:
00:00.0 Host bridge: ATI Technologies Inc RS690 Host Bridge
00:01.0 PCI bridge: ATI Technologies Inc RS690 PCI to PCI Bridge (Internal gfx)
00:04.0 PCI bridge: ATI Technologies Inc Device 7914
00:06.0 PCI bridge: ATI Technologies Inc RS690 PCI to PCI Bridge (PCI Express Port 2)
00:07.0 PCI bridge: ATI Technologies Inc RS690 PCI to PCI Bridge (PCI Express Port 3)
00:12.0 SATA controller: ATI Technologies Inc SB600 Non-Raid-5 SATA
00:13.0 USB Controller: ATI Technologies Inc SB600 USB (OHCI0)
00:13.1 USB Controller: ATI Technologies Inc SB600 USB (OHCI1)
00:13.2 USB Controller: ATI Technologies Inc SB600 USB (OHCI2)
00:13.3 USB Controller: ATI Technologies Inc SB600 USB (OHCI3)
00:13.4 USB Controller: ATI Technologies Inc SB600 USB (OHCI4)
00:13.5 USB Controller: ATI Technologies Inc SB600 USB Controller (EHCI)
00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 14)
00:14.1 IDE interface: ATI Technologies Inc SB600 IDE
00:14.2 Audio device: ATI Technologies Inc SBx00 Azalia (Intel HDA)
00:14.3 ISA bridge: ATI Technologies Inc SB600 PCI to LPC Bridge
00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:05.0 VGA compatible controller: ATI Technologies Inc RS690M [Radeon X1200 Series]
01:05.2 Audio device: ATI Technologies Inc Radeon X1200 Series Audio Controller
02:00.0 Ethernet controller: Atheros Communications Inc. AR242x / AR542x Wireless Network Adapter (PCI-Express) (rev 01)
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)
06:04.0 FLASH memory: ENE Technology Inc ENE PCI Memory Stick Card Reader Controller
06:04.1 SD Host controller: ENE Technology Inc ENE PCI SmartMedia / xD Card Reader Controller
06:04.2 FLASH memory: ENE Technology Inc Memory Stick Card Reader Controller
06:04.3 FLASH memory: ENE Technology Inc ENE PCI Secure Digital / MMC Card Reader Controller

Please tell me what more information/testing is needed.

Comment 1 Bob Copeland 2011-11-07 20:55:01 UTC
> Oct 28 22:14:40 smut kernel: [28916.356033] ath5k phy0: gain calibration
timeout (2447MHz)

Calibration does a lot of busy loops..  I think we can change some mdelays into sleeps without problems now that reset is called from process context.

Comment 2 David Nečas 2011-11-07 21:22:47 UTC
Old logs do not contain the calibration timeout messages.  Has the ath5k calibration changed and this is a consequence?

I can try to find the exact kernel package where the problem appeared (taking older kernels pkgs from koji) if it helps.  The only method to test a kernel seems to be to boot it and wait a few minutes to see if anything bad happens so bisection will take time but it's possible.

Comment 3 Nick Kossifidis 2011-11-07 22:03:22 UTC
Can you try using fast channel switching ?

There is a module option for that fastchanswitch, set it to true when loading the module and see if you get better results.

As for the udelay, I think we should switch to usleep_range for most stuff we use and in general we don't need accuracy e.g. when waiting for analog parts to settle, just a rough estimate. I already have some patches for calibration pending so I'll do the switch and see how it goes...

Comment 4 David Nečas 2011-11-08 19:13:38 UTC
I added `options ath5k fastchanswitch', enabled wireless, rebooted (to get to a defined state) and it seemed fixed.  But then I removed the option again, rebooted and still did not observe any problems - they miraculously disappeared.  All with the same 2.6.40.6-0.fc15.x86_64 kernel.  So I don't know what's going on here.

Can the calibration be influenced by other wifi devices around?

If it is so it makes the problem non-detereministic (when I experienced it it persisted through several reboots).

Comment 5 Kieran Clancy 2012-02-04 14:07:15 UTC
I'm seeing this too. Related to bug 638943?

Comment 6 Josh Boyer 2012-06-07 15:07:23 UTC
Is this still an issue on 2.6.43/3.3?

Comment 7 Josh Boyer 2012-07-11 17:53:43 UTC
Fedora 15 has reached it's end of life as of June 26, 2012.  As a result, we will not be fixing any remaining bugs found in Fedora 15.

In the event that you have upgraded to a newer release and the bug you reported is still present, please reopen the bug and set the version field to the newest release you have encountered the issue with.  Before doing so, please ensure you are testing the latest kernel update in that release and attach any new and relevant information you may have gathered.

Thank you for taking the time to file a report.  We hope newer versions of Fedora suit your needs.


Note You need to log in before you can comment on or make changes to this bug.