Bug 609156 - CPU usage of kipmi thread is too high..(95 to 98%)
CPU usage of kipmi thread is too high..(95 to 98%)
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
All Linux
low Severity urgent
: rc
: 6.0
Assigned To: Peter Martuccelli
WANG Chao
:
Depends On:
Blocks: 594856
  Show dependency treegraph
 
Reported: 2010-06-29 10:34 EDT by Shyam Iyer
Modified: 2015-04-28 00:18 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-11-11 11:15:11 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Patch to add parameter kipmid_max_busy_us module parameter (3.47 KB, patch)
2010-07-12 22:55 EDT, Shyam Iyer
no flags Details | Diff

  None (edit)
Description Shyam Iyer 2010-06-29 10:34:58 EDT
Description of problem:
CPU usage of the kernel is too high on Dell PowerEdge Servers with ipmi

Version-Release number of selected component (if applicable):
Latest kernel 2.6.32-28

How reproducible:
Often

Steps to Reproduce:
1.Run "ipmitool sel list" in a tight loop
2.
3.
  
Actual results:
CPU usage shoots up to 95 to 98%

Expected results:
CPU usage of kipmi thread should be negligible

Additional info:
BZ 584106 was attempted to solve this but that did not fix this issue..

Based on test results in BZ 584106 the kernel param kipmid_max_busy_us is required to solve this issue.
Comment 2 RHEL Product and Program Management 2010-06-29 11:03:09 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.
Comment 3 Matthew Garrett 2010-06-29 11:29:02 EDT
Does this manifest itself in real use cases? The polling is used because the hardware doesn't provide interrupts. Reducing the polling interval will result in increased latency in handling IPMI commands. kipmi will only be running if commands are in flight, so I think things are working as designed here.
Comment 4 Shyam Iyer 2010-06-29 11:48:34 EDT
In the real world scenario the higher CPU usage is observed when Dell's systems management software Open-Manage is running.

Opening up to other Dell folks to provide the specific use cases.
Comment 5 Shyam Iyer 2010-06-29 12:00:08 EDT
(In reply to comment #3)
> Does this manifest itself in real use cases? The polling is used because the
> hardware doesn't provide interrupts. Reducing the polling interval will result
> in increased latency in handling IPMI commands. kipmi will only be running if
> commands are in flight, so I think things are working as designed here.    

Sure.. kipmi is running only when the commands are in flight but the open manage system software check the sensors periodically and that spikes the CPU usage considerably and affects normal operations.

In the past we have asked folks to see if they can shut of the OpenManage software for their performance critical tasks but sysadmins are wary of losing out sensor logs during that time.

See below thread..
https://patchwork.kernel.org/patch/13068/
Comment 6 Matthew Garrett 2010-06-29 13:07:25 EDT
The net effect of this patch is that if kipmid runs for more than the configured number of nanoseconds it'll then sleep for a millisecond. This still requires manual configuration and may potentially significantly slow bulk ipmi transactions such as firmware updates. How frequently is the OpenManage code actually triggering ipmi queries, and how long does each of those queries take?
Comment 7 Shyam Iyer 2010-07-01 11:38:06 EDT
From the OpenManage team:

"OpenManage queries the IPMI driver every 20 seconds and queries will last for about a second.

Firmware update to iDRAC doesn't happen over KCS for DRAC5 and above. iDRAC key is mounted as partition on host OS and image is transferred there and then upgrade initiated"
Comment 8 Charles Rose 2010-07-07 09:53:07 EDT
Setting Sev to Urgent since Bug 584106 results in unacceptable performance on Dell PowerEdge servers.
Comment 9 Matthew Garrett 2010-07-07 10:31:20 EDT
This isn't an elegant solution and still requires manual configuration (and may result in problems for some other use cases), but upstream carries this so I guess there's no real harm. In the long term I'd recommend that Dell ship hardware that supports interrupts.
Comment 10 Charles Rose 2010-07-08 01:21:19 EDT
(In reply to comment #9)
> This isn't an elegant solution and still requires manual configuration (and may
> result in problems for some other use cases), but upstream carries this so I
> guess there's no real harm. In the long term I'd recommend that Dell ship
> hardware that supports interrupts.    

This is definitely the plan and we are working on the implementation, but we will not have it implemented by RHEL6 GA.
Comment 11 Shyam Iyer 2010-07-12 22:55:46 EDT
Created attachment 431335 [details]
Patch to add parameter kipmid_max_busy_us module parameter

This patch factors the recent upstream fix to the regression caused by module parameter patch.
Please test.
Comment 12 Shyam Iyer 2010-07-13 15:34:34 EDT
Updates from testing by Srini at Dell with the patch:

Reboot 1 

Kipmid_max_busy_us       CPU Utilization range        Real time

 

0                       94% to 96%                           0.145s

100                     8% to 10%                            0.288s

200                     2% to 5%                             0.093s

300                     0.3% to 1%                           0.154s

400                     3% to 6%                             0.205s

500                     5% to 12%                            0.077s

 

Reboot 2:

 

Kipmid_max_busy_us       CPU Utilization range        Real time

 

0                        95% to 97%                          0.162s

100                      0.3% to 1.8%                        0.248s

200                      0.3% to 0.5%                        0.260s

300                      0.3% to 0.7%                        0.211s

400                      2% to 4%                            0.291s

500                      5% to 8%                            0.113s

 

Reboot 3:

 

Kipmid_max_busy_us       CPU Utilization range        Real time

 

0                        95% to 98%                         0.311s

100                      0.3% to 1.8%                       0.183s

200                      0.3% to 0.5%                       0.240s

300                      0.2% to 0.4%                       0.201s

400                      2% to 4%                           0.492s

500                      5% to 8%                           0.121s
Comment 14 Charles Rose 2010-07-26 01:23:43 EDT
We did not find this fix in 2.6.32-44.1. What is the target kernel for this fix?
Comment 15 Aristeu Rozanski 2010-07-26 11:16:07 EDT
Patch(es) available on kernel-2.6.32-52.el6
Comment 18 Raghavendra Biligiri 2010-08-02 08:23:10 EDT
Verified that the patch is present in RHEL6-Snapshot8 (kernel 2.6.32-52).

Below is the CPU utilization we observed in RHEL6-Snapshot8:

Kipmid_max_busy_us       CPU Utilization range        Real time

0                        80% to 95%                        0m37.45s

100                      0.6% to 1.0%                       0m39.104s

200                      1%                       0m29.78s

300                      1.0% to 3.0%                       0m28.974s

400                      2.0% to 10.0%                           0m36.65s

500                      3% to 22%                           0m30.93s
Comment 21 releng-rhel@redhat.com 2010-11-11 11:15:11 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.