609156 – CPU usage of kipmi thread is too high..(95 to 98%)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 609156 - CPU usage of kipmi thread is too high..(95 to 98%)

Summary: CPU usage of kipmi thread is too high..(95 to 98%)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	urgent
Target Milestone:	rc
Target Release:	6.0
Assignee:	Peter Martuccelli
QA Contact:	WANG Chao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	594856
TreeView+	depends on / blocked

Reported:	2010-06-29 14:34 UTC by Shyam Iyer
Modified:	2018-10-27 12:40 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-11-11 16:15:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Patch to add parameter kipmid_max_busy_us module parameter (3.47 KB, patch) 2010-07-13 02:55 UTC, Shyam Iyer	no flags	Details \| Diff
View All

Description Shyam Iyer 2010-06-29 14:34:58 UTC

Description of problem:
CPU usage of the kernel is too high on Dell PowerEdge Servers with ipmi

Version-Release number of selected component (if applicable):
Latest kernel 2.6.32-28

How reproducible:
Often

Steps to Reproduce:
1.Run "ipmitool sel list" in a tight loop
2.
3.
  
Actual results:
CPU usage shoots up to 95 to 98%

Expected results:
CPU usage of kipmi thread should be negligible

Additional info:
BZ 584106 was attempted to solve this but that did not fix this issue..

Based on test results in BZ 584106 the kernel param kipmid_max_busy_us is required to solve this issue.

Comment 2 RHEL Program Management 2010-06-29 15:03:09 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 3 Matthew Garrett 2010-06-29 15:29:02 UTC

Does this manifest itself in real use cases? The polling is used because the hardware doesn't provide interrupts. Reducing the polling interval will result in increased latency in handling IPMI commands. kipmi will only be running if commands are in flight, so I think things are working as designed here.

Comment 4 Shyam Iyer 2010-06-29 15:48:34 UTC

In the real world scenario the higher CPU usage is observed when Dell's systems management software Open-Manage is running.

Opening up to other Dell folks to provide the specific use cases.

Comment 5 Shyam Iyer 2010-06-29 16:00:08 UTC

(In reply to comment #3)
> Does this manifest itself in real use cases? The polling is used because the
> hardware doesn't provide interrupts. Reducing the polling interval will result
> in increased latency in handling IPMI commands. kipmi will only be running if
> commands are in flight, so I think things are working as designed here.    

Sure.. kipmi is running only when the commands are in flight but the open manage system software check the sensors periodically and that spikes the CPU usage considerably and affects normal operations.

In the past we have asked folks to see if they can shut of the OpenManage software for their performance critical tasks but sysadmins are wary of losing out sensor logs during that time.

See below thread..
https://patchwork.kernel.org/patch/13068/

Comment 6 Matthew Garrett 2010-06-29 17:07:25 UTC

The net effect of this patch is that if kipmid runs for more than the configured number of nanoseconds it'll then sleep for a millisecond. This still requires manual configuration and may potentially significantly slow bulk ipmi transactions such as firmware updates. How frequently is the OpenManage code actually triggering ipmi queries, and how long does each of those queries take?

Comment 7 Shyam Iyer 2010-07-01 15:38:06 UTC

From the OpenManage team:

"OpenManage queries the IPMI driver every 20 seconds and queries will last for about a second.

Firmware update to iDRAC doesn't happen over KCS for DRAC5 and above. iDRAC key is mounted as partition on host OS and image is transferred there and then upgrade initiated"

Comment 8 Charles Rose 2010-07-07 13:53:07 UTC

Setting Sev to Urgent since Bug 584106 results in unacceptable performance on Dell PowerEdge servers.

Comment 9 Matthew Garrett 2010-07-07 14:31:20 UTC

This isn't an elegant solution and still requires manual configuration (and may result in problems for some other use cases), but upstream carries this so I guess there's no real harm. In the long term I'd recommend that Dell ship hardware that supports interrupts.

Comment 10 Charles Rose 2010-07-08 05:21:19 UTC

(In reply to comment #9)
> This isn't an elegant solution and still requires manual configuration (and may
> result in problems for some other use cases), but upstream carries this so I
> guess there's no real harm. In the long term I'd recommend that Dell ship
> hardware that supports interrupts.    

This is definitely the plan and we are working on the implementation, but we will not have it implemented by RHEL6 GA.

Comment 11 Shyam Iyer 2010-07-13 02:55:46 UTC

Created attachment 431335 [details]
Patch to add parameter kipmid_max_busy_us module parameter

This patch factors the recent upstream fix to the regression caused by module parameter patch.
Please test.

Comment 12 Shyam Iyer 2010-07-13 19:34:34 UTC

Updates from testing by Srini at Dell with the patch:

Reboot 1 

Kipmid_max_busy_us       CPU Utilization range        Real time

 

0                       94% to 96%                           0.145s

100                     8% to 10%                            0.288s

200                     2% to 5%                             0.093s

300                     0.3% to 1%                           0.154s

400                     3% to 6%                             0.205s

500                     5% to 12%                            0.077s

 

Reboot 2:

 

Kipmid_max_busy_us       CPU Utilization range        Real time

 

0                        95% to 97%                          0.162s

100                      0.3% to 1.8%                        0.248s

200                      0.3% to 0.5%                        0.260s

300                      0.3% to 0.7%                        0.211s

400                      2% to 4%                            0.291s

500                      5% to 8%                            0.113s

 

Reboot 3:

 

Kipmid_max_busy_us       CPU Utilization range        Real time

 

0                        95% to 98%                         0.311s

100                      0.3% to 1.8%                       0.183s

200                      0.3% to 0.5%                       0.240s

300                      0.2% to 0.4%                       0.201s

400                      2% to 4%                           0.492s

500                      5% to 8%                           0.121s

Comment 14 Charles Rose 2010-07-26 05:23:43 UTC

We did not find this fix in 2.6.32-44.1. What is the target kernel for this fix?

Comment 15 Aristeu Rozanski 2010-07-26 15:16:07 UTC

Patch(es) available on kernel-2.6.32-52.el6

Comment 18 Raghavendra Biligiri 2010-08-02 12:23:10 UTC

Verified that the patch is present in RHEL6-Snapshot8 (kernel 2.6.32-52).

Below is the CPU utilization we observed in RHEL6-Snapshot8:

Kipmid_max_busy_us       CPU Utilization range        Real time

0                        80% to 95%                        0m37.45s

100                      0.6% to 1.0%                       0m39.104s

200                      1%                       0m29.78s

300                      1.0% to 3.0%                       0m28.974s

400                      2.0% to 10.0%                           0m36.65s

500                      3% to 22%                           0m30.93s

Comment 21 releng-rhel@redhat.com 2010-11-11 16:15:11 UTC

Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.