Bug 1090619

Summary: kipmi0 process hangs with ipmitool
Product: Red Hat Enterprise Linux 6 Reporter: Tim Skirvin <tskirvin>
Component: kernelAssignee: Tony Camuso <tcamuso>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.6CC: abezella, burt, dgibson, john.mora, kingsumos, martin.josefsson, mavit, pasteur, psklenar, riehecky, sauchter, sia, tbowling, tcamuso, yehs1
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-16 14:06:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
getInfoIPMI.c reproducer
none
doit reproducer
none
doit2 reproducer none

Description Tim Skirvin 2014-04-23 19:02:49 UTC
Description of problem:

We're seeing an intermittent-but-irritating bug with IPMI on our RHEL6 Linux hosts.  Specifically, the kipmi0 process occasionally hangs, sending its CPU usage to ~100% (not actually a problem) and indefinitely hanging in response to commands like 'ipmitool chassis power status' (a bigger problem).  'service ipmi status' says that all is well:

    [root@cmswn1535 ~]# service ipmi status
    ipmi_msghandler module in kernel.
    ipmi_si module in kernel.
    ipmi_devintf module loaded.
    /dev/ipmi0 exists.

A reboot seems to takes care of the problem, and early experiments seem to indicate that a full power cycle is *not* necessary.  

Version-Release number of selected component (if applicable):

We've seen this on 6.4 and 6.5, but I suspect it's a problem on other releases.

How reproducible:

Difficult.  

Steps to Reproduce:
1.  Run ipmitool commands a lot, e.g.:
    ipmitool chassis bootdev 'Force PXE' options=persistent
2.  Sometimes it hangs.

Actual results:

kipmi0 CPU usage goes to 100%; all further calls to ipmitool hang indefinitely.

Expected results:

We get a response of some sort.

Comment 2 Terry Bowling 2014-05-07 12:48:52 UTC
I have another customer who has reported a possibly similar experience.  I have requested that they try to reproduce and force a vmcore/kdump at the time this is occurring for deeper analysis.

Notes from customer (scrubbed for private data):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
on rhel6.5, /dev/ipmi0 hangs and kipmi0 is using 100% CPU

This looks similar to bugzilla:

    https://bugzilla.redhat.com/show_bug.cgi?id=1090619

We have an application running on rhel6.5 and they are seeing the kipmi0
process go to 100% CPU and all IPMI commands start failing.  This application
was previously running rhel6.3 and they did not see the issue.

The only way to recover is to reboot the blade.

RHEL6.3 is kernel:  2.6.32-279.19.1.el6.x86_64
RHEL6.5 is kernel:  2.6.32-431.1.2.el6.x86_64

The OS is running on [sic. custom, RH certified] hardware on a shelf with a 
management card.   The provider of the hardware has board support scripts and 
one of them loops every two seconds running the following ipmitool commands:

    /usr/bin/ipmitool raw 0x2e 0x00 0x0A 0x40 0x00
    /usr/bin/ipmitool raw 0x2C 0x00 0

We have found that we can usually recreate the lockup by running lots of
IPMI commands.  We have a tool that can query the physical location of
the blade, getInfoIPMI.  If we run 20 of those in a loop, and then run
a script that executes the above two ipmitool commands in a two second loop,
we can usually create the issue.  During the looping, the
getInfoIPMI command sometimes fails to run the IPMI commands.
By running lots of IPMI commands in the loop, we force the kipmi0
CPU usage to 97-100%.   As mentioned, it might take 30 minutes or 2 hrs
but the lockup is mostly reproducible.   We sometimes adjust the number of
getInfoIPMI commands we run or adjust the delay in the ipmitool loop.  We
are usually successful at re-creating.

When we try to recreate the issue on rhel6.3 kernel, we
see the getInfoIPMI errors but the device never locks up.   We have also
tried a couple of rhel6.4 kernels (2.6.32-358.23.2.el6.x86_64 and
2.6.32-358.11.1.el6.x86_64) and haven't been able to reproduce the lockup
with those kernels either.

We've been running rhel 5 and rhel 6 on this hardware for several years.
The IPMI firmware and BIOS has not recently changed.

Comment 3 Terry Bowling 2014-05-08 20:40:42 UTC
My other customer performed the following tests (note, on their own custom hardware):

1.  Copy to case the output of this command to verify what ipmi kernel parameters are set to:
	# sysctl -a | grep -i ipmi

RESULT:
dev.ipmi.poweroff_powercycle = 0


2.  Execute this command when ipmi0 hangs to see if is able to restore ipmi functionality:
	# ipmitool mc reset cold

RESULT:
No success.  ipmi0 was still hung.


3.  Does the "Workaround for RHEL6" show at https://access.redhat.com/site/solutions/759683 restore functionality?  (I realise that KCS is for IBM modules specifically, but the workaround may still be aplicable).

RESULT:
Yes the workaround in 759683 works.   On two of the blades i ran it while there were 30 doit scripts that were still banging on ipmi.  This workaround was run on all 3 blades that hung and IPMI was working again after it ran.


4.  Try capturing the following debug output while ipmi0 is hung; upload dmesg data to the case after:

	# echo 7 > /sys/module/ipmi_si/parameters/kcs_debug

	Since this is a bit verbose, wait for a short while - several seconds should be enough - and 
	turn debugging off again:

	# echo 0 > /sys/module/ipmi_si/parameters/kcs_debug

	dmesg then shows something like this:

RESULT:
kcs_debug -- on two of the blades the output was:

     KCS: State = 9, 4

on one of the blades the output was:

     KCS: State = 9, 0

dmesg output from kcss_debug:

May  8 13:28:44 at072-s00c03h0 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
May  8 13:28:44 at072-s00c03h0 kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
May  8 13:28:44 at072-s00c03h0 kernel: ADDRCONF(NETDEV_CHANGE): eth0.800: link becomes ready
May  8 14:40:53 at072-s00c03h0 kernel: KCS: State = 9, 4
May  8 14:40:53 at072-s00c03h0 kernel: KCS: State = 9, 4
May  8 14:40:53 at072-s00c03h0 kernel: KCS: State = 9, 4
May  8 14:40:53 at072-s00c03h0 kernel: KCS: State = 9, 4
...
# lots of repeated KCS messages
...
May  8 14:40:55 at072-s00c03h0 kernel: KCS: State = 9, 4
May  8 14:40:55 at072-s00c03h0 kernel: KCS: State = 9, 4
May  8 14:40:55 at072-s00c03h0 kernel: KCS: State = 9, 4
May  8 14:46:44 at072-s00c03h0 kernel: Warning: dev (pts0) tty->count(4) != #fd's(3) in tty_release_dev
May  8 14:47:53 at072-s00c03h0 kernel: Warning: dev (pts0) tty->count(4) != #fd's(3) in tty_release_dev
May  8 14:48:39 at072-s00c03h0 kernel: Warning: dev (pts0) tty->count(4) != #fd's(3) in tty_release_dev
May  8 14:49:10 at072-s00c03h0 kernel: Warning: dev (pts0) tty->count(4) != #fd's(3) in tty_release_dev
May  8 14:50:53 at072-s00c03h0 kernel: Warning: dev (pts0) tty->count(4) != #fd's(3) in tty_release_dev
May  8 14:52:55 at072-s00c03h0 kernel: Warning: dev (pts0) tty->count(4) != #fd's(3) in tty_release_dev
May  8 14:54:20 at072-s00c03h0 kernel: ipmi_si: Adding hotmod-specified kcs state machine
May  8 14:54:20 at072-s00c03h0 kernel: ipmi_si: Trying hotmod-specified kcs state machine at i/o address 0xca2, slave address 0x0, irq 0
May  8 14:54:20 at072-s00c03h0 kernel: ipmi_si ipmi_si.1: Found new BMC (man_id: 0x001956, prod_id: 0x0000, dev_id: 0x15)
May  8 14:54:20 at072-s00c03h0 kernel: IPMI poweroff: ATCA Detect mfg 0x1956 prod 0x0
May  8 14:54:20 at072-s00c03h0 kernel: IPMI poweroff: Found a ATCA style poweroff function
May  8 14:54:20 at072-s00c03h0 kernel: ipmi_si ipmi_si.1: IPMI kcs interface initialized

Comment 7 Terry Bowling 2014-05-23 13:41:21 UTC
Created attachment 898672 [details]
getInfoIPMI.c reproducer

Customer provided reproducer example from support case 01090131

Using attachments getInfoIPMI.c, doit, and doit2

Instructions: HostA running 20 doit scripts and HostB running 1 doit2.   After 45 minutes, I started 10 more doit scripts in the background.  20-30 min later the problem occurred.   

 > On two of the blades I ran it while there were 30 doit scripts that were still banging on ipmi. 
 > Confirmed that the workaround in https://access.redhat.com/site/solutions/759683 does work but we need 
 > to understand the underlying problem.   This workaround was run on all 3 blades that hung and IPMI was 
 > working again after it ran.

 > This has been working on this hardware for years on both rhel 5 and rhel 6.   It just started to occur 
 > on one of our applications.   It's possible one of our applications changed something (or is running 
 > something) that is a variable/catalyst to this.   As mentioned previously, I could not recreate the hang 
 > in rhel6.3 or rhel6.4 but can recreate it on rhel6.5.

Comment 8 Terry Bowling 2014-05-23 13:42:39 UTC
Created attachment 898674 [details]
doit  reproducer

Comment 9 Terry Bowling 2014-05-23 13:43:27 UTC
Created attachment 898676 [details]
doit2  reproducer

Comment 10 Terry Bowling 2014-05-27 12:26:14 UTC
Customer tests of bisected kernels on case 01090131 / BZ 1090619 point to the char-ipmi-simplify-locking.patch possibly introducing the behavior customer is seeing.

Dave, do you have any additional data or details to elaborate?

Comment 11 David Gibson 2014-05-28 06:42:27 UTC
Terry,

Not much.  Context is:
   * Customer can get ipmi to stop responding (with kipmi chewing CPU) on 2.6.32-370, but not on an earlier kernel
   * I bisected down to show the regression appeared between 2.6.32-369 and -370.
   * -370 contains a number (~36) new IPMI related patches.
   * I bisected within -370 by building test kernels starting with 370 minus all the ipmi related patches, and adding those patches back in gradually
   * A kernel based on 370, but excluding char-ipmi-simplify-locking.patch and all later ipmi related patches could not reproduce the problem
   * A kernel based on 370 including char-ipmi-simplify-locking.patch, but excluding all later ipmi related patches *can* reproduce the problem

Comment 12 Terry Bowling 2014-06-14 11:44:31 UTC
my customer tested patched kernel build on bz 1098324 with positive results.

Can we just close this as a dupe of bz 1098324?

Comment 13 Tony Camuso 2014-06-16 14:06:47 UTC

*** This bug has been marked as a duplicate of bug 1098324 ***

Comment 14 Burt Holzman 2014-06-16 15:03:30 UTC
Please reopen this or open access to bz 1098324 -- 1098324 isn't visible to the public.

Comment 15 Martin Josefsson 2014-07-09 13:00:36 UTC
(In reply to Terry Bowling from comment #12)
> my customer tested patched kernel build on bz 1098324 with positive results.
> 
> Can we just close this as a dupe of bz 1098324?

Please open up bz 1098324, mere mortals aren't allowed to access it :(
Or at least share the resolution / patches / etc here as well.

Comment 16 Terry Bowling 2014-07-12 16:58:09 UTC
Hello,

Bugzilla 1098324 was opened by a hardware partner and we cannot open that up without permission from the originator, which I have requested.

I can say that the issue for this BZ is being addressed by BZ 1098324 and a patch is being tested and evaluated for an errata release in 6.5 on BZ 1108360.  It is not yet fully evaluated and there is no guarantee of when it will be released, but I personally am watching it closely for the customer I am working with that originated this BZ.

Additionally, there is another BZ 1077493 that may be of interest that is making good progress for both rhel 6.6 and 6.5 errata BZ 1109712.  Again, there are no guarantees of if/when the releases will be available, but I will watch them closely.

I will try to update this bz with updates to assist with communication.

Best regards,
Terry Bowling

Comment 17 Asd123 2014-08-14 16:45:50 UTC
Hi,

What is current status of BZ 1098324?

Kind regards,
Evandro

Comment 18 Tony Camuso 2014-08-14 19:01:27 UTC
It is ON_QA

Patch has been incorporated into the kernel and is being subjected to QA testing.

Comment 19 Huaisheng 2016-06-29 08:23:14 UTC
We found a very similar issue with this one based on RHEL6.7.
Which leads to thread kipmi0 chew one core of processor 100%, and ipmitool hang at select to wait IMM receiving data. And driver module ipmi_si repeatingly submit KCS transactions as ‘start_kcs_transaction - 18 31’, and most of KCS state event would be cost as ‘KCS: State = 5, 44’.



top - 00:35:57 up  3:05,  2 users,  load average: 1.00, 1.00, 0.93
Tasks: 1015 total,   2 running, 1013 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  2.5%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32158976k total,  2289596k used, 29869380k free,        0k buffers
Swap:        0k total,        0k used,        0k free,  1730440k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+   P COMMAND
 2028 root      39  19     0    0    0 R 100.0  0.0 165:47.39  2 kipmi0
 5360 root      20   0  440m 137m  18m S  1.0  0.4   3:03.35  0 uxspi102_cli.rh
  196 root      20   0     0    0    0 S  0.3  0.0   0:00.18 33 events/33
    1 root      20   0 19232 1544 1260 S  0.0  0.0   0:04.19  1 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.01  1 kthreadd
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.20  0 migration/0
    4 root      20   0     0    0    0 S  0.0  0.0   0:00.08  0 ksoftirqd/0
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00  0 stopper/0
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.00  0 watchdog/0
    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.00  1 migration/1
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00  1 stopper/1