Bug 1090619
Summary: | kipmi0 process hangs with ipmitool | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Tim Skirvin <tskirvin> | ||||||||
Component: | kernel | Assignee: | Tony Camuso <tcamuso> | ||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 6.6 | CC: | abezella, burt, dgibson, john.mora, kingsumos, martin.josefsson, mavit, pasteur, psklenar, riehecky, sauchter, sia, tbowling, tcamuso, yehs1 | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2014-06-16 14:06:47 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Tim Skirvin
2014-04-23 19:02:49 UTC
I have another customer who has reported a possibly similar experience. I have requested that they try to reproduce and force a vmcore/kdump at the time this is occurring for deeper analysis. Notes from customer (scrubbed for private data): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ on rhel6.5, /dev/ipmi0 hangs and kipmi0 is using 100% CPU This looks similar to bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1090619 We have an application running on rhel6.5 and they are seeing the kipmi0 process go to 100% CPU and all IPMI commands start failing. This application was previously running rhel6.3 and they did not see the issue. The only way to recover is to reboot the blade. RHEL6.3 is kernel: 2.6.32-279.19.1.el6.x86_64 RHEL6.5 is kernel: 2.6.32-431.1.2.el6.x86_64 The OS is running on [sic. custom, RH certified] hardware on a shelf with a management card. The provider of the hardware has board support scripts and one of them loops every two seconds running the following ipmitool commands: /usr/bin/ipmitool raw 0x2e 0x00 0x0A 0x40 0x00 /usr/bin/ipmitool raw 0x2C 0x00 0 We have found that we can usually recreate the lockup by running lots of IPMI commands. We have a tool that can query the physical location of the blade, getInfoIPMI. If we run 20 of those in a loop, and then run a script that executes the above two ipmitool commands in a two second loop, we can usually create the issue. During the looping, the getInfoIPMI command sometimes fails to run the IPMI commands. By running lots of IPMI commands in the loop, we force the kipmi0 CPU usage to 97-100%. As mentioned, it might take 30 minutes or 2 hrs but the lockup is mostly reproducible. We sometimes adjust the number of getInfoIPMI commands we run or adjust the delay in the ipmitool loop. We are usually successful at re-creating. When we try to recreate the issue on rhel6.3 kernel, we see the getInfoIPMI errors but the device never locks up. We have also tried a couple of rhel6.4 kernels (2.6.32-358.23.2.el6.x86_64 and 2.6.32-358.11.1.el6.x86_64) and haven't been able to reproduce the lockup with those kernels either. We've been running rhel 5 and rhel 6 on this hardware for several years. The IPMI firmware and BIOS has not recently changed. My other customer performed the following tests (note, on their own custom hardware): 1. Copy to case the output of this command to verify what ipmi kernel parameters are set to: # sysctl -a | grep -i ipmi RESULT: dev.ipmi.poweroff_powercycle = 0 2. Execute this command when ipmi0 hangs to see if is able to restore ipmi functionality: # ipmitool mc reset cold RESULT: No success. ipmi0 was still hung. 3. Does the "Workaround for RHEL6" show at https://access.redhat.com/site/solutions/759683 restore functionality? (I realise that KCS is for IBM modules specifically, but the workaround may still be aplicable). RESULT: Yes the workaround in 759683 works. On two of the blades i ran it while there were 30 doit scripts that were still banging on ipmi. This workaround was run on all 3 blades that hung and IPMI was working again after it ran. 4. Try capturing the following debug output while ipmi0 is hung; upload dmesg data to the case after: # echo 7 > /sys/module/ipmi_si/parameters/kcs_debug Since this is a bit verbose, wait for a short while - several seconds should be enough - and turn debugging off again: # echo 0 > /sys/module/ipmi_si/parameters/kcs_debug dmesg then shows something like this: RESULT: kcs_debug -- on two of the blades the output was: KCS: State = 9, 4 on one of the blades the output was: KCS: State = 9, 0 dmesg output from kcss_debug: May 8 13:28:44 at072-s00c03h0 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx May 8 13:28:44 at072-s00c03h0 kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready May 8 13:28:44 at072-s00c03h0 kernel: ADDRCONF(NETDEV_CHANGE): eth0.800: link becomes ready May 8 14:40:53 at072-s00c03h0 kernel: KCS: State = 9, 4 May 8 14:40:53 at072-s00c03h0 kernel: KCS: State = 9, 4 May 8 14:40:53 at072-s00c03h0 kernel: KCS: State = 9, 4 May 8 14:40:53 at072-s00c03h0 kernel: KCS: State = 9, 4 ... # lots of repeated KCS messages ... May 8 14:40:55 at072-s00c03h0 kernel: KCS: State = 9, 4 May 8 14:40:55 at072-s00c03h0 kernel: KCS: State = 9, 4 May 8 14:40:55 at072-s00c03h0 kernel: KCS: State = 9, 4 May 8 14:46:44 at072-s00c03h0 kernel: Warning: dev (pts0) tty->count(4) != #fd's(3) in tty_release_dev May 8 14:47:53 at072-s00c03h0 kernel: Warning: dev (pts0) tty->count(4) != #fd's(3) in tty_release_dev May 8 14:48:39 at072-s00c03h0 kernel: Warning: dev (pts0) tty->count(4) != #fd's(3) in tty_release_dev May 8 14:49:10 at072-s00c03h0 kernel: Warning: dev (pts0) tty->count(4) != #fd's(3) in tty_release_dev May 8 14:50:53 at072-s00c03h0 kernel: Warning: dev (pts0) tty->count(4) != #fd's(3) in tty_release_dev May 8 14:52:55 at072-s00c03h0 kernel: Warning: dev (pts0) tty->count(4) != #fd's(3) in tty_release_dev May 8 14:54:20 at072-s00c03h0 kernel: ipmi_si: Adding hotmod-specified kcs state machine May 8 14:54:20 at072-s00c03h0 kernel: ipmi_si: Trying hotmod-specified kcs state machine at i/o address 0xca2, slave address 0x0, irq 0 May 8 14:54:20 at072-s00c03h0 kernel: ipmi_si ipmi_si.1: Found new BMC (man_id: 0x001956, prod_id: 0x0000, dev_id: 0x15) May 8 14:54:20 at072-s00c03h0 kernel: IPMI poweroff: ATCA Detect mfg 0x1956 prod 0x0 May 8 14:54:20 at072-s00c03h0 kernel: IPMI poweroff: Found a ATCA style poweroff function May 8 14:54:20 at072-s00c03h0 kernel: ipmi_si ipmi_si.1: IPMI kcs interface initialized Created attachment 898672 [details] getInfoIPMI.c reproducer Customer provided reproducer example from support case 01090131 Using attachments getInfoIPMI.c, doit, and doit2 Instructions: HostA running 20 doit scripts and HostB running 1 doit2. After 45 minutes, I started 10 more doit scripts in the background. 20-30 min later the problem occurred. > On two of the blades I ran it while there were 30 doit scripts that were still banging on ipmi. > Confirmed that the workaround in https://access.redhat.com/site/solutions/759683 does work but we need > to understand the underlying problem. This workaround was run on all 3 blades that hung and IPMI was > working again after it ran. > This has been working on this hardware for years on both rhel 5 and rhel 6. It just started to occur > on one of our applications. It's possible one of our applications changed something (or is running > something) that is a variable/catalyst to this. As mentioned previously, I could not recreate the hang > in rhel6.3 or rhel6.4 but can recreate it on rhel6.5. Created attachment 898674 [details]
doit reproducer
Created attachment 898676 [details]
doit2 reproducer
Customer tests of bisected kernels on case 01090131 / BZ 1090619 point to the char-ipmi-simplify-locking.patch possibly introducing the behavior customer is seeing. Dave, do you have any additional data or details to elaborate? Terry, Not much. Context is: * Customer can get ipmi to stop responding (with kipmi chewing CPU) on 2.6.32-370, but not on an earlier kernel * I bisected down to show the regression appeared between 2.6.32-369 and -370. * -370 contains a number (~36) new IPMI related patches. * I bisected within -370 by building test kernels starting with 370 minus all the ipmi related patches, and adding those patches back in gradually * A kernel based on 370, but excluding char-ipmi-simplify-locking.patch and all later ipmi related patches could not reproduce the problem * A kernel based on 370 including char-ipmi-simplify-locking.patch, but excluding all later ipmi related patches *can* reproduce the problem my customer tested patched kernel build on bz 1098324 with positive results. Can we just close this as a dupe of bz 1098324? *** This bug has been marked as a duplicate of bug 1098324 *** Please reopen this or open access to bz 1098324 -- 1098324 isn't visible to the public. (In reply to Terry Bowling from comment #12) > my customer tested patched kernel build on bz 1098324 with positive results. > > Can we just close this as a dupe of bz 1098324? Please open up bz 1098324, mere mortals aren't allowed to access it :( Or at least share the resolution / patches / etc here as well. Hello, Bugzilla 1098324 was opened by a hardware partner and we cannot open that up without permission from the originator, which I have requested. I can say that the issue for this BZ is being addressed by BZ 1098324 and a patch is being tested and evaluated for an errata release in 6.5 on BZ 1108360. It is not yet fully evaluated and there is no guarantee of when it will be released, but I personally am watching it closely for the customer I am working with that originated this BZ. Additionally, there is another BZ 1077493 that may be of interest that is making good progress for both rhel 6.6 and 6.5 errata BZ 1109712. Again, there are no guarantees of if/when the releases will be available, but I will watch them closely. I will try to update this bz with updates to assist with communication. Best regards, Terry Bowling Hi, What is current status of BZ 1098324? Kind regards, Evandro It is ON_QA Patch has been incorporated into the kernel and is being subjected to QA testing. We found a very similar issue with this one based on RHEL6.7. Which leads to thread kipmi0 chew one core of processor 100%, and ipmitool hang at select to wait IMM receiving data. And driver module ipmi_si repeatingly submit KCS transactions as ‘start_kcs_transaction - 18 31’, and most of KCS state event would be cost as ‘KCS: State = 5, 44’. top - 00:35:57 up 3:05, 2 users, load average: 1.00, 1.00, 0.93 Tasks: 1015 total, 2 running, 1013 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 2.5%sy, 0.0%ni, 97.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 32158976k total, 2289596k used, 29869380k free, 0k buffers Swap: 0k total, 0k used, 0k free, 1730440k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 2028 root 39 19 0 0 0 R 100.0 0.0 165:47.39 2 kipmi0 5360 root 20 0 440m 137m 18m S 1.0 0.4 3:03.35 0 uxspi102_cli.rh 196 root 20 0 0 0 0 S 0.3 0.0 0:00.18 33 events/33 1 root 20 0 19232 1544 1260 S 0.0 0.0 0:04.19 1 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 1 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:00.20 0 migration/0 4 root 20 0 0 0 0 S 0.0 0.0 0:00.08 0 ksoftirqd/0 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 0 stopper/0 6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 0 watchdog/0 7 root RT 0 0 0 0 S 0.0 0.0 0:00.00 1 migration/1 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 1 stopper/1 |