Bug 754611

Summary: Kernel can't recover upon receiving IOCK NMI.
Product: Red Hat Enterprise Linux 5 Reporter: Vitaly <v.mayatskih>
Component: kernelAssignee: Don Zickus <dzickus>
Status: CLOSED WONTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.7CC: jfeeney, prarit, tcamuso
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-08-10 23:05:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
"lspci -vvv" before NMI on RHEL-6
none
"lspci -vvv" after NMI on RHEL-6
none
dmesg after IOCK NMI on RHEL-6
none
IOCK NMI message on RHEL-5
none
dmesg/el5 none

Description Vitaly 2011-11-17 01:18:24 UTC
Description of problem:

On HP Proliant DL380 G7 IOCK NMI issued when error is detected on PCI/PCI-e bus. Kernel in RHEL-5.7 can't recover from that kind of NMI and loops infinitely.

Version-Release number of selected component (if applicable):

2.6.18-274.7.1.el5

How reproducible:

Always.

Steps to Reproduce:
1. Plug PCI-e device into PCI-e slot in HP DL380 G7 (G5 and G6 are sane). In our case it's IO-expansion card by OneStop connected via PCIe-over-cable to PCI-e device inside FPGA.
2. Boot OS.
3. Remove PCI-e device from slot or unplug cable from IO-expansion card.

PCI-e express port associated with device is not hotpluggable, it sends AER report regarding uncorrectable error, IOCK NMI is issued. System does not recover from NMI, it loops forever printing this backtrace to console:

NMI: IOCK error (debug interrupt?)
CPU 0 
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sr_mod cdrom sg i7core_edac shpchp edac_mc bnx2 serio_raw tpm_tis pcspkr tpm tpm_bios hpilo dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Tainted: G     ---- 2.6.18-274.7.1.el5 #1
RIP: 0010:[<ffffffff8006b9bf>]  [<ffffffff8006b9bf>] mwait_idle_with_hints+0x66/0x67
RSP: 0018:ffffffff8045df40  EFLAGS: 00000046
RAX: 0000000000000020 RBX: ffff8101268da178 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000020
RBP: 12372ffd828a73e0 R08: 0000000000040bac R09: 0000000000000038
R10: ffff810127e94038 R11: 0000000000000202 R12: 0000000000000000
R13: ffff8101268da000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff8042c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaaacd5000 CR3: 0000000117a2c000 CR4: 00000000000006a0
Process swapper (pid: 0, threadinfo ffffffff8045c000, task ffffffff80315b60)
Stack:  ffffffff801a2afd 0000000000000000 00000000000000b3 0000000000000000
 00000000000000b3 0000000000000000 ffffffff801a29a1 0000000000090000
 0000000000000000 0000000000000000 ffffffff80048fe2 0000000000200800
Call Trace:
 [<ffffffff801a2afd>] acpi_processor_idle_simple+0x15c/0x31c
 [<ffffffff801a29a1>] acpi_processor_idle_simple+0x0/0x31c
 [<ffffffff80048fe2>] cpu_idle+0x95/0xb8
 [<ffffffff80467809>] start_kernel+0x220/0x225
 [<ffffffff8046722f>] _sinittext+0x22f/0x236

Code: c3 41 57 41 56 49 89 f6 41 55 49 89 fd 41 54 4c 8d a7 e0 02

Actual results:

System dies.

Expected results:

Correct recover on IOCK NMI.

Additional info:

RHEL-6 is sane in that respect: it reports IOCK NMI once and recovers.

Comment 1 Don Zickus 2011-11-22 19:48:24 UTC
Hi Vitaly,

By default RHEL-5 doesn't do anything with AER, which means the OS probably isn't clearing the NMI thus causing an endless loop of NMI IOCK errors.  RHEL-6 we handle all this correctly, which is probably why you only see this once.

Try booting with the kernel option 'aer'.  This enables the support in RHEL-5 and will hopefully handle this the way you intended.

Cheers,
Don

Comment 2 Vitaly 2011-11-22 21:06:42 UTC
No, aer didn't help.

In fact, any failure in PCI-Express device will bring DL380 G7 + RHEL-5 system down. That's not good.

Comment 3 Vitaly 2011-11-22 21:07:29 UTC
Created attachment 535191 [details]
"lspci -vvv" before NMI on RHEL-6

Comment 4 Vitaly 2011-11-22 21:07:59 UTC
Created attachment 535192 [details]
"lspci -vvv" after NMI on RHEL-6

Comment 5 Don Zickus 2012-06-19 16:49:28 UTC
Hi Vitaly,

Can you send me the whole 'dmesg' output, so I can see the AER output too.
The IOCK NMI is most likely coming from the HP iLO.  We have a bz opened to support that properly for RHEL-6.  All the fix really does is record and reboot the machine when it detects an IOCK NMI.

I can probably hack up something similar for RHEL-5.  Though I am not entirely sure it will be accepted this late in the RHEL-5 cycle.

Cheers,
Don

Comment 6 Don Zickus 2012-08-03 18:23:12 UTC
Hi Tony,

Can you give me your thoughts on this bz?  Does HP support this?  Is this the iLO acting up again?

Cheers,
Don

Comment 7 Vitaly 2012-08-03 20:18:12 UTC
Created attachment 602182 [details]
dmesg after IOCK NMI on RHEL-6

I can't get dmesg on RHEL-5, because it dies in eternal loop. Here's dmesg captured on RHEL-6. I don't see any sign of AER.

Comment 8 Tony Camuso 2012-08-03 20:27:51 UTC
(In reply to comment #6)
> Hi Tony,
> 
> Can you give me your thoughts on this bz?  Does HP support this?  Is this
> the iLO acting up again?
> 
> Cheers,
> Don

We need a screen shot. 

Vitaly, try ssh from a terminal window with a deep screen buffer to the iLO ...

ssh Administrator.whatever

... which connects to the Virtual Serial Console.

You will need to edit grub to send output to the serial port, so boot normally first. 

When you have the Virtual Serial Port working, then try your experiment. You should be able to capture all messages in the terminal window's scroll buffer.

Comment 9 Vitaly 2012-08-03 21:17:24 UTC
Created attachment 602189 [details]
IOCK NMI message on RHEL-5

I see this message printed in a loop.

Comment 10 Tony Camuso 2012-08-03 21:22:50 UTC
Vitaly, we must see all the information leading up to that point. Seeing the stack trace does not give us enough information.

Please follow the instructions I listed above and give us all the screen output from the beginning of boot until you get the NMI. 

You will need a deep terminal buffer, say, 10000 lines.

Comment 11 Vitaly 2012-08-03 21:25:05 UTC
Last file is a copy-paste from iLO/VSP. There's nothing interesting prior IOCK NMI. I can attach boot log or dmesg if you want.

Comment 12 Don Zickus 2012-08-07 14:06:43 UTC
Vitaly,

The IOCK NMI is most likely coming from the iLO.  The question is why.  Providing the boot log or dmesg might be able to give us a clue.  Are you suggesting the iLO logs does not provide that information?

I understand you can't get it from the console because of the never ending stream of IOCK NMIs, but Tony was hoping the iLO would capture the serial stream.  This is the output we would like to see.

Cheers,
Don

Comment 13 Vitaly 2012-08-07 14:39:26 UTC
Created attachment 602771 [details]
dmesg/el5

dmesg attached.

We have seen that G5's iLO was triggering NMI, but it is not the case with G7. At least first interrupt is triggered by PCI Express root port (to which our device is attached), and when we block error reporting (DevCtl register) no more NMIs occur.

Comment 14 Don Zickus 2012-08-07 16:11:38 UTC
Hi Vitaly,

I am just trying to make sure I understand the scenario here:

You boot the system, hot-unplug a pcie cable and then you get flooded with NMI IOCK messages, correct?

You are expecting only one NMI in that situation, right?

Cheers,
Don

Comment 15 Vitaly 2012-08-07 19:34:25 UTC
That's right.

Comment 16 Don Zickus 2012-08-08 15:41:09 UTC
Hi Vitaly,

Before you hot unplug your device, can you run

modprobe acpiphp

This is supposed to attach to the root bridge and handle
hotplug events.  Hopefully, that will detect the pcie errors
and limit them to one.

Otherwise, folks here say RHEL-5 and pci hotplug are shaky at best.

Cheers,
Don

Comment 17 Vitaly 2012-08-08 16:33:54 UTC
There are no hoplug slots in our G7:

# modprobe acpiphp
FATAL: Error inserting acpiphp (/lib/modules/2.6.18-238.19.1.el5/kernel/drivers/pci/hotplug/acpiphp.ko): No such device

Comment 18 Don Zickus 2012-08-08 18:38:29 UTC
Hi Vitaly,

Sorry about that.  That is odd can you reboot with 'debug' on the kernel command line and use the following command

modprobe acpiphp debug=1

Hopefully that will stick debug messages in the dmesg output that will tell us why that driver is failing.

Cheers,
Don

Comment 19 Vitaly 2012-08-08 19:13:58 UTC
Because, as I previously said, there's no hotplug slots in this machine :)

acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
acpiphp_glue: Total 0 slots

Comment 20 Don Zickus 2012-08-08 20:34:44 UTC
Hi Vitaly,

Hmm.  You don't have a similar RHEL-6 box handy, do you?  It would be nice to see what that box is saying.  The current theory from talking with folks is that the acpi table parsing in RHEL-5 can't handle a G7 correctly.  A RHEL-6 box might give us a clue by telling us what it found in the acpi tables.

Then we could go look at the code and see what changes to bring back to RHEL-5.

Cheers,
Don

Comment 21 Vitaly 2012-08-08 20:59:18 UTC
I do have el6 on same box. lspci and dmesg are already attached.

Comment 22 Don Zickus 2012-08-10 14:48:07 UTC
Hi Vitaly,

Our hotplug developer just came back from vacation and basically said, this is not something we support in RHEL-5 (surprise hotplug).  He said if you unload the driver before removing the cable it might work.

We could on our end dig up an HP G7 box, duplicate the problem and figure out what patches are needed.  But those patches would not be accpeted in RHEL-5 because we do not support this feature.

A few days ago I was under the impression that some of the hotplug drivers were blacklisted as to not imply we support them.  However, it seems your system does not use those drivers.  So it looks like something actually needs to be fixed.

I am sorry to say that I will have to close this bug out as WONTFIX.

Cheers,
Don

Comment 23 Tony Camuso 2012-08-10 23:05:03 UTC
Closed as WONTFIX