| Summary: | Kernel can't recover upon receiving IOCK NMI. | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Vitaly <v.mayatskih> | ||||||||||||
| Component: | kernel | Assignee: | Don Zickus <dzickus> | ||||||||||||
| Status: | CLOSED WONTFIX | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||
| Priority: | unspecified | ||||||||||||||
| Version: | 5.7 | CC: | jfeeney, prarit, tcamuso | ||||||||||||
| Target Milestone: | rc | ||||||||||||||
| Target Release: | --- | ||||||||||||||
| Hardware: | x86_64 | ||||||||||||||
| OS: | Linux | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2012-08-10 23:05:03 UTC | Type: | --- | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Vitaly
2011-11-17 01:18:24 UTC
Hi Vitaly, By default RHEL-5 doesn't do anything with AER, which means the OS probably isn't clearing the NMI thus causing an endless loop of NMI IOCK errors. RHEL-6 we handle all this correctly, which is probably why you only see this once. Try booting with the kernel option 'aer'. This enables the support in RHEL-5 and will hopefully handle this the way you intended. Cheers, Don No, aer didn't help. In fact, any failure in PCI-Express device will bring DL380 G7 + RHEL-5 system down. That's not good. Created attachment 535191 [details]
"lspci -vvv" before NMI on RHEL-6
Created attachment 535192 [details]
"lspci -vvv" after NMI on RHEL-6
Hi Vitaly, Can you send me the whole 'dmesg' output, so I can see the AER output too. The IOCK NMI is most likely coming from the HP iLO. We have a bz opened to support that properly for RHEL-6. All the fix really does is record and reboot the machine when it detects an IOCK NMI. I can probably hack up something similar for RHEL-5. Though I am not entirely sure it will be accepted this late in the RHEL-5 cycle. Cheers, Don Hi Tony, Can you give me your thoughts on this bz? Does HP support this? Is this the iLO acting up again? Cheers, Don Created attachment 602182 [details]
dmesg after IOCK NMI on RHEL-6
I can't get dmesg on RHEL-5, because it dies in eternal loop. Here's dmesg captured on RHEL-6. I don't see any sign of AER.
(In reply to comment #6) > Hi Tony, > > Can you give me your thoughts on this bz? Does HP support this? Is this > the iLO acting up again? > > Cheers, > Don We need a screen shot. Vitaly, try ssh from a terminal window with a deep screen buffer to the iLO ... ssh Administrator.whatever ... which connects to the Virtual Serial Console. You will need to edit grub to send output to the serial port, so boot normally first. When you have the Virtual Serial Port working, then try your experiment. You should be able to capture all messages in the terminal window's scroll buffer. Created attachment 602189 [details]
IOCK NMI message on RHEL-5
I see this message printed in a loop.
Vitaly, we must see all the information leading up to that point. Seeing the stack trace does not give us enough information. Please follow the instructions I listed above and give us all the screen output from the beginning of boot until you get the NMI. You will need a deep terminal buffer, say, 10000 lines. Last file is a copy-paste from iLO/VSP. There's nothing interesting prior IOCK NMI. I can attach boot log or dmesg if you want. Vitaly, The IOCK NMI is most likely coming from the iLO. The question is why. Providing the boot log or dmesg might be able to give us a clue. Are you suggesting the iLO logs does not provide that information? I understand you can't get it from the console because of the never ending stream of IOCK NMIs, but Tony was hoping the iLO would capture the serial stream. This is the output we would like to see. Cheers, Don Created attachment 602771 [details]
dmesg/el5
dmesg attached.
We have seen that G5's iLO was triggering NMI, but it is not the case with G7. At least first interrupt is triggered by PCI Express root port (to which our device is attached), and when we block error reporting (DevCtl register) no more NMIs occur.
Hi Vitaly, I am just trying to make sure I understand the scenario here: You boot the system, hot-unplug a pcie cable and then you get flooded with NMI IOCK messages, correct? You are expecting only one NMI in that situation, right? Cheers, Don That's right. Hi Vitaly, Before you hot unplug your device, can you run modprobe acpiphp This is supposed to attach to the root bridge and handle hotplug events. Hopefully, that will detect the pcie errors and limit them to one. Otherwise, folks here say RHEL-5 and pci hotplug are shaky at best. Cheers, Don There are no hoplug slots in our G7: # modprobe acpiphp FATAL: Error inserting acpiphp (/lib/modules/2.6.18-238.19.1.el5/kernel/drivers/pci/hotplug/acpiphp.ko): No such device Hi Vitaly, Sorry about that. That is odd can you reboot with 'debug' on the kernel command line and use the following command modprobe acpiphp debug=1 Hopefully that will stick debug messages in the dmesg output that will tell us why that driver is failing. Cheers, Don Because, as I previously said, there's no hotplug slots in this machine :) acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 acpiphp_glue: Total 0 slots Hi Vitaly, Hmm. You don't have a similar RHEL-6 box handy, do you? It would be nice to see what that box is saying. The current theory from talking with folks is that the acpi table parsing in RHEL-5 can't handle a G7 correctly. A RHEL-6 box might give us a clue by telling us what it found in the acpi tables. Then we could go look at the code and see what changes to bring back to RHEL-5. Cheers, Don I do have el6 on same box. lspci and dmesg are already attached. Hi Vitaly, Our hotplug developer just came back from vacation and basically said, this is not something we support in RHEL-5 (surprise hotplug). He said if you unload the driver before removing the cable it might work. We could on our end dig up an HP G7 box, duplicate the problem and figure out what patches are needed. But those patches would not be accpeted in RHEL-5 because we do not support this feature. A few days ago I was under the impression that some of the hotplug drivers were blacklisted as to not imply we support them. However, it seems your system does not use those drivers. So it looks like something actually needs to be fixed. I am sorry to say that I will have to close this bug out as WONTFIX. Cheers, Don Closed as WONTFIX |