Bug 504968
Summary: | testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (177->177)! | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Qian Cai <qcai> | ||||
Component: | kernel | Assignee: | Prarit Bhargava <prarit> | ||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 5.3 | CC: | dzickus, jfeeney, syeghiay, yugzhang | ||||
Target Milestone: | rc | Keywords: | Reopened | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2010-10-06 13:48:10 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 533192 | ||||||
Attachments: |
|
Description
Qian Cai
2009-06-10 08:55:41 UTC
Not a regression, pushing to 5.5. P. Created attachment 347987 [details]
acpidump from system
I did notice this bit of craziness in the boot log: ACPI: LAPIC_NMI (acpi_id[0xff] high level lint[0x1]) acpi_table_parse_entries: here acpi_id of 256? That doesn't seem right. P. According to the spec: "A value of 0xFF signifies that this applies to all processors in the machine." So this seems like a red herring. Prarit, this is an x86_64 system, so I suppose it should autodetect the working NMI setting? I forgot to enter this data in. The reason the NMI is firing is because of code 0x20 or 0x30. This system uses the ICH7 chipset. According to the specification the above code means: Timer Counter 2 OUT Status (TMR2_OUT_STS) — RO. This bit reflects the current state of the 8254 counter 2 output. Counter 2 must be programmed following any PCI reset for this bit to have a determinate value. When writing to port 61h, this bit must be a 0. ie) Timer Counter 2 has fired. But, AFAICT, the Timer Counter has been disabled (bit 0 in the code above). I have tried to write the register with bit 0 = 0 but this has no impact. P. (In reply to comment #8) > Prarit, this is an x86_64 system, so I suppose it should autodetect the working > NMI setting? Uh, I'm not sure what you mean? Are you asking if there is a way to default back to the IOAPIC if the LAPIC NMI fails? P. Every 36 seconds the system receives a burst of NMIs (reason code 0x20) on CPU 0. So timer 2 is running and being reset. P. (In reply to comment #10) > (In reply to comment #8) > > Prarit, this is an x86_64 system, so I suppose it should autodetect the working > > NMI setting? > > Uh, I'm not sure what you mean? Are you asking if there is a way to default > back to the IOAPIC if the LAPIC NMI fails? > > P. Yes Cai, x86_64 autodetects which NMI to use, normally LAPIC, but if the system is old enough, it will fall back to IOAPIC. Prarit, my experience is the reason codes NMI gives you are misleading and useless as NMIs can happen for a variety of reasons, some which do _not_ come from the timer (IPIs for example). This is why it is important that the drivers that register on the DIE chain be smart enough to figure out if they are causing the NMI or not. closing because the machine in question doesn't exist in its entirety any more. Therefore it will be difficult to troubleshoot. I met with a similar problem when testing rhel-4.8.z. https://beaker.engineering.redhat.com/recipes/41385 " ... Total of 16 processors activated (90740.10 BogoMIPS). ..MP-BIOS bug: 8254 timer not connected to IO-APIC failed. timer doesn't work through the IO-APIC - disabling NMI Watchdog! works. Using local APIC timer interrupts. Detected 10.425 MHz APIC timer. checking TSC synchronization across 16 CPUs: passed. Brought up 16 CPUs time.c: Using PIT/HPET based timekeeping. testing NMI watchdog ... CPU#0: NMI appears to be stuck (0)! checking if image is initramfs... it is NET: Registered protocol family 16 PCI: Using configuration type 1 ... " This is a RHEL-5 bz. Please clone it to RHEL-4 if you are able to duplicate the problem there. Thanks, Don |