Bug 504968

Summary: testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (177->177)!
Product: Red Hat Enterprise Linux 5 Reporter: Qian Cai <qcai>
Component: kernelAssignee: Prarit Bhargava <prarit>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.3CC: dzickus, jfeeney, syeghiay, yugzhang
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-06 13:48:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 533192    
Attachments:
Description Flags
acpidump from system none

Description Qian Cai 2009-06-10 08:55:41 UTC
Description of problem:
Both -128.el5 and -152.el5 kernels seem have no functional NMI on dell380-2.rhts.bos.redhat.com

...
CPU1: Thermal monitoring enabled (TM1)
              Intel(R) Pentium(R) D CPU 3.20GHz stepping 04
SMP alternatives: switching to SMP code
Booting processor 2/4 APIC 0x1
Initializing CPU#2
Uhhuh. NMI received for unknown reason 20.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Calibrating delay using timer specific routine.. 6384.06 BogoMIPS (lpj=3192034)
CPU: Trace cache: 12K uops
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
CPU2: Thermal monitoring enabled (TM1)
              Intel(R) Pentium(R) D CPU 3.20GHz stepping 04
SMP alternatives: switching to SMP code
Booting processor 3/4 APIC 0x3
Initializing CPU#3
Calibrating delay using timer specific routine.. 6384.06 BogoMIPS (lpj=3192031)
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
CPU3: Thermal monitoring enabled (TM1)
              Intel(R) Pentium(R) D CPU 3.20GHz stepping 04
Brought up 4 CPUs
testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (177->177)!
...

Version-Release number of selected component (if applicable):
kernel-2.6.18-152.el5
kernel-2.6.18-128.el5

How reproducible:
always

Steps to Reproduce:
1. boot the machine.

Actual results:
NMI appears to be stuck

Expected results:
NMI is working.

Additional info:
Full log of -152.el5 kernel boots,
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=8480656

Comment 1 Prarit Bhargava 2009-06-10 13:37:52 UTC
Not a regression, pushing to 5.5.

P.

Comment 2 Prarit Bhargava 2009-06-15 19:06:06 UTC
Created attachment 347987 [details]
acpidump from system

Comment 3 Prarit Bhargava 2009-06-15 19:06:55 UTC
I did notice this bit of craziness in the boot log:

ACPI: LAPIC_NMI (acpi_id[0xff] high level lint[0x1])
acpi_table_parse_entries: here


acpi_id of 256?  That doesn't seem right.

P.

Comment 4 Matthew Garrett 2009-06-15 20:01:00 UTC
According to the spec:

"A value of 0xFF signifies that this applies to all processors
in the machine."

So this seems like a red herring.

Comment 8 Qian Cai 2009-06-16 21:17:21 UTC
Prarit, this is an x86_64 system, so I suppose it should autodetect the working NMI setting?

Comment 9 Prarit Bhargava 2009-06-17 12:25:32 UTC
I forgot to enter this data in.

The reason the NMI is firing is because of code 0x20 or 0x30.

This system uses the ICH7 chipset.  According to the specification the above code means:

  Timer Counter 2 OUT Status (TMR2_OUT_STS) — RO. This bit reflects the current state of the 8254 counter 2 output. Counter 2 must be programmed following any PCI reset for this bit to have a determinate value. When writing to port 61h, this bit must be a 0.

ie) Timer Counter 2 has fired.

But, AFAICT, the Timer Counter has been disabled (bit 0 in the code above).

I have tried to write the register with bit 0 = 0 but this has no impact.

P.

Comment 10 Prarit Bhargava 2009-06-17 12:33:04 UTC
(In reply to comment #8)
> Prarit, this is an x86_64 system, so I suppose it should autodetect the working
> NMI setting?  

Uh, I'm not sure what you mean?  Are you asking if there is a way to default back to the IOAPIC if the LAPIC NMI fails?

P.

Comment 11 Prarit Bhargava 2009-06-17 13:06:01 UTC
Every 36 seconds the system receives a burst of NMIs (reason code 0x20) on CPU 0.

So timer 2 is running and being reset.

P.

Comment 12 Don Zickus 2009-06-17 14:04:46 UTC
(In reply to comment #10)
> (In reply to comment #8)
> > Prarit, this is an x86_64 system, so I suppose it should autodetect the working
> > NMI setting?  
> 
> Uh, I'm not sure what you mean?  Are you asking if there is a way to default
> back to the IOAPIC if the LAPIC NMI fails?
> 
> P.  

Yes Cai, x86_64 autodetects which NMI to use, normally LAPIC, but if the system is old enough, it will fall back to IOAPIC.

Prarit, my experience is the reason codes NMI gives you are misleading and useless as NMIs can happen for a variety of reasons, some which do _not_ come from the timer (IPIs for example).  This is why it is important that the drivers that register on the DIE chain be smart enough to figure out if they are causing the NMI or not.

Comment 18 Don Zickus 2010-04-22 15:55:11 UTC
closing because the machine in question doesn't exist in its entirety any more.  Therefore it will be difficult to troubleshoot.

Comment 19 Igor Zhang 2010-10-06 05:28:16 UTC
I met with a similar problem when testing rhel-4.8.z.
https://beaker.engineering.redhat.com/recipes/41385

"
...
Total of 16 processors activated (90740.10 BogoMIPS).
..MP-BIOS bug: 8254 timer not connected to IO-APIC
 failed.
timer doesn't work through the IO-APIC - disabling NMI Watchdog!
 works.
Using local APIC timer interrupts.
Detected 10.425 MHz APIC timer.
checking TSC synchronization across 16 CPUs: passed.
Brought up 16 CPUs
time.c: Using PIT/HPET based timekeeping.
testing NMI watchdog ... CPU#0: NMI appears to be stuck (0)!
checking if image is initramfs... it is
NET: Registered protocol family 16
PCI: Using configuration type 1
...
"

Comment 20 Don Zickus 2010-10-06 13:48:10 UTC
This is a RHEL-5 bz.  Please clone it to RHEL-4 if you are able to duplicate the problem there.

Thanks,
Don