Bug 504968 - testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (177->177)!
testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (177->177)!
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
x86_64 Linux
low Severity medium
: rc
: ---
Assigned To: Prarit Bhargava
Red Hat Kernel QE team
: Reopened
Depends On:
Blocks: 533192
  Show dependency treegraph
 
Reported: 2009-06-10 04:55 EDT by CAI Qian
Modified: 2013-01-10 03:00 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-10-06 09:48:10 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
acpidump from system (60.22 KB, text/plain)
2009-06-15 15:06 EDT, Prarit Bhargava
no flags Details

  None (edit)
Description CAI Qian 2009-06-10 04:55:41 EDT
Description of problem:
Both -128.el5 and -152.el5 kernels seem have no functional NMI on dell380-2.rhts.bos.redhat.com

...
CPU1: Thermal monitoring enabled (TM1)
              Intel(R) Pentium(R) D CPU 3.20GHz stepping 04
SMP alternatives: switching to SMP code
Booting processor 2/4 APIC 0x1
Initializing CPU#2
Uhhuh. NMI received for unknown reason 20.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Calibrating delay using timer specific routine.. 6384.06 BogoMIPS (lpj=3192034)
CPU: Trace cache: 12K uops
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
CPU2: Thermal monitoring enabled (TM1)
              Intel(R) Pentium(R) D CPU 3.20GHz stepping 04
SMP alternatives: switching to SMP code
Booting processor 3/4 APIC 0x3
Initializing CPU#3
Calibrating delay using timer specific routine.. 6384.06 BogoMIPS (lpj=3192031)
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
CPU3: Thermal monitoring enabled (TM1)
              Intel(R) Pentium(R) D CPU 3.20GHz stepping 04
Brought up 4 CPUs
testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (177->177)!
...

Version-Release number of selected component (if applicable):
kernel-2.6.18-152.el5
kernel-2.6.18-128.el5

How reproducible:
always

Steps to Reproduce:
1. boot the machine.

Actual results:
NMI appears to be stuck

Expected results:
NMI is working.

Additional info:
Full log of -152.el5 kernel boots,
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=8480656
Comment 1 Prarit Bhargava 2009-06-10 09:37:52 EDT
Not a regression, pushing to 5.5.

P.
Comment 2 Prarit Bhargava 2009-06-15 15:06:06 EDT
Created attachment 347987 [details]
acpidump from system
Comment 3 Prarit Bhargava 2009-06-15 15:06:55 EDT
I did notice this bit of craziness in the boot log:

ACPI: LAPIC_NMI (acpi_id[0xff] high level lint[0x1])
acpi_table_parse_entries: here


acpi_id of 256?  That doesn't seem right.

P.
Comment 4 Matthew Garrett 2009-06-15 16:01:00 EDT
According to the spec:

"A value of 0xFF signifies that this applies to all processors
in the machine."

So this seems like a red herring.
Comment 8 CAI Qian 2009-06-16 17:17:21 EDT
Prarit, this is an x86_64 system, so I suppose it should autodetect the working NMI setting?
Comment 9 Prarit Bhargava 2009-06-17 08:25:32 EDT
I forgot to enter this data in.

The reason the NMI is firing is because of code 0x20 or 0x30.

This system uses the ICH7 chipset.  According to the specification the above code means:

  Timer Counter 2 OUT Status (TMR2_OUT_STS) — RO. This bit reflects the current state of the 8254 counter 2 output. Counter 2 must be programmed following any PCI reset for this bit to have a determinate value. When writing to port 61h, this bit must be a 0.

ie) Timer Counter 2 has fired.

But, AFAICT, the Timer Counter has been disabled (bit 0 in the code above).

I have tried to write the register with bit 0 = 0 but this has no impact.

P.
Comment 10 Prarit Bhargava 2009-06-17 08:33:04 EDT
(In reply to comment #8)
> Prarit, this is an x86_64 system, so I suppose it should autodetect the working
> NMI setting?  

Uh, I'm not sure what you mean?  Are you asking if there is a way to default back to the IOAPIC if the LAPIC NMI fails?

P.
Comment 11 Prarit Bhargava 2009-06-17 09:06:01 EDT
Every 36 seconds the system receives a burst of NMIs (reason code 0x20) on CPU 0.

So timer 2 is running and being reset.

P.
Comment 12 Don Zickus 2009-06-17 10:04:46 EDT
(In reply to comment #10)
> (In reply to comment #8)
> > Prarit, this is an x86_64 system, so I suppose it should autodetect the working
> > NMI setting?  
> 
> Uh, I'm not sure what you mean?  Are you asking if there is a way to default
> back to the IOAPIC if the LAPIC NMI fails?
> 
> P.  

Yes Cai, x86_64 autodetects which NMI to use, normally LAPIC, but if the system is old enough, it will fall back to IOAPIC.

Prarit, my experience is the reason codes NMI gives you are misleading and useless as NMIs can happen for a variety of reasons, some which do _not_ come from the timer (IPIs for example).  This is why it is important that the drivers that register on the DIE chain be smart enough to figure out if they are causing the NMI or not.
Comment 18 Don Zickus 2010-04-22 11:55:11 EDT
closing because the machine in question doesn't exist in its entirety any more.  Therefore it will be difficult to troubleshoot.
Comment 19 Igor Zhang 2010-10-06 01:28:16 EDT
I met with a similar problem when testing rhel-4.8.z.
https://beaker.engineering.redhat.com/recipes/41385

"
...
Total of 16 processors activated (90740.10 BogoMIPS).
..MP-BIOS bug: 8254 timer not connected to IO-APIC
 failed.
timer doesn't work through the IO-APIC - disabling NMI Watchdog!
 works.
Using local APIC timer interrupts.
Detected 10.425 MHz APIC timer.
checking TSC synchronization across 16 CPUs: passed.
Brought up 16 CPUs
time.c: Using PIT/HPET based timekeeping.
testing NMI watchdog ... CPU#0: NMI appears to be stuck (0)!
checking if image is initramfs... it is
NET: Registered protocol family 16
PCI: Using configuration type 1
...
"
Comment 20 Don Zickus 2010-10-06 09:48:10 EDT
This is a RHEL-5 bz.  Please clone it to RHEL-4 if you are able to duplicate the problem there.

Thanks,
Don

Note You need to log in before you can comment on or make changes to this bug.