Bug 578905 - RHEL 5.3 on DL585 G6: testing NMI watchdog fails on bootup
Summary: RHEL 5.3 on DL585 G6: testing NMI watchdog fails on bootup
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Don Zickus
QA Contact: Jan Tluka
URL:
Whiteboard:
: 584547 (view as bug list)
Depends On:
Blocks: 593678 613667 640580 659816
TreeView+ depends on / blocked
 
Reported: 2010-04-01 17:43 UTC by Issue Tracker
Modified: 2018-11-14 20:07 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 613667 659816 (view as bug list)
Environment:
Last Closed: 2011-01-13 21:24:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Messages file from .2 debug kernel (316.74 KB, application/octet-stream)
2010-04-12 16:33 UTC, John Ruemker
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Issue Tracker 2010-04-01 17:43:52 UTC
Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2010-04-01 17:43:53 UTC
Event posted on 03-30-2010 02:32pm CDT by jruemker

Messages seen during bootup with OS Control power mode in the BIOS:

testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (1150->1153)!

From engineering:

============================================
Ok, I am not entirely sure why those are happening, but it is the result of the cpu being slow somehow.  I see those sporadically.  It does _not_ indicate a hardware problem.  The 'testing NMI' was expecting the nmi to count at least 5 nmis before declaring it 'passing'.  In this case we only see it increment 2-3 (ie 1135->1138).  The cpus were supposed to be put into a tight loop to increment quickly but they are not for some reason.  As a result the nmi_watchdog disables itself and I think you get delayed nmi watchdog interrupts which causes the 'Dazed and confused' messages.  And 'no' I don't think this is related to the 'Hardware Error' we are seeing.  Just a coincidence.

So a bug can be filed against that problem if you want and I will look into it further.  As an ugly workaround, one might be able to boot with 'nmi_watchdog=0' and then enable the nmi watchdog from the console with 'echo 1 > /proc/sys/kernel/nmi_watchdog'.

I did notice one customer update his BIOS and the problem went away.  Not sure why or if the BIOS was doing something in the background which slows the cpu down.

So all bootup nmi watchdog warnings can be attributed to a software problem for now.  Once the system boots though, those nmis are for a different reason.
===================================

Setting the power mode to 'Static - High' in the BIOS works around this issue. 

I will file a bug on this
This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 703233

Comment 2 Issue Tracker 2010-04-01 17:43:54 UTC
Event posted on 03-31-2010 08:54am CDT by jruemker

Problem Description
---------------------------------------------------
>> 1. Time and date of problem:

Ongoing

>> 2. System architecture(s):

x86_64 (HP DL 585 G6)

>> 3. Provide a clear and concise problem description as it is understood
at the time of escalation. 
>>   Observed behavior:

Occasionally when booting an HP DL 585 G6 on RHEL 5.3, they see a message
such as the following:

  kernel: testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be
stuck (1150->1153)!

In BZ 574083, Don Zickus provided the following explanation:

  Ok, I am not entirely sure why those are happening, but it is the result
of the cpu being slow somehow.  I see those sporadically.  It does _not_
indicate a hardware problem.  The 'testing NMI' was expecting the nmi to
count at least 5 nmis before declaring it 'passing'.  In this case we
only see it increment 2-3 (ie 1135->1138).  The cpus were supposed to be
put into a tight loop to increment quickly but they are not for some
reason.  As a result the nmi_watchdog disables itself and I think you get
delayed nmi watchdog interrupts which causes the 'Dazed and confused'
messages.  And 'no' I don't think this is related to the 'Hardware
Error' we are seeing.  Just a coincidence.

  So a bug can be filed against that problem if you want and I will look
into it further.  As an ugly workaround, one might be able to boot with
'nmi_watchdog=0' and then enable the nmi watchdog from the console with
'echo 1 > /proc/sys/kernel/nmi_watchdog'.

  I did notice one customer update his BIOS and the problem went away. 
Not sure why or if the BIOS was doing something in the background which
slows the cpu down.

  So all bootup nmi watchdog warnings can be attributed to a software
problem for now.  Once the system boots though, those nmis are for a
different reason.

The customer is able to workaround the issue by setting the BIOS Power
Saving mode to "Static - High" (as opposed to OS Control), which
apparently prevents the CPU from throttling down. 	

>>   Desired behavior:

NMI Watchdog is successfully tested and remains enabled.

>> 4. Specific action requested of SEG:

Review information provided and determine root cause and permanent fix
that does not involve disabling power saving measures.

>> 5. Is a defect (bug) in the product suspected? yes/no

Possibly

>>   Bugzilla number (if one already exists):

None

>> 6. Does a proposed patch exist? yes/no

No

>> 7. What is the impact to the customer when they experience this
problem? 

NMI Watchdog is disabled

Supporting Information
------------------------------------------------------
>> 1. Other actions already taken in working the problem (tech-list
posting, google searches, fulltext search, consultation with another
engineer, etc.):

Talked to Don Zickus in another IT/BZ, found workaround 

>>   Relevant data found (if any):

"Static - High" mode works around issue

>> 2. Attach sosreport.

Sos attached (messages.2 shows issue)

>> 3. Attach other supporting data (if any).
>> 4. Provide issue reproduction information, including location and
access of reproducer machine, if available.
>>   Steps to reproduce the problem:

a. Enable "OS Control" mode in BIOS
b. Boot system
c. In *some* instances, NMI watchdog test fails

>> 5. Known hot-fix packages on the system:

None. 

>> 6. Customer applied changes from the last 30 days: 

None.



Issue escalated to Support Engineering Group by: jruemker.
Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 703233

Comment 3 Issue Tracker 2010-04-01 17:43:56 UTC
Event posted on 04-01-2010 11:58am CDT by jruemker

I am raising the priority on this issue due to the impact mentioned on
today's call.  This is affecting the rollout of over 150 servers, as
changing the power saving mode to prevent CPU throttling is not an
acceptable long-term solution.

I'll keep you posted on any findings.

-John

Severity set to: High
Priority set to: 2

This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 703233

Comment 4 Issue Tracker 2010-04-01 17:45:03 UTC
Event posted on 04-01-2010 12:45pm CDT by streeter

Please note I created this as a public BZ, since it contains no
customer-specific data.


This event sent from IssueTracker by streeter 
 issue 703233

Comment 6 Don Zickus 2010-04-06 19:26:21 UTC
I created a fix for this issue.  You can download it from here.

http://people.redhat.com/dzickus/.bz578905/

Please let me know the results of any testing.

Cheers,
Don

Comment 20 John Ruemker 2010-04-12 16:33:09 UTC
Created attachment 406007 [details]
Messages file from .2 debug kernel

Here is their messages file from their tests with the kernel you provided Friday (.2).  Note that this morning we did discover they had installed and loaded the HP NMI watchdog and I suspected that was why we weren't getting any NMIs.  However they have since removed it and are still seeing the same problem (0->0).  I've confirmed in their latest sosreport (in the IT if you want it) that they are not loading it anymore.  nmi_watchdog is also set to 1 on the kernel command line.  

Also note that unfortunately they did not clearly label which boot sequence went with which Power mode.  They assure us that the order was Dynamic, OS Control, High Static - it's just that some of them were run multiple times, or had a boot with the standard kernel in between.  If you need these tests run again with a more clear correlation between each set of messages and the mode, let me know and I can see if they'll do the test again.

Thanks,
john

Comment 21 Don Zickus 2010-04-12 18:39:22 UTC
Actually could you remove the nmi_watchdog=1 and re-test.  nmi_watchdog=1 doesn't do what you think it does.  It sets the nmi_watchdog to use the deprecated IOAPIC interface.  nmi_watchdog=2 uses the default LAPIC interface.

Though to be honest nmi_watchdog=1 should work correctly just takes a different code path.

Cheers,
Don

Comment 22 Don Zickus 2010-04-20 17:51:56 UTC
ping?

Cheers,
Don

Comment 23 John Ruemker 2010-04-20 18:02:35 UTC
Sorry Don,
Not sure what happened there.  I had updated IT and thought I told it to send to BZ, but I guess not. 

The customer tested the kernel (.2) without nmi_watchdog and it did correct the issue.  The nmi_watchdog test completed successfully every time.  At this point I think that confirms your fix did what we had hoped.

Let me know if you need anything else.  

Thanks!
John

Comment 24 Don Zickus 2010-04-20 18:43:16 UTC
Ok thanks.  I can't say I am entirely sure why removing nmi_watchdog=1 did the trick, but I am glad the problem is now gone.

I'll post something for 5.6

Cheers,
Don

Comment 25 Don Zickus 2010-05-10 18:46:24 UTC
*** Bug 584547 has been marked as a duplicate of this bug. ***

Comment 35 RHEL Program Management 2010-11-22 19:19:23 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 39 Don Zickus 2010-12-03 19:18:03 UTC
Ok, so HP's magny cours boxes have issues with their performance counters.  Using perf counter 1 instead of perf counter 0 resolves the issue.  

I'm am working with HP to determine if this is a BIOS issue (BIOS using perf counters and forgetting to copy the registers back) or an AMD chip problem.

I am going to put this bz back to ON_QA and clone this bug to track the HP problem separately.

Can QE re-test this bz with different machines?

Cheers,
Don

Comment 41 errata-xmlrpc 2011-01-13 21:24:25 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.