Escalated to Bugzilla from IssueTracker
Event posted on 03-30-2010 02:32pm CDT by jruemker Messages seen during bootup with OS Control power mode in the BIOS: testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (1150->1153)! From engineering: ============================================ Ok, I am not entirely sure why those are happening, but it is the result of the cpu being slow somehow. I see those sporadically. It does _not_ indicate a hardware problem. The 'testing NMI' was expecting the nmi to count at least 5 nmis before declaring it 'passing'. In this case we only see it increment 2-3 (ie 1135->1138). The cpus were supposed to be put into a tight loop to increment quickly but they are not for some reason. As a result the nmi_watchdog disables itself and I think you get delayed nmi watchdog interrupts which causes the 'Dazed and confused' messages. And 'no' I don't think this is related to the 'Hardware Error' we are seeing. Just a coincidence. So a bug can be filed against that problem if you want and I will look into it further. As an ugly workaround, one might be able to boot with 'nmi_watchdog=0' and then enable the nmi watchdog from the console with 'echo 1 > /proc/sys/kernel/nmi_watchdog'. I did notice one customer update his BIOS and the problem went away. Not sure why or if the BIOS was doing something in the background which slows the cpu down. So all bootup nmi watchdog warnings can be attributed to a software problem for now. Once the system boots though, those nmis are for a different reason. =================================== Setting the power mode to 'Static - High' in the BIOS works around this issue. I will file a bug on this This event sent from IssueTracker by streeter [SEG - Kernel] issue 703233
Event posted on 03-31-2010 08:54am CDT by jruemker Problem Description --------------------------------------------------- >> 1. Time and date of problem: Ongoing >> 2. System architecture(s): x86_64 (HP DL 585 G6) >> 3. Provide a clear and concise problem description as it is understood at the time of escalation. >> Observed behavior: Occasionally when booting an HP DL 585 G6 on RHEL 5.3, they see a message such as the following: kernel: testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (1150->1153)! In BZ 574083, Don Zickus provided the following explanation: Ok, I am not entirely sure why those are happening, but it is the result of the cpu being slow somehow. I see those sporadically. It does _not_ indicate a hardware problem. The 'testing NMI' was expecting the nmi to count at least 5 nmis before declaring it 'passing'. In this case we only see it increment 2-3 (ie 1135->1138). The cpus were supposed to be put into a tight loop to increment quickly but they are not for some reason. As a result the nmi_watchdog disables itself and I think you get delayed nmi watchdog interrupts which causes the 'Dazed and confused' messages. And 'no' I don't think this is related to the 'Hardware Error' we are seeing. Just a coincidence. So a bug can be filed against that problem if you want and I will look into it further. As an ugly workaround, one might be able to boot with 'nmi_watchdog=0' and then enable the nmi watchdog from the console with 'echo 1 > /proc/sys/kernel/nmi_watchdog'. I did notice one customer update his BIOS and the problem went away. Not sure why or if the BIOS was doing something in the background which slows the cpu down. So all bootup nmi watchdog warnings can be attributed to a software problem for now. Once the system boots though, those nmis are for a different reason. The customer is able to workaround the issue by setting the BIOS Power Saving mode to "Static - High" (as opposed to OS Control), which apparently prevents the CPU from throttling down. >> Desired behavior: NMI Watchdog is successfully tested and remains enabled. >> 4. Specific action requested of SEG: Review information provided and determine root cause and permanent fix that does not involve disabling power saving measures. >> 5. Is a defect (bug) in the product suspected? yes/no Possibly >> Bugzilla number (if one already exists): None >> 6. Does a proposed patch exist? yes/no No >> 7. What is the impact to the customer when they experience this problem? NMI Watchdog is disabled Supporting Information ------------------------------------------------------ >> 1. Other actions already taken in working the problem (tech-list posting, google searches, fulltext search, consultation with another engineer, etc.): Talked to Don Zickus in another IT/BZ, found workaround >> Relevant data found (if any): "Static - High" mode works around issue >> 2. Attach sosreport. Sos attached (messages.2 shows issue) >> 3. Attach other supporting data (if any). >> 4. Provide issue reproduction information, including location and access of reproducer machine, if available. >> Steps to reproduce the problem: a. Enable "OS Control" mode in BIOS b. Boot system c. In *some* instances, NMI watchdog test fails >> 5. Known hot-fix packages on the system: None. >> 6. Customer applied changes from the last 30 days: None. Issue escalated to Support Engineering Group by: jruemker. Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by streeter [SEG - Kernel] issue 703233
Event posted on 04-01-2010 11:58am CDT by jruemker I am raising the priority on this issue due to the impact mentioned on today's call. This is affecting the rollout of over 150 servers, as changing the power saving mode to prevent CPU throttling is not an acceptable long-term solution. I'll keep you posted on any findings. -John Severity set to: High Priority set to: 2 This event sent from IssueTracker by streeter [SEG - Kernel] issue 703233
Event posted on 04-01-2010 12:45pm CDT by streeter Please note I created this as a public BZ, since it contains no customer-specific data. This event sent from IssueTracker by streeter issue 703233
I created a fix for this issue. You can download it from here. http://people.redhat.com/dzickus/.bz578905/ Please let me know the results of any testing. Cheers, Don
Created attachment 406007 [details] Messages file from .2 debug kernel Here is their messages file from their tests with the kernel you provided Friday (.2). Note that this morning we did discover they had installed and loaded the HP NMI watchdog and I suspected that was why we weren't getting any NMIs. However they have since removed it and are still seeing the same problem (0->0). I've confirmed in their latest sosreport (in the IT if you want it) that they are not loading it anymore. nmi_watchdog is also set to 1 on the kernel command line. Also note that unfortunately they did not clearly label which boot sequence went with which Power mode. They assure us that the order was Dynamic, OS Control, High Static - it's just that some of them were run multiple times, or had a boot with the standard kernel in between. If you need these tests run again with a more clear correlation between each set of messages and the mode, let me know and I can see if they'll do the test again. Thanks, john
Actually could you remove the nmi_watchdog=1 and re-test. nmi_watchdog=1 doesn't do what you think it does. It sets the nmi_watchdog to use the deprecated IOAPIC interface. nmi_watchdog=2 uses the default LAPIC interface. Though to be honest nmi_watchdog=1 should work correctly just takes a different code path. Cheers, Don
ping? Cheers, Don
Sorry Don, Not sure what happened there. I had updated IT and thought I told it to send to BZ, but I guess not. The customer tested the kernel (.2) without nmi_watchdog and it did correct the issue. The nmi_watchdog test completed successfully every time. At this point I think that confirms your fix did what we had hoped. Let me know if you need anything else. Thanks! John
Ok thanks. I can't say I am entirely sure why removing nmi_watchdog=1 did the trick, but I am glad the problem is now gone. I'll post something for 5.6 Cheers, Don
*** Bug 584547 has been marked as a duplicate of this bug. ***
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Ok, so HP's magny cours boxes have issues with their performance counters. Using perf counter 1 instead of perf counter 0 resolves the issue. I'm am working with HP to determine if this is a BIOS issue (BIOS using perf counters and forgetting to copy the registers back) or an AMD chip problem. I am going to put this bz back to ON_QA and clone this bug to track the HP problem separately. Can QE re-test this bz with different machines? Cheers, Don
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html