Bug 619426
Summary: | RHEL UV: kernel patch for kexec | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | George Beshers <gbeshers> | ||||
Component: | kernel | Assignee: | George Beshers <gbeshers> | ||||
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.0 | CC: | dwa, dzickus, gbeshers, martinez, peterm, qcai, rja, tee | ||||
Target Milestone: | rc | ||||||
Target Release: | 6.1 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | kernel-2.6.32-130.el6 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2011-05-19 12:17:01 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 669808 | ||||||
Bug Blocks: | 580566, 607400, 645474 | ||||||
Attachments: |
|
This issue has been proposed when we are only considering blocker issues in the current Red Hat Enterprise Linux release. ** If you would still like this issue considered for the current release, ask your support representative to file as a blocker on your behalf. Otherwise ask that it be considered for the next Red Hat Enterprise Linux release. ** This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative. This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release. Requesting an exception for 6.1. This depends on nmi_watchdog changes made by Don Zickus and posted late 1/14. The BZ#669808. Lifted these comments from Don's posting to rhkernel. This patch will cause problems with the new nmi_watchdog without upstream patch 673a6092ce5f5bec45619b7a7f89cfcf8bcf3c41. > + nmi_watchdog = 0; /* No nmi_watchdog on SGI systems */ nmi_watchdog doesn't exist on the new nmi_watchdog, so this is essentailly a noop. > + return NOTIFY_OK; upstream I changed this to DIE_NMIUNKNOWN to make it sit _underneath_ the nmi watchdog. I know this patch is upstream, but for some reason I am not comfortable with a lot of it. Does Russ still work at SGI or is he gone? I wouldn't mind asking a few questions about this patch. Cheers, Don ------ Yes, I should have made the changes to DIE_NMIUNKOWN, Don had warned me about that twice. > > + nmi_watchdog = 0; /* No nmi_watchdog on SGI systems */
> nmi_watchdog doesn't exist on the new nmi_watchdog, so this is essentailly a noop.
Looking at the community code, nmi_watchdog should be watchdog_enabled.
The intent is to disable the watchdog functionality as if nmi_watchdog=0
was specified on the linux boot line.
(In reply to comment #8) > > > + nmi_watchdog = 0; /* No nmi_watchdog on SGI systems */ > > nmi_watchdog doesn't exist on the new nmi_watchdog, so this is essentailly a noop. > > Looking at the community code, nmi_watchdog should be watchdog_enabled. > The intent is to disable the watchdog functionality as if nmi_watchdog=0 > was specified on the linux boot line. Hi Russ, I am curious to know why you want to disable the nmi watchdog. With the old nmi watchdog, I would understand. But with the new one you shouldn't need to. And with the change to DIE_NMIUNKNOWN, you should be able to press the external nmi button all day long without any problems with the nmi watchdog running in parallel. I also had a couple of questions regarding the addition of the handle_nmi stuff you added the the UV box. I was trying to understand if added an nmi entry to the x86 struct was the best way to go, but first I wanted to see what the upstream lkml discussion looked like. Did you have a pointer to that? I guess I am trying to correct any hacks associated with old nmi watchdog behaviour. :-) My second question is more non-RH educational one. Your nmi patch upstream enabled the NMIs on all the cpus to allow them dump on an external nmi button press. I was curious from a hardware perspective how you routed that. Does it just come in on LINT0 and that is just sent to all the cpus instead of just the BPS normally? Also what happens in the case when you receive any other external NMIs (like say pci serr or IOCK, though you guys maybe using MCEs with Nehalems)? It seems like any other external NMIs would get sent to all cpus and would accidentally dump cpu stacks (or in the case of IOCK the first cpu would print an IOCK problem, then the rest cpu dumps). Thanks, Don > I am curious to know why you want to disable the nmi watchdog. With the old > nmi watchdog, I would understand. NMI watchdog has always been a pain on large systems, so it has always been disabled. For example RHEL6 cannot boot on UV with it enabled. > But with the new one you shouldn't need to. OK, when I see it I'll believe it. :-) > I also had a couple of questions regarding the addition of the handle_nmi stuff > you added the the UV box. I was trying to understand if added an nmi entry to > the x86 struct was the best way to go, but first I wanted to see what the > upstream lkml discussion looked like. Did you have a pointer to that? Ingo asked for a generalized NMI interface so I complied. The upstream discussion: http://marc.info/?t=126642550600002&r=1&w=2 > I guess I am trying to correct any hacks associated with old nmi > watchdog behaviour. :-) OK. > My second question is more non-RH educational one. Your nmi patch upstream > enabled the NMIs on all the cpus to allow them dump on an external nmi button > press. I was curious from a hardware perspective how you routed that. Does it > just come in on LINT0 and that is just sent to all the cpus instead of just the > BPS normally? The hardware will send NMI to all the cpus. The problem with the old code is that only the boot CPU had NMI enabled, so we would only get a backtrace from the boot cpu. Not very helpful on a large system. > Also what happens in the case when you receive any other > external NMIs (like say pci serr or IOCK, though you guys maybe using MCEs > with Nehalems)? Not sure off hand. My PCI contact that would know the answer is out today. > It seems like any other external NMIs would get sent to all cpus > and would accidentally dump cpu stacks (or in the case of IOCK the first cpu > would print an IOCK problem, then the rest cpu dumps). A general problem with x86 is overloading the NMI vector. More specifically not having an easily identifiable way of knowing which type of NMI or who initiated the NMI. ix64 had multiple NMI interrupt vectors specifically to deal with this issue. We have had similar issues with perf code (which uses NMI to collect statistics). If everyone could have their own NMI vector life would be much cleaner. (In reply to comment #10) > > I am curious to know why you want to disable the nmi watchdog. With the old > > nmi watchdog, I would understand. > > NMI watchdog has always been a pain on large systems, so it has > always been disabled. For example RHEL6 cannot boot on UV with > it enabled. Right because of the conflicts with ftrace and the frequency of the nmis during boot. > > > But with the new one you shouldn't need to. > > OK, when I see it I'll believe it. :-) Heh. The new nmi watchdog lowers the amount of nmi traffic, such that the problems with ftrace go away. Also once the system is booted, as long as 'perf' works, the nmi watchdog should work fine. Getting it to work with SGI's 'nmi button' should be a straightforward exercise (and hopefully won't take to many reboots). > > > I also had a couple of questions regarding the addition of the handle_nmi stuff > > you added the the UV box. I was trying to understand if added an nmi entry to > > the x86 struct was the best way to go, but first I wanted to see what the > > upstream lkml discussion looked like. Did you have a pointer to that? > > Ingo asked for a generalized NMI interface so I complied. > > The upstream discussion: > http://marc.info/?t=126642550600002&r=1&w=2 Thanks. For some reason it seem a little dangerous to allow all the cpus to get an external NMI. But then again we don't have a mechanism from a NMI handler that can send an IPI to all the other cpus to dump their stack (we have trigger_all_cpu_backtrace, but that is from a non-nmi/irq context). > > > > I guess I am trying to correct any hacks associated with old nmi > > watchdog behaviour. :-) > > OK. > > > My second question is more non-RH educational one. Your nmi patch upstream > > enabled the NMIs on all the cpus to allow them dump on an external nmi button > > press. I was curious from a hardware perspective how you routed that. Does it > > just come in on LINT0 and that is just sent to all the cpus instead of just the > > BPS normally? > > The hardware will send NMI to all the cpus. The problem with the old code > is that only the boot CPU had NMI enabled, so we would only get a backtrace > from the boot cpu. Not very helpful on a large system. People have had similar complaints about the nmi watchdog only give backtraces on the current cpu when it was most likely another cpu causing the deadlock. :-/ > > > Also what happens in the case when you receive any other > > external NMIs (like say pci serr or IOCK, though you guys maybe using MCEs > > with Nehalems)? > > Not sure off hand. My PCI contact that would know the answer is out today. > > > It seems like any other external NMIs would get sent to all cpus > > and would accidentally dump cpu stacks (or in the case of IOCK the first cpu > > would print an IOCK problem, then the rest cpu dumps). > > A general problem with x86 is overloading the NMI vector. More specifically > not having an easily identifiable way of knowing which type of NMI or > who initiated the NMI. ix64 had multiple NMI interrupt vectors specifically > to deal with this issue. We have had similar issues with perf code (which > uses NMI to collect statistics). If everyone could have their own NMI > vector life would be much cleaner. I think that is why Intel is trying to move error reporting to MCEs to stop overloading the NMI vector. But yes, perf abuses it and we try hard to detect if the NMI came from the perf counters or not to help allievate the pain. Leaving other un-verifiable sources of NMIs to fall into the unknown NMI handler. Then code can register their own unknown nmi handlers like SGIs nmi button to deal with their scenarios (unless of course the NMI is not from a 'button press'. I would be curious to see what happens if George takes the latest RHEL-6.1 bits and puts his patches and the DIE_NMIUNKNOWN patch I referred him to and booted the UV system with it. Cheers, Don > The new nmi watchdog lowers the amount of nmi traffic, such that the
> problems with ftrace go away.
Why would there be any nmi watchdog traffic at all? That is a performance
overhead concern.
Data point for Don's case on NMI is that uvxw with the 118 kernel boots without any problems without the nmi_watchdog=0. There still might be some performance overhead. I ran out of energy last night, but will put the DIE_NMIUNKOWN to the test this afternoon. I have uv48 Thursday evening and will try again there. NOTE: uvsw is 2 racks and 6 core sockets without HT. 4TB. uv48-sys is Westmere, 3 racks of 10-core sockets, 6TB IIRC. So an order of magnitude larger. George (In reply to comment #12) > > The new nmi watchdog lowers the amount of nmi traffic, such that the > > problems with ftrace go away. > > Why would there be any nmi watchdog traffic at all? That is a performance > overhead concern. The old nmi watchdog performed a self test during boot. This flooded the system with nmis just to make sure the perf counters were working correctly. This caused issues with ftrace, which needed the system to be quiet in order to stop the machine and dynamically alter code. The new nmi_watchdog doesn't perform this self-test. Also the old nmi_watchdog went off every second. The new nmi_watchdog is set to trigger every 60 seconds or so (though the perf subsystem seems to break it up into ~10 second chunks). So the performance overhead should be minimal. Cheers, Don > The new nmi_watchdog doesn't perform this self-test. OK, that's good. > Also the old nmi_watchdog went off every second. The new nmi_watchdog is set > to trigger every 60 seconds or so (though the perf subsystem seems to break it > up into ~10 second chunks). The one second interval on a large system was too much overhead. 10 or 60 seconds will still be more than some customers want. The concern is on big systems the overhead of rounding up all the cpus is significantly higher than on small systems. I think the amount is <single_stack_dump_time> * <cpu count> since only one cpu is dumped at a time. (correct me if I'm wrong.) I understand using using a watchdog for finding performance holdoffs. We do similar things for our internal systems. I can also understand turning on watchdog at a customer site to troubleshoot a specific problem. I don't understand the benefit of watchdog being on by default. (In reply to comment #15) > > The new nmi_watchdog doesn't perform this self-test. > > OK, that's good. > > > Also the old nmi_watchdog went off every second. The new nmi_watchdog is set > > to trigger every 60 seconds or so (though the perf subsystem seems to break it > > up into ~10 second chunks). > > The one second interval on a large system was too much overhead. > 10 or 60 seconds will still be more than some customers want. 1 NMI every 10 seconds or even 60 seconds (on newer perfs I think) is to much overhead? Odd, I figured with a system like SGI's there would be thousands of interrupts/exceptions a second to handle all the backplane traffic. > > The concern is on big systems the overhead of rounding up all > the cpus is significantly higher than on small systems. I think > the amount is <single_stack_dump_time> * <cpu count> since only > one cpu is dumped at a time. (correct me if I'm wrong.) I guess I am confused on this statement, is this calculation based on a running system or during a dump from pressing the nmi button? I am trying to understand what overhead you mean from 'rounding up all the cpus'. > > I understand using using a watchdog for finding performance holdoffs. > We do similar things for our internal systems. I can also understand > turning on watchdog at a customer site to troubleshoot a specific > problem. I don't understand the benefit of watchdog being on by default. Well whatever, it's your system, feel free to continue using nmi_watchdog=0 on the command line. :-) Cheers, Don > I guess I am confused on this statement, is this calculation based on a running > system or during a dump from pressing the nmi button? I am trying to > understand what overhead you mean from 'rounding up all the cpus'. Maybe I am confusing the two (button NMI & watchdog). With watchdog timer are all the cpus NMIed or just one? Watching the boot log the NMI watchdog appears to be enabled on all cpus. However, I think there might be a third alternative which is all cpus are NMI'd once a second but not synchronized; that is no global lock. George (In reply to comment #17) > > I guess I am confused on this statement, is this calculation based on a > running > > system or during a dump from pressing the nmi button? I am trying to > > understand what overhead you mean from 'rounding up all the cpus'. > > Maybe I am confusing the two (button NMI & watchdog). With watchdog timer > are all the cpus NMIed or just one? Sorry for not clarifying. Yes, all cpus are NMI'd but they are locally fired to the individual CPU. The global impact should be minimal in the normal case I believe. This would allow it to scale without performance issues. If not, I would be interested in hearing the bottlenecks. Cheers, Don Hi Don, I think I've addressed all of your concerns. Thanks, George This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. George -- Did this make snap 3? It was posted in time. Stanislaw removed his NACK and ACK'd the bitops patch. Don Zickus and Dean Nelson have ACK'd all the others. Prarit said he would, but I have not seen that. No modified message yet. George Patch(es) available on kernel-2.6.32-130.el6 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html |
Created attachment 435306 [details] patch to fix kexec Description of problem: This patch fixes a problem with kexec/kdump where sometimes memory was getting scrambled. The patch has been tested inside SGI. Version-Release number of selected component (if applicable): How reproducible: Only on very large systems. Steps to Reproduce: 1. 2. 3. Actual results: Garbled kdump. Expected results: Additional info: