Bug 619426

Summary: RHEL UV: kernel patch for kexec
Product: Red Hat Enterprise Linux 6 Reporter: George Beshers <gbeshers>
Component: kernelAssignee: George Beshers <gbeshers>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: high    
Version: 6.0CC: dwa, dzickus, gbeshers, martinez, peterm, qcai, rja, tee
Target Milestone: rc   
Target Release: 6.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-130.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 12:17:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 669808    
Bug Blocks: 580566, 607400, 645474    
Attachments:
Description Flags
patch to fix kexec none

Description George Beshers 2010-07-29 14:19:57 UTC
Created attachment 435306 [details]
patch to fix kexec

Description of problem:
  This patch fixes a problem with kexec/kdump where sometimes
  memory was getting scrambled.

  The patch has been tested inside SGI.


Version-Release number of selected component (if applicable):


How reproducible:
  Only on very large systems.


Steps to Reproduce:
1.
2.
3.
  
Actual results:
  Garbled kdump.


Expected results:


Additional info:

Comment 2 RHEL Program Management 2010-07-29 14:47:46 UTC
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 3 RHEL Program Management 2011-01-07 04:50:50 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 4 Suzanne Logcher 2011-01-07 16:11:54 UTC
This request was erroneously denied for the current release of Red Hat
Enterprise Linux.  The error has been fixed and this request has been
re-proposed for the current release.

Comment 5 George Beshers 2011-01-17 17:31:55 UTC
Requesting an exception for 6.1.

This depends on nmi_watchdog changes made by Don Zickus
and posted late 1/14.  The BZ#669808.

Comment 7 George Beshers 2011-02-28 13:15:46 UTC
Lifted these comments from Don's posting to rhkernel.


This patch will cause problems with the new nmi_watchdog without upstream
patch 673a6092ce5f5bec45619b7a7f89cfcf8bcf3c41.


> +		nmi_watchdog = 0;	/* No nmi_watchdog on SGI systems */
nmi_watchdog doesn't exist on the new nmi_watchdog, so this is essentailly
a noop.


> +		return NOTIFY_OK;
upstream I changed this to DIE_NMIUNKNOWN to make it sit _underneath_ the
nmi watchdog.


I know this patch is upstream, but for some reason I am not comfortable
with a lot of it.  Does Russ still work at SGI or is he gone?  I
wouldn't mind asking a few questions about this patch.

Cheers,
Don

------

Yes, I should have made the changes to DIE_NMIUNKOWN, Don had warned
me about that twice.

Comment 8 Russ Anderson 2011-02-28 16:42:32 UTC
> > +		nmi_watchdog = 0;	/* No nmi_watchdog on SGI systems */
> nmi_watchdog doesn't exist on the new nmi_watchdog, so this is essentailly a noop.

Looking at the community code, nmi_watchdog should be watchdog_enabled.
The intent is to disable the watchdog functionality as if nmi_watchdog=0
was specified on the linux boot line.

Comment 9 Don Zickus 2011-02-28 17:18:33 UTC
(In reply to comment #8)
> > > +		nmi_watchdog = 0;	/* No nmi_watchdog on SGI systems */
> > nmi_watchdog doesn't exist on the new nmi_watchdog, so this is essentailly a noop.
> 
> Looking at the community code, nmi_watchdog should be watchdog_enabled.
> The intent is to disable the watchdog functionality as if nmi_watchdog=0
> was specified on the linux boot line.

Hi Russ,

I am curious to know why you want to disable the nmi watchdog.  With the old nmi watchdog, I would understand.  But with the new one you shouldn't need to.  And with the change to DIE_NMIUNKNOWN, you should be able to press the external nmi button all day long without any problems with the nmi watchdog running in parallel.

I also had a couple of questions regarding the addition of the handle_nmi stuff you added the the UV box.  I was trying to understand if added an nmi entry to the x86 struct was the best way to go, but first I wanted to see what the upstream lkml discussion looked like.  Did you have a pointer to that?

I guess I am trying to correct any hacks associated with old nmi watchdog behaviour.  :-)

My second question is more non-RH educational one.  Your nmi patch upstream enabled the NMIs on all the cpus to allow them dump on an external nmi button press.  I was curious from a hardware perspective how you routed that.  Does it just come in on LINT0 and that is just sent to all the cpus instead of just the BPS normally?  Also what happens in the case when you receive any other external NMIs (like say pci serr or IOCK, though you guys maybe using MCEs with Nehalems)?  It seems like any other external NMIs would get sent to all cpus and would accidentally dump cpu stacks (or in the case of IOCK the first cpu would print an IOCK problem,  then the rest cpu dumps).

Thanks,
Don

Comment 10 Russ Anderson 2011-02-28 20:46:25 UTC
> I am curious to know why you want to disable the nmi watchdog.  With the old
> nmi watchdog, I would understand. 

NMI watchdog has always been a pain on large systems, so it has
always been disabled.  For example RHEL6 cannot boot on UV with
it enabled.

> But with the new one you shouldn't need to.

OK, when I see it I'll believe it.  :-)

> I also had a couple of questions regarding the addition of the handle_nmi stuff
> you added the the UV box.  I was trying to understand if added an nmi entry to
> the x86 struct was the best way to go, but first I wanted to see what the
> upstream lkml discussion looked like.  Did you have a pointer to that?

Ingo asked for a generalized NMI interface so I complied.

The upstream discussion:
http://marc.info/?t=126642550600002&r=1&w=2


> I guess I am trying to correct any hacks associated with old nmi
> watchdog behaviour.  :-)

OK.

> My second question is more non-RH educational one.  Your nmi patch upstream
> enabled the NMIs on all the cpus to allow them dump on an external nmi button
> press.  I was curious from a hardware perspective how you routed that.  Does it
> just come in on LINT0 and that is just sent to all the cpus instead of just the
> BPS normally?

The hardware will send NMI to all the cpus.  The problem with the old code
is that only the boot CPU had NMI enabled, so we would only get a backtrace
from the boot cpu.  Not very helpful on a large system.

>   Also what happens in the case when you receive any other
> external NMIs (like say pci serr or IOCK, though you guys maybe using MCEs
> with Nehalems)?

Not sure off hand.  My PCI contact that would know the answer is out today.

> It seems like any other external NMIs would get sent to all cpus
> and would accidentally dump cpu stacks (or in the case of IOCK the first cpu
> would print an IOCK problem,  then the rest cpu dumps).

A general problem with x86 is overloading the NMI vector.  More specifically
not having an easily identifiable way of knowing which type of NMI or 
who initiated the NMI.  ix64 had multiple NMI interrupt vectors specifically 
to deal with this issue.  We have had similar issues with perf code (which
uses NMI to collect statistics).  If everyone could have their own NMI
vector life would be much cleaner.

Comment 11 Don Zickus 2011-02-28 21:33:24 UTC
(In reply to comment #10)
> > I am curious to know why you want to disable the nmi watchdog.  With the old
> > nmi watchdog, I would understand. 
> 
> NMI watchdog has always been a pain on large systems, so it has
> always been disabled.  For example RHEL6 cannot boot on UV with
> it enabled.

Right because of the conflicts with ftrace and the frequency of the nmis during boot.

> 
> > But with the new one you shouldn't need to.
> 
> OK, when I see it I'll believe it.  :-)

Heh.  The new nmi watchdog lowers the amount of nmi traffic, such that the problems with ftrace go away.

Also once the system is booted, as long as 'perf' works, the nmi watchdog should work fine.  Getting it to work with SGI's 'nmi button' should be a straightforward exercise (and hopefully won't take to many reboots).

> 
> > I also had a couple of questions regarding the addition of the handle_nmi stuff
> > you added the the UV box.  I was trying to understand if added an nmi entry to
> > the x86 struct was the best way to go, but first I wanted to see what the
> > upstream lkml discussion looked like.  Did you have a pointer to that?
> 
> Ingo asked for a generalized NMI interface so I complied.
> 
> The upstream discussion:
> http://marc.info/?t=126642550600002&r=1&w=2

Thanks.  For some reason it seem a little dangerous to allow all the cpus to get an external NMI.  But then again we don't have a mechanism from a NMI handler that can send an IPI to all the other cpus to dump their stack (we have trigger_all_cpu_backtrace, but that is from a non-nmi/irq context).

> 
> 
> > I guess I am trying to correct any hacks associated with old nmi
> > watchdog behaviour.  :-)
> 
> OK.
> 
> > My second question is more non-RH educational one.  Your nmi patch upstream
> > enabled the NMIs on all the cpus to allow them dump on an external nmi button
> > press.  I was curious from a hardware perspective how you routed that.  Does it
> > just come in on LINT0 and that is just sent to all the cpus instead of just the
> > BPS normally?
> 
> The hardware will send NMI to all the cpus.  The problem with the old code
> is that only the boot CPU had NMI enabled, so we would only get a backtrace
> from the boot cpu.  Not very helpful on a large system.

People have had similar complaints about the nmi watchdog only give backtraces on the current cpu when it was most likely another cpu causing the deadlock. :-/

> 
> >   Also what happens in the case when you receive any other
> > external NMIs (like say pci serr or IOCK, though you guys maybe using MCEs
> > with Nehalems)?
> 
> Not sure off hand.  My PCI contact that would know the answer is out today.
> 
> > It seems like any other external NMIs would get sent to all cpus
> > and would accidentally dump cpu stacks (or in the case of IOCK the first cpu
> > would print an IOCK problem,  then the rest cpu dumps).
> 
> A general problem with x86 is overloading the NMI vector.  More specifically
> not having an easily identifiable way of knowing which type of NMI or 
> who initiated the NMI.  ix64 had multiple NMI interrupt vectors specifically 
> to deal with this issue.  We have had similar issues with perf code (which
> uses NMI to collect statistics).  If everyone could have their own NMI
> vector life would be much cleaner.

I think that is why Intel is trying to move error reporting to MCEs to stop overloading the NMI vector.  But yes, perf abuses it and we try hard to detect if the NMI came from the perf counters or not to help allievate the pain.  Leaving other un-verifiable sources of NMIs to fall into the unknown NMI handler.  Then code can register their own unknown nmi handlers like SGIs nmi button to deal with their scenarios (unless of course the NMI is not from a 'button press'.

I would be curious to see what happens if George takes the latest RHEL-6.1 bits and puts his patches and the DIE_NMIUNKNOWN patch I referred him to and booted the UV system with it.

Cheers,
Don

Comment 12 Russ Anderson 2011-02-28 23:17:16 UTC
> The new nmi watchdog lowers the amount of nmi traffic, such that the
> problems with ftrace go away.

Why would there be any nmi watchdog traffic at all?  That is a performance
overhead concern.

Comment 13 George Beshers 2011-03-01 13:14:06 UTC
Data point for Don's case on NMI is that uvxw with the 118 kernel
boots without any problems without the nmi_watchdog=0.

There still might be some performance overhead.

I ran out of energy last night, but will put the DIE_NMIUNKOWN to
the test this afternoon.

I have uv48 Thursday evening and will try again there.


NOTE: 
   uvsw is 2 racks and 6 core sockets without HT.  4TB.
   uv48-sys is Westmere, 3 racks of 10-core sockets, 6TB IIRC.

   So an order of magnitude larger.


George

Comment 14 Don Zickus 2011-03-01 14:51:13 UTC
(In reply to comment #12)
> > The new nmi watchdog lowers the amount of nmi traffic, such that the
> > problems with ftrace go away.
> 
> Why would there be any nmi watchdog traffic at all?  That is a performance
> overhead concern.

The old nmi watchdog performed a self test during boot. This flooded the system with nmis just to make sure the perf counters were working correctly.  This caused issues with ftrace, which needed the system to be quiet in order to stop the machine and dynamically alter code.  The new nmi_watchdog doesn't perform this self-test.

Also the old nmi_watchdog went off every second.  The new nmi_watchdog is set to trigger every 60 seconds or so (though the perf subsystem seems to break it up into ~10 second chunks).

So the performance overhead should be minimal.

Cheers,
Don

Comment 15 Russ Anderson 2011-03-07 22:10:33 UTC
> The new nmi_watchdog doesn't perform this self-test.

OK, that's good.

> Also the old nmi_watchdog went off every second.  The new nmi_watchdog is set
> to trigger every 60 seconds or so (though the perf subsystem seems to break it
> up into ~10 second chunks).

The one second interval on a large system was too much overhead.  
10 or 60 seconds will still be more than some customers want.

The concern is on big systems the overhead of rounding up all
the cpus is significantly higher than on small systems.  I think
the amount is <single_stack_dump_time> * <cpu count> since only
one cpu is dumped at a time.  (correct me if I'm wrong.)

I understand using using a watchdog for finding performance holdoffs.
We do similar things for our internal systems.  I can also understand
turning on watchdog at a customer site to troubleshoot a specific
problem.  I don't understand the benefit of watchdog being on by default.

Comment 16 Don Zickus 2011-03-07 22:52:37 UTC
(In reply to comment #15)
> > The new nmi_watchdog doesn't perform this self-test.
> 
> OK, that's good.
> 
> > Also the old nmi_watchdog went off every second.  The new nmi_watchdog is set
> > to trigger every 60 seconds or so (though the perf subsystem seems to break it
> > up into ~10 second chunks).
> 
> The one second interval on a large system was too much overhead.  
> 10 or 60 seconds will still be more than some customers want.

1 NMI every 10 seconds or even 60 seconds (on newer perfs I think) is to much overhead?  Odd, I figured with a system like SGI's there would be thousands of interrupts/exceptions a second to handle all the backplane traffic.

> 
> The concern is on big systems the overhead of rounding up all
> the cpus is significantly higher than on small systems.  I think
> the amount is <single_stack_dump_time> * <cpu count> since only
> one cpu is dumped at a time.  (correct me if I'm wrong.)

I guess I am confused on this statement, is this calculation based on a running system or during a dump from pressing the nmi button?  I am trying to understand what overhead you mean from 'rounding up all the cpus'.

> 
> I understand using using a watchdog for finding performance holdoffs.
> We do similar things for our internal systems.  I can also understand
> turning on watchdog at a customer site to troubleshoot a specific
> problem.  I don't understand the benefit of watchdog being on by default.

Well whatever, it's your system, feel free to continue using nmi_watchdog=0 on the command line. :-)

Cheers,
Don

Comment 17 Russ Anderson 2011-03-07 23:07:28 UTC
> I guess I am confused on this statement, is this calculation based on a
running                              
> system or during a dump from pressing the nmi button?  I am trying to                                        
> understand what overhead you mean from 'rounding up all the cpus'. 

Maybe I am confusing the two (button NMI & watchdog).  With watchdog timer 
are all the cpus NMIed or just one?

Comment 18 gbeshers 2011-03-08 11:00:18 UTC
Watching the boot log the NMI watchdog appears to be enabled on
all cpus.  However, I think there might be a third alternative
which is all cpus are NMI'd once a second but not synchronized;
that is no global lock.

George

Comment 19 Don Zickus 2011-03-08 15:04:24 UTC
(In reply to comment #17)
> > I guess I am confused on this statement, is this calculation based on a
> running                              
> > system or during a dump from pressing the nmi button?  I am trying to                                        
> > understand what overhead you mean from 'rounding up all the cpus'. 
> 
> Maybe I am confusing the two (button NMI & watchdog).  With watchdog timer 
> are all the cpus NMIed or just one?

Sorry for not clarifying.  Yes, all cpus are NMI'd but they are locally fired to the individual CPU.  The global impact should be minimal in the normal case I believe.  This would allow it to scale without performance issues.  If not, I would be interested in hearing the bottlenecks.

Cheers,
Don

Comment 20 George Beshers 2011-03-09 02:26:33 UTC
Hi Don,

I think I've addressed all of your concerns.

Thanks,
George

Comment 21 RHEL Program Management 2011-03-17 18:29:29 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 25 Marizol Martinez 2011-04-04 16:33:24 UTC
George -- Did this make snap 3?

Comment 26 gbeshers 2011-04-04 19:34:08 UTC
It was posted in time.

Stanislaw removed his NACK and ACK'd the bitops patch.
Don Zickus and Dean Nelson have ACK'd all the others.
Prarit said he would, but I have not seen that.

No modified message yet.

George

Comment 27 Aristeu Rozanski 2011-04-07 14:14:31 UTC
Patch(es) available on kernel-2.6.32-130.el6

Comment 30 errata-xmlrpc 2011-05-19 12:17:01 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html