Bug 603911 - RHEL UV: x86_64: NMI Watchdog detected LOCKUP
Summary: RHEL UV: x86_64: NMI Watchdog detected LOCKUP
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.0
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: rc
: 6.0
Assignee: George Beshers
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 555548 619578
TreeView+ depends on / blocked
 
Reported: 2010-06-14 20:17 UTC by George Beshers
Modified: 2010-09-29 03:31 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Due to the way ftrace works when modifying the code during startup, the NMI watchdog causes too much noise and ftrace can not find a quiet period to instrument the code. Consequently, machines with more than 512 cpus will encounter issues with the NMI watchdog. Such issues will return error messages similar to "BUG: NMI Watchdog detected LOCKUP" and have either 'ftrace_modify_code' or 'ipi_handler' in the backtrace. To work around this issue, disable nmi_watchdog using the command: nmi_watchdog=0
Clone Of:
: 619578 (view as bug list)
Environment:
Last Closed: 2010-07-29 20:51:41 UTC


Attachments (Terms of Use)
console log (617.92 KB, text/plain)
2010-06-14 20:17 UTC, George Beshers
no flags Details
slow down nmis from every 1 second to 10 seconds (498 bytes, application/octet-stream)
2010-06-15 13:42 UTC, Don Zickus
no flags Details
console log with 'nmi_watchdog=0' (1.47 MB, application/octet-stream)
2010-06-16 15:27 UTC, George Beshers
no flags Details
console log with Don's nmi delay patch (3.75 MB, application/octet-stream)
2010-06-16 15:30 UTC, George Beshers
no flags Details
patch actually used. (4.39 KB, patch)
2010-06-16 17:51 UTC, George Beshers
no flags Details | Diff
Rerun with the NMI delay patch (2.52 MB, text/x-log)
2010-06-17 13:15 UTC, George Beshers
no flags Details
Patched used against -33 to produce previous log (3.14 KB, patch)
2010-06-17 13:17 UTC, George Beshers
no flags Details | Diff
pause nmis till all nmi's are done (860 bytes, patch)
2010-06-17 18:32 UTC, Steven Rostedt
no flags Details | Diff
Simply touch_nmi_watchdog() (533 bytes, patch)
2010-06-22 01:06 UTC, Steven Rostedt
no flags Details | Diff
console log (8.24 KB, application/octet-stream)
2010-06-22 18:46 UTC, George Beshers
no flags Details
messages file from 2048p/5tb system (9.83 MB, application/octet-stream)
2010-06-23 17:25 UTC, George Beshers
no flags Details
console log (466.57 KB, application/octet-stream)
2010-07-27 15:38 UTC, George Beshers
no flags Details

Description George Beshers 2010-06-14 20:17:49 UTC
Created attachment 423971 [details]
console log

Description of problem:

The following comes up on a 512p UV system, the issue is more
pronounced on a 2048p system.

The problem is that we are getting NMI Watchdog messages, but
only when CONFIG_DYNAMIC_FTRACE is turned on.  The following
portion of the console log.  The kernel is -33 with dzickus
patch plus 

commit 72680a191b934377430032f93af15ef50aafb3a8
Author: Hedi Berriche <hedi@sgi.com>
Date:   Wed May 26 14:44:06 2010 -0700

    pids: increase pid_max based on num_possible_cpus


This is backtrace is from uv35-sys.

[   72.214393] BUG: NMI Watchdog detected LOCKUP on CPU401, ip ffffffff810362ad, registers:                                                
[   72.214393] CPU 401                                                                                                                     
[   72.214393] Modules linked in: dm_mod(+)(U)                                                                                             
[   72.214393]                                                                                                                             
[   72.214393] Modules linked in: dm_mod(+)(U)                                                                                             
[   72.214393] Pid: 6244, comm: modprobe Not tainted 2.6.32-33.el6.DZ_Hedi.x86_64 #1 Stoutland Platform                                    
[   72.214393] RIP: 0010:[<ffffffff810362ad>]  [<ffffffff810362ad>] do_ftrace_mod_code+0x3d/0x100                                          
[   72.214393] RSP: 0018:ffff8901fe725e08  EFLAGS: 00000097                                                                                
[   72.214393] RAX: 000000000000000b RBX: ffff8901fe725e18 RCX: ffffffff80000000                                                           
[   72.214393] RDX: 0000000000000000 RSI: ffffffff81a655dc RDI: ffffffffa000a660                                                           
[   72.214393] RBP: ffff8901fe725e08 R08: 0000000000000000 R09: 0000000000000000                                                           
[   72.214393] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa000a660                                                           
[   72.214393] R13: ffffffff81a655b4 R14: ffffffff81a655dc R15: 0000001380aa269c                                                           
[   72.214393] FS:  00007fbcc326d700(0000) GS:ffff892007520000(0000) knlGS:0000000000000000                                                
[   72.214393] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                                                                           
[   72.214393] CR2: 00007fbcc322e00f CR3: 00000121fe52e000 CR4: 00000000000006e0                                                           
[   72.214393] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000                                                           
[   72.214393] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400                                                           
[   72.214393] Process modprobe (pid: 6244, threadinfo ffff8901fe724000, task ffff8901fe7c6b30)                                            
[   72.214393] Stack:                                                                                                                      
[   72.214393]  ffff8901fe725e48 ffffffff81036559 ffff89e10087dbe8 ffffffff810e6057                                                        
[   72.214393] <0> ffff8921fe523150 ffffffffa000a660 ffffffffa000d860 0000000000000246                                                     
[   72.214393] <0> ffff8901fe725e58 ffffffff81036633 ffff8901fe725e98 ffffffff810e431e                                                     
[   72.214393] Call Trace:                                                                                                                 
[   72.214393]  [<ffffffff81036559>] ftrace_modify_code+0x79/0x90                                                                          
[   72.214393]  [<ffffffff810e6057>] ? ring_buffer_time_stamp+0x7/0x10                                                                     
[   72.214393]  [<ffffffffa000a660>] ? dm_attr_show+0x10/0x68 [dm_mod]                                                                     
[   72.214393]  [<ffffffff81036633>] ftrace_make_nop+0x33/0x40                                                                             
[   72.214393]  [<ffffffff810e431e>] ftrace_convert_nops+0xfe/0x290                                                                        
[   72.214393]  [<ffffffff810e44f5>] ftrace_module_notify+0x45/0x50                                                                        
[   72.214393]  [<ffffffff814dd035>] notifier_call_chain+0x55/0x80                                                                         
[   72.214393]  [<ffffffff81095eea>] __blocking_notifier_call_chain+0x5a/0x80                                                              
[   72.214393]  [<ffffffff81095f26>] blocking_notifier_call_chain+0x16/0x20                                                                
[   72.214393]  [<ffffffff810aec58>] sys_init_module+0xa8/0x250                                                                            
[   72.214393]  [<ffffffff81013172>] system_call_fastpath+0x16/0x1b                                                                        
[   72.214393] Code: 0f ae f0 48 c7 c1 00 00 00 80 31 c0 f0 0f b1 0d 1a f3 a2 00 85 c0 74 1e 31 d2 0f 1f 40 00 f3                          90 48 89 d0 f0 0f b1 0d 03 f3 a2 00 <85> c0 75 ef 83 05 20 f3 a2 00 01 0f ae f0 48 8b 35 0e f3 a2 00                                       
[   72.214393] ---[ end trace 2f5ff6cb68e61c06 ]---
[   72.214393] Kernel panic - not syncing: Non maskable interrupt
[   72.214393] Pid: 6244, comm: modprobe Tainted: G      D    2.6.32-33.el6.DZ_Hedi.x86_64 #1
[   72.214393] Call Trace:
[   72.214393]  <NMI>  [<ffffffff814d72cd>] panic+0x78/0x137
[   72.214393]  [<ffffffff8106a3d3>] ? print_oops_end_marker+0x23/0x30
[   72.214393]  [<ffffffff814db5dc>] die_nmi+0xfc/0x100
[   72.214393]  [<ffffffff814dbb8a>] nmi_watchdog_tick+0x1aa/0x200
[   72.214393]  [<ffffffff814db103>] do_nmi+0x1a3/0x2d0
[   72.214393]  [<ffffffffa000a660>] ? dm_attr_show+0x10/0x68 [dm_mod]
[   72.214393]  [<ffffffff814da9e0>] nmi+0x20/0x30
[   72.214393]  [<ffffffffa000a660>] ? dm_attr_show+0x10/0x68 [dm_mod]
[   72.214393]  [<ffffffffa000a660>] ? dm_attr_show+0x10/0x68 [dm_mod]
[   72.214393]  [<ffffffff810362ad>] ? do_ftrace_mod_code+0x3d/0x100
[   72.214393]  <<EOE>>  [<ffffffff81036559>] ftrace_modify_code+0x79/0x90
[   72.214393]  [<ffffffff810e6057>] ? ring_buffer_time_stamp+0x7/0x10
[   72.214393]  [<ffffffffa000a660>] ? dm_attr_show+0x10/0x68 [dm_mod]
[   72.214393]  [<ffffffff81036633>] ftrace_make_nop+0x33/0x40
[   72.214393]  [<ffffffff810e431e>] ftrace_convert_nops+0xfe/0x290
[   72.214393]  [<ffffffff810e44f5>] ftrace_module_notify+0x45/0x50
[   72.214393]  [<ffffffff814dd035>] notifier_call_chain+0x55/0x80
[   72.214393]  [<ffffffff81095eea>] __blocking_notifier_call_chain+0x5a/0x80
[   72.214393]  [<ffffffff81095f26>] blocking_notifier_call_chain+0x16/0x20
[   72.214393]  [<ffffffff810aec58>] sys_init_module+0xa8/0x250
[   72.214393]  [<ffffffff81013172>] system_call_fastpath+0x16/0x1b




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Don Zickus 2010-06-14 20:42:18 UTC
Adding Steven to the cc list for his opinion.

I am not sure how long it is expected for ftrace to do its code modification.  Considering it seems to be syncing across 2048 cpus, I would think it would be appropriate for a 'touch_nmi_watchdog' inside say 'do_ftrace_mod_code()'?

Perhaps Steven could give us some insight here.

Cheers,
Don

Comment 3 RHEL Product and Program Management 2010-06-14 20:43:03 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 4 Steven Rostedt 2010-06-14 21:42:13 UTC
Ouch, 2048 CPUS!

But still, this is on module loading which does not even run stop_machine. 

Oh God, I think it may be this where we are stuck:

static void wait_for_nmi(void)
{
	if (!atomic_read(&nmi_running))
		return;

	do {
		cpu_relax();
	} while (atomic_read(&nmi_running));

	nmi_wait_count++;
}

If we have 2048 CPUs, and all of them running NMI's it is possible that there's always an NMI running on one of the CPUs.

To solve this, we would need to convert this into a RCU like loop. That is, this only needs to wait for NMIs that are currently running. We can set this up to continue when all currently running NMIs have finished. It may take a bit of work.

To see if this is the case, does this box boot up fine without NMIs. You could also check by making the interval of NMI triggering longer.

Comment 5 George Beshers 2010-06-14 22:02:58 UTC
I recently tried with a similar patch a TIMEOUT of 120 (rhel6 currently is 30).
I still got some LOCKUPs with that setting.

I have not tried TIMEOUT=0 with the latest patches; I have the machine
reserved tomorrow morning.

George

Comment 6 Steven Rostedt 2010-06-14 22:50:52 UTC
I'm not talking about making the timeout longer, because if an NMI is always firing on one CPU, it will still lockup at any timeout.

We need to lengthen the time between NMIs.

That would require modifying the code itself though.

Comment 7 Don Zickus 2010-06-15 13:41:29 UTC
George,

We can probably answer part of Steven's question by booting with nmi_watchdog=0.  That is probably the easiest for now.

Let me attach a patch that will lengthen the timeout from every second on an active cpu to every 10 seconds.  This will address Steven's other thought.

Cheers,
Don

Comment 8 Don Zickus 2010-06-15 13:42:12 UTC
Created attachment 424157 [details]
slow down nmis from every 1 second to 10 seconds

Comment 9 George Beshers 2010-06-16 15:27:24 UTC
Created attachment 424503 [details]
console log with 'nmi_watchdog=0'

Comment 10 George Beshers 2010-06-16 15:30:14 UTC
Created attachment 424504 [details]
console log with Don's nmi delay patch


Note that at the point were the log stops I left the
machine for about half an hour with no forward progress.

George

Comment 11 George Beshers 2010-06-16 15:33:53 UTC
Forgot to note that 'nmi_watchdog=0' did boot successfully.

Cheers,
George

Comment 12 Steven Rostedt 2010-06-16 15:49:24 UTC
Thanks George, I just had a BZ mid air collision asking that very question ;-)

I guess the answer is for me to come up with an RCU style wait_for_nmi() function.

-- Steve

Comment 13 Don Zickus 2010-06-16 16:04:08 UTC
Hi George,

Thanks for the console logs.

In regards to the logs using my delay patch, the message

LOCKUP related delay of 10 on XXXX

is repeated forever, but for the life of me I can't find in the kernel where that is displayed.  Did you put an extra patch in?

Cheers,
Don

Comment 14 Steven Rostedt 2010-06-16 16:05:27 UTC
Looking back at the code, a RCU style wait_or_nmi() can not work, since we must wait for a point that all CPUs are not running any NMI.

I'm thinking since all NMIs are accessing an atomic global variable when doing the ftrace modifications that the cacheline bouncing that it is causing with >2000 CPUs is making it to the point that it is locking up.

I can add to the loop that if it fails to clear the nmi_running variable after X amount of iterations that it bails and disables ftrace. It may just be that we can have ftrace or NMI watchdogs in a machine like that, but not both.

Comment 15 Steven Rostedt 2010-06-16 16:12:23 UTC
Another solution is to block the NMIs on exit. Have them spin until the last one does the work.

This might work. I'll write up a patch to try it out.

Comment 16 George Beshers 2010-06-16 17:23:32 UTC
Don,

Yes -- and I forgot about it :(.  Do you think it materially
altered the test?

Its in ./arch/x86/kernel/apic/nmi.c 

		if (t % 10 == 0) {
		  printk(KERN_WARNING "LOCKUP related delay of %ld on CPU%d\n", t, smp_processor_id());
		}
		if (local_read(&__get_cpu_var(alert_counter)) == CONFIG_DEBUG_NMI_TIMEOUT * nmi_hz)
			/*
			 * die_nmi will return ONLY if NOTIFY_STOP happens..
			 */
			die_nmi("BUG: NMI Watchdog detected LOCKUP",
				regs, panic_on_timeout);

George

Comment 17 Don Zickus 2010-06-16 17:38:41 UTC
(In reply to comment #16)
> Don,
> 
> Yes -- and I forgot about it :(.  Do you think it materially
> altered the test?

I can't tell until I see the _whole_ patch.  But yes, putting printks in there can slow down a processors ability to generate interrupts and thus mislead the machine into thinking it has stuck cpus (because interrupts are not firing in a periodic fashion).

Also note that during the check_nmi test, the nmi watchdog is configured to fire HZ (1000) times a second instead of the usual once a second (or 10 with my patch).  

But please in the future let me know all the patches you put on the kernel so I can accurately figure out what is going on.

Thanks,
Don

Comment 18 George Beshers 2010-06-16 17:51:53 UTC
Created attachment 424537 [details]
patch actually used.


I have the large machine reserved for early tomorrow morning,
will re-run with a clean set of patches.

Sorry for the confusion.

George

Comment 19 George Beshers 2010-06-17 13:15:59 UTC
Created attachment 424810 [details]
Rerun with the NMI delay patch


The system failed to boot, confirming the validity
of yesterday's run.

Comment 20 George Beshers 2010-06-17 13:17:29 UTC
Created attachment 424811 [details]
Patched used against -33 to produce previous log

Comment 21 Don Zickus 2010-06-17 17:22:29 UTC
(In reply to comment #19)
> Created an attachment (id=424810) [details]
> Rerun with the NMI delay patch
> 
> 
> The system failed to boot, confirming the validity
> of yesterday's run.    

Well not really.  It failed in the original spot at the beginning of this bz.  The timeout characteristics were probably changed such that you didn't see the LOCKUP yet.  But this hang is different than yesterday's hang.

What this test proves is that having the nmi trigger every 10 seconds instead of every 1, still doesn't resolve this issue.  So far only disabling nmi_watchdog entirely works (which is probably a good workaround for now).

Perhaps Steven can craft a patch that allows ftrace to do its magic without having to worry about all the NMIs.

Cheers,
Don

Comment 22 Steven Rostedt 2010-06-17 18:32:06 UTC
Created attachment 424905 [details]
pause nmis till all nmi's are done

George,

Can you try this patch. It's just a test patch. While the code modification is happening, it forces NMIs on other CPUs to wait till all other NMIs are done.

I can do even more, but I want to see if this fixes the issue first.

Comment 23 George Beshers 2010-06-21 21:48:23 UTC
Hi Steve,

This patch failed early on.  I have access to the machine this evening
and tomorrow morning.

George


Booting Node 126, Processors  #2032 #2033 #2034 #2035 #2036 #2037 #2038 #2039 Ok.
Booting Node 127, Processors  #2040 #2041 #2042 #2043 #2044 #2045 #2046 #2047 Ok.
Brought up 2048 CPUs
Total of 2048 processors activated (9289042.89 BogoMIPS).
Testing NMI watchdog ... OK.
BUG: NMI Watchdog detected LOCKUP on CPU0, ip ffffffff8102a0ca, registers:
CPU 0 
Modules linked in:

Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.32-33.el6.SRostedt.x86_64 #1 Altix UV
RIP: 0010:[<ffffffff8102a0ca>]  [<ffffffff8102a0ca>] set_mtrr+0xca/0x150
RSP: 0018:ffff8c2ff8e13e80  EFLAGS: 00000006
RAX: 00000000000007a5 RBX: 00000000ffffffff RCX: 3333333333333333
RDX: 0000000000000020 RSI: 0000000000000800 RDI: ffffffffffffffff
RBP: ffff8c2ff8e13ed0 R08: ffffffff818bb9b8 R09: 00000000000007c0
R10: 0000000000000020 R11: 0000000000000000 R12: 0000000000000246
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001001000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff8c2ff8e12000, task ffff8c27f9e12040)
Stack:
 00000001000007a5 0000000000000000 0000000000000000 00000000ffffffff
<0> ffff8c2ff8e13ee0 0000000114d5c5e5 ffffffff818bb6c0 0000000000016700
<0> 0000000000000800 0000000000090000 ffff8c2ff8e13ee0 ffffffff8102a1e0
Call Trace:
 [<ffffffff8102a1e0>] mtrr_aps_init+0x30/0x40
 [<ffffffff818ea04b>] native_smp_cpus_done+0x111/0x150
 [<ffffffff818dc737>] kernel_init+0x165/0x29c
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffff818dc5d2>] ? kernel_init+0x0/0x29c
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
Code: 45 b0 83 fb ff c7 45 b4 01 00 00 00 74 7b 48 8b 05 dc 22 a1 00 41 0f b6 cf 4c 89 f2 4c 89 ee 89 df ff 50 08 e
b 05 0f 1f 00 f3 90 <8b> 45 b0 85 c0 75 f7 8b 35 a9 78 88 00 48 8b 3d f2 1c a1 00 e8 
---[ end trace 8a80a59122ce32a3 ]---
Kernel panic - not syncing: Non maskable interrupt
Pid: 1, comm: swapper Tainted: G      D    2.6.32-33.el6.SRostedt.x86_64 #1
Call Trace:
 <NMI>  [<ffffffff814d72cd>] panic+0x78/0x137
 [<ffffffff8106a3d3>] ? print_oops_end_marker+0x23/0x30
 [<ffffffff814db5dc>] die_nmi+0xfc/0x100
 [<ffffffff814dbb8a>] nmi_watchdog_tick+0x1aa/0x200
 [<ffffffff814db103>] do_nmi+0x1a3/0x2d0
 [<ffffffff814da9e0>] nmi+0x20/0x30
 [<ffffffff8102a0ca>] ? set_mtrr+0xca/0x150
 <<EOE>>  [<ffffffff8102a1e0>] mtrr_aps_init+0x30/0x40
 [<ffffffff818ea04b>] native_smp_cpus_done+0x111/0x150
 [<ffffffff818dc737>] kernel_init+0x165/0x29c
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffff818dc5d2>] ? kernel_init+0x0/0x29c
 [<ffffffff810141c0>] ? child_rip+0x0/0x20

Comment 24 Steven Rostedt 2010-06-22 01:06:10 UTC
Created attachment 425787 [details]
Simply touch_nmi_watchdog()

George, 

Remove the other patch and try this one. This is the simplest approach, all it does is to do a touch_nmi_watchdog() after each conversion of a mcount nop.

Comment 25 George Beshers 2010-06-22 18:46:54 UTC
Created attachment 426047 [details]
console log


This is with the touch_nmi_watchdog() patch installed.

George

Comment 26 Steven Rostedt 2010-06-22 19:07:29 UTC
George,

Are you sure something else isn't broken? That crash does not seem to be related to the patch I have at all.

Did you remove the first patch?

Comment 27 George Beshers 2010-06-22 21:20:26 UTC
Steve,

I started over fresh, but I didn't include Don's patch.
I have the machine reserved again in a few hours and
will try both.

George

Comment 28 George Beshers 2010-06-23 17:23:55 UTC
Good news > Bad news.

First, I switched from -33 based kernel to -37 based kernel.

It booted -- Steve's touch_nmi_watchdog() patch seems to have
solved the timeout problem.

The bad news is that the mdelay in  arch/x86/kernel/apic/nmi.c 
is too short.  I tried changing it to 80 and you can see the
results in the attachment.

I will try 200 in the morning, but I am concerned about
the future.  We would like to achieve parity between distros
and that means running on 4 racks and 4096 cpus / 16Tb of
memory.

A 3 rack 3000cpu system is being assembled but I don't think
I can get on it until next week or (more likely) 2 weeks.

George

Comment 29 George Beshers 2010-06-23 17:25:36 UTC
Created attachment 426320 [details]
messages file from 2048p/5tb system


NOTE: this booted without the nmi_watchdog=0.

George

Comment 30 Don Zickus 2010-06-23 17:58:27 UTC
(In reply to comment #28)
> Good news > Bad news.
> 
> First, I switched from -33 based kernel to -37 based kernel.
> 
> It booted -- Steve's touch_nmi_watchdog() patch seems to have
> solved the timeout problem.
> 
> The bad news is that the mdelay in  arch/x86/kernel/apic/nmi.c 
> is too short.  I tried changing it to 80 and you can see the
> results in the attachment.
> 
> I will try 200 in the morning, but I am concerned about
> the future.  We would like to achieve parity between distros
> and that means running on 4 racks and 4096 cpus / 16Tb of
> memory.
> 
Well I am pretty sure that will fail too.

My guess is that uv_handle_nmi is grabbing all the nmi_watchdog NMIs and as a result the nmi_watchdog can not re-arm itself.  This leads to what appears as an NMI stuck message (because no NMIs are being generate on the system).

Of course, without the full patch you applied to the system I can not tell for sure as the uv_handle_nmi function does not seem to be in the -37.el6 kernel.

Or can you just test plain -37.el6 with just Steven's patch?

Comment 31 Don Zickus 2010-06-23 18:00:06 UTC
Oh, you will probably need to add my patch too.  I thought that made into -37.el6 but it doesn't appear to.

Cheers,
Don

Comment 33 Don Zickus 2010-07-14 17:47:08 UTC
Ok so I talked to George about this problem.  I didn't realize a patch was posted that put uv_handle_nmi in the RHEL-6 kernel.

After looking at that code, I realized that won't work with the nmi_watchdog.  What happens is that uv_handle_nmi grabs all the nmis before the nmi_watchdog (and possibly others).  As a result the nmi_watchdog never gets to re-arm and uv_handle_nmi prints all the stack traces from nmis that were supposed to be for the nmi_watchdog.

So the uv_handle_nmi code needs to be modified to become a shared nmi handler, otherwise (even with nmi_watchdog=0 set) it can starve other nmi handlers in the system depending on when it gets registered in the die_chain (because it just assumes the nmi is for uv_handle_nmi).

George mentioned that uv_handle_nmi was just a debug hook to print out backtraces on the system for all cpus.  The kernel has something like that already or if that doesn't work, SGI could implement a similar mechanism that at least sets a flag signaling to the uv_handle_nmi that the nmi is most likely intended for it.

In arch/x86/kernel/apic/nmi.c::arch_trigger_all_cpu_backtrace() look at how the variable 'backtrace_mask' is used.  Or just call arch_trigger_all_cpu_backtrace() if it meets SGIs needs.

Hope that helps.

Cheers,
Don

Comment 35 George Beshers 2010-07-27 15:38:11 UTC
Created attachment 434727 [details]
console log


Note that this has the rt_sched_patch (see BZ607587) and
a patch for kdump.

The uv_nmi stackdump has been removed.

Main thing to note is that ipi_handler is always the *first*
thing to cat the NMI Watchdog detected LOCKUP.

Comment 36 Don Zickus 2010-07-27 16:04:22 UTC
So one cpu has trouble flushing its cache, locks up the cpu, then another one wants to panic because of another NMI lockup but can't because the first cpu has interrupts disabled (and spinning on a spinlock!) which prevents the panic ipi from finishing.  

This results in the 2000+ cpu pile up.  Lovely.

The question is why couldn't cpu 769 flush its cache?

Cheers,
Don

Comment 37 Don Zickus 2010-07-29 17:40:38 UTC
pushing to 6.1.  If SGI feels this is a blocker for 6.0, feel free to talk with your PM and have it changed.

Cheers,
Don

Comment 38 Don Zickus 2010-07-29 20:51:41 UTC
I cloned this bz to track for 6.1.

For 6.0, it is to late to fix this properly.  It seems the workaround for this problem will be 'nmi_watchdog=0'.

I will close this bz and add a tech note for the workaround, please use bz619578 to continue the discussion.

Tech note:

Machines with larger than 512 cpus will have known issues with the NMI watchdog.  These issues will manifest itself as "BUG: NMI Watchdog detected LOCKUP" and have either 'ftrace_modify_code' or 'ipi_handler' in its backtrace.

Due to the way ftrace works when modifying the code during startup, the NMI watchdog causes too much noise and ftrace can not find a quiet period in which to instrument the code.  Using nmi_watchdog=0, resolves this issue for now.

Due to slowness when invalidating the cache, the cpu may take longer than normal to finish.  This causes the nmi watchdog to prematurely fire.  Using nmi_watchdog=0, resolves this issue for now.

Comment 39 Peter Martuccelli 2010-07-30 14:28:54 UTC
Tech note is for RHEL 6.0.

Comment 40 Peter Martuccelli 2010-07-30 14:28:54 UTC
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

New Contents:
Machines with larger than 512 cpus will have known issues with the NMI
watchdog.  These issues will manifest itself as "BUG: NMI Watchdog detected
LOCKUP" and have either 'ftrace_modify_code' or 'ipi_handler' in its backtrace.

Due to the way ftrace works when modifying the code during startup, the NMI
watchdog causes too much noise and ftrace can not find a quiet period in which
to instrument the code.  Using nmi_watchdog=0, resolves this issue for now.

Due to slowness when invalidating the cache, the cpu may take longer than
normal to finish.  This causes the nmi watchdog to prematurely fire.  Using
nmi_watchdog=0, resolves this issue for now.

Comment 41 George Beshers 2010-07-30 14:33:37 UTC
Per our conversation yesterday.

I ran into a hardware problem so only limited testing;
I still ended up with messages with the slower clock-tick
with the count a 6 (60 seconds).

George

Comment 42 Don Zickus 2010-07-30 15:14:39 UTC
What did the stack trace look like, did it have the ipi_handler messages in it or the ftrace ones?

Regardless, I was hoping that would have worked because that would have been a nice generic time to use across everything.

I guess increasing 6 to 12, 18 or 24 would be the next logical step.  Of course providing a commandline knob will probably make sense too if this gets too large.

Cheers,
Don

Comment 43 George Beshers 2010-07-30 15:38:34 UTC
I lost the initial message -- scrollback wasn't set large enough.

However, previous testing it was almost always the ipi_handler.

George

Comment 45 Ryan Lerch 2010-09-29 03:31:25 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,11 +1,3 @@
-Machines with larger than 512 cpus will have known issues with the NMI
-watchdog.  These issues will manifest itself as "BUG: NMI Watchdog detected
-LOCKUP" and have either 'ftrace_modify_code' or 'ipi_handler' in its backtrace.
+Due to the way ftrace works when modifying the code during startup, the NMI watchdog causes too much noise and ftrace can not find a quiet period to instrument the code. Consequently, machines with more than 512 cpus will encounter issues with the NMI watchdog. Such issues will return error messages similar to "BUG: NMI Watchdog detected LOCKUP" and have either 'ftrace_modify_code' or 'ipi_handler' in the backtrace. To work around this issue, disable nmi_watchdog using the command:
 
-Due to the way ftrace works when modifying the code during startup, the NMI
+nmi_watchdog=0-watchdog causes too much noise and ftrace can not find a quiet period in which
-to instrument the code.  Using nmi_watchdog=0, resolves this issue for now.
-
-Due to slowness when invalidating the cache, the cpu may take longer than
-normal to finish.  This causes the nmi watchdog to prematurely fire.  Using
-nmi_watchdog=0, resolves this issue for now.


Note You need to log in before you can comment on or make changes to this bug.