Bug 610814 - [RHEL6 beta 1] ixgbe doesn't support 128 cpus
[RHEL6 beta 1] ixgbe doesn't support 128 cpus
Status: CLOSED DUPLICATE of bug 604608
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
All Linux
high Severity high
: rc
: 6.0
Assigned To: Andy Gospodarek
Red Hat Kernel QE team
:
Depends On:
Blocks: 580574
  Show dependency treegraph
 
Reported: 2010-07-02 09:54 EDT by Andy Gospodarek
Modified: 2014-06-29 19:02 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-07-13 15:02:09 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
irq-fix.patch (699 bytes, patch)
2010-07-02 15:57 EDT, Andy Gospodarek
no flags Details | Diff

  None (edit)
Description Andy Gospodarek 2010-07-02 09:54:26 EDT
512g 8sockets G5:
     │                                 4%                               
│     
     │                                                                  
│     
     │                  Packages completed: 274 of 2900                 
│     
     │                                                                  
│     
     │ Kernel panic - not syncing: stack-protector: Kernel stack is
corrupted in: ffffffff810a2df2 the cairo graphics library

1024g 8sockets G5:
     │                                 2%                               
│     
     │                                                                  
│     
     │                  Packages completed: 122 of 2900                 
│     
     │                                                                  
│     
     │ Kernel panic - not syncing: stack-protector: Kernel stack is
corrupted in: ffffffff810a2df2ce for GTK2 (a GUI library for X)

192g 4 sockets G5:
          ┌────────────────┤
Installation Starting
├────────────────┐          
          │                                                        
│          
          │ Starting installation process                          
│          
          │                                                        
│          
          │                          Kernel panic - not syncing:
stack-protector: Kernel stack is corrupted in:
ffffffff8146cc70                    │          

└─────────────────────────────────────────────────────────┘
Comment 1 Andy Gospodarek 2010-07-02 09:57:34 EDT
Gary, I'd also like to ask if they can take the ixgbe card out (or disable it in BIOS if it is a LOM) and see if the install still fails.
Comment 2 Andy Gospodarek 2010-07-02 14:04:31 EDT
Fixing the title to reflect the actual complaint from IT
Comment 3 Andy Gospodarek 2010-07-02 14:39:41 EDT
Looks like this is a problem with the latest kernel when booting on a system
with 128 CPUs.  Is this something your team has seen, John?
Comment 4 Andy Gospodarek 2010-07-02 14:57:12 EDT
My initial thoughts are that this patch:

commit fdd3d631cddad20ad9d3e1eb7dbf26825a8a121f
Author: Krishna Kumar <krkumar2@in.ibm.com>
Date:   Wed Feb 3 13:13:10 2010 +0000

    ixgbe: Fix return of invalid txq

was on the right track, but didn't handle all the needed cases.
Comment 5 Issue Tracker 2010-07-02 15:45:16 EDT
Event posted on 07-02-2010 03:29pm EDT by Yinghai.Lu

Gary,

Can you ask your engineer to backport following patch from mainline?

Thanks


http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=753649dbc49345a73a2454c770a3f2d54d11aec6

From 753649dbc49345a73a2454c770a3f2d54d11aec6 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx@linutronix.de>
Date: Wed, 31 Mar 2010 13:30:19 +0200
Subject: [PATCH] genirq: Force MSI irq handlers to run with interrupts
disabled

Network folks reported that directing all MSI-X vectors of their multi
queue NICs to a single core can cause interrupt stack overflows when
enough interrupts fire at the same time.

This is caused by the fact that we run interrupt handlers by default
with interrupts enabled unless the driver reuqests the interrupt with
the IRQF_DISABLED set. The NIC handlers do not set this flag, so
simultaneous interrupts can nest unlimited and cause the stack
overflow.

The only safe counter measure is to run the interrupt handlers with
interrupts disabled. We can't switch to this mode in general right
now, but it is safe to do so for MSI interrupts.

Force IRQF_DISABLED for MSI interrupt handlers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Linus Torvalds <torvalds@osdl.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: stable@kernel.org
---
 kernel/irq/manage.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 398fda1..704e488 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -757,6 +757,16 @@ __setup_irq(unsigned int irq, struct irq_desc *desc,
struct irqaction *new)
 		if (new->flags & IRQF_ONESHOT)
 			desc->status |= IRQ_ONESHOT;
 
+		/*
+		 * Force MSI interrupts to run with interrupts
+		 * disabled. The multi vector cards can cause stack
+		 * overflows due to nested interrupts when enough of
+		 * them are directed to a core and fire at the same
+		 * time.
+		 */
+		if (desc->msi_desc)
+			new->flags |= IRQF_DISABLED;
+
 		if (!(desc->status & IRQ_NOAUTOEN)) {
 			desc->depth = 0;
 			desc->status &= ~IRQ_DISABLED;
-- 
1.7.1.1






This event sent from IssueTracker by gcase 
 issue 921383
Comment 6 Andy Gospodarek 2010-07-02 15:57:11 EDT
Created attachment 429155 [details]
irq-fix.patch

Here is a backported version of the patch.  Feel free to let us know if this resolves the issue as we still do not have the hardware available to reproduce.
Comment 7 Andy Gospodarek 2010-07-02 16:19:59 EDT
I imagine that the system would function just fine with booting with the extra kernel command line option 'pci=nomsi' if the patch you have posted is the correct one.  It is not a guarantee since other things might break when booting with 'pci=nomsi', but it would be a quick data-point for you to gather for us.
Comment 8 Andy Gospodarek 2010-07-02 17:19:14 EDT
(In reply to comment #4)
> My initial thoughts are that this patch:
> 
> commit fdd3d631cddad20ad9d3e1eb7dbf26825a8a121f
> Author: Krishna Kumar <krkumar2@in.ibm.com>
> Date:   Wed Feb 3 13:13:10 2010 +0000
> 
>     ixgbe: Fix return of invalid txq
> 
> was on the right track, but didn't handle all the needed cases.    

I took a look at the checks after this and though they could use some additional checking to safeguard against invalid indexes, I do not think they are related to the problem we are addressing in this bug.
Comment 12 Andy Gospodarek 2010-07-12 12:36:22 EDT
Based on my initial testing it appears the patch in comment #5 will resolve this issue.  With the current kernel if I load the ixgbe driver and bring and interface up (ifconfig ethX up) this is easily reproducible.  After this patch is applied, I was not able to reproduce it.  I would like to test is a few more timest to be sure, but it looks good.
Comment 13 Andy Gospodarek 2010-07-12 14:34:14 EDT
Testing looks good.  I will post:

commit 753649dbc49345a73a2454c770a3f2d54d11aec6
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Wed Mar 31 13:30:19 2010 +0200

    genirq: Force MSI irq handlers to run with interrupts disabled

for consideration for RHEL6.0.
Comment 15 Andy Gospodarek 2010-07-13 15:02:09 EDT
This patch was already added as part of a pull of stable fixes from 2.6.32.12
and should be available in 2.6.32-47.

*** This bug has been marked as a duplicate of bug 604608 ***

Note You need to log in before you can comment on or make changes to this bug.