RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 610814 - [RHEL6 beta 1] ixgbe doesn't support 128 cpus
Summary: [RHEL6 beta 1] ixgbe doesn't support 128 cpus
Keywords:
Status: CLOSED DUPLICATE of bug 604608
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.0
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: 6.0
Assignee: Andy Gospodarek
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 580574
TreeView+ depends on / blocked
 
Reported: 2010-07-02 13:54 UTC by Andy Gospodarek
Modified: 2014-06-29 23:02 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-07-13 19:02:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
irq-fix.patch (699 bytes, patch)
2010-07-02 19:57 UTC, Andy Gospodarek
no flags Details | Diff

Description Andy Gospodarek 2010-07-02 13:54:26 UTC
512g 8sockets G5:
     │                                 4%                               
│     
     │                                                                  
│     
     │                  Packages completed: 274 of 2900                 
│     
     │                                                                  
│     
     │ Kernel panic - not syncing: stack-protector: Kernel stack is
corrupted in: ffffffff810a2df2 the cairo graphics library

1024g 8sockets G5:
     │                                 2%                               
│     
     │                                                                  
│     
     │                  Packages completed: 122 of 2900                 
│     
     │                                                                  
│     
     │ Kernel panic - not syncing: stack-protector: Kernel stack is
corrupted in: ffffffff810a2df2ce for GTK2 (a GUI library for X)

192g 4 sockets G5:
          ┌────────────────┤
Installation Starting
├────────────────┐          
          │                                                        
│          
          │ Starting installation process                          
│          
          │                                                        
│          
          │                          Kernel panic - not syncing:
stack-protector: Kernel stack is corrupted in:
ffffffff8146cc70                    │          

└─────────────────────────────────────────────────────────┘

Comment 1 Andy Gospodarek 2010-07-02 13:57:34 UTC
Gary, I'd also like to ask if they can take the ixgbe card out (or disable it in BIOS if it is a LOM) and see if the install still fails.

Comment 2 Andy Gospodarek 2010-07-02 18:04:31 UTC
Fixing the title to reflect the actual complaint from IT

Comment 3 Andy Gospodarek 2010-07-02 18:39:41 UTC
Looks like this is a problem with the latest kernel when booting on a system
with 128 CPUs.  Is this something your team has seen, John?

Comment 4 Andy Gospodarek 2010-07-02 18:57:12 UTC
My initial thoughts are that this patch:

commit fdd3d631cddad20ad9d3e1eb7dbf26825a8a121f
Author: Krishna Kumar <krkumar2.com>
Date:   Wed Feb 3 13:13:10 2010 +0000

    ixgbe: Fix return of invalid txq

was on the right track, but didn't handle all the needed cases.

Comment 5 Issue Tracker 2010-07-02 19:45:16 UTC
Event posted on 07-02-2010 03:29pm EDT by Yinghai.Lu

Gary,

Can you ask your engineer to backport following patch from mainline?

Thanks


http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=753649dbc49345a73a2454c770a3f2d54d11aec6

From 753649dbc49345a73a2454c770a3f2d54d11aec6 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx>
Date: Wed, 31 Mar 2010 13:30:19 +0200
Subject: [PATCH] genirq: Force MSI irq handlers to run with interrupts
disabled

Network folks reported that directing all MSI-X vectors of their multi
queue NICs to a single core can cause interrupt stack overflows when
enough interrupts fire at the same time.

This is caused by the fact that we run interrupt handlers by default
with interrupts enabled unless the driver reuqests the interrupt with
the IRQF_DISABLED set. The NIC handlers do not set this flag, so
simultaneous interrupts can nest unlimited and cause the stack
overflow.

The only safe counter measure is to run the interrupt handlers with
interrupts disabled. We can't switch to this mode in general right
now, but it is safe to do so for MSI interrupts.

Force IRQF_DISABLED for MSI interrupt handlers.

Signed-off-by: Thomas Gleixner <tglx>
Cc: Andi Kleen <andi>
Cc: Linus Torvalds <torvalds>
Cc: Andrew Morton <akpm>
Cc: Ingo Molnar <mingo>
Cc: Peter Zijlstra <peterz>
Cc: Alan Cox <alan.org.uk>
Cc: David Miller <davem>
Cc: Greg Kroah-Hartman <gregkh>
Cc: Arnaldo Carvalho de Melo <acme>
Cc: stable
---
 kernel/irq/manage.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 398fda1..704e488 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -757,6 +757,16 @@ __setup_irq(unsigned int irq, struct irq_desc *desc,
struct irqaction *new)
 		if (new->flags & IRQF_ONESHOT)
 			desc->status |= IRQ_ONESHOT;
 
+		/*
+		 * Force MSI interrupts to run with interrupts
+		 * disabled. The multi vector cards can cause stack
+		 * overflows due to nested interrupts when enough of
+		 * them are directed to a core and fire at the same
+		 * time.
+		 */
+		if (desc->msi_desc)
+			new->flags |= IRQF_DISABLED;
+
 		if (!(desc->status & IRQ_NOAUTOEN)) {
 			desc->depth = 0;
 			desc->status &= ~IRQ_DISABLED;
-- 
1.7.1.1






This event sent from IssueTracker by gcase 
 issue 921383

Comment 6 Andy Gospodarek 2010-07-02 19:57:11 UTC
Created attachment 429155 [details]
irq-fix.patch

Here is a backported version of the patch.  Feel free to let us know if this resolves the issue as we still do not have the hardware available to reproduce.

Comment 7 Andy Gospodarek 2010-07-02 20:19:59 UTC
I imagine that the system would function just fine with booting with the extra kernel command line option 'pci=nomsi' if the patch you have posted is the correct one.  It is not a guarantee since other things might break when booting with 'pci=nomsi', but it would be a quick data-point for you to gather for us.

Comment 8 Andy Gospodarek 2010-07-02 21:19:14 UTC
(In reply to comment #4)
> My initial thoughts are that this patch:
> 
> commit fdd3d631cddad20ad9d3e1eb7dbf26825a8a121f
> Author: Krishna Kumar <krkumar2.com>
> Date:   Wed Feb 3 13:13:10 2010 +0000
> 
>     ixgbe: Fix return of invalid txq
> 
> was on the right track, but didn't handle all the needed cases.    

I took a look at the checks after this and though they could use some additional checking to safeguard against invalid indexes, I do not think they are related to the problem we are addressing in this bug.

Comment 12 Andy Gospodarek 2010-07-12 16:36:22 UTC
Based on my initial testing it appears the patch in comment #5 will resolve this issue.  With the current kernel if I load the ixgbe driver and bring and interface up (ifconfig ethX up) this is easily reproducible.  After this patch is applied, I was not able to reproduce it.  I would like to test is a few more timest to be sure, but it looks good.

Comment 13 Andy Gospodarek 2010-07-12 18:34:14 UTC
Testing looks good.  I will post:

commit 753649dbc49345a73a2454c770a3f2d54d11aec6
Author: Thomas Gleixner <tglx>
Date:   Wed Mar 31 13:30:19 2010 +0200

    genirq: Force MSI irq handlers to run with interrupts disabled

for consideration for RHEL6.0.

Comment 15 Andy Gospodarek 2010-07-13 19:02:09 UTC
This patch was already added as part of a pull of stable fixes from 2.6.32.12
and should be available in 2.6.32-47.

*** This bug has been marked as a duplicate of bug 604608 ***


Note You need to log in before you can comment on or make changes to this bug.