Bug 610814
Summary: | [RHEL6 beta 1] ixgbe doesn't support 128 cpus | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Andy Gospodarek <agospoda> | ||||
Component: | kernel | Assignee: | Andy Gospodarek <agospoda> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.0 | CC: | gcase, jane.lv, jburke, john.ronciak, jvillalo, kzhang, luyu, peterm, rpacheco | ||||
Target Milestone: | rc | ||||||
Target Release: | 6.0 | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2010-07-13 19:02:09 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 580574 | ||||||
Attachments: |
|
Description
Andy Gospodarek
2010-07-02 13:54:26 UTC
Gary, I'd also like to ask if they can take the ixgbe card out (or disable it in BIOS if it is a LOM) and see if the install still fails. Fixing the title to reflect the actual complaint from IT Looks like this is a problem with the latest kernel when booting on a system with 128 CPUs. Is this something your team has seen, John? My initial thoughts are that this patch: commit fdd3d631cddad20ad9d3e1eb7dbf26825a8a121f Author: Krishna Kumar <krkumar2.com> Date: Wed Feb 3 13:13:10 2010 +0000 ixgbe: Fix return of invalid txq was on the right track, but didn't handle all the needed cases. Event posted on 07-02-2010 03:29pm EDT by Yinghai.Lu Gary, Can you ask your engineer to backport following patch from mainline? Thanks http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=753649dbc49345a73a2454c770a3f2d54d11aec6 From 753649dbc49345a73a2454c770a3f2d54d11aec6 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner <tglx> Date: Wed, 31 Mar 2010 13:30:19 +0200 Subject: [PATCH] genirq: Force MSI irq handlers to run with interrupts disabled Network folks reported that directing all MSI-X vectors of their multi queue NICs to a single core can cause interrupt stack overflows when enough interrupts fire at the same time. This is caused by the fact that we run interrupt handlers by default with interrupts enabled unless the driver reuqests the interrupt with the IRQF_DISABLED set. The NIC handlers do not set this flag, so simultaneous interrupts can nest unlimited and cause the stack overflow. The only safe counter measure is to run the interrupt handlers with interrupts disabled. We can't switch to this mode in general right now, but it is safe to do so for MSI interrupts. Force IRQF_DISABLED for MSI interrupt handlers. Signed-off-by: Thomas Gleixner <tglx> Cc: Andi Kleen <andi> Cc: Linus Torvalds <torvalds> Cc: Andrew Morton <akpm> Cc: Ingo Molnar <mingo> Cc: Peter Zijlstra <peterz> Cc: Alan Cox <alan.org.uk> Cc: David Miller <davem> Cc: Greg Kroah-Hartman <gregkh> Cc: Arnaldo Carvalho de Melo <acme> Cc: stable --- kernel/irq/manage.c | 10 ++++++++++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 398fda1..704e488 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -757,6 +757,16 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) if (new->flags & IRQF_ONESHOT) desc->status |= IRQ_ONESHOT; + /* + * Force MSI interrupts to run with interrupts + * disabled. The multi vector cards can cause stack + * overflows due to nested interrupts when enough of + * them are directed to a core and fire at the same + * time. + */ + if (desc->msi_desc) + new->flags |= IRQF_DISABLED; + if (!(desc->status & IRQ_NOAUTOEN)) { desc->depth = 0; desc->status &= ~IRQ_DISABLED; -- 1.7.1.1 This event sent from IssueTracker by gcase issue 921383 Created attachment 429155 [details]
irq-fix.patch
Here is a backported version of the patch. Feel free to let us know if this resolves the issue as we still do not have the hardware available to reproduce.
I imagine that the system would function just fine with booting with the extra kernel command line option 'pci=nomsi' if the patch you have posted is the correct one. It is not a guarantee since other things might break when booting with 'pci=nomsi', but it would be a quick data-point for you to gather for us. (In reply to comment #4) > My initial thoughts are that this patch: > > commit fdd3d631cddad20ad9d3e1eb7dbf26825a8a121f > Author: Krishna Kumar <krkumar2.com> > Date: Wed Feb 3 13:13:10 2010 +0000 > > ixgbe: Fix return of invalid txq > > was on the right track, but didn't handle all the needed cases. I took a look at the checks after this and though they could use some additional checking to safeguard against invalid indexes, I do not think they are related to the problem we are addressing in this bug. Based on my initial testing it appears the patch in comment #5 will resolve this issue. With the current kernel if I load the ixgbe driver and bring and interface up (ifconfig ethX up) this is easily reproducible. After this patch is applied, I was not able to reproduce it. I would like to test is a few more timest to be sure, but it looks good. Testing looks good. I will post: commit 753649dbc49345a73a2454c770a3f2d54d11aec6 Author: Thomas Gleixner <tglx> Date: Wed Mar 31 13:30:19 2010 +0200 genirq: Force MSI irq handlers to run with interrupts disabled for consideration for RHEL6.0. This patch was already added as part of a pull of stable fixes from 2.6.32.12 and should be available in 2.6.32-47. *** This bug has been marked as a duplicate of bug 604608 *** |