Bug 564138 - [Stratus 5.7 bug] MSI/MSI-X can cause improper re-allocation of interrupt vectors
Summary: [Stratus 5.7 bug] MSI/MSI-X can cause improper re-allocation of interrupt vec...
Keywords:
Status: CLOSED DUPLICATE of bug 652799
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: x86_64
OS: Linux
high
high
Target Milestone: beta
: 5.7
Assignee: Prarit Bhargava
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 557597 694224
TreeView+ depends on / blocked
 
Reported: 2010-02-11 22:15 UTC by Dan Duval
Modified: 2012-12-27 07:19 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-04-20 13:49:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Dan Duval 2010-02-11 22:15:53 UTC
Description of problem: see below


Version-Release number of selected component: kernel 2.6.18-164.el5


How reproducible: dependent on system configurations.  If a system
is configured such that the problem occurs, it will occur every time
on that system.


Steps to Reproduce: see description.  Involves hot-plugging and
-unplugging MSI/MSI-X -capable PCI hardware.


Actual results: kernel appears to hang solidly.  Eventually, on
systems so equipped, an external watchdog will NMI the system.


Expected results: hot-pluggable PCI hardware should be added to
and removed from a system without triggering a crash.


Additional info:

	Affected architecture: x86_64 (and possibly i386 -- not tested)


Description:

0. On x86_64 systems that use the PIT clock to generate ticks,
the timer interrupt is always allocated vector (IDT index) 0x31
(== FIRST_DEVICE_VECTOR).  This slot in the IDT is then set to
point to the IRQ0 interrupt stub IRQ0x00_interrupt.

This is done relatively early in the boot process.

1. The central vector allocator for the kernel is

        arch/x86_64/kernel/io_apic.c:assign_irq_vector()

This routine is given a range of vectors from FIRST_DEVICE_VECTOR
to FIRST_SYSTEM_VECTOR-1 that it can hand out.  It does so starting
with 0x39, and marching in strides of 8 up the range: 0x41, 0x49, ...
When it reaches the end, it wraps around to the beginning of the
range and begins allocating the vectors that are equal to 2 (mod
8), and so forth.  In this way, it will eventually cover all the
vectors in its range.

2. Allocation of interrupts for MSI and MSI-X PCI devices is done by

        drivers/pci/msi.c:assign_msi_vector()

Each time it needs a new interrupt vector, this routine calls
assign_irq_vector() to get one.  This occurs until it obtains
vector LAST_DEVICE_VECTOR (which is FIRST_SYSTEM_VECTOR-1, i.e.,
the end of the range), at which time it switches to a different mode
where it no longer allocates new vectors, but instead attempts to
re-allocate vectors that have previously been freed.  The routine
remains in this mode for the rest of the system lifetime.

3. The problem arises because assign_msi_vector() is not the only
consumer of vectors from assign_irq_vector().  In particular,
the ACPI code does its own independent allocation through

        arch/x86_64/kernel/io_apic.c:gsi_irq_sharing()

which also calls assign_irq_vector().

So, if things happen in just the wrong order, gsi_irq_sharing() can
consume vector LAST_DEVICE_VECTOR.  This prevents assign_msi_vector()
from ever seeing LAST_DEVICE_VECTOR, which means that it never
switches modes.  Instead, it continues to call assign_irq_vector()
for every allocation.

4. Because the timer vector 0x31 is within the range that
assign_irq_vector() can allocate, eventually it will do so.
This results in changing the corresponding IDT entry to point at a
different entry point (IRQ) among the interrupt stubs.  At this point,
the timer interrupt handler is "orphaned", and the system stops
processing timer ticks.  This tends to have an adverse effect on
system performance.

5. This problem is particular to hot-pluggable PCI devices that
use MSI or MSI-X. In some cases, it can take only one hot-unplug and
-plug to trigger the condition.

Since occurrence depends on the exact configuration of system hardware,
and on the order in which it is scanned, the problem either happens
every time or doesn't happen at all.  It's conceivable that simply
swapping two PCI cards between slots could turn a "good" configuration
into a "bad" one.

6. The most recent upstream kernels, including the ones being
used for RHEL6, have introduced two mechanisms for preventing
this problem.

First, the range of vectors available for devices has been
restricted so that, now, the timer vector falls outside the range.
assign_irq_vector() therefore will never "commandeer" the timer
vector.

Note that this change alone doesn't address the underlying problem.

Second, a 256-bit bitmap "used_vectors" has been introduced.
assign_irq_vector() checks this bitmap when asked to allocate a new
vector, and will not reallocate a vector that is already marked as
"used".

If asked to guess, I would say that it would be reasonably
straightforward to backport the appropriate upstream changes to
this kernel.

Comment 1 Andrius Benokraitis 2010-02-15 19:18:02 UTC
Prioritizing for RHEL 5.6 inclusion based on timing for 5.5.

Comment 3 Andrius Benokraitis 2010-10-05 19:01:11 UTC
Stratus - are you still seeing this in RHEL 5.5?

Comment 6 Prarit Bhargava 2010-10-07 15:29:48 UTC
It should be feasible to backport FIRST_EXTERNAL_VECTOR and the used_vectors bitmap.

I am, however, concerned about system stability.  This change at this late stage in RHEL5 has to be widely assessed across a broad series of systems.

Stratus, can you confirm you are still seeing this issue with RHEL5.5?

Thanks,

P.

Comment 7 Dan Duval 2010-10-07 18:51:26 UTC
Stratus is working on re-assembling the failing configuration.  We'll
provide an answer to Prarit's question as soon as we can.

Comment 8 Prarit Bhargava 2010-10-08 17:40:56 UTC
Created attachment 452388 [details]
RHEL5 fix for this issue part 1

Comment 9 Prarit Bhargava 2010-10-08 17:41:27 UTC
Created attachment 452389 [details]
RHEL5 fix for this issue part 2

Comment 10 Andrius Benokraitis 2010-10-11 14:30:15 UTC
This needs a full Beta cycle for testing - deferring to 5.7 with OK from Stratus.

Comment 11 Dan Duval 2010-10-28 20:11:52 UTC
Stratus confirms that we are still seeing this problem under RHEL 5.5.

Comment 12 Prarit Bhargava 2011-01-20 18:27:09 UTC
Dan,

I've made two patches (comment #8 and comment #9) in this bug visible.  Can you please test the RHEL5 kernel with these two patches and tell me if they work for you?

Thanks,

P.

Comment 13 Dan Duval 2011-01-25 22:51:55 UTC
In testing the patches in comments #8 and #9, I'm discovering that, after
a relatively short series of hot-unplugs/-plugs, assign_msi_vector() starts
returning -EBUSY.  This bubbles up to the drivers, which fall back on "legacy"
(wired) interrupts.  This happens well before all of the available vectors
in the device range have been allocated.

Is this by design?  Some of the cards affected are high-performance 10Gb
NICs, which can run multiple receive queues in parallel on SMP machines
if MSI-X is available.

Comment 14 Prarit Bhargava 2011-01-26 15:43:23 UTC
(In reply to comment #13)
> In testing the patches in comments #8 and #9, I'm discovering that, after
> a relatively short series of hot-unplugs/-plugs, assign_msi_vector() starts
> returning -EBUSY.  This bubbles up to the drivers, which fall back on "legacy"
> (wired) interrupts.  This happens well before all of the available vectors
> in the device range have been allocated.
> 
> Is this by design?  Some of the cards affected are high-performance 10Gb
> NICs, which can run multiple receive queues in parallel on SMP machines
> if MSI-X is available.

Nope, that's a bug in the design :(

Dan, I'm going to have to get the Stratus system here in Westford up and running and try some debugging to make sure I get this right ...

P.

Comment 16 Prarit Bhargava 2011-03-03 14:05:43 UTC
I'm wondering if linux-2.6 commit 92db6d10bc1bc43330a4c540fa5b64c83d9d865f would help out here?

It seems that upstream has dropped this convoluted and confusing method of allocating for something much simpler.

P.

Comment 17 Dan Duval 2011-03-24 19:44:51 UTC
I can confirm that commit 92db6d10bc1bc43330a4c540fa5b64c83d9d865f does
fix the problem, at least in our case.  The hang/crash goes away.

I still see assign_msi_vector() returning -EBUSY a lot, starting with
first break (unplug/plug).  This doesn't happen every time, though;
-EBUSY returns are interspersed with "real" allocations.

In theory, this shouldn't be necessary, since it's the same hardware
being removed/added, over and over again, so it should be possible to
reuse exactly the same vectors each time.

Still, this may be the best that can be done without radical surgery
to the vector-allocation scheme.

Comment 18 Prarit Bhargava 2011-04-20 13:45:35 UTC
(In reply to comment #17)
> I can confirm that commit 92db6d10bc1bc43330a4c540fa5b64c83d9d865f does
> fix the problem, at least in our case.  The hang/crash goes away.
> 
> I still see assign_msi_vector() returning -EBUSY a lot, starting with
> first break (unplug/plug).  This doesn't happen every time, though;
> -EBUSY returns are interspersed with "real" allocations.
> 
> In theory, this shouldn't be necessary, since it's the same hardware
> being removed/added, over and over again, so it should be possible to
> reuse exactly the same vectors each time.
> 
> Still, this may be the best that can be done without radical surgery
> to the vector-allocation scheme.

Hey Dan,

I agree -- I don't want to make a large change to the MSI allocation code, and since this commit fixes your problem, I'll go with that.

P.

Comment 19 Prarit Bhargava 2011-04-20 13:49:42 UTC

*** This bug has been marked as a duplicate of bug 652799 ***


Note You need to log in before you can comment on or make changes to this bug.