Description of problem: see below Version-Release number of selected component: kernel 2.6.18-164.el5 How reproducible: dependent on system configurations. If a system is configured such that the problem occurs, it will occur every time on that system. Steps to Reproduce: see description. Involves hot-plugging and -unplugging MSI/MSI-X -capable PCI hardware. Actual results: kernel appears to hang solidly. Eventually, on systems so equipped, an external watchdog will NMI the system. Expected results: hot-pluggable PCI hardware should be added to and removed from a system without triggering a crash. Additional info: Affected architecture: x86_64 (and possibly i386 -- not tested) Description: 0. On x86_64 systems that use the PIT clock to generate ticks, the timer interrupt is always allocated vector (IDT index) 0x31 (== FIRST_DEVICE_VECTOR). This slot in the IDT is then set to point to the IRQ0 interrupt stub IRQ0x00_interrupt. This is done relatively early in the boot process. 1. The central vector allocator for the kernel is arch/x86_64/kernel/io_apic.c:assign_irq_vector() This routine is given a range of vectors from FIRST_DEVICE_VECTOR to FIRST_SYSTEM_VECTOR-1 that it can hand out. It does so starting with 0x39, and marching in strides of 8 up the range: 0x41, 0x49, ... When it reaches the end, it wraps around to the beginning of the range and begins allocating the vectors that are equal to 2 (mod 8), and so forth. In this way, it will eventually cover all the vectors in its range. 2. Allocation of interrupts for MSI and MSI-X PCI devices is done by drivers/pci/msi.c:assign_msi_vector() Each time it needs a new interrupt vector, this routine calls assign_irq_vector() to get one. This occurs until it obtains vector LAST_DEVICE_VECTOR (which is FIRST_SYSTEM_VECTOR-1, i.e., the end of the range), at which time it switches to a different mode where it no longer allocates new vectors, but instead attempts to re-allocate vectors that have previously been freed. The routine remains in this mode for the rest of the system lifetime. 3. The problem arises because assign_msi_vector() is not the only consumer of vectors from assign_irq_vector(). In particular, the ACPI code does its own independent allocation through arch/x86_64/kernel/io_apic.c:gsi_irq_sharing() which also calls assign_irq_vector(). So, if things happen in just the wrong order, gsi_irq_sharing() can consume vector LAST_DEVICE_VECTOR. This prevents assign_msi_vector() from ever seeing LAST_DEVICE_VECTOR, which means that it never switches modes. Instead, it continues to call assign_irq_vector() for every allocation. 4. Because the timer vector 0x31 is within the range that assign_irq_vector() can allocate, eventually it will do so. This results in changing the corresponding IDT entry to point at a different entry point (IRQ) among the interrupt stubs. At this point, the timer interrupt handler is "orphaned", and the system stops processing timer ticks. This tends to have an adverse effect on system performance. 5. This problem is particular to hot-pluggable PCI devices that use MSI or MSI-X. In some cases, it can take only one hot-unplug and -plug to trigger the condition. Since occurrence depends on the exact configuration of system hardware, and on the order in which it is scanned, the problem either happens every time or doesn't happen at all. It's conceivable that simply swapping two PCI cards between slots could turn a "good" configuration into a "bad" one. 6. The most recent upstream kernels, including the ones being used for RHEL6, have introduced two mechanisms for preventing this problem. First, the range of vectors available for devices has been restricted so that, now, the timer vector falls outside the range. assign_irq_vector() therefore will never "commandeer" the timer vector. Note that this change alone doesn't address the underlying problem. Second, a 256-bit bitmap "used_vectors" has been introduced. assign_irq_vector() checks this bitmap when asked to allocate a new vector, and will not reallocate a vector that is already marked as "used". If asked to guess, I would say that it would be reasonably straightforward to backport the appropriate upstream changes to this kernel.
Prioritizing for RHEL 5.6 inclusion based on timing for 5.5.
Stratus - are you still seeing this in RHEL 5.5?
It should be feasible to backport FIRST_EXTERNAL_VECTOR and the used_vectors bitmap. I am, however, concerned about system stability. This change at this late stage in RHEL5 has to be widely assessed across a broad series of systems. Stratus, can you confirm you are still seeing this issue with RHEL5.5? Thanks, P.
Stratus is working on re-assembling the failing configuration. We'll provide an answer to Prarit's question as soon as we can.
Created attachment 452388 [details] RHEL5 fix for this issue part 1
Created attachment 452389 [details] RHEL5 fix for this issue part 2
This needs a full Beta cycle for testing - deferring to 5.7 with OK from Stratus.
Stratus confirms that we are still seeing this problem under RHEL 5.5.
Dan, I've made two patches (comment #8 and comment #9) in this bug visible. Can you please test the RHEL5 kernel with these two patches and tell me if they work for you? Thanks, P.
In testing the patches in comments #8 and #9, I'm discovering that, after a relatively short series of hot-unplugs/-plugs, assign_msi_vector() starts returning -EBUSY. This bubbles up to the drivers, which fall back on "legacy" (wired) interrupts. This happens well before all of the available vectors in the device range have been allocated. Is this by design? Some of the cards affected are high-performance 10Gb NICs, which can run multiple receive queues in parallel on SMP machines if MSI-X is available.
(In reply to comment #13) > In testing the patches in comments #8 and #9, I'm discovering that, after > a relatively short series of hot-unplugs/-plugs, assign_msi_vector() starts > returning -EBUSY. This bubbles up to the drivers, which fall back on "legacy" > (wired) interrupts. This happens well before all of the available vectors > in the device range have been allocated. > > Is this by design? Some of the cards affected are high-performance 10Gb > NICs, which can run multiple receive queues in parallel on SMP machines > if MSI-X is available. Nope, that's a bug in the design :( Dan, I'm going to have to get the Stratus system here in Westford up and running and try some debugging to make sure I get this right ... P.
I'm wondering if linux-2.6 commit 92db6d10bc1bc43330a4c540fa5b64c83d9d865f would help out here? It seems that upstream has dropped this convoluted and confusing method of allocating for something much simpler. P.
I can confirm that commit 92db6d10bc1bc43330a4c540fa5b64c83d9d865f does fix the problem, at least in our case. The hang/crash goes away. I still see assign_msi_vector() returning -EBUSY a lot, starting with first break (unplug/plug). This doesn't happen every time, though; -EBUSY returns are interspersed with "real" allocations. In theory, this shouldn't be necessary, since it's the same hardware being removed/added, over and over again, so it should be possible to reuse exactly the same vectors each time. Still, this may be the best that can be done without radical surgery to the vector-allocation scheme.
(In reply to comment #17) > I can confirm that commit 92db6d10bc1bc43330a4c540fa5b64c83d9d865f does > fix the problem, at least in our case. The hang/crash goes away. > > I still see assign_msi_vector() returning -EBUSY a lot, starting with > first break (unplug/plug). This doesn't happen every time, though; > -EBUSY returns are interspersed with "real" allocations. > > In theory, this shouldn't be necessary, since it's the same hardware > being removed/added, over and over again, so it should be possible to > reuse exactly the same vectors each time. > > Still, this may be the best that can be done without radical surgery > to the vector-allocation scheme. Hey Dan, I agree -- I don't want to make a large change to the MSI allocation code, and since this commit fixes your problem, I'll go with that. P.
*** This bug has been marked as a duplicate of bug 652799 ***