Bug 564138
Summary: | [Stratus 5.7 bug] MSI/MSI-X can cause improper re-allocation of interrupt vectors | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Dan Duval <dan.duval> |
Component: | kernel | Assignee: | Prarit Bhargava <prarit> |
Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.4 | CC: | agospoda, chas.horvath, dbayly, dhoward, jan.public, jparadis, kevin.paetzold, prarit, qzhou, robert.evans, robert.manchek, rpacheco, ythuang |
Target Milestone: | beta | Keywords: | OtherQA |
Target Release: | 5.7 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-04-20 13:49:42 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 557597, 694224 |
Description
Dan Duval
2010-02-11 22:15:53 UTC
Prioritizing for RHEL 5.6 inclusion based on timing for 5.5. Stratus - are you still seeing this in RHEL 5.5? It should be feasible to backport FIRST_EXTERNAL_VECTOR and the used_vectors bitmap. I am, however, concerned about system stability. This change at this late stage in RHEL5 has to be widely assessed across a broad series of systems. Stratus, can you confirm you are still seeing this issue with RHEL5.5? Thanks, P. Stratus is working on re-assembling the failing configuration. We'll provide an answer to Prarit's question as soon as we can. Created attachment 452388 [details]
RHEL5 fix for this issue part 1
Created attachment 452389 [details]
RHEL5 fix for this issue part 2
This needs a full Beta cycle for testing - deferring to 5.7 with OK from Stratus. Stratus confirms that we are still seeing this problem under RHEL 5.5. Dan, I've made two patches (comment #8 and comment #9) in this bug visible. Can you please test the RHEL5 kernel with these two patches and tell me if they work for you? Thanks, P. In testing the patches in comments #8 and #9, I'm discovering that, after a relatively short series of hot-unplugs/-plugs, assign_msi_vector() starts returning -EBUSY. This bubbles up to the drivers, which fall back on "legacy" (wired) interrupts. This happens well before all of the available vectors in the device range have been allocated. Is this by design? Some of the cards affected are high-performance 10Gb NICs, which can run multiple receive queues in parallel on SMP machines if MSI-X is available. (In reply to comment #13) > In testing the patches in comments #8 and #9, I'm discovering that, after > a relatively short series of hot-unplugs/-plugs, assign_msi_vector() starts > returning -EBUSY. This bubbles up to the drivers, which fall back on "legacy" > (wired) interrupts. This happens well before all of the available vectors > in the device range have been allocated. > > Is this by design? Some of the cards affected are high-performance 10Gb > NICs, which can run multiple receive queues in parallel on SMP machines > if MSI-X is available. Nope, that's a bug in the design :( Dan, I'm going to have to get the Stratus system here in Westford up and running and try some debugging to make sure I get this right ... P. I'm wondering if linux-2.6 commit 92db6d10bc1bc43330a4c540fa5b64c83d9d865f would help out here? It seems that upstream has dropped this convoluted and confusing method of allocating for something much simpler. P. I can confirm that commit 92db6d10bc1bc43330a4c540fa5b64c83d9d865f does fix the problem, at least in our case. The hang/crash goes away. I still see assign_msi_vector() returning -EBUSY a lot, starting with first break (unplug/plug). This doesn't happen every time, though; -EBUSY returns are interspersed with "real" allocations. In theory, this shouldn't be necessary, since it's the same hardware being removed/added, over and over again, so it should be possible to reuse exactly the same vectors each time. Still, this may be the best that can be done without radical surgery to the vector-allocation scheme. (In reply to comment #17) > I can confirm that commit 92db6d10bc1bc43330a4c540fa5b64c83d9d865f does > fix the problem, at least in our case. The hang/crash goes away. > > I still see assign_msi_vector() returning -EBUSY a lot, starting with > first break (unplug/plug). This doesn't happen every time, though; > -EBUSY returns are interspersed with "real" allocations. > > In theory, this shouldn't be necessary, since it's the same hardware > being removed/added, over and over again, so it should be possible to > reuse exactly the same vectors each time. > > Still, this may be the best that can be done without radical surgery > to the vector-allocation scheme. Hey Dan, I agree -- I don't want to make a large change to the MSI allocation code, and since this commit fixes your problem, I'll go with that. P. *** This bug has been marked as a duplicate of bug 652799 *** |