Bug 599065

Summary: PCI passthrough w/ shared IRQ broken
Product: Red Hat Enterprise Linux 5 Reporter: Tamas Vincze <tom>
Component: kernel-xenAssignee: Don Dutile (Red Hat) <ddutile>
Status: CLOSED WONTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: low    
Version: 5.5CC: ddugger, ddutile, drjones, lersek, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-07-29 10:35:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514490    
Attachments:
Description Flags
lspci -v none

Description Tamas Vincze 2010-06-02 15:45:47 UTC
Created attachment 419079 [details]
lspci -v

I have a USB controller that I attached to a PV domU using PCI passthrough.
Unfortunately VT-d is not supported by the chipset.
It has 3 IRQs that are shared with dom0 devices.
After a few hours the interrupts get disabled in dom0 and domU.

The passed through device:

04:00.0 USB Controller: NEC Corporation USB (rev 43) (prog-if 10 [OHCI])
	Subsystem: NEC Corporation Hama USB 2.0 CardBus
	Flags: bus master, medium devsel, latency 32, IRQ 16
	Memory at fc300000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [40] Power Management version 2

04:00.1 USB Controller: NEC Corporation USB (rev 43) (prog-if 10 [OHCI])
	Subsystem: NEC Corporation Hama USB 2.0 CardBus
	Flags: bus master, medium devsel, latency 32, IRQ 17
	Memory at fc301000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [40] Power Management version 2

04:00.2 USB Controller: NEC Corporation USB 2.0 (rev 04) (prog-if 20 [EHCI])
	Subsystem: NEC Corporation USB 2.0
	Flags: bus master, medium devsel, latency 132, IRQ 18
	Memory at fc302000 (32-bit, non-prefetchable) [size=256]
	Capabilities: [40] Power Management version 2

IRQs 16, 17 and 18 are shared, see lspci output.

=== dom0 dmesg ===

irq 17: nobody cared (try booting with the "irqpoll" option)

Call Trace:
 <IRQ>  [<ffffffff802b3e43>] __report_bad_irq+0x30/0x7d
 [<ffffffff802b407a>] note_interrupt+0x1ea/0x22b
 [<ffffffff802b3572>] __do_IRQ+0xbd/0x103
 [<ffffffff8029043f>] _local_bh_enable+0x61/0xc5
 [<ffffffff8026df48>] do_IRQ+0xe7/0xf5
 [<ffffffff803b3ae7>] evtchn_do_upcall+0x13b/0x1fb
 [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff8026f4eb>] raw_safe_halt+0x84/0xa8
 [<ffffffff8026ca80>] xen_idle+0x38/0x4a
 [<ffffffff8024b0aa>] cpu_idle+0x97/0xba
 [<ffffffff8064cb0f>] start_kernel+0x21f/0x224
 [<ffffffff8064c1e5>] _sinittext+0x1e5/0x1eb

handlers:
[<ffffffff803e7cb2>] (usb_hcd_irq+0x0/0x55)
[<ffffffff803e7cb2>] (usb_hcd_irq+0x0/0x55)
Disabling IRQ #17

=== domU dmesg ===

irq 18: nobody cared (try booting with the "irqpoll" option)

Call Trace:
 <IRQ>  [<ffffffff802b3e43>] __report_bad_irq+0x30/0x7d
 [<ffffffff802b407a>] note_interrupt+0x1ea/0x22b
 [<ffffffff802b3572>] __do_IRQ+0xbd/0x103
 [<ffffffff8029043f>] _local_bh_enable+0x61/0xc5
 [<ffffffff8026df48>] do_IRQ+0xe7/0xf5
 [<ffffffff803b3ae7>] evtchn_do_upcall+0x13b/0x1fb
 [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff8026f4eb>] raw_safe_halt+0x84/0xa8
 [<ffffffff8026ca80>] xen_idle+0x38/0x4a
 [<ffffffff8024b0aa>] cpu_idle+0x97/0xba
 [<ffffffff8064cb0f>] start_kernel+0x21f/0x224
 [<ffffffff8064c1e5>] _sinittext+0x1e5/0x1eb

handlers:
[<ffffffff803e7cb2>] (usb_hcd_irq+0x0/0x55)
Disabling IRQ #18

irq 16: nobody cared (try booting with the "irqpoll" option)

Call Trace:
 <IRQ>  [<ffffffff802b3e43>] __report_bad_irq+0x30/0x7d
 [<ffffffff802b407a>] note_interrupt+0x1ea/0x22b
 [<ffffffff802b3572>] __do_IRQ+0xbd/0x103
 [<ffffffff8029043f>] _local_bh_enable+0x61/0xc5
 [<ffffffff8026df48>] do_IRQ+0xe7/0xf5
 [<ffffffff803b3ae7>] evtchn_do_upcall+0x13b/0x1fb
 [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff8026f4eb>] raw_safe_halt+0x84/0xa8
 [<ffffffff8026ca80>] xen_idle+0x38/0x4a
 [<ffffffff8024b0aa>] cpu_idle+0x97/0xba
 [<ffffffff8064cb0f>] start_kernel+0x21f/0x224
 [<ffffffff8064c1e5>] _sinittext+0x1e5/0x1eb

handlers:
[<ffffffff803e7cb2>] (usb_hcd_irq+0x0/0x55)
Disabling IRQ #16


Initial Xen IRQ info:
(XEN)     IRQ 16 Vec144: type=IO-APIC-level   status=00000010 in-flight=0 domain-list=0(----),3(----),
(XEN)     IRQ 17 Vec152: type=IO-APIC-level   status=00000010 in-flight=0 domain-list=0(----),3(----),
(XEN)     IRQ 18 Vec160: type=IO-APIC-level   status=00000010 in-flight=0 domain-list=0(----),3(----),

Afterwards:
(XEN)     IRQ 16 Vec144: type=IO-APIC-level   status=00000010 in-flight=0 domain-list=0(----),
(XEN)     IRQ 17 Vec152: type=IO-APIC-level   status=00000010 in-flight=0 domain-list=3(----),
(XEN)     IRQ 18 Vec160: type=IO-APIC-level   status=00000010 in-flight=0 domain-list=0(----),


Fortunately the whole system didn't crash this time, but it happened previously that the LSI disk controllers IRQ got disabled in dom0 that required a hardware reset.

Comment 1 Tamas Vincze 2010-06-02 15:49:27 UTC
Possible solution?
http://lists.xensource.com/archives/html/xen-devel/2010-02/msg00832.html

diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index e138053..923de2e 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -25,7 +25,7 @@ static int xen_pcifront_enable_irq(struct pci_dev *dev)
        if (dev->irq < 0)
                return -EINVAL;
 
-       rc = xen_allocate_pirq(dev->irq, 0, "pcifront");
+       rc = xen_allocate_pirq(dev->irq, 1 /* share */, "pcifront");
        if (rc < 0) {
                dev_warn(&dev->dev, "Xen PCI IRQ: %d, failed to register:%d\n",
                         dev->irq, rc);

Comment 2 Don Dutile (Red Hat) 2010-06-16 19:03:37 UTC
 More information needed:

(a) guest kernel version ?
    ... and pls provide details of dom0 (kernel version, xen(tools) version).

(b) what tree is the patch listed in c#1 from ?
    -- arch/x86/pci/xen.c  is _not_ in latest xen tree nor in latest linux tree.
    -- appears the file may only exist in Jeremy's xen/master tree,
       and would only be valid for rhel6, _if_ the whole file was backported into
       rhel6.  

cc-ing Intel partner in case they can add more info as well.

Comment 3 Tamas Vincze 2010-06-16 19:55:33 UTC
a) Both dom0 and the guest are 2.6.18-194.3.1.el5xen
b) Haven't checked the patch further.

I added noirqdebug to both the dom0 and domU kernel command lines and that fixed the problem: the interrupts no longer get disabled, but probably still aren't handled properly.

Comment 4 Tamas Vincze 2010-06-16 19:58:04 UTC
dom0 has xen-3.0.3-105.el5_5.2

Comment 10 Laszlo Ersek 2011-07-29 14:36:02 UTC
Justification for the WONTFIX resolution:

Passing through a device that shares an interrupt with other dom0/host devices, or with devices assigned to other guests, is not supported for security reasons. Such configurations are therefore not subject to targeted testing either.

The proposed fix is based on upstream (2.6.3x), whose interrupt dispatching code differs significantly from that of RHEL-5.