Bug 688673
Summary: | PCI Virtual Function Passthrough - SR-IOV, Paravirt Guest fails to obtain IRQ after reboot | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Rob |
Component: | kernel-xen | Assignee: | Laszlo Ersek <lersek> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 5.6 | CC: | drjones, jarod, leiwang, lersek, mrezanin, pasik, pbonzini, pcao, prarit, qwan, sassmann, xen-maint, yuzhang, yuzhou |
Target Milestone: | rc | ||
Target Release: | 5.8 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | kernel-2.6.18-283.el5 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-02-21 03:32:11 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 514489 | ||
Attachments: |
Description
Rob
2011-03-17 17:11:08 UTC
Hi Rob, How certain are you about the patch you've pointed out, did you get the WARN() that it describes? Would you mind testing a kernel with the patch that I can build for you? Thanks, Drew Hi Drew, I've just finished building pciback from the 247 kernel source with the patch. The patch applied cleanly and partially fixes the problem. Abrupt paravirt VM restart is now OK (eg. xm destroy testvm && xm create testvm now works) Clean restart or clean shutdown of the VM and then xm create of VM still shows the problem. Additional information is appearing in Dom console when the failures occur, the VM has 2 Intel 82576 VF assignments from 2 different Intel 82576 lan chips and the increased info in dom0 console looks like this: msix entry 0 for dev 05:10:0 are not freed before acquire again. msix entry 1 for dev 05:10:0 are not freed before acquire again. msix entry 2 for dev 05:10:0 are not freed before acquire again. msix entry 0 for dev 09:10:0 are not freed before acquire again. msix entry 1 for dev 09:10:0 are not freed before acquire again. msix entry 2 for dev 09:10:0 are not freed before acquire again. unmap irq ff failed unmap irq fe failed unmap irq fd failed unmap irq fc failed unmap irq fb failed unmap irq fa failed What is reposnible for unassigning the MSIX IRQ when the VM is shutdown or rebooted? I'm using the igbvf driver that is included with the 247 kernel. Forgot to add, I dont know if I got the WARN() before patch because I'm not sure if I needed to enable additional debugging or even where this WARN() is meant to appear (DomU log?). Yeah, it's not clear which exact WARN() it would be, but you'd find it in the guest console output during boot or the guest dmesg. We'll set up a test system here to try and reproduce the issue as well. Hi Andrew, Please let me know your results. All of the above is with the igb and igbvf drivers. The igb driver in Dom0 behaves the same regardless of if I used the included or the latest Intel 2.4.3 version. I have had less luck with the X520-DA2 (ixgbe) card. This refuses to detect a link regardless although I can ping between the Dom0s on it fine. Strangely enough I can compile and install the Intel 1.0.7 ixgbevf driver on the same paravirt domu, it gives some more info in dmesg but behaves the same. Theres a pcifront patch here that "Fix Freeing of Device", wont apply cleanly but sounds like it might be whats needed to sort the reboot and clean shutdown issues: http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/5dd203fba380 Hi Rob, We got a test machine set up and were able to reproduce the problem. We got the same misx messages. After the second boot when it wasn't working I attempted a 'modprobe -r igbvf' to reinit the module. This "fixed" it. I'm still poking at the logs from both the host and the guest of the whole process to try and sort out why we're not properly disabling msix in the first place. pcifront might be the right place to look in the end, but in light of the 'modprobe -r' experiment, I'll start with the driver and work down. Drew Hi Drew, Thanks for the update. For some reason I couldn't compile the updated igbvf driver from Intels site, the updated ixgbevf driver and the Dom0 drivers compile OK. Do you have an Intel 82599 card (Uses ixgbe/ixgbevf driver) to test with? Rob Here's a wild guess: (In reply to comment #3) > Additional information is appearing in Dom console when the failures occur, > the VM has 2 Intel 82576 VF assignments from 2 different Intel 82576 lan > chips and the increased info in dom0 console looks like this: > msix entry 0 for dev 05:10:0 are not freed before acquire again. > msix entry 1 for dev 05:10:0 are not freed before acquire again. > msix entry 2 for dev 05:10:0 are not freed before acquire again. > msix entry 0 for dev 09:10:0 are not freed before acquire again. > msix entry 1 for dev 09:10:0 are not freed before acquire again. > msix entry 2 for dev 09:10:0 are not freed before acquire again. We have two loops here (one for each device), each running from 0 (inclusive) to 3 (exclusive). This closely resembles the igbvf driver's igbvf_set_interrupt_capability() function. Quote: /* we allocate 3 vectors, 1 for tx, 1 for rx, one for pf messages */ adapter->msix_entries = kcalloc(3, sizeof(struct msix_entry), GFP_KERNEL); if (adapter->msix_entries) { for (i = 0; i < 3; i++) adapter->msix_entries[i].entry = i; err = pci_enable_msix(adapter->pdev, adapter->msix_entries, 3); } pci_enable_msix() is implemented in "drivers/pci/msi-xen.c", which is also the only place that logs messages of the above form. igbvf_reset_interrupt_capability() seems to undo this (consistently with the patch pointed to in comment 0): if (adapter->msix_entries) { pci_disable_msix(adapter->pdev); kfree(adapter->msix_entries); adapter->msix_entries = NULL; } igbvf_reset_interrupt_capability() is called: - on the igbvf_set_interrupt_capability() error/cleanup path, - on the igbvf_probe() error/cleanup path, - think it is also called (indirectly) at suspend time, via igbvf_suspend() -> pci_disable_device() -> disable_msi_mode(), - and most importantly by igbvf_remove(), which handles "rmmod" (see comment 8) or hot-unplugging the device. Perhaps it is not invoked at shutdown. Tangentially, the ixgbe driver provides an IntMode parameter, so that one can force MSI (=1) or legacy (=0) interrupts. (In reply to comment #7) > Theres a pcifront patch here that "Fix Freeing of Device", wont apply cleanly > but sounds like it might be whats needed to sort the reboot and clean shutdown > issues: > http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/5dd203fba380 I think that's unrelated. It makes sure that the dynamic IRQ and the underlying event channel are correctly disassociated, and that both are correctly released. OTOH, RHEL-5's pcifront is not even interrupt-driven; we only have an event channel that we manually trigger and poll in do_pci_op(). Side note: (In reply to comment #0) > Steps to Reproduce: > 1. Boot 2.6.18-247 Dom0 Xen Kernel with iommu=pv and usual Redhat SR-IOV > options iommu=pv is not a RHEL-5 xen cmdline option from -264 drivers/passthrough/iommu.c::parse_iommu_param(): the recognized values are off/no/false/0/disable, force/required, passthrough, snoop, no-snoop, no-intremap, amd-iommu-debug, amd-iommu-perdev-intremap. /* * The 'iommu' parameter enables the IOMMU. Optional comma separated * value may contain: * * off|no|false|disable Disable IOMMU (default) * force|required Don't boot unless IOMMU is enabled * passthrough Bypass VT-d translation for Dom0 * snoop Utilize the snoop control for IOMMU (default) * no-snoop Dont utilize the snoop control for IOMMU * amd-iommu-debug Turn on debug info for AMD IOMMU */ (In reply to comment #13) > iommu=pv is not a RHEL-5 xen cmdline option ... but it enables the iommu functionality just the same as iommu=on (which is also not a distinguished option argument) Reproduced on 2.6.18-264.el5xen (both host and guest version). xen cmdline: iommu=on dom0 cmdline: pci_pt_e820_access=on new /etc/modprobe.conf entry: options igb max_vfs=1 After module reinsert or host reboot, get a list of VFs provided by igb type cards: lspci -nn -d 8086:10ca -D Pick one, eg. UNB=0000:01:10.0 and set it up for passthrough: echo "$UNB" > /sys/bus/pci/drivers/igbvf/unbind modprobe pciback hide="($UNB)" alternatively: modprobe pciback echo "$UNB" > /sys/bus/pci/drivers/pciback/new_slot echo "$UNB" > /sys/bus/pci/drivers/pciback/bind check if device is indeed assignable: xm pci-list-assignable-devices add to the guest while guest is shut down: echo "pci = [ \"$UNB\" ]" >> /etc/xen/rhel56-64bit-pv Boot the guest, then reboot it. Dom0 log: msix entry 0 for dev 01:10:0 are not freed before acquire again. msix entry 1 for dev 01:10:0 are not freed before acquire again. msix entry 2 for dev 01:10:0 are not freed before acquire again. Removing now the igbvf module in the guest, dom0/Xen log: (XEN) irq.c:934: dom2: pirq 255 not mapped unmap irq ff failed (XEN) irq.c:934: dom2: pirq 254 not mapped unmap irq fe failed (XEN) irq.c:934: dom2: pirq 253 not mapped unmap irq fd failed Tracing from drivers/base/power/shutdown.c::device_shutdown() on yields: ... igbvf 0000:00:00.0: shutdown via bus op ffffffff80352999 igbvf_suspend: enter igbvf_suspend: right before pci_disable_device() pci_disable_device: enter pci_disable_device: exit igbvf_suspend: exit: 0 pci0000:00: no shutdown method pcifront pci-0: shutdown via bus op ffffffff803be8ee ... $ addr2line -fie vmlinux-syms-2.6.18.test <<<ffffffff80352999 pci_device_shutdown /root/src/rhel5/kernel/drivers/pci/pci-driver.c:388 drivers/pci/pci.c::pci_disable_device() looks like this: 871 void 872 pci_disable_device(struct pci_dev *dev) 873 { 877 printk(KERN_INFO "%s: enter\n", __FUNCTION__); 883 if (dev->msi_enabled) 884 disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI), 885 PCI_CAP_ID_MSI); 886 if (dev->msix_enabled) { 887 printk(KERN_INFO "%s: disabling msix\n", __FUNCTION__); 888 disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI), 889 PCI_CAP_ID_MSIX); 890 } 902 printk(KERN_INFO "%s: exit\n", __FUNCTION__); 903 } The shutdown functions are invoked correctly, but "dev->msix_enabled" is false for some reason, so disable_msi_mode() is not called. Conversely, as written in comment 11, when the module is removed: igbvf_remove() -> igbvf_reset_interrupt_capability() -- checks "adapter->msix_entries" -> pci_disable_msix() -- in drivers/pci/msi-xen.c Now "msi-xen.c" also provides a disable_msi_mode() function, therefore I think once we can set "dev->msix_enabled" correctly, the code to disable MSIX under Xen should be already there. The problem is in RHEL-5's drivers/pci/msi-xen.c::pci_enable_msix(). igbvf_set_interrupt_capability() calls it. In RHEL-5 (as of -264), pci_enable_msix() does not set dev->msix_enabled to 1, whereas upstream (as of 1086:415a9b435fef) does. Comparing the two flavors of pci_enable_msix() and blaming upstream for the lines the diff adds, the backport candidates are as follows: c/s 680 http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/583086d5cd26 pci-msi: Dynamically map into Linux irq space. c/s 750 http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/63a878f8851b Fix buggy mask_base in saving/restoring MSI-X table during S3 c/s 956 http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/7ee9b1483445 xen/PCI/MSI-X: Don't disable when it was never enabled The last one adds "dev->msix_enabled = 1". The other two may be necessary as a basis. I'd say c/s 680 does not apply, because we have none of evtchn_get_xen_pirq() / evtchn_map_pirq(). Projecting c/s 750 to the diff between RHEL-5's and upstream's pci_enable_msix(), it takes the blame for a single line inside this diff: http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/63a878f8851b#l1.148 c/s 750 modified the signature of attach_pirq_entry(), so it had to update the block that c/s 680 added to pci_enable_msix() as well, because that block calls attach_pirq_entry() too. Since 680 is N/A for us, and further because we already seem to have the other parts of c/s 750 (see commit a8614c21, bug 484227), I claim c/s 956 is the only thing we need to backport from the above three. In addition, the same needs to be done for MSI; c/s 945 seems to describe and fix the exact problem captured in this BZ. In summary: c/s 945 http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/7bae5a000157 xen: Dont call msi_unmap_pirq() if did not enable msi c/s 956 http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/7ee9b1483445 xen/PCI/MSI-X: Don't disable when it was never enabled (In reply to comment #17) > In summary: > > c/s 945 http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/7bae5a000157 > xen: Dont call msi_unmap_pirq() if did not enable msi > > c/s 956 http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/7ee9b1483445 > xen/PCI/MSI-X: Don't disable when it was never enabled (plus, from comment 0:) c/s 1070 http://xenbits.xen.org/hg/linux-2.6.18-xen.hg/rev/2994d2997f9d pciback: disable MSI/MSI-X when resetting device Created attachment 501323 [details]
fix msi(x) handling
backport upstream c/s 945, 956, 1070
Found xen-devel thread here: http://lists.xensource.com/archives/html/xen-devel/2011-03/msg01213.html Gives me the idea to try "iommu=no-intremap" together with the patch. (In reply to comment #20) > Created attachment 501323 [details] > fix msi(x) handling > > backport upstream c/s 945, 956, 1070 not good enough, problem persists Comment on attachment 501323 [details] fix msi(x) handling After carefully studying Documentation/MSI-HOWTO.txt, the igbvf driver, the related parts of the PCI subsystem, and msi-xen.c, I have the following to say. Here's a distillation/interpretation of MSI-HOWTO.txt, wrt. MSI-X: * Normally the PCI device starts up in PIN-IRQ assertion (legacy) mode. dev->irq holds the pre-assigned IOAPIC vector. * The driver calls pci_enable_msix(). It is an all-or-nothing function. It tries to allocate the requested number of MSI vectors. The request specifies the number of needed MSI vectors, and their unique entry positions in the (device dependent) MSI-X table. The entry positions seem nothing more than unique serial numbers. The function allocates an MSI vector for each entry requested. The returned MSI vector values are arbitrary and can vary from pci_enable_msix() call to pci_enable_msix() call. The original dev->irq value is saved by the PCI subsystem. On return, it is not modified. Instead the function returns the allocated MSI vectors in the request array, each associated with the corresponding, requested entry. The device is switched to MSI-X mode and further use of dev->irq is undefined behavior. Since there is no storage allocated in *dev for the MSI vectors, the driver must save (copy out or just keep) the returned MSI vectors on its own. * After a successful return from pci_enable_msix(), the driver must call request_irq() on each MSI vector value, ie. assign a (possibly separate) interrupt handler to each. * When the device driver is *unloading*, all interrupt handlers assigned to the MSI vectors must be dissociated with free_irq(). The driver calls pci_disable_msix(). It marks the MSI vectors as unused, and restores PIN-IRQ assertion (legacy) mode. dev->irq becomes defined again. Enter the igbvf driver. The driver issues the following calls to pci_disable_msix(): igbvf_probe == igbvf_driver.probe igbvf_sw_init igbvf_set_interrupt_capability (error path) igbvf_reset_interrupt_capability pci_disable_msix igbvf_probe (error path) == igbvf_driver.probe igbvf_reset_interrupt_capability pci_disable_msix igbvf_remove == igbvf_driver.remove igbvf_reset_interrupt_capability pci_disable_msix If the igbvf module is removed in the guest before shutdown, pci_disable_msix() is called correctly. A subsequent restart will complete successfully; I just tested it and ping worked on the igbvf interface of the rebooted gest. The igbvf driver's shutdown() callback method (called from device_shutdown(), see comment 16) runs like this: igbvf_shutdown() == igbvf_driver.shutdown igbvf_suspend() pci_disable_device() pci_disable_device() does the following (if dev->msix_enabled is true): * First it searches the PCI device's (= igbvf's) capability list for a structure that controls Message Signalled Interrupts. * Once the position of this MSI capability structure is known, pci_disable_device() calls disable_msi_mode(dev, position, PCI_CAP_ID_MSIX) on it. * disable_msi_mode() reads word a from the control register under this position (ie. the MSI control register), masks out the PCI_MSIX_FLAGS_ENABLE bit in the word (via msix_disable()), then writes it back to the register. disable_msi_mode(... PCI_CAP_ID_MSIX) also sets dev->msix_enabled to false. * pci_disable_device() goes on to turn off bus-mastering (again, control register manipulation), and calls pcibios_disable_device() as the final step. * On i386, pcibios_disable_device() turns off IO and memory access (pcibios_disable_resources(), control register manipulation), and, since we're under Xen (pcibios_disable_irq == NULL), does nothing else. Accordingly, igbvf_shutdown() *never* calls pci_disable_msix() -- not even on bare metal --, therefore the following call chain can never run: pci_disable_msix() pci_frontend_disable_msix() do_pci_op() call the XEN_PCI_OP_disable_msix hypervisor operation On the bare metal this is no problem: once igbvf_shutdown() returns and device_shutdown() completes walking all the devices, the physical host is rebooted and the device is returned to PIN-IRQ assertion (legacy) mode. The hypervisor doesn't forget though. Looking at the upstream changeset 1070:2994d2997f9d, suggested in comment 0, I observe the following: * First note that pci_disable_msix() is explicitly called before pci_disable_device(). This corroborates pci_disable_device() does *not* cleans up the MSI-X resources -- and igbvf_shutdown() only calls pci_disable_device(). * Second, this patch handles the abrupt guest termination case. Unfortunately, we can't apply this verbatim to a normally terminated guest: in c/s 1070, calling pci_disable_msix() depends on dev->msix_enabled. But -- referring back to pci_disable_device() / disable_msi_mode() -- dev->msix_enabled is already zero after the guest has shut down in a controlled manner. This seems to imply that dev->msix_enabled has mixed responsibilities: on one hand, c/s 1070 treats it as if it were keepig track of MSI-X interrupt *delivery* (setting up vectors, handlers and so on -- this is what pci_disable_msix() undoes), while pci_disable_device() / disable_msi_mode() associates it with MSI-X interrupt *generation* (ie. whether the PCI device is allowed to emit such interrupts). * Thus a PCI device that was told not to emit MSI-X interrupts can have MSI-X vectors and handlers in place; and indeed igbvf has, after igbvf_shutdown() returns. In this sense, upstream changesets 945 an 956 (see comment 17) can even be detrimental; they stop pci_disable_msi() / pci_disable_msix() from working if the corresponding disable_msi_mode() was called on the device previously. I think this is an igbvf problem -- after all, if a physical device is passed through to a domU, the domU takes responsibility. (A VF is a physical device in this respect.) * For the kernel-xen component, this could be a CANTFIX. (As the patches mentioned this far can't fix the most common case, that is, the normal reboot of the guest, I'm obsoleting attachment 501323 [details].) * For igbvf, there already might be a bug submitted. We could close this BZ as a dup of that then. * I recommend adding "rmmod igbvf" to one of the kill scripts in such a domU. If this will be our official standpoint, then backporting changeset 1070 (see comment 0) may make sense. (The added cleanup still won't work if the guest crashes after it has called disable_msi_mode() and set dev->msix_enabled to false.) Created attachment 501508 [details] "configurable" igbvf_shutdown() Here's my adventurous excursion to device driver land. This patch adds the domU_shutdown module parameter to igbvf. Since I'm not sure what order of operations is allowed, I tried to make it configurable. Since igbvf_suspend() calls pci_disable_device() (see comment 25), after which no operation on the device seems to be allowed, setting domU_shutdown to 1 appears very wrong. Setting it to 2 seems wrong again, because at that time free_irq() was not yet called, and releasing the MSI-X vectors before dissociating their IRQ handlers is invalid. Anyway, I added these only for the case if domU_shutdown=3 had not worked -- I wanted to be able to try 1 and 2 without a recompile. Luckily, domU_shutdown=3, ie. calling igbvf_remove() instead of igbvf_suspend() from igbvf_shutdown() worked for me twice in a row: I rebooted the guest twice and each time the new domain brought up the igbvf interface (I verified it with ping). I'm certain that calling igbvf_remove() from igbvf_shutdown(), ie. device_shutdown() context violates a bunch of pci-dev invariants unknown to me; furthermore, it's not really nice to litter the igbvf driver with Xen-specific exceptions. Nonetheless, the patch still demonstrates IMO what operation is missing from the shutdown sequence. Setting "review?" because I'd like to run the idea by device driver experts. Rob, can you please build a PV guest kernel with the patch in comment 26m and re-run your tests? Please do it like this: - start out with a "clean" host, ie. one under which the problematic igbvf messages never showed up before, - make sure that the VF is not assigned to the PV guest, - boot the PV guest with its current kernel, - install the patched kernel, - add "options igbvf domU_shutdown=3" to /etc/modprobe.conf, - shut down the guest, - assign the VF to the guest, - start the reboot loop. I'm setting NEEDINFO. Thanks! Hi Laszlo, I appreciate the work you have done in identifying and hopefully fixing this but I no longer have access to the system involved with this bug. The production system was on a deadline and we were forced to use Squeeze (Which did not suffer from this bug or broken Intel 82559 NIC passthrough). Hi Rob, (In reply to comment #28) > we were forced to use Squeeze did you install Squeeze on the host, or on the guest? Or even both? Thanks. Hi Laszlo, Both. Squeeze Dom0 and Paravirt Squeeze Guest. Config was VF passthough to multiple guests for 2 ports on an 82576EB, Dom0 only for 2 ports on another 82576EB and Passthrough of both ports on a 82599 10Gbit card (Traditional passthrough on the 10Gbit ports because jumbo frames was needed for these). I used a script to set the MACs on all the VFs in Dom0 on bootup but otherwise nothing unsual was needed. I tried to look at *how* things go wrong when they go wrong, in more detail. Now I also think this is a dom0 bug, not domU and/or igbvf. Please correct anything wrong below. It's about how dom0 mediates MSI-X vector (de)assignment on domU's behalf. Upstream and RHEL-5 are quite different in this aspect I think, see c/s 680 in comment 17. Building on comment 25 and comment 11: domU: igbvf_set_interrupt_capability() pci_enable_msix() pci_frontend_enable_msix() do_pci_op(pdev, XEN_PCI_OP_enable_msix) hv: ... dom0: pci_enable_msix() msix_capability_init() So, domU's MSI-X enablement request is forwarded via the hv to dom0. It's dom0 running msix_capability_init() that logs the messages in comment 0. dom0 maintains a (static) list of MSI-related PCI devices -- msi_dev_head (/home/drivers/pci/msi-xen.c). Per c/s 680, upstream seems to manipulate this list in domU as well, but in RHEL-5 The list only ever accessed in dom0, I believe. This list can only grow. It collects those PCI devices dom0 has ever touched wrt. MSI(-X) interrupts. get_msi_dev_pirq_list() looks up a PCI device in this list -- if the device is not yet in the list, the function adds it and returns a pointer to it. Any such list entry owns a sub-list of (MSI-X-vector, entry number) assignments that are in effect for the PCI device. The "msi_dev_head" list implements a mapping, from PCI device to the set of (MSI-X vector, entry number) pairs active for the device. PCI device --> { (MSI-X vector, entry number) } msix_capability_init() does the following (again, this runs only in dom0): - It grabs (after possibly inserting) the "msi_dev_list" entry for the PCI device being MSI-X-activated. - It compares the *requested* entry numbers, one by one, against the set of *active* entry numbers. - For any requested entry number that's already maintained in the list, it only prints the warning. - Any requested entry that's not yet maintained is mapped (see below)) and added to the set with attach_pirq_entry(). The mapping happens like this (following from above): domU: igbvf_set_interrupt_capability() pci_enable_msix() pci_frontend_enable_msix() do_pci_op(pdev, XEN_PCI_OP_enable_msix) hv: ... dom0: pci_enable_msix() msix_capability_init() msi_map_vector() <---- called from the loop msi_map_pirq_to_vector() HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq) That is, dom0 asks the HV on behalf of domU to do the "real" MSI-X allocation, based on: - domid of the domU owning the PCI device, - PCI device, - entry number. The returned "pirq"'s are batched up in msix_capability_init() (all or nothing), then passed back to the domU via the above stack. Those "pirq" MSI-X interrupt vectors will be generated for the domU; in this case, igbvf is the driver that installs interrupt handlers for them with request_irq(). The MSI-X vector teardown goes similarly: domU: igbvf_reset_interrupt_capability() pci_disable_msix() pci_frontend_disable_msix() do_pci_op(pdev, XEN_PCI_OP_disable_msix) hv: ... dom0: pci_disable_msix() msi_remove_pci_irq_vectors() -- loops: msi_unmap_pirq() HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq) list_del() The unmapping also relies on the domid of the domU owning the PCI device. Acquiring this domid happens via pciback_get_owner(), indirectly. So what happens when the domU simply goes away, without calling pci_disable_msix(), and without dom0 calling pci_disable_msix() for it? Two things: (1) The hypervisor *must* obviously delete all PIRQ assignments that were made towards this domU (ie. emulate the effects of a PHYSDEVOP_unmap_pirq loop). I have no doubt this actually occurs. (2) The "msi_dev_head" entry, stored in dom0, that corresponds to the PCI device will keep the full set of (MSI-X vector, entry number) pairs. This set now stale, it should be empty. When the next PV guest (any guest) tries set up the MSI-X vectors for this PCI device, the above filtering in msix_capability_init() will run on stale data, and omit any PHYSDEVOP_map_pirq hypercalls. Two consequencees: - since the hv assigns no pirq's to the domU, igbvf has no chance to work, - when we then try to detach the device or remove the igbvf module in the domU, the unmapping loop in msi_remove_pci_irq_vectors(), running on stale data, will see all PHYSDEVOP_unmap_pirq hypercalls fail, and produce these hv / dom0 logs (see comment 15): (XEN) irq.c:934: dom2: pirq 255 not mapped unmap irq ff failed (XEN) irq.c:934: dom2: pirq 254 not mapped unmap irq fe failed (XEN) irq.c:934: dom2: pirq 253 not mapped unmap irq fd failed The best place to fix this is probably pciback_reset_device() in drivers/xen/pciback/pciback_ops.c. The inspiration comes from c/s 1070 (see comment 0): it covers the "destroyed guest" case there too. We should introduce an extern function called msi_prune_pci_irq_vectors() to drivers/pci/msi-xen.c, and call that from pciback_reset_device(). Missed an important call in comment #33: > The mapping happens like this (following from above): > > domU: igbvf_set_interrupt_capability() > pci_enable_msix() > pci_frontend_enable_msix() > do_pci_op(pdev, XEN_PCI_OP_enable_msix) > hv: ... > dom0: pci_enable_msix() > msix_capability_init() > msi_map_vector() <---- called from the loop > msi_map_pirq_to_vector() > HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq) Then, if msi_map_vector() succeeded, we extend the set of (MSI-X vector, entry number) pairs for this PCI device: > attach_pirq_entry <---- also called from the loop Yet another piece of the puzzle I didn't understand before: whenever I wrote "hv: ..." above, that doesn't involve the hypervisor. In those places there seems to be direct event channel signalling from the domU, drivers/xen/pcifront/pci_op.c: do_pci_op(), to dom0, drivers/xen/pciback/pciback_ops.c: pciback_do_op(). I think I *might* have found a way how the domU can zero out the dom0 dev->msix_enabled field. Correcting/extending comment 25: domU: igbvf_shutdown() igbvf_suspend() pci_disable_device() pcibios_disable_device() pcibios_disable_resources() pci_write_config_word( dev, PCI_COMMAND, orig & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY) ) ... pcifront_bus_write() do_pci_op(XEN_PCI_OP_conf_write) dom0: pciback_do_op() pciback_config_write() conf_space_write() command_write() [via PCI_COMMAND funcptr] pci_disable_device() disable_msi_mode() dev->msix_enabled = 0; In short, when the domU manipulates PCI config space, that's forwarded to dom0. In pciback/dom0, many fields in the config space are "hooked" and handled specially. If both PCI_COMMAND_IO and PCI_COMMAND_MEMORY are masked off the PCI_COMMAND field, then dom0 disables the device and sets the msix_enabled flag to false, even though the MSI-X vectors are still allocated. That's why c/s 1070 is not enough in itself. Created attachment 502139 [details]
backport c/s 1070 for the abrupt termination case (xm destroy)
Created attachment 502141 [details]
introduce msi_prune_pci_irq_vectors()
extend pciback_reset_device() with the following functionality (msi_prune_pci_irq_vectors()):
- remove all (MSI-X vector, entry number) pairs that might still belong, in
dom0's mind, to the PCI device being reset,
- unmap some space "used for saving/restoring MSI-X tables" that might otherwise
go leaked
The function doesn't touch the hypervisor nor the PCI device; it only trims the list used by dom0 for filtering. Justification being: when this is called, either the owner domain is gone already (or very close to being gone), or the device was already correctly detached.
(In reply to comment #40) > Created attachment 502141 [details] > introduce msi_prune_pci_irq_vectors() modeled after msi_remove_pci_irq_vectors() brew build: https://brewweb.devel.redhat.com/taskinfo?taskID=3361515 (In reply to comment #41) > (In reply to comment #40) > > Created attachment 502141 [details] > > introduce msi_prune_pci_irq_vectors() > > modeled after msi_remove_pci_irq_vectors() discussion with upstream: http://lists.xensource.com/archives/html/xen-devel/2011-06/msg00036.html Considering the consequences of attachment 502141 [details] for HVM passthrough: Since pciback is needed for device assignment even if the targeted domU is HVM, pciback_reset_device() and consequently introduce msi_prune_pci_irq_vectors() will be called for PCI devices originally assigned to HVM domains as well. (See callers of pciback_reset_device() in [1].) I intend to verify this with a host kernel that prints something in pciback_reset_device(). On the other hand, when an HVM domU enables MSI-X for a device passed through to it (see for example igbvf_set_interrupt_capability() in comment 33 for igbvf), the guest's pci_enable_msix() call will not end up in msi-xen.c and travel to dom0: domU: igbvf_set_interrupt_capability() pci_enable_msix() [msi-xen.c] pci_frontend_enable_msix() [pci_op.c] do_pci_op(pdev, XEN_PCI_OP_enable_msix) [pci_op.c] hv: ... dom0: pci_enable_msix() [msi-xen.c] msix_capability_init() [msi-xen.c] <---- filtering attach_pirq_entry() [msi-xen.c] <---- list manipulation Instead, the guest's pci_enable_msix() call will go directly to the hardware: domU: igbvf_set_interrupt_capability() pci_enable_msix() [msi.c] msix_capability_init() [msi.c] <----- hardware access That is, a PCI device passed through to an HVM guest does not hit dom0's "msi_dev_head" list at all. - pci_enable_msix() in PV domU: - device is added to dom0's "msi_dev_head" list if not yet there (note that this is separate from whatever accounts pciback keeps!) - MSI-X vector list belonging to the device is used for filtering - newly allocated vectors are saved in the MSI-X vec list - pci_enable_msix() in HVM domU: - doesn't reach dom0 at all - pciback_reset_device() --> msi_prune_pci_irq_vectors(): - runs in dom0 - doesn't care about the type of the device's previous (or next) owner domain - looks up device "msi_dev_head" list -- if it's not there, nothing changes - if it's there, empties the MSI-X vector list belonging to the entry - hardware, pci_dev, hypervisor are not touched Since pciback enforces that a device is owned by at most by one domU at any point in time, pciback_reset_device() --> msi_prune_pci_irq_vectors() can't empty the MSI-X vector set of a device that is assigned to "some other domain". (As said in the upstream thread [1], dom0's "msi_dev_head" list is completely unaware of who owns what.) IOW, when a HVM domU goes away and msi_prune_pci_irq_vectors() is called for a device that was passed through to the domU, two things can happen: - the device is not in the list --> no-op - the device is in the list (from an earlier PV passthrough), but the MSI-X vector set is empty below it --> no-op - the device is in the list with a non-empty vector set --> this should never happen, since when *the* PV domain owning the device went away, msi_prune_pci_irq_vectors() was called on the device and emptied the MSI-X vector set. Even if the list is non-empty, the function only cleans up the list; doesn't touch the hypervisor. And the iounmap() block should be idempotent (heh) on two levels: first it's checked for being non-NULL, second it's checked against "vmlist". [1] http://lists.xensource.com/archives/html/xen-devel/2011-06/msg00054.html Created attachment 502633 [details] debug messages for msi_prune_pci_irq_vectors() [ad-hoc testing] My HVM related tests seem to confirm comment 43: msi_prune_pci_irq_vectors() is called at other times as well, but it decays to a no-op. When binding to pciback: (XEN) PCI add device 01:10.0 pciback 0000:01:10.0: seizing device PCI: Enabling device 0000:01:10.0 (0000 -> 0002) msi_prune_pci_irq_vectors(): enter msi_prune_pci_irq_vectors(): 0000:01:10.0: exit after in-list Start HVM domain -- OK, igbvf ping successful. Try to start PV domain while HVM is running: Error: failed to assign device 0000:01:10.0 that has already been assigned to other domain. HVM domain shutdown: (XEN) hvm.c:524:d1 DOM1/VCPU13: going offline. (XEN) hvm.c:524:d1 DOM1/VCPU14: going offline. msi_prune_pci_irq_vectors(): enter msi_prune_pci_irq_vectors(): 0000:01:10.0: exit after in-list Start PV domain: OK, igbvf ping successful. Try to start HVM domain while PV is running: Error: failed to assign device 0000:01:10.0 that has already been assigned to other domain. Reboot PV domain: msi_prune_pci_irq_vectors(): enter PCI: 0000:01:10.0: cleaning up MSI-X entry 0 PCI: 0000:01:10.0: cleaning up MSI-X entry 1 PCI: 0000:01:10.0: cleaning up MSI-X entry 2 PCI: 0000:01:10.0: cleaning up mask_base msi_prune_pci_irq_vectors(): 0000:01:10.0: exit after in-list igbvf ping successful in new PV domain. Shutdown PV domain: msi_prune_pci_irq_vectors(): enter PCI: 0000:01:10.0: cleaning up MSI-X entry 0 PCI: 0000:01:10.0: cleaning up MSI-X entry 1 PCI: 0000:01:10.0: cleaning up MSI-X entry 2 PCI: 0000:01:10.0: cleaning up mask_base msi_prune_pci_irq_vectors(): 0000:01:10.0: exit after in-list Start HVM domain: OK, igbvf ping successful. Shutdown HVM domain: (XEN) hvm.c:524:d10 DOM10/VCPU13: going offline. msi_prune_pci_irq_vectors(): enter msi_prune_pci_irq_vectors(): 0000:01:10.0: exit after in-list Tried a separate PV destroy: c/s 1070 (--> msi_remove_pci_irq_vectors) runs, unmappings fail (domU's already gone): get owner for dev 80 get c unmap irq ff failed get owner for dev 80 get c unmap irq fe failed get owner for dev 80 get c unmap irq fd failed msi_prune_pci_irq_vectors() follows immediately: msi_prune_pci_irq_vectors(): enter msi_prune_pci_irq_vectors(): 0000:01:10.0: exit after in-list Message-Id: <1307089198-3818-1-git-send-email-lersek> Created the following test plan to verify dynamic attach/detach:
owner domain next owner domain
---------------- -----------------
rhel56-64bit-pv rhel56-64bit-pv2
rhel56-64bit-pv2 rhel56-64bit-hvm
rhel56-64bit-hvm rhel56-64bit-pv
rhel56-64bit-pv rhel56-64bit-hvm
rhel56-64bit-hvm rhel56-64bit-pv2
rhel56-64bit-pv2 rhel56-64bit-pv
This should check all possible owner transitions. I started all three guests without assigning the PCI device to any of them. Then I tried to assign it to rhel56-64bit-pv (the PCI device was already seized by pciback):
[root@intel-s3ea2-08 ~]# xm pci-attach rhel56-64bit-pv 0000:01:10.0
Error: only HVM guest support pci attach
Retried with virsh:
[root@intel-s3ea2-08 ~]# cat dev.xml
<hostdev mode='subsystem' type='pci' managed='no'>
<source>
<address domain='0x0000' bus='0x01' slot='0x10' function='0x0'/>
</source>
</hostdev>
[root@intel-s3ea2-08 ~]# virsh attach-device rhel56-64bit-pv dev.xml
error: Failed to attach device from dev.xml
error: Requested operation is not valid: Xm driver only supports modifying persistent config
[root@intel-s3ea2-08 ~]# rpm -q libvirt xen xen-libs
libvirt-0.8.2-20.el5
xen-3.0.3-131.el5
xen-libs-3.0.3-131.el5
Beaker Test information:
HOSTNAME=intel-s3ea2-08.lab.bos.redhat.com
JOBID=90237
RECIPEID=184994
DISTRO=RHEL-5.7-Server-Beta-1.2
ARCHITECTURE=x86_64
So I think dynamic assignment to PV domains is not supported.
Testing HVM: since the guest is RHEL-5 (-5.6, -238); it's important to load the "acpiphp" module first in it; otherwise hotplugging the passed-through PCI device won't work.
Repeated the following twice:
- xm pci-attach rhel56-64bit-hvm 0000:01:10.0
- dhcp on igbvf in HVM guest
- verify with ping
- ifdown in guest
- xm pci-detach rhel56-64bit-hvm 0000:01:10.0
The pci-detach step triggers the call to msi_prune_pci_irq_vectors() in dom0 (these entry/exit log messages are written only by the debug printk patch, attachment 502633 [details]):
msi_prune_pci_irq_vectors(): enter
msi_prune_pci_irq_vectors(): 0000:01:10.0: exit after in-list
I left the HVM domain running, and booted, rebooted and shut down the PV guest (with persistent PCI passthrough config, checking igbvf with ping). The two PV domain teardowns logged, in dom0:
PCI: 0000:01:10.0: cleaning up MSI-X entry 0
PCI: 0000:01:10.0: cleaning up MSI-X entry 1
PCI: 0000:01:10.0: cleaning up MSI-X entry 2
PCI: 0000:01:10.0: cleaning up mask_base
Finally re-attached the device to the HVM domain and confirmed igbvf availability with ping.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Patch(es) available in kernel-2.6.18-283.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. Reproduce the bug in kernel-xen-2.6.18-274.el5, and verify the bug on kernel-xen-2.6.18-300.el5 and xen-3.0.3-135.el5. Steps: 1. Assign one VF to a RHEL5.8 PV guest. 2. Reboot the guest. On kernel-xen-2.6.18-274.el5, following log show in Domain0 dmesg: blktap: ring-ref 9, event-channel 18, protocol 1 (x86_64-abi) PCI: Enabling device 0000:01:10.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:01:10.0 to 64 msix entry 0 for dev 01:10:0 are not freed before acquire again. msix entry 1 for dev 01:10:0 are not freed before acquire again. msix entry 2 for dev 01:10:0 are not freed before acquire again. On kernel-xen-2.6.18-300.el5, no such messages in Domain0 dmesg, the VF is initialized successful, Domain0 dmesg as following: tap tap-2-51712: 2 getting info pciback: vpci: 0000:01:10.0: assign to virtual slot 0 blktap: ring-ref 9, event-channel 18, protocol 1 (x86_64-abi) PCI: Enabling device 0000:01:10.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:01:10.0 to 64 get owner for dev 80 get 2 get owner for dev 80 get 2 get owner for dev 80 get 2 According above info, change this bug into VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0150.html |