Bug 688673

Summary: PCI Virtual Function Passthrough - SR-IOV, Paravirt Guest fails to obtain IRQ after reboot
Product: Red Hat Enterprise Linux 5 Reporter: Rob
Component: kernel-xenAssignee: Laszlo Ersek <lersek>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.6CC: drjones, jarod, leiwang, lersek, mrezanin, pasik, pbonzini, pcao, prarit, qwan, sassmann, xen-maint, yuzhang, yuzhou
Target Milestone: rc   
Target Release: 5.8   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.18-283.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-21 03:32:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514489    
Attachments:
Description Flags
fix msi(x) handling
none
"configurable" igbvf_shutdown()
none
backport c/s 1070 for the abrupt termination case (xm destroy)
none
introduce msi_prune_pci_irq_vectors()
none
debug messages for msi_prune_pci_irq_vectors() [ad-hoc testing] none

Description Rob 2011-03-17 17:11:08 UTC
Description of problem:
Passthrough PCI Virtual Function detects and works perfectly first boot of paravirt domu. If paravirt domu rebooted or on any subsequent boot of domu then detection of the passthough virtual function will fail. HVM domu with passthough PCI VF always works fine.

Version-Release number of selected component (if applicable):
5u6

How reproducible:
Always

Steps to Reproduce:
1. Boot 2.6.18-247 Dom0 Xen Kernel with iommu=pv and usual Redhat SR-IOV options
2. Start a PV domu with PCI Virtual Function passthrough; it works fine
3. Reboot the domu; now its broken
  
Actual results:
On domu reboot PCI Virtual Function Fails to intialise in domu.
Dom0 log shows:
msix entry 0 for dev 09:10:0 are not freed before acquire again.
msix entry 1 for dev 09:10:0 are not freed before acquire again.
msix entry 2 for dev 09:10:0 are not freed before acquire again.

Expected results:
On domu reboot detect PCI passthrough VF device as did on intial domu boot.

Additional info:
This is almost certainly because the 2.6.18-247 Xen kernel needs this patch:
http://xenbits.xen.org/hg/linux-2.6.18-xen.hg/rev/2994d2997f9d

Please could you apply it.

Comment 2 Andrew Jones 2011-03-18 09:36:48 UTC
Hi Rob,

How certain are you about the patch you've pointed out, did you get the WARN() that it describes? Would you mind testing a kernel with the patch that I can build for you?

Thanks,
Drew

Comment 3 Rob 2011-03-18 10:47:26 UTC
Hi Drew,


I've just finished building pciback from the 247 kernel source with the patch. The patch applied cleanly and partially fixes the problem.

Abrupt paravirt VM restart is now OK (eg. xm destroy testvm && xm create testvm now works)
Clean restart or clean shutdown of the VM and then xm create of VM still shows the problem.

Additional information is appearing in Dom console when the failures occur, the VM has 2 Intel 82576 VF assignments from 2 different Intel 82576 lan chips and the increased info in dom0 console looks like this:
msix entry 0 for dev 05:10:0 are not freed before acquire again.
msix entry 1 for dev 05:10:0 are not freed before acquire again.
msix entry 2 for dev 05:10:0 are not freed before acquire again.
msix entry 0 for dev 09:10:0 are not freed before acquire again.
msix entry 1 for dev 09:10:0 are not freed before acquire again.
msix entry 2 for dev 09:10:0 are not freed before acquire again.
unmap irq ff failed
unmap irq fe failed
unmap irq fd failed
unmap irq fc failed
unmap irq fb failed
unmap irq fa failed

What is reposnible for unassigning the MSIX IRQ when the VM is shutdown or rebooted? I'm using the igbvf driver that is included with the 247 kernel.

Comment 4 Rob 2011-03-18 11:29:48 UTC
Forgot to add, I dont know if I got the WARN() before patch because I'm not sure if I needed to enable additional debugging or even where this WARN() is meant to appear (DomU log?).

Comment 5 Andrew Jones 2011-03-18 11:58:40 UTC
Yeah, it's not clear which exact WARN() it would be, but you'd find it in the guest console output during boot or the guest dmesg. We'll set up a test system here to try and reproduce the issue as well.

Comment 6 Rob 2011-03-18 16:12:43 UTC
Hi Andrew,

Please let me know your results. All of the above is with the igb and igbvf drivers. The igb driver in Dom0 behaves the same regardless of if I used the included or the latest Intel 2.4.3 version.

I have had less luck with the X520-DA2 (ixgbe) card. This refuses to detect a link regardless although I can ping between the Dom0s on it fine. Strangely enough I can compile and install the Intel 1.0.7 ixgbevf driver on the same paravirt domu, it gives some more info in dmesg but behaves the same.

Comment 7 Rob 2011-03-21 08:35:01 UTC
Theres a pcifront patch here that "Fix Freeing of Device", wont apply cleanly but sounds like it might be whats needed to sort the reboot and clean shutdown issues:
http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/5dd203fba380

Comment 8 Andrew Jones 2011-03-22 16:56:07 UTC
Hi Rob,

We got a test machine set up and were able to reproduce the problem. We got the same misx messages. After the second boot when it wasn't working I attempted a 'modprobe -r igbvf' to reinit the module. This "fixed" it. I'm still poking at the logs from both the host and the guest of the whole process to try and sort out why we're not properly disabling msix in the first place. pcifront might be the right place to look in the end, but in light of the 'modprobe -r' experiment, I'll start with the driver and work down.

Drew

Comment 9 Rob 2011-03-22 17:26:21 UTC
Hi Drew,

Thanks for the update. For some reason I couldn't compile the updated igbvf driver from Intels site, the updated ixgbevf driver and the Dom0 drivers compile OK.

Do you have an Intel 82599 card (Uses ixgbe/ixgbevf driver) to test with?


Rob

Comment 11 Laszlo Ersek 2011-05-24 13:25:58 UTC
Here's a wild guess:

(In reply to comment #3)

> Additional information is appearing in Dom console when the failures occur,
> the VM has 2 Intel 82576 VF assignments from 2 different Intel 82576 lan
> chips and the increased info in dom0 console looks like this:
> msix entry 0 for dev 05:10:0 are not freed before acquire again.
> msix entry 1 for dev 05:10:0 are not freed before acquire again.
> msix entry 2 for dev 05:10:0 are not freed before acquire again.
> msix entry 0 for dev 09:10:0 are not freed before acquire again.
> msix entry 1 for dev 09:10:0 are not freed before acquire again.
> msix entry 2 for dev 09:10:0 are not freed before acquire again.

We have two loops here (one for each device), each running from 0 (inclusive) to 3 (exclusive). This closely resembles the igbvf driver's igbvf_set_interrupt_capability() function. Quote:

        /* we allocate 3 vectors, 1 for tx, 1 for rx, one for pf messages */
        adapter->msix_entries = kcalloc(3, sizeof(struct msix_entry),
                                        GFP_KERNEL);
        if (adapter->msix_entries) {
                for (i = 0; i < 3; i++)
                        adapter->msix_entries[i].entry = i;

                err = pci_enable_msix(adapter->pdev,
                                      adapter->msix_entries, 3);
        }

pci_enable_msix() is implemented in "drivers/pci/msi-xen.c", which is also the only place that logs messages of the above form.

igbvf_reset_interrupt_capability() seems to undo this (consistently with the patch pointed to in comment 0):

        if (adapter->msix_entries) {
                pci_disable_msix(adapter->pdev);
                kfree(adapter->msix_entries);
                adapter->msix_entries = NULL;
        }

igbvf_reset_interrupt_capability() is called:
- on the igbvf_set_interrupt_capability() error/cleanup path,
- on the igbvf_probe() error/cleanup path,
- think it is also called (indirectly) at suspend time, via igbvf_suspend() ->
  pci_disable_device() -> disable_msi_mode(),
- and most importantly by igbvf_remove(), which handles "rmmod" (see comment 8)
  or hot-unplugging the device.

Perhaps it is not invoked at shutdown.

Tangentially, the ixgbe driver provides an IntMode parameter, so that one can force MSI (=1) or legacy (=0) interrupts.

(In reply to comment #7)
> Theres a pcifront patch here that "Fix Freeing of Device", wont apply cleanly
> but sounds like it might be whats needed to sort the reboot and clean shutdown
> issues:
> http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/5dd203fba380

I think that's unrelated. It makes sure that the dynamic IRQ and the underlying event channel are correctly disassociated, and that both are correctly released.

OTOH, RHEL-5's pcifront is not even interrupt-driven; we only have an event channel that we manually trigger and poll in do_pci_op().

Comment 13 Laszlo Ersek 2011-05-25 22:06:20 UTC
Side note:

(In reply to comment #0)

> Steps to Reproduce:
> 1. Boot 2.6.18-247 Dom0 Xen Kernel with iommu=pv and usual Redhat SR-IOV
> options

iommu=pv is not a RHEL-5 xen cmdline option

from -264 drivers/passthrough/iommu.c::parse_iommu_param(): the recognized values are off/no/false/0/disable, force/required, passthrough, snoop, no-snoop, no-intremap, amd-iommu-debug, amd-iommu-perdev-intremap.

/*
 * The 'iommu' parameter enables the IOMMU.  Optional comma separated
 * value may contain:
 *
 *   off|no|false|disable       Disable IOMMU (default)
 *   force|required             Don't boot unless IOMMU is enabled
 *   passthrough                Bypass VT-d translation for Dom0
 *   snoop                      Utilize the snoop control for IOMMU (default)
 *   no-snoop                   Dont utilize the snoop control for IOMMU
 *   amd-iommu-debug            Turn on debug info for AMD IOMMU
 */

Comment 14 Laszlo Ersek 2011-05-25 22:07:50 UTC
(In reply to comment #13)

> iommu=pv is not a RHEL-5 xen cmdline option

... but it enables the iommu functionality just the same as iommu=on (which is also not a distinguished option argument)

Comment 15 Laszlo Ersek 2011-05-25 23:33:03 UTC
Reproduced on 2.6.18-264.el5xen (both host and guest version).

xen cmdline:

  iommu=on

dom0 cmdline:

  pci_pt_e820_access=on

new /etc/modprobe.conf entry:

  options igb max_vfs=1

After module reinsert or host reboot, get a list of VFs provided by igb type cards:

  lspci -nn -d 8086:10ca -D

Pick one, eg.

  UNB=0000:01:10.0

and set it up for passthrough:

  echo "$UNB" > /sys/bus/pci/drivers/igbvf/unbind
  modprobe pciback hide="($UNB)"

alternatively:

  modprobe pciback
  echo "$UNB" > /sys/bus/pci/drivers/pciback/new_slot
  echo "$UNB" > /sys/bus/pci/drivers/pciback/bind

check if device is indeed assignable:

  xm pci-list-assignable-devices

add to the guest while guest is shut down:

  echo "pci = [ \"$UNB\" ]" >> /etc/xen/rhel56-64bit-pv

Boot the guest, then reboot it. Dom0 log:

  msix entry 0 for dev 01:10:0 are not freed before acquire again.
  msix entry 1 for dev 01:10:0 are not freed before acquire again.
  msix entry 2 for dev 01:10:0 are not freed before acquire again.

Removing now the igbvf module in the guest, dom0/Xen log:

  (XEN) irq.c:934: dom2: pirq 255 not mapped
  unmap irq ff failed
  (XEN) irq.c:934: dom2: pirq 254 not mapped
  unmap irq fe failed
  (XEN) irq.c:934: dom2: pirq 253 not mapped
  unmap irq fd failed

Comment 16 Laszlo Ersek 2011-05-27 09:41:56 UTC
Tracing from drivers/base/power/shutdown.c::device_shutdown() on yields:


...
igbvf 0000:00:00.0: shutdown via bus op ffffffff80352999
igbvf_suspend: enter
igbvf_suspend: right before pci_disable_device()
pci_disable_device: enter
pci_disable_device: exit
igbvf_suspend: exit: 0
 pci0000:00: no shutdown method
pcifront pci-0: shutdown via bus op ffffffff803be8ee
...

$ addr2line -fie vmlinux-syms-2.6.18.test <<<ffffffff80352999
pci_device_shutdown
/root/src/rhel5/kernel/drivers/pci/pci-driver.c:388

drivers/pci/pci.c::pci_disable_device() looks like this:

   871  void
   872  pci_disable_device(struct pci_dev *dev)
   873  {

   877    printk(KERN_INFO "%s: enter\n", __FUNCTION__);

   883    if (dev->msi_enabled)
   884      disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
   885        PCI_CAP_ID_MSI);
   886    if (dev->msix_enabled) {
   887      printk(KERN_INFO "%s: disabling msix\n", __FUNCTION__);
   888      disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
   889        PCI_CAP_ID_MSIX);
   890    }

   902    printk(KERN_INFO "%s: exit\n", __FUNCTION__);
   903  }

The shutdown functions are invoked correctly, but "dev->msix_enabled" is false for some reason, so disable_msi_mode() is not called. Conversely, as written in comment 11, when the module is removed:

igbvf_remove()
  -> igbvf_reset_interrupt_capability() -- checks "adapter->msix_entries"
    -> pci_disable_msix() -- in drivers/pci/msi-xen.c

Now "msi-xen.c" also provides a disable_msi_mode() function, therefore I think once we can set "dev->msix_enabled" correctly, the code to disable MSIX under Xen should be already there.

Comment 17 Laszlo Ersek 2011-05-27 10:42:31 UTC
The problem is in RHEL-5's drivers/pci/msi-xen.c::pci_enable_msix(). igbvf_set_interrupt_capability() calls it.

In RHEL-5 (as of -264), pci_enable_msix() does not set dev->msix_enabled to 1, whereas upstream (as of 1086:415a9b435fef) does. Comparing the two flavors of pci_enable_msix() and blaming upstream for the lines the diff adds, the backport candidates are as follows:

c/s 680 http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/583086d5cd26
pci-msi: Dynamically map into Linux irq space.

c/s 750 http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/63a878f8851b
Fix buggy mask_base in saving/restoring MSI-X table during S3

c/s 956 http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/7ee9b1483445
xen/PCI/MSI-X: Don't disable when it was never enabled

The last one adds "dev->msix_enabled = 1". The other two may be necessary as a basis.

I'd say c/s 680 does not apply, because we have none of evtchn_get_xen_pirq() / evtchn_map_pirq().

Projecting c/s 750 to the diff between RHEL-5's and upstream's pci_enable_msix(), it takes the blame for a single line inside this diff:

http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/63a878f8851b#l1.148

c/s 750 modified the signature of attach_pirq_entry(), so it had to update the block that c/s 680 added to pci_enable_msix() as well, because that block calls attach_pirq_entry() too.

Since 680 is N/A for us, and further because we already seem to have the other parts of c/s 750 (see commit a8614c21, bug 484227), I claim c/s 956 is the only thing we need to backport from the above three.

In addition, the same needs to be done for MSI; c/s 945 seems to describe and fix the exact problem captured in this BZ. In summary:

c/s 945 http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/7bae5a000157
xen: Dont call msi_unmap_pirq() if did not enable msi

c/s 956 http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/7ee9b1483445
xen/PCI/MSI-X: Don't disable when it was never enabled

Comment 18 Laszlo Ersek 2011-05-27 10:46:56 UTC
(In reply to comment #17)

> In summary:
> 
> c/s 945 http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/7bae5a000157
> xen: Dont call msi_unmap_pirq() if did not enable msi
> 
> c/s 956 http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/7ee9b1483445
> xen/PCI/MSI-X: Don't disable when it was never enabled

(plus, from comment 0:)

c/s 1070 http://xenbits.xen.org/hg/linux-2.6.18-xen.hg/rev/2994d2997f9d
pciback: disable MSI/MSI-X when resetting device

Comment 20 Laszlo Ersek 2011-05-27 14:57:30 UTC
Created attachment 501323 [details]
fix msi(x) handling

backport upstream c/s 945, 956, 1070

Comment 22 Laszlo Ersek 2011-05-27 22:25:41 UTC
Found xen-devel thread here:

http://lists.xensource.com/archives/html/xen-devel/2011-03/msg01213.html

Gives me the idea to try "iommu=no-intremap" together with the patch.

Comment 24 Laszlo Ersek 2011-05-28 01:23:16 UTC
(In reply to comment #20)
> Created attachment 501323 [details]
> fix msi(x) handling
> 
> backport upstream c/s 945, 956, 1070

not good enough, problem persists

Comment 25 Laszlo Ersek 2011-05-28 13:49:44 UTC
Comment on attachment 501323 [details]
fix msi(x) handling

After carefully studying Documentation/MSI-HOWTO.txt, the igbvf driver, the
related parts of the PCI subsystem, and msi-xen.c, I have the following to
say.

Here's a distillation/interpretation of MSI-HOWTO.txt, wrt. MSI-X:

* Normally the PCI device starts up in PIN-IRQ assertion (legacy) mode.
  dev->irq holds the pre-assigned IOAPIC vector.

* The driver calls pci_enable_msix(). It is an all-or-nothing function. It
  tries to allocate the requested number of MSI vectors. The request specifies
  the number of needed MSI vectors, and their unique entry positions in the
  (device dependent) MSI-X table. The entry positions seem nothing more than
  unique serial numbers.

  The function allocates an MSI vector for each entry requested. The returned
  MSI vector values are arbitrary and can vary from pci_enable_msix() call to
  pci_enable_msix() call.

  The original dev->irq value is saved by the PCI subsystem. On return, it is
  not modified. Instead the function returns the allocated MSI vectors in the
  request array, each associated with the corresponding, requested entry. The
  device is switched to MSI-X mode and further use of dev->irq is undefined
  behavior. Since there is no storage allocated in *dev for the MSI vectors,
  the driver must save (copy out or just keep) the returned MSI vectors on its
  own.

* After a successful return from pci_enable_msix(), the driver must call
  request_irq() on each MSI vector value, ie. assign a (possibly separate)
  interrupt handler to each.

* When the device driver is *unloading*, all interrupt handlers assigned to
  the MSI vectors must be dissociated with free_irq().

  The driver calls pci_disable_msix(). It marks the MSI vectors as unused, and
  restores PIN-IRQ assertion (legacy) mode. dev->irq becomes defined again.


Enter the igbvf driver. The driver issues the following calls to
pci_disable_msix():

  igbvf_probe == igbvf_driver.probe
    igbvf_sw_init
      igbvf_set_interrupt_capability (error path)
        igbvf_reset_interrupt_capability
          pci_disable_msix

  igbvf_probe (error path) == igbvf_driver.probe
    igbvf_reset_interrupt_capability
      pci_disable_msix

  igbvf_remove == igbvf_driver.remove
    igbvf_reset_interrupt_capability
      pci_disable_msix


If the igbvf module is removed in the guest before shutdown,
pci_disable_msix() is called correctly. A subsequent restart will complete
successfully; I just tested it and ping worked on the igbvf interface of the
rebooted gest.

The igbvf driver's shutdown() callback method (called from device_shutdown(),
see comment 16) runs like this:

igbvf_shutdown() == igbvf_driver.shutdown
  igbvf_suspend()
    pci_disable_device()

pci_disable_device() does the following (if dev->msix_enabled is true):

* First it searches the PCI device's (= igbvf's) capability list for a
  structure that controls Message Signalled Interrupts.

* Once the position of this MSI capability structure is known,
  pci_disable_device() calls disable_msi_mode(dev, position, PCI_CAP_ID_MSIX)
  on it.

* disable_msi_mode() reads word a from the control register under this
  position (ie. the MSI control register), masks out the PCI_MSIX_FLAGS_ENABLE
  bit in the word (via msix_disable()), then writes it back to the register.

  disable_msi_mode(... PCI_CAP_ID_MSIX) also sets dev->msix_enabled to false.

* pci_disable_device() goes on to turn off bus-mastering (again, control
  register manipulation), and calls pcibios_disable_device() as the final
  step.

* On i386, pcibios_disable_device() turns off IO and memory access
  (pcibios_disable_resources(), control register manipulation), and, since
  we're under Xen  (pcibios_disable_irq == NULL), does nothing else.

Accordingly, igbvf_shutdown() *never* calls pci_disable_msix() -- not even on
bare metal --, therefore the following call chain can never run:

  pci_disable_msix()
    pci_frontend_disable_msix()
      do_pci_op()
        call the XEN_PCI_OP_disable_msix hypervisor operation

On the bare metal this is no problem: once igbvf_shutdown() returns and
device_shutdown() completes walking all the devices, the physical host is
rebooted and the device is returned to PIN-IRQ assertion (legacy) mode.
The hypervisor doesn't forget though.

Looking at the upstream changeset 1070:2994d2997f9d, suggested in comment 0, I
observe the following:

* First note that pci_disable_msix() is explicitly called before
  pci_disable_device(). This corroborates pci_disable_device() does *not*
  cleans up the MSI-X resources -- and igbvf_shutdown() only calls
  pci_disable_device().

* Second, this patch handles the abrupt guest termination case. Unfortunately,
  we can't apply this verbatim to a normally terminated guest: in c/s 1070,
  calling pci_disable_msix() depends on dev->msix_enabled. But -- referring
  back to pci_disable_device() / disable_msi_mode() -- dev->msix_enabled is
  already zero after the guest has shut down in a controlled manner.

  This seems to imply that dev->msix_enabled has mixed responsibilities:

  on one hand, c/s 1070 treats it as if it were keepig track of MSI-X
  interrupt *delivery* (setting up vectors, handlers and so on -- this is what
  pci_disable_msix() undoes),

  while pci_disable_device() / disable_msi_mode() associates it with MSI-X
  interrupt *generation* (ie. whether the PCI device is allowed to emit such
  interrupts).

* Thus a PCI device that was told not to emit MSI-X interrupts can have MSI-X
  vectors and handlers in place; and indeed igbvf has, after igbvf_shutdown()
  returns.

  In this sense, upstream changesets 945 an 956 (see comment 17) can even be
  detrimental; they stop pci_disable_msi() / pci_disable_msix() from working
  if the corresponding disable_msi_mode() was called on the device previously.


I think this is an igbvf problem -- after all, if a physical device is passed
through to a domU, the domU takes responsibility. (A VF is a physical device
in this respect.)

* For the kernel-xen component, this could be a CANTFIX.

  (As the patches mentioned this far can't fix the most common case, that is,
  the normal reboot of the guest, I'm obsoleting attachment 501323 [details].)

* For igbvf, there already might be a bug submitted. We could close this BZ as
  a dup of that then.

* I recommend adding "rmmod igbvf" to one of the kill scripts in such a domU.
  If this will be our official standpoint, then backporting changeset 1070
  (see comment 0) may make sense. (The added cleanup still won't work if the
  guest crashes after it has called disable_msi_mode() and set
  dev->msix_enabled to false.)

Comment 26 Laszlo Ersek 2011-05-28 19:06:16 UTC
Created attachment 501508 [details]
"configurable" igbvf_shutdown()

Here's my adventurous excursion to device driver land.

This patch adds the domU_shutdown module parameter to igbvf. Since I'm not sure what order of operations is allowed, I tried to make it configurable. Since igbvf_suspend() calls pci_disable_device() (see comment 25), after which no operation on the device seems to be allowed, setting domU_shutdown to 1 appears very wrong.

Setting it to 2 seems wrong again, because at that time free_irq() was not yet called, and releasing the MSI-X vectors before dissociating their IRQ handlers is invalid.

Anyway, I added these only for the case if domU_shutdown=3 had not worked -- I wanted to be able to try 1 and 2 without a recompile. Luckily, domU_shutdown=3, ie. calling igbvf_remove() instead of igbvf_suspend() from igbvf_shutdown() worked for me twice in a row: I rebooted the guest twice and each time the new domain brought up the igbvf interface (I verified it with ping).

I'm certain that calling igbvf_remove() from igbvf_shutdown(), ie. device_shutdown() context violates a bunch of pci-dev invariants unknown to me; furthermore, it's not really nice to litter the igbvf driver with Xen-specific exceptions. Nonetheless, the patch still demonstrates IMO what operation is missing from the shutdown sequence.

Setting "review?" because I'd like to run the idea by device driver experts.

Comment 27 Laszlo Ersek 2011-05-28 19:13:10 UTC
Rob,

can you please build a PV guest kernel with the patch in comment 26m and re-run your tests? Please do it like this:
- start out with a "clean" host, ie. one under which the problematic igbvf
  messages never showed up before,
- make sure that the VF is not assigned to the PV guest,
- boot the PV guest with its current kernel,
- install the patched kernel,
- add "options igbvf domU_shutdown=3" to /etc/modprobe.conf,
- shut down the guest,
- assign the VF to the guest,
- start the reboot loop.

I'm setting NEEDINFO. Thanks!

Comment 28 Rob 2011-05-29 12:56:14 UTC
Hi Laszlo,

I appreciate the work you have done in identifying and hopefully fixing this  but I no longer have access to the system involved with this bug. The production system was on a deadline and we were forced to use Squeeze (Which did not suffer from this bug or broken Intel 82559 NIC passthrough).

Comment 29 Laszlo Ersek 2011-05-30 13:28:49 UTC
Hi Rob,

(In reply to comment #28)
> we were forced to use Squeeze

did you install Squeeze on the host, or on the guest? Or even both? Thanks.

Comment 30 Rob 2011-05-30 14:27:46 UTC
Hi Laszlo,


Both. Squeeze Dom0 and Paravirt Squeeze Guest. Config was VF passthough to multiple guests for 2 ports on an 82576EB, Dom0 only for 2 ports on another 82576EB and Passthrough of both ports on a 82599 10Gbit card (Traditional passthrough on the 10Gbit ports because jumbo frames was needed for these). I used a script to set the MACs on all the VFs in Dom0 on bootup but otherwise nothing unsual was needed.

Comment 33 Laszlo Ersek 2011-05-31 13:29:13 UTC
I tried to look at *how* things go wrong when they go wrong, in more detail.
Now I also think this is a dom0 bug, not domU and/or igbvf. Please correct
anything wrong below.

It's about how dom0 mediates MSI-X vector (de)assignment on domU's behalf.
Upstream and RHEL-5 are quite different in this aspect I think, see c/s 680 in
comment 17.

Building on comment 25 and comment 11:

domU: igbvf_set_interrupt_capability()
        pci_enable_msix()
          pci_frontend_enable_msix()
            do_pci_op(pdev, XEN_PCI_OP_enable_msix)
hv:           ...
dom0:         pci_enable_msix()
                msix_capability_init()

So, domU's MSI-X enablement request is forwarded via the hv to dom0. It's dom0
running msix_capability_init() that logs the messages in comment 0.

dom0 maintains a (static) list of MSI-related PCI devices -- msi_dev_head
(/home/drivers/pci/msi-xen.c). Per c/s 680, upstream seems to manipulate this
list in domU as well, but in RHEL-5 The list only ever accessed in dom0, I
believe.

This list can only grow. It collects those PCI devices dom0 has ever touched
wrt. MSI(-X) interrupts. get_msi_dev_pirq_list() looks up a PCI device in this
list -- if the device is not yet in the list, the function adds it and returns
a pointer to it. Any such list entry owns a sub-list of (MSI-X-vector, entry
number) assignments that are in effect for the PCI device. The "msi_dev_head"
list implements a mapping, from PCI device to the set of (MSI-X vector, entry
number) pairs active for the device.

PCI device --> { (MSI-X vector, entry number) }

msix_capability_init() does the following (again, this runs only in dom0):
- It grabs (after possibly inserting) the "msi_dev_list" entry for the PCI
  device being MSI-X-activated.
- It compares the *requested* entry numbers, one by one, against the set of
  *active* entry numbers.
- For any requested entry number that's already maintained in the list, it
  only prints the warning.
- Any requested entry that's not yet maintained is mapped (see below)) and
  added to the set with attach_pirq_entry().

The mapping happens like this (following from above):

domU: igbvf_set_interrupt_capability()
        pci_enable_msix()
          pci_frontend_enable_msix()
            do_pci_op(pdev, XEN_PCI_OP_enable_msix)
hv:           ...
dom0:         pci_enable_msix()
                msix_capability_init()
                  msi_map_vector()                  <---- called from the loop
                    msi_map_pirq_to_vector()
                      HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq)

That is, dom0 asks the HV on behalf of domU to do the "real" MSI-X allocation,
based on:
- domid of the domU owning the PCI device,
- PCI device,
- entry number.

The returned "pirq"'s are batched up in msix_capability_init() (all or
nothing), then passed back to the domU via the above stack. Those "pirq" MSI-X
interrupt vectors will be generated for the domU; in this case, igbvf is the
driver that installs interrupt handlers for them with request_irq().

The MSI-X vector teardown goes similarly:

domU: igbvf_reset_interrupt_capability()
        pci_disable_msix()
          pci_frontend_disable_msix()
            do_pci_op(pdev, XEN_PCI_OP_disable_msix)
hv:           ...
dom0:           pci_disable_msix()
                  msi_remove_pci_irq_vectors() -- loops:
                    msi_unmap_pirq()
                      HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq)
                    list_del()

The unmapping also relies on the domid of the domU owning the PCI device.
Acquiring this domid happens via pciback_get_owner(), indirectly.

So what happens when the domU simply goes away, without calling
pci_disable_msix(), and without dom0 calling pci_disable_msix() for it? Two
things:

(1) The hypervisor *must* obviously delete all PIRQ assignments that were made
    towards this domU (ie. emulate the effects of a PHYSDEVOP_unmap_pirq
    loop). I have no doubt this actually occurs.

(2) The "msi_dev_head" entry, stored in dom0, that corresponds to the PCI
    device will keep the full set of (MSI-X vector, entry number) pairs. This
    set now stale, it should be empty.

When the next PV guest (any guest) tries set up the MSI-X vectors for this PCI
device, the above filtering in msix_capability_init() will run on stale data,
and omit any PHYSDEVOP_map_pirq hypercalls. Two consequencees:

- since the hv assigns no pirq's to the domU, igbvf has no chance to work,

- when we then try to detach the device or remove the igbvf module in the
  domU, the unmapping loop in msi_remove_pci_irq_vectors(), running on stale
  data, will see all PHYSDEVOP_unmap_pirq hypercalls fail, and produce these
  hv / dom0 logs (see comment 15):

  (XEN) irq.c:934: dom2: pirq 255 not mapped
  unmap irq ff failed
  (XEN) irq.c:934: dom2: pirq 254 not mapped
  unmap irq fe failed
  (XEN) irq.c:934: dom2: pirq 253 not mapped
  unmap irq fd failed


The best place to fix this is probably pciback_reset_device() in
drivers/xen/pciback/pciback_ops.c. The inspiration comes from c/s 1070 (see
comment 0): it covers the "destroyed guest" case there too. We should
introduce an extern function called msi_prune_pci_irq_vectors() to
drivers/pci/msi-xen.c, and call that from pciback_reset_device().

Comment 35 Laszlo Ersek 2011-05-31 13:46:15 UTC
Missed an important call in comment #33:

> The mapping happens like this (following from above):
> 
> domU: igbvf_set_interrupt_capability()
>         pci_enable_msix()
>           pci_frontend_enable_msix()
>             do_pci_op(pdev, XEN_PCI_OP_enable_msix)
> hv:           ...
> dom0:         pci_enable_msix()
>                 msix_capability_init()
>                   msi_map_vector()                  <---- called from the loop
>                     msi_map_pirq_to_vector()
>                       HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq)

Then, if msi_map_vector() succeeded, we extend the set of (MSI-X vector, entry number) pairs for this PCI device:

>                   attach_pirq_entry            <---- also called from the loop

Comment 36 Laszlo Ersek 2011-05-31 16:43:48 UTC
Yet another piece of the puzzle I didn't understand before: whenever I wrote "hv: ..." above, that doesn't involve the hypervisor. In those places there seems to be direct event channel signalling from the domU, drivers/xen/pcifront/pci_op.c: do_pci_op(), to dom0, drivers/xen/pciback/pciback_ops.c: pciback_do_op().

Comment 38 Laszlo Ersek 2011-06-01 00:36:27 UTC
I think I *might* have found a way how the domU can zero out the dom0
dev->msix_enabled field. Correcting/extending comment 25:

domU: igbvf_shutdown()
        igbvf_suspend()
          pci_disable_device()
            pcibios_disable_device()
              pcibios_disable_resources()
                pci_write_config_word(
                    dev, PCI_COMMAND,
                    orig & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)
                )
                  ...
                    pcifront_bus_write()
                      do_pci_op(XEN_PCI_OP_conf_write)
dom0:                   pciback_do_op()
                          pciback_config_write()
                            conf_space_write()
                              command_write() [via PCI_COMMAND funcptr]
                                pci_disable_device()
                                  disable_msi_mode()
                                    dev->msix_enabled = 0;

In short, when the domU manipulates PCI config space, that's forwarded to
dom0. In pciback/dom0, many fields in the config space are "hooked" and
handled specially. If both PCI_COMMAND_IO and PCI_COMMAND_MEMORY are masked
off the PCI_COMMAND field, then dom0 disables the device and sets the
msix_enabled flag to false, even though the MSI-X vectors are still allocated.
That's why c/s 1070 is not enough in itself.

Comment 39 Laszlo Ersek 2011-06-01 00:40:08 UTC
Created attachment 502139 [details]
backport c/s 1070 for the abrupt termination case (xm destroy)

Comment 40 Laszlo Ersek 2011-06-01 00:51:32 UTC
Created attachment 502141 [details]
introduce msi_prune_pci_irq_vectors()

extend pciback_reset_device() with the following functionality (msi_prune_pci_irq_vectors()):
- remove all (MSI-X vector, entry number) pairs that might still belong, in
  dom0's mind, to the PCI device being reset,
- unmap some space "used for saving/restoring MSI-X tables" that might otherwise
  go leaked

The function doesn't touch the hypervisor nor the PCI device; it only trims the list used by dom0 for filtering. Justification being: when this is called, either the owner domain is gone already (or very close to being gone), or the device was already correctly detached.

Comment 41 Laszlo Ersek 2011-06-01 00:55:18 UTC
(In reply to comment #40)
> Created attachment 502141 [details]
> introduce msi_prune_pci_irq_vectors()

modeled after msi_remove_pci_irq_vectors()

brew build: https://brewweb.devel.redhat.com/taskinfo?taskID=3361515

Comment 42 Laszlo Ersek 2011-06-01 14:53:30 UTC
(In reply to comment #41)
> (In reply to comment #40)
> > Created attachment 502141 [details]
> > introduce msi_prune_pci_irq_vectors()
> 
> modeled after msi_remove_pci_irq_vectors()

discussion with upstream:
http://lists.xensource.com/archives/html/xen-devel/2011-06/msg00036.html

Comment 43 Laszlo Ersek 2011-06-02 11:54:41 UTC
Considering the consequences of attachment 502141 [details] for HVM passthrough:

Since pciback is needed for device assignment even if the targeted domU is HVM, pciback_reset_device() and consequently introduce msi_prune_pci_irq_vectors() will be called for PCI devices originally assigned to HVM domains as well. (See callers of pciback_reset_device() in [1].) I intend to verify this with a host kernel that prints something in pciback_reset_device().

On the other hand, when an HVM domU enables MSI-X for a device passed through to it (see for example igbvf_set_interrupt_capability() in comment 33 for igbvf), the guest's pci_enable_msix() call will not end up in msi-xen.c and travel to dom0:

domU: igbvf_set_interrupt_capability()
        pci_enable_msix() [msi-xen.c]
          pci_frontend_enable_msix() [pci_op.c]
            do_pci_op(pdev, XEN_PCI_OP_enable_msix) [pci_op.c]
hv:           ...
dom0:         pci_enable_msix() [msi-xen.c]
                msix_capability_init() [msi-xen.c]   <---- filtering
                  attach_pirq_entry()  [msi-xen.c]   <---- list manipulation

Instead, the guest's pci_enable_msix() call will go directly to the hardware:

domU: igbvf_set_interrupt_capability()
        pci_enable_msix() [msi.c]
          msix_capability_init() [msi.c]  <----- hardware access

That is, a PCI device passed through to an HVM guest does not hit dom0's "msi_dev_head" list at all.

- pci_enable_msix() in PV domU:
  - device is added to dom0's "msi_dev_head" list if not yet there (note that
    this is separate from whatever accounts pciback keeps!)
  - MSI-X vector list belonging to the device is used for filtering
  - newly allocated vectors are saved in the MSI-X vec list

- pci_enable_msix() in HVM domU:
    - doesn't reach dom0 at all

- pciback_reset_device() --> msi_prune_pci_irq_vectors():
  - runs in dom0
  - doesn't care about the type of the device's previous (or next) owner
    domain
  - looks up device "msi_dev_head" list -- if it's not there, nothing changes
  - if it's there, empties the MSI-X vector list belonging to the entry
  - hardware, pci_dev, hypervisor are not touched

Since pciback enforces that a device is owned by at most by one domU at any point in time, pciback_reset_device() --> msi_prune_pci_irq_vectors() can't empty the MSI-X vector set of a device that is assigned to "some other domain". (As said in the upstream thread [1], dom0's "msi_dev_head" list is completely unaware of who owns what.) IOW, when a HVM domU goes away and msi_prune_pci_irq_vectors() is called for a device that was passed through to the domU, two things can happen:
- the device is not in the list --> no-op
- the device is in the list (from an earlier PV passthrough), but the MSI-X
  vector set is empty below it --> no-op

- the device is in the list with a non-empty vector set --> this should never
  happen, since when *the* PV domain owning the device went away,
  msi_prune_pci_irq_vectors() was called on the device and emptied the MSI-X
  vector set. Even if the list is non-empty, the function only cleans up the
  list; doesn't touch the hypervisor. And the iounmap() block should be
  idempotent (heh) on two levels: first it's checked for being non-NULL,
  second it's checked against "vmlist".

[1] http://lists.xensource.com/archives/html/xen-devel/2011-06/msg00054.html

Comment 45 Laszlo Ersek 2011-06-02 20:23:28 UTC
Created attachment 502633 [details]
debug messages for msi_prune_pci_irq_vectors() [ad-hoc testing]

My HVM related tests seem to confirm comment 43: msi_prune_pci_irq_vectors() is called at other times as well, but it decays to a no-op.

When binding to pciback:
  (XEN) PCI add device 01:10.0
  pciback 0000:01:10.0: seizing device
  PCI: Enabling device 0000:01:10.0 (0000 -> 0002)
  msi_prune_pci_irq_vectors(): enter
  msi_prune_pci_irq_vectors(): 0000:01:10.0: exit after in-list

Start HVM domain -- OK, igbvf ping successful.

Try to start PV domain while HVM is running:
  Error: failed to assign device 0000:01:10.0 that has already been assigned
  to other domain.

HVM domain shutdown:
  (XEN) hvm.c:524:d1 DOM1/VCPU13: going offline.
  (XEN) hvm.c:524:d1 DOM1/VCPU14: going offline.
  msi_prune_pci_irq_vectors(): enter
  msi_prune_pci_irq_vectors(): 0000:01:10.0: exit after in-list

Start PV domain: OK, igbvf ping successful.

Try to start HVM domain while PV is running:
  Error: failed to assign device 0000:01:10.0 that has already been assigned
  to other domain.

Reboot PV domain:
  msi_prune_pci_irq_vectors(): enter
  PCI: 0000:01:10.0: cleaning up MSI-X entry 0
  PCI: 0000:01:10.0: cleaning up MSI-X entry 1
  PCI: 0000:01:10.0: cleaning up MSI-X entry 2
  PCI: 0000:01:10.0: cleaning up mask_base
  msi_prune_pci_irq_vectors(): 0000:01:10.0: exit after in-list

igbvf ping successful in new PV domain.

Shutdown PV domain:
  msi_prune_pci_irq_vectors(): enter
  PCI: 0000:01:10.0: cleaning up MSI-X entry 0
  PCI: 0000:01:10.0: cleaning up MSI-X entry 1
  PCI: 0000:01:10.0: cleaning up MSI-X entry 2
  PCI: 0000:01:10.0: cleaning up mask_base
  msi_prune_pci_irq_vectors(): 0000:01:10.0: exit after in-list

Start HVM domain: OK, igbvf ping successful.

Shutdown HVM domain:
  (XEN) hvm.c:524:d10 DOM10/VCPU13: going offline.
  msi_prune_pci_irq_vectors(): enter
  msi_prune_pci_irq_vectors(): 0000:01:10.0: exit after in-list

Tried a separate PV destroy:

  c/s 1070 (--> msi_remove_pci_irq_vectors) runs, unmappings fail
  (domU's already gone):
    get owner for dev 80 get c 
    unmap irq ff failed
    get owner for dev 80 get c 
    unmap irq fe failed
    get owner for dev 80 get c 
    unmap irq fd failed

  msi_prune_pci_irq_vectors() follows immediately:
    msi_prune_pci_irq_vectors(): enter
    msi_prune_pci_irq_vectors(): 0000:01:10.0: exit after in-list

Comment 46 Laszlo Ersek 2011-06-03 08:19:36 UTC
Message-Id: <1307089198-3818-1-git-send-email-lersek>

Comment 47 Laszlo Ersek 2011-06-06 09:53:02 UTC
Created the following test plan to verify dynamic attach/detach:

owner domain      next owner domain
----------------  -----------------
rhel56-64bit-pv   rhel56-64bit-pv2
rhel56-64bit-pv2  rhel56-64bit-hvm
rhel56-64bit-hvm  rhel56-64bit-pv
rhel56-64bit-pv   rhel56-64bit-hvm
rhel56-64bit-hvm  rhel56-64bit-pv2
rhel56-64bit-pv2  rhel56-64bit-pv

This should check all possible owner transitions. I started all three guests without assigning the PCI device to any of them. Then I tried to assign it to rhel56-64bit-pv (the PCI device was already seized by pciback):

[root@intel-s3ea2-08 ~]# xm pci-attach rhel56-64bit-pv 0000:01:10.0

Error: only HVM guest support pci attach

Retried with virsh:

[root@intel-s3ea2-08 ~]# cat dev.xml 

<hostdev mode='subsystem' type='pci' managed='no'>
  <source>
    <address domain='0x0000' bus='0x01' slot='0x10' function='0x0'/>
  </source>
</hostdev>

[root@intel-s3ea2-08 ~]# virsh attach-device rhel56-64bit-pv dev.xml 
error: Failed to attach device from dev.xml
error: Requested operation is not valid: Xm driver only supports modifying persistent config

[root@intel-s3ea2-08 ~]# rpm -q libvirt xen xen-libs

libvirt-0.8.2-20.el5
xen-3.0.3-131.el5
xen-libs-3.0.3-131.el5

      Beaker Test information:
                         HOSTNAME=intel-s3ea2-08.lab.bos.redhat.com
                            JOBID=90237
                         RECIPEID=184994
                           DISTRO=RHEL-5.7-Server-Beta-1.2
                     ARCHITECTURE=x86_64

So I think dynamic assignment to PV domains is not supported.

Testing HVM: since the guest is RHEL-5 (-5.6, -238); it's important to load the "acpiphp" module first in it; otherwise hotplugging the passed-through PCI device won't work.

Repeated the following twice:
- xm pci-attach rhel56-64bit-hvm 0000:01:10.0
- dhcp on igbvf in HVM guest
- verify with ping
- ifdown in guest
- xm pci-detach rhel56-64bit-hvm 0000:01:10.0

The pci-detach step triggers the call to msi_prune_pci_irq_vectors() in dom0 (these entry/exit log messages are written only by the debug printk patch, attachment 502633 [details]):

  msi_prune_pci_irq_vectors(): enter
  msi_prune_pci_irq_vectors(): 0000:01:10.0: exit after in-list

I left the HVM domain running, and booted, rebooted and shut down the PV guest (with persistent PCI passthrough config, checking igbvf with ping). The two PV domain teardowns logged, in dom0:

  PCI: 0000:01:10.0: cleaning up MSI-X entry 0
  PCI: 0000:01:10.0: cleaning up MSI-X entry 1
  PCI: 0000:01:10.0: cleaning up MSI-X entry 2
  PCI: 0000:01:10.0: cleaning up mask_base

Finally re-attached the device to the HVM domain and confirmed igbvf availability with ping.

Comment 50 RHEL Program Management 2011-08-17 13:30:13 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 52 Jarod Wilson 2011-08-30 19:22:07 UTC
Patch(es) available in kernel-2.6.18-283.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 54 Yuyu Zhou 2011-12-12 08:33:56 UTC
Reproduce the bug in kernel-xen-2.6.18-274.el5, and verify the bug on kernel-xen-2.6.18-300.el5 and xen-3.0.3-135.el5.

Steps:
1. Assign one VF to a RHEL5.8 PV guest.
2. Reboot the guest.

On kernel-xen-2.6.18-274.el5, following log show in Domain0 dmesg:

blktap: ring-ref 9, event-channel 18, protocol 1 (x86_64-abi)
PCI: Enabling device 0000:01:10.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:01:10.0 to 64
msix entry 0 for dev 01:10:0 are not freed before acquire again.
msix entry 1 for dev 01:10:0 are not freed before acquire again.
msix entry 2 for dev 01:10:0 are not freed before acquire again.

On kernel-xen-2.6.18-300.el5, no such messages in Domain0 dmesg, the VF is initialized successful, Domain0 dmesg as following:

tap tap-2-51712: 2 getting info
pciback: vpci: 0000:01:10.0: assign to virtual slot 0
blktap: ring-ref 9, event-channel 18, protocol 1 (x86_64-abi)
PCI: Enabling device 0000:01:10.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:01:10.0 to 64
get owner for dev 80 get 2 
get owner for dev 80 get 2 
get owner for dev 80 get 2 

According above info, change this bug into VERIFIED.

Comment 56 errata-xmlrpc 2012-02-21 03:32:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0150.html