Bug 849223

Summary:

RHEL5 Xen SR-IOV VF PCI passthru does not work to RHEL6 HVM guest; no interrupts received on the guest VF

Product:

Red Hat Enterprise Linux 6

Reporter:

Pasi Karkkainen <pasik>

Component:

kernel

Assignee:

Laszlo Ersek <lersek>

Status:

CLOSED ERRATA

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

6.3

CC:

agospoda, bfan, ddutile, dzickus, knoel, leiwang, lersek, pbonzini, qguan, sassmann, tburke, xen-maint, yuzhou

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

xen

Fixed In Version:

kernel-2.6.32-318.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

861349 861352 (view as bug list)

Environment:

Last Closed:

2013-02-21 06:47:17 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

861349

Bug Blocks:

861352

Attachments:

Description	Flags
rhel58 x64 xen hypervisor serial console log	none
rhel58 x64 xen dom0 linux kernel dmesg log	none
rhel58 x64 xen dom0 lspci -vvv	none
rhel58 x64 xen dom0 xend log	none
rhel63 x64 xen hvm guest linux kernel dmesg log	none
rhel63 x64 xen hvm guest lspci -vvv	none
Dell R510 DMAR dump from acpidump	none
rhel63 x64 xen hvm guest linux kernel crash with pci=nomsi	none
decompiled Dell-R510-DMAR.dsl	none
rhel58 x64 xen qemu-dm log for rhel63 x64 hvm guest	none
rhel58 x64 xen qemu-dm log for rhel63 x64 hvm guest with multiple VFs	none
rhel58 x64 xen qemu-dm log for rhel58 x64 hvm guest	none
additional debug messages for qemu-dm	none
qemu-dm log (with debug patch) about successful msi-x initialization in RHEL-5 guest	none
qemu-dm log (with debug patch) about failed msi-x initialization in RHEL-6 guest	none
add debug messages to write_msi_msg_desc() -- debug patch for kernel-2.6.32-279.5.2.el6	none
rhel63 x64 xen hvm guest linux kernel dmesg log 2.6.32-279.5.2.el6.bz849223_debug.x86_64 with VF passthru	none
rhel63 x64 xen hvm guest linux kernel dmesg log 2.6.32-279.5.2.el6.bz849223_debug.x86_64 with PF passthru	none
[1/1] PCI: Set device power state to PCI_D0 for device without native PM support	none
rhel58 x64 xen hvm guest ixgbevf_msix_clean_tx crash log stack trace	none
rhel63 x64 xen hvm guest with 1 vf does not work 2.6.32-279.5.2.el6.bz849223_pci_d0_dbg	none
rhel63 x64 xen hvm guest with 2 vfs only first vf works 2.6.32-279.5.2.el6.bz849223_pci_d0_dbg	none
rhel58 x64 xen qemu-dm log for rhel63 x64 hvm guest with 1 vf	none
rhel58 x64 xen qemu-dm log for rhel63 x64 hvm guest with 2 vfs	none
write_msi_msg_desc() debug messages (v2), now with WARN	none
qemu-dm debug messages (v2), with PCI BDFs	none
sr-iov vf passthru test results to el5.8 and el6.3 hvm guests	none
qemu-dm debug messages (v3), track "valid" flag too	none
sr-iov vf passthru to el6.3 test results for comment #142	none
[1/2] Backport a single hunk from qemu-xen-unstable commit 13669683	none
[2/2] qemu-dm: fix unregister_iomem()	none
sr-iov vf passthru test results for comment 150 to el5.8 and el6.3 hvm guests	none

Description Pasi Karkkainen 2012-08-17 18:10:41 UTC

Description of problem:

Host / Xen dom0:
- Dell R510 server, latest BIOS 1.10.2 (04/27/2012).
- Intel Xeon CPU L5640. 
- IOMMU/VT-d enabled in BIOS. 
- Intel 82599EB 10 Gbit/sec dual-port SR-IOV Server NIC. 
- RHEL 5.8 x64 in dom0.
- "dom0_mem=2048M iommu=1" boot options for xen.gz.
- "pci_pt_e820_access=on" boot option for dom0 Linux kernel.
- ixgbe max_vfs=8 option added to modprobe.conf
- ixgbevf driver blacklisted in dom0.
- pciback is configured to "hide" the VF PCI devices in dom0.
- Physical Function (PF) eth-devices are configured "up" in dom0.

Virtual Function (VF) PCI passthru works OK to RHEL 5.8 Xen PV domU, and the VF works properly in the domU, I can use the VF network in the PV domU without problems.

However when I PCI passthru a VF to RHEL6.3 HVM guest, the VF is visible in guest "lspci", the ixgbevf drivers are automatically loaded in the guest, eth-interface is visible, I can configure eth-interface with an IP, and "up" it, but the VF just won't work. I can't see any traffic on the VF eth-interface. Also I don't see any IRQs for the VF eth-device in /proc/interrupts. The IRQ counts are zero and stay zero for the VF MSI IRQs.

I've tried both with and without PVHVM drivers (xen_platform_pci=0 and 1), no change in behaviour.

I tried with 1 VCPUs for the HVM guest, and also with 2 VCPUs, no change in behaviour.

I also tried updating to latest Intel upstream drivers, ixgbe 3.10.16 in dom0 and ixgbevf 2.6.2 in the HVM guest, but that didn't help either, no change in behaviour, still zero interrupts received.



Version-Release number of selected component (if applicable):
kernel-xen-2.6.18-308.11.1.el5


How reproducible:
Always.

Steps to Reproduce:
1. Configure RHEL 5.8 x64 dom0 with Intel 82599 SR-IOV NIC.
2. PCI passthru a VF to RHEL6.3 HVM guest.
3. Notice how the VF won't work in the RHEL6.3 HVM guest.
  
Actual results:
VF doesn't work, no interrupts received by the PCI device MSI IRQs in the HVM guest.

Expected results:
VF works normally and I can run ethernet/IP traffic over it.

Additional info:

[root@dom0 ~]# uname -r
2.6.18-308.11.1.el5xen

[root@dom0 ~]# lspci | grep 82599
03:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
03:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
03:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:10.2 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:10.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:10.4 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:10.5 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:10.6 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:11.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:11.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:11.2 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:11.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:11.4 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:11.5 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:11.6 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
03:11.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)

[root@dom0 ~]# dmesg|grep ixgbe
ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 3.4.8-k
ixgbe: Copyright (c) 1999-2011 Intel Corporation.
ixgbe 0000:03:00.0: Multiqueue Disabled: Rx Queue count = 1, Tx Queue count = 1
ixgbe 0000:03:00.0: (PCI Express:5.0GT/s:Width x8) 00:2b:31:77:9e:1c
ixgbe 0000:03:00.0: MAC: 2, PHY: 8, SFP+: 3, PBA No: E81283-002
ixgbe 0000:03:00.0: eth2: IOV is enabled with 8 VFs
ixgbe 0000:03:00.0: eth2: IOV: VF 0 is enabled MAC 66:e6:ed:0d:c5:2e
ixgbe 0000:03:00.0: eth2: IOV: VF 1 is enabled MAC 62:19:70:85:a6:a0
ixgbe 0000:03:00.0: eth2: IOV: VF 2 is enabled MAC 7e:13:b2:5f:ae:ba
ixgbe 0000:03:00.0: eth2: IOV: VF 3 is enabled MAC 36:26:f2:67:19:7e
ixgbe 0000:03:00.0: eth2: IOV: VF 4 is enabled MAC be:16:b1:ac:30:ab
ixgbe 0000:03:00.0: eth2: IOV: VF 5 is enabled MAC 82:ce:d1:5f:62:98
ixgbe 0000:03:00.0: eth2: IOV: VF 6 is enabled MAC 52:13:25:43:1c:52
ixgbe 0000:03:00.0: eth2: IOV: VF 7 is enabled MAC 36:cc:88:e8:d2:35
ixgbe 0000:03:00.0: Intel(R) 10 Gigabit Network Connection
ixgbe 0000:03:00.1: Multiqueue Disabled: Rx Queue count = 1, Tx Queue count = 1
ixgbe 0000:03:00.1: (PCI Express:5.0GT/s:Width x8) 00:2b:31:77:9e:1d
ixgbe 0000:03:00.1: MAC: 2, PHY: 8, SFP+: 4, PBA No: E81283-002
ixgbe 0000:03:00.1: eth3: IOV is enabled with 8 VFs
ixgbe 0000:03:00.1: eth3: IOV: VF 0 is enabled MAC d6:f1:32:fa:f4:81
ixgbe 0000:03:00.1: eth3: IOV: VF 1 is enabled MAC 1a:ae:07:cc:63:05
ixgbe 0000:03:00.1: eth3: IOV: VF 2 is enabled MAC ee:6d:f8:ab:46:8a
ixgbe 0000:03:00.1: eth3: IOV: VF 3 is enabled MAC 16:ee:96:ea:a1:3f
ixgbe 0000:03:00.1: eth3: IOV: VF 4 is enabled MAC 72:ca:6d:13:b0:9e
ixgbe 0000:03:00.1: eth3: IOV: VF 5 is enabled MAC 16:54:43:35:56:8c
ixgbe 0000:03:00.1: eth3: IOV: VF 6 is enabled MAC 52:ee:96:8b:dc:7c
ixgbe 0000:03:00.1: eth3: IOV: VF 7 is enabled MAC 9a:4b:0f:25:bf:9b
ixgbe 0000:03:00.1: Intel(R) 10 Gigabit Network Connection
ixgbe 0000:03:00.0: eth2: detected SFP+: 3
ixgbe 0000:03:00.0: eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
ixgbe 0000:03:00.1: eth3: detected SFP+: 4
ixgbe 0000:03:00.1: eth3: NIC Link is Up 10 Gbps, Flow Control: RX/TX
ixgbe 0000:03:00.0: eth2: VF Reset msg received from vf 1



[root@dom0 ~]# xm dmesg|grep -i vt-d
(XEN) [VT-D]dmar.c:468: Host address width 40
(XEN) [VT-D]dmar.c:477: found ACPI_DMAR_DRHD
(XEN) [VT-D]dmar.c:336: dmaru->address = fed90000
(XEN) [VT-D]dmar.c:293: found IOAPIC: bdf = 0:1e.1
(XEN) [VT-D]dmar.c:293: found IOAPIC: bdf = 0:13.0
(XEN) [VT-D]dmar.c:345: found INCLUDE_ALL
(XEN) [VT-D]dmar.c:481: found ACPI_DMAR_RMRR
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1a.7
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1d.7
(XEN) [VT-D]dmar.c:481: found ACPI_DMAR_RMRR
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1a.0
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1a.1
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1a.2
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1d.0
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1d.1
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1d.2
(XEN) [VT-D]dmar.c:481: found ACPI_DMAR_RMRR
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1a.0
(XEN) [VT-D]dmar.c:481: found ACPI_DMAR_RMRR
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1a.1
(XEN) [VT-D]dmar.c:481: found ACPI_DMAR_RMRR
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1d.0
(XEN) [VT-D]dmar.c:481: found ACPI_DMAR_RMRR
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1d.1
(XEN) [VT-D]dmar.c:481: found ACPI_DMAR_RMRR
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1d.2
(XEN) [VT-D]dmar.c:481: found ACPI_DMAR_RMRR
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1a.7
(XEN) [VT-D]dmar.c:481: found ACPI_DMAR_RMRR
(XEN) [VT-D]dmar.c:287: found endpoint: bdf = 0:1d.7
(XEN) [VT-D]dmar.c:485: found ACPI_DMAR_ATSR
(XEN) [VT-D]dmar.c:274: found bridge: bdf = 0:1.0  sec = 1  sub = 1
(XEN) [VT-D]dmar.c:274: found bridge: bdf = 0:3.0  sec = 2  sub = 2
(XEN) [VT-D]dmar.c:274: found bridge: bdf = 0:7.0  sec = 3  sub = 5
(XEN) [VT-D]dmar.c:274: found bridge: bdf = 0:9.0  sec = 6  sub = 6
(XEN) [VT-D]dmar.c:274: found bridge: bdf = 0:a.0  sec = 7  sub = 7
(XEN) Intel VT-d has been enabled
(XEN) Intel VT-d snoop control enabled
(XEN) [VT-D]iommu.c:619: iommu_enable_translation: iommu->reg = ffff828bfff58000


Logs from the RHEL 6.3 x64 HVM guest:

[root@rhel63x64hvm ~]# uname -r
2.6.32-279.5.1.el6.x86_64

[root@rhel63x64hvm ~]# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:01.3 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 SCSI storage controller: XenSource, Inc. Xen Platform Device (rev 01)
00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)

[root@rhel63x64hvm ~]# ethtool -i eth1
driver: ixgbevf
version: 2.2.0-k
firmware-version:
bus-info: 0000:00:06.0

[root@rhel63x64hvm ~]# dmesg|grep -i ixgbe
ixgbevf: Intel(R) 10 Gigabit PCI Express Virtual Function Network Driver - version 2.2.0-k
ixgbevf: Copyright (c) 2009 - 2012 Intel Corporation.
ixgbevf 0000:00:06.0: setting latency timer to 64
ixgbevf 0000:00:06.0: irq 48 for MSI/MSI-X
ixgbevf 0000:00:06.0: irq 49 for MSI/MSI-X
ixgbevf 0000:00:06.0: irq 50 for MSI/MSI-X

[root@rhel63x64hvm ~]# ifconfig eth1 up

[root@rhel63x64hvm ~]# ethtool eth1
Settings for eth1:
        Supported ports: [ ]
        Supported link modes:   10000baseT/Full
        Supports auto-negotiation: No
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Speed: 10000Mb/s
        Duplex: Full
        Port: Other
        PHYAD: 0
        Transceiver: Unknown!
        Auto-negotiation: off
        Current message level: 0x00000007 (7)
        Link detected: yes


[root@rhel63x64hvm ~]# cat /proc/interrupts
           CPU0
  0:        173   IO-APIC-edge      timer
  1:         24   IO-APIC-edge      i8042
  8:          1   IO-APIC-edge      rtc0
  9:          0   IO-APIC-fasteoi   acpi
 12:        110   IO-APIC-edge      i8042
 14:          0   IO-APIC-edge      ata_piix
 15:        129   IO-APIC-edge      ata_piix
 23:         40   IO-APIC-fasteoi   uhci_hcd:usb1
 28:       3649   IO-APIC-fasteoi   xen-platform-pci
 48:          0   PCI-MSI-edge      eth1-rx-0
 49:          0   PCI-MSI-edge      eth1-tx-0
 50:          0   PCI-MSI-edge      eth1:mbx
253:        394   xen-dyn-event     eth0
254:       3112   xen-dyn-event     blkif
255:        193   xen-dyn-event     xenbus
NMI:          0   Non-maskable interrupts
LOC:      12131   Local timer interrupts
SPU:          0   Spurious interrupts
PMI:          0   Performance monitoring interrupts
IWI:          0   IRQ work interrupts
RES:          0   Rescheduling interrupts
CAL:          0   Function call interrupts
TLB:          0   TLB shootdowns
TRM:          0   Thermal event interrupts
THR:          0   Threshold APIC interrupts
MCE:          0   Machine check exceptions
MCP:          1   Machine check polls
ERR:          0
MIS:          0

No matter what I do the IRQ counts are always zero for the VF's MSI IRQs 48, 49 and 50.

Xen cfgfile for the RHEL 6.3 x64 HVM guest:

kernel = "/usr/lib/xen/boot/hvmloader"
builder='hvm'
device_model = '/usr/lib64/xen/bin/qemu-dm'
name = "rhel63x64hvm"
memory = 1024
shadow_memory = 8
vcpus=1
pae=1
acpi=1
apic=1
vif = [ 'mac=00:36:6f:22:12:11, bridge=virbr0, model=e1000' ]
disk = [ 'phy:/dev/VolGroup00/rhel63x64hvm,hda,w', ',hdc:cdrom,r' ]
boot='cd'
xen_platform_pci=0
on_poweroff = 'destroy'
on_reboot   = 'restart'
on_crash    = 'restart'
sdl=0
vnc=1
vncpasswd=''
stdvga=0
serial='pty'
tsc_mode=0
usb=1
usbdevice='tablet'
keymap='fi'
pci = [ '03:10.2' ]

Comment 1 Laszlo Ersek 2012-08-17 21:24:42 UTC

Hi,

is this a regression according to your knowledge?

Please try with "pci=nomsi" on the guest kernel command line.

Please upload
- guest kernel dmesg (with "ignore_loglevel"),
- dom0 dmesg (ditto),
- hypervisor serial console output ("loglvl=all guest_loglvl=all"),
- xend.log from dom0,
- "lspci -vvv" from host & guest.

The version/component will probably change to RHEL-5 kernel-xen or RHEL-6 kernel.

Thanks!
Laszlo

Comment 3 Pasi Karkkainen 2012-08-18 17:41:18 UTC

(In reply to comment #1)
> Hi,
> 

Hello!

> is this a regression according to your knowledge?
> 

Not sure.. earlier in RHEL <= 5.7 there was other SR-IOV related bugs, so I haven't been able to test this properly earlier.

> Please try with "pci=nomsi" on the guest kernel command line.
> 

I actually already tried that earlier but forgot to mention about it. 
pci=nomsi on the HVM guest kernel cmdline makes it crash and reboot itself when I do "ifconfig eth1 up" inside the guest.. 

> Please upload
> - guest kernel dmesg (with "ignore_loglevel"),
> - dom0 dmesg (ditto),
> - hypervisor serial console output ("loglvl=all guest_loglvl=all"),
> - xend.log from dom0,
> - "lspci -vvv" from host & guest.
> 

Ok, will do next week.

> The version/component will probably change to RHEL-5 kernel-xen or RHEL-6
> kernel.
> 
> Thanks!
> Laszlo

Yep, thanks!

Comment 4 Laszlo Ersek 2012-08-19 10:32:00 UTC

(In reply to comment #3)

> I actually already tried that earlier but forgot to mention about it. 
> pci=nomsi on the HVM guest kernel cmdline makes it crash and reboot itself
> when I do "ifconfig eth1 up" inside the guest.. 

Maybe a guest regression then... I can see some ixgbevf/SR-IOV related changes between 6.2 and 6.3.

Comment 6 Pasi Karkkainen 2012-08-19 20:34:13 UTC

(In reply to comment #4)
> (In reply to comment #3)
> 
> > I actually already tried that earlier but forgot to mention about it. 
> > pci=nomsi on the HVM guest kernel cmdline makes it crash and reboot itself
> > when I do "ifconfig eth1 up" inside the guest.. 
> 
> Maybe a guest regression then... I can see some ixgbevf/SR-IOV related
> changes between 6.2 and 6.3.
>

I quickly tried with 6.2 kernel, and behaviour was the same. no interrupts received, all the interrupt counts are and stay zero for the VF.

I didn't forget about the logs, but it'll take a couple of days before I can fix the serial console etc.

Comment 8 Pasi Karkkainen 2012-08-21 20:46:37 UTC

Created attachment 606043 [details]
rhel58 x64 xen hypervisor serial console log

Comment 9 Pasi Karkkainen 2012-08-21 20:47:23 UTC

Created attachment 606044 [details]
rhel58 x64 xen dom0 linux kernel dmesg log

Comment 10 Pasi Karkkainen 2012-08-21 20:47:54 UTC

Created attachment 606045 [details]
rhel58 x64 xen dom0 lspci -vvv

Comment 11 Pasi Karkkainen 2012-08-21 20:48:37 UTC

Created attachment 606046 [details]
rhel58 x64 xen dom0 xend log

Comment 12 Pasi Karkkainen 2012-08-21 20:49:42 UTC

Created attachment 606050 [details]
rhel63 x64 xen hvm guest linux kernel dmesg log

Comment 13 Pasi Karkkainen 2012-08-21 20:50:25 UTC

Created attachment 606051 [details]
rhel63 x64 xen hvm guest lspci -vvv

Comment 14 Pasi Karkkainen 2012-08-21 20:51:27 UTC

(In reply to comment #1)
> 
> Please upload
> - guest kernel dmesg (with "ignore_loglevel"),
> - dom0 dmesg (ditto),
> - hypervisor serial console output ("loglvl=all guest_loglvl=all"),
> - xend.log from dom0,
> - "lspci -vvv" from host & guest.
> 

Done.

Comment 15 Laszlo Ersek 2012-08-22 08:42:56 UTC

(In reply to comment #11)
> Created attachment 606046 [details]
> rhel58 x64 xen dom0 xend log

This bug seems to be a duplicate of bug 735890. See especially bug 735890 comment 16. (Note that the dup candidate is about PV passthrough, but I think that should make no difference for the MSI(-X) range's availability.)

From comment 0, we're passing through 03:10.2. From comment 10 (extract):

03:10.2 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual
                             Function (rev 01)
	Region 0: [virtual] Memory at de404000 (64-bit, non-prefetchable)
                  [size=16K]
	Region 3: [virtual] Memory at de504000 (64-bit, non-prefetchable)
                  [size=16K]
	Capabilities: [70] MSI-X: Enable- Count=3 Masked-
		Vector table: BAR=3 offset=00000000         /* #1 */
		PBA: BAR=3 offset=00002000                  /* #2 */

From the xend.log:

  (pciquirk:91) NO quirks found for PCI device [8086:10ed:8086:7a11]
  (pciquirk:131) Permissive mode NOT enabled for PCI device
                 [8086:10ed:8086:7a11]
  (pciif:378) pci: enabling iomem 0xde404000/0x4000 pfn 0xde404/0x4
  (pciif:378) pci: enabling iomem 0xde504000/0x4000 pfn 0xde504/0x4

These correspond to Region 0 and Region 3 above.

  (pciif:398) pci-msix: remove permission for 0xde504000/0x3000 0xde504/0x3

This is MSI-X range #1 ("Vector table") inside Region 3 ("BAR=3"), offset 0:
0xde504000 == 0xde504000 + 0.

  (pciif:398) pci-msix: remove permission for 0xde506000/0x1000 0xde506/0x1

This is MSI-X range #2 ("PBA") inside Region 3 ("BAR=3"), offset 0x2000:
0xde506000 == 0xde504000 + 0x2000.

Comment 16 Laszlo Ersek 2012-08-22 09:00:55 UTC

(In reply to comment #15)

> This bug seems to be a duplicate of bug 735890. See especially bug 735890
> comment 16. (Note that the dup candidate is about PV passthrough, but I
> think that should make no difference for the MSI(-X) range's availability.)

Actually I may be very wrong about this... the HVM guest would try to access these ranges via the IOMMU.

(XEN) [VT-D]iommu.c:1241:d32767 domain_context_mapping:PCIe: bdf = 3:10.2

Let's try to attack it from another side (*) -- when the guest crashes with "pci=nomsi" (comment 3), does it dump the stack to its serial console? Does Xen or dom0 log anything? (I'd like to reproduce this and get a vmcore myself, but I'm still waiting on a Beaker box with such a card.)

(*) There may be a "common" IRQ setup problem, and the "normal" PCI interrupt path could be less complicated to debug.

Comment 17 Laszlo Ersek 2012-08-22 09:37:10 UTC

Can you please boot a bare-metal kernel on this machine and run

    acpidump --table DMAR --binary -o DMAR.dump

and attach "DMAR.dump"? Thanks.

Comment 18 Laszlo Ersek 2012-08-22 09:50:35 UTC

Can you please also retry with "iommu=no-intremap" on the xen.gz command line? Thanks!

Comment 19 Pasi Karkkainen 2012-08-22 19:25:02 UTC

Created attachment 606365 [details]
Dell R510 DMAR dump from acpidump

Comment 20 Pasi Karkkainen 2012-08-22 19:25:49 UTC

(In reply to comment #17)
> Can you please boot a bare-metal kernel on this machine and run
> 
>     acpidump --table DMAR --binary -o DMAR.dump
> 
> and attach "DMAR.dump"? Thanks.
>

Done.

Comment 21 Pasi Karkkainen 2012-08-22 19:47:13 UTC

(In reply to comment #18)
> Can you please also retry with "iommu=no-intremap" on the xen.gz command
> line? Thanks!
>

I tried it, but unfortunately it didn't help.. still the same problem.

Comment 22 Pasi Karkkainen 2012-08-22 19:59:39 UTC

Created attachment 606370 [details]
rhel63 x64 xen hvm guest linux kernel crash with pci=nomsi

Comment 23 Pasi Karkkainen 2012-08-22 20:02:23 UTC

(In reply to comment #16)
> (In reply to comment #15)
> 
> > This bug seems to be a duplicate of bug 735890. See especially bug 735890
> > comment 16. (Note that the dup candidate is about PV passthrough, but I
> > think that should make no difference for the MSI(-X) range's availability.)
> 
> Actually I may be very wrong about this... the HVM guest would try to access
> these ranges via the IOMMU.
> 
> (XEN) [VT-D]iommu.c:1241:d32767 domain_context_mapping:PCIe: bdf = 3:10.2
> 
> Let's try to attack it from another side (*) -- when the guest crashes with
> "pci=nomsi" (comment 3), does it dump the stack to its serial console? Does
> Xen or dom0 log anything? (I'd like to reproduce this and get a vmcore
> myself, but I'm still waiting on a Beaker box with such a card.)
> 
> (*) There may be a "common" IRQ setup problem, and the "normal" PCI
> interrupt path could be less complicated to debug.
>

Ok, I booted rhel6.3 x64 hvm guest with with pci=nomsi on the guest kernel cmdline, and when I do "ifconfig eth1 up" for the VF in the HVM guest I get the attached crash.

Comment 25 Laszlo Ersek 2012-08-23 15:43:57 UTC

Created attachment 606639 [details]
decompiled Dell-R510-DMAR.dsl

I have no idea what could be going wrong. The DMAR doesn't seem to violate anything described in "Intel(r)_VT_for_Direct_IO.pdf". Both IO-APIC's found in the MADT are listed in the DMAR/DRHD. I can neither prove nor disprove there's a mismatch between hardware & the DMAR. RMRR's indeed point into reserved RAM.

This kind of IOMMU bug is hard (see bug 760007, bug 512617 etc...) Whenever a device to be passed through is down one (or more) PCI-to-PCI bridges, we suck at passing it through. (You might want to check that with "lspci -tv" in a bare-metal kernel.)

I see

  (XEN) io_apic.c:2161:
  (XEN) ioapic_guest_write: apic=0, pin=3, old_irq=3, new_irq=3
  (XEN) ioapic_guest_write: old_entry=000000f2, new_entry=000100f2
  (XEN) ioapic_guest_write: Attempt to modify IO-APIC pin for in-use IRQ!

in the hypervisor log, but this kind of message is printed all the time without adverse effects.

Also the DRHD reports FED90000 as register base address; see the attachment plus:

  (XEN) [VT-D]dmar.c:477: found ACPI_DMAR_DRHD
  (XEN) [VT-D]dmar.c:336: dmaru->address = fed90000

and dom0 logs

  pnp: 00:0b: iomem range 0xfed90000-0xfed91fff could not be reserved

but this may not mean anything if dom0 is not supposed to access the DMA remapping unit directly.

Perhaps try "iommu=passthrough" on the xen.gz command line, but it's just shotgun experimentation now.

Does RHEL-63 HVM work under upstream Xen+dom0? (Even if it does, I've looked at upstream IOMMU patches before, and I can either not pick candidates, or the changes are very invasive).

Comment 26 Laszlo Ersek 2012-08-23 16:08:45 UTC

(In reply to comment #22)
> Created attachment 606370 [details]
> rhel63 x64 xen hvm guest linux kernel crash with pci=nomsi

BUG: unable to handle kernel NULL pointer dereference at (null)

(gdb) file ixgbevf.ko
Reading symbols from /home/lacos/tmp/ixgbevf.ko...
Reading symbols from /usr/lib/debug/lib/modules/2.6.32-279.el6.x86_64/\
    kernel/drivers/net/ixgbevf/ixgbevf.ko.debug...
done.
done.

(gdb) list *(ixgbevf_open+0x475)
0x62c5 is in ixgbevf_open (include/linux/interrupt.h:126).
121
122  static inline int __must_check
123  request_irq(unsigned int irq, irq_handler_t handler,
124              unsigned long flags, const char *name, void *dev)
125  {
126      return request_threaded_irq(irq, handler, NULL, flags, name, dev);
127  }
128
129  extern void exit_irq_thread(void);
130  #else


ixgbevf_probe()
  ixgbevf_init_interrupt_scheme()
    ixgbevf_set_interrupt_capability()
      ixgbevf_acquire_msix_vectors()
        pci_enable_msix()

ixgbevf_open()
  ixgbevf_request_irq()
    ixgbevf_request_msix_irqs()
      request_irq()

It might be useful to match the above callgraph against the full guest
dmesg, but I believe ixgbevf can't work without MSI-X.

Comment 27 Laszlo Ersek 2012-08-23 16:50:32 UTC

(In reply to comment #25)

> Whenever
> a device to be passed through is down one (or more) PCI-to-PCI bridges, we
> suck at passing it through. (You might want to check that with "lspci -tv"
> in a bare-metal kernel.)

ATSR lists these:

  00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub
          PCI Express Root Port 1 (rev 13) (prog-if 00 [Normal decode])
  00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub
          PCI Express Root Port 3 (rev 13) (prog-if 00 [Normal decode])
  00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub
          PCI Express Root Port 7 (rev 13) (prog-if 00 [Normal decode])
  00:09.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub
          PCI Express Root Port 9 (rev 13) (prog-if 00 [Normal decode])
  00:0a.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub
          PCI Express Root Port 10 (rev 13) (prog-if 00 [Normal decode])

"The ATSR structures identifies PCI Express Root-Ports supporting
Address Translation Services (ATS) transactions."

The dom0 log / lspci include

  PCI: Transparent bridge - 0000:00:1e.0

  00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
                      (prog-if 01 [Subtractive decode])

The "82599 Ethernet Controller Virtual Function"s have "[virtual] Memory" bases in [de400000..de71c000]; all of which fall into 00:07.0's "Memory behind bridge: de200000-de7fffff".

Comment 28 Pasi Karkkainen 2012-08-30 16:48:34 UTC

I'm travelling this week, but I'll try the suggestions next week..

Comment 29 Paolo Bonzini 2012-09-03 10:16:16 UTC

> Actually I may be very wrong about this... the HVM guest would try to access 
> these ranges via the IOMMU.

I think the point is that interrupts are processed by the hypervisor and forwarded to the guest.  For this reason allowing access to the MSI-X ranges (no matter if via IOMMU or directly) is a no-no.  You need instead to emulate those and program the hypervisor appropriately.  This is what tools/ioemu/hw/pt-msi.c does.

Problem is, our QEMU with the upstream qemu-xen tree are so different that from a quick look I hardly can tell if we have the relevant commits upstream (mostly commit 7551a51, passthrough: use devfn instead of slots as the unit for pass-through, 2009-06-25).  It seems like we do (see bug 581655).

Pasi, can you:

1) attach the qemu-dm logs too?

2) try passing the whole NIC to the guest, and then bring up the VF?

Comment 30 Laszlo Ersek 2012-09-03 10:27:31 UTC

(In reply to comment #29)

> 2) try passing the whole NIC to the guest, and then bring up the VF?

Hmm right, I recall repeated recommendations from QE to pass through all functions, whenever I played with SR-IOV before.

Comment 31 Pasi Karkkainen 2012-09-06 20:42:23 UTC

(In reply to comment #25)
> 
> Perhaps try "iommu=passthrough" on the xen.gz command line, but it's just
> shotgun experimentation now.
> 

Unfortunately that didn't seem to help.

> Does RHEL-63 HVM work under upstream Xen+dom0? (Even if it does, I've looked
> at upstream IOMMU patches before, and I can either not pick candidates, or
> the changes are very invasive).
>

I'll try this later.

Comment 32 Pasi Karkkainen 2012-09-06 20:54:33 UTC

Created attachment 610520 [details]
rhel58 x64 xen qemu-dm log for rhel63 x64 hvm guest

Comment 33 Pasi Karkkainen 2012-09-06 20:55:35 UTC

(In reply to comment #29)
> 
> Pasi, can you:
> 
> 1) attach the qemu-dm logs too?
> 

Done.

Comment 34 Pasi Karkkainen 2012-09-06 21:19:40 UTC

(In reply to comment #30)
> (In reply to comment #29)
> 
> > 2) try passing the whole NIC to the guest, and then bring up the VF?
> 
> Hmm right, I recall repeated recommendations from QE to pass through all
> functions, whenever I played with SR-IOV before.
>

Ok, so I blacklisted ixgbe and hid the PF PCI ids in dom0, and passed thru one PF to the HVM guest.

Loading ixgbe driver for the PF works in RHEL 6.3 HVM guest, and the PF works OK. I can see interrupt count increasing for the PF while pinging:

[root@rhel63x64hvm ~]# grep eth1 /proc/interrupts
 48:        348   PCI-MSI-edge      eth1

So far all good. 

Adding "max_vfs=8" option in the HVM guest for the ixgbe module is where the problems begin:

[root@c63x64hvm ~]# dmesg | grep ixgbe
ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 3.6.7-k
ixgbe: Copyright (c) 1999-2012 Intel Corporation.
ixgbe 0000:00:06.0: PCI INT A -> GSI 40 (level, low) -> IRQ 40
ixgbe 0000:00:06.0: setting latency timer to 64
ixgbe 0000:00:06.0: (unregistered net_device): Failed to enable PCI sriov: -19
ixgbe 0000:00:06.0: (unregistered net_device): ATR is not supported while multiple queues are disabled.  Disabling Flow Director
ixgbe 0000:00:06.0: irq 48 for MSI/MSI-X
ixgbe 0000:00:06.0: Multiqueue Disabled: Rx Queue count = 1, Tx Queue count = 1
ixgbe 0000:00:06.0: (PCI Express:5.0GT/s:Width x8) 00:2b:31:77:9e:1c
ixgbe 0000:00:06.0: MAC: 2, PHY: 8, SFP+: 3, PBA No: E81283-002
ixgbe 0000:00:06.0: Intel(R) 10 Gigabit Network Connection
ixgbe 0000:00:06.0: eth1: detected SFP+: 3
ixgbe 0000:00:06.0: eth1: NIC Link is Up 10 Gbps, Flow Control: RX/TX

So especially this:
ixgbe 0000:00:06.0: (unregistered net_device): Failed to enable PCI sriov: -19

Any ideas?

Comment 35 Laszlo Ersek 2012-09-07 07:14:42 UTC

(In reply to comment #34)

> So especially this:
> ixgbe 0000:00:06.0: (unregistered net_device): Failed to enable PCI sriov:
> -19

The direct reason could be

ixgbe_enable_sriov()
  pci_enable_sriov()

	if (!dev->is_physfn)
		return -ENODEV;

or

ixgbe_enable_sriov()
  pci_enable_sriov()
    sriov_enable()

	if (iov->link != dev->devfn) {
		pdev = pci_get_slot(dev->bus, iov->link);
		if (!pdev)
			return -ENODEV;

		pci_dev_put(pdev);

		if (!pdev->is_physfn)
			return -ENODEV;

But that doesn't tell me much.

What if you don't pass through the physical device (03:00.*), but pass through *all* the VFs instead (03:10.*)? Can you please repeat your original test with

    pci = [ '03:10.0', '03:10.1', ..., '03:10.7' ]

in the vm config file? Thanks.

Comment 36 Laszlo Ersek 2012-09-07 07:17:03 UTC

(In reply to comment #32)
> Created attachment 610520 [details]
> rhel58 x64 xen qemu-dm log for rhel63 x64 hvm guest

"pt_pci_read_config: Warning: Return ALL F from libpci read. [00:06.0][Offset:00h][Length:2]"

Comment 37 Pasi Karkkainen 2012-09-10 16:55:55 UTC

I changed to max_vfs=2,2 for ixgbe in dom0, and then passed thru all the four VFs to the HVM guest:

[root@dom0 ~]# grep pci /etc/xen/rhel63x64hvm
xen_platform_pci=0
pci = [ '03:10.0', '03:10.1', '03:10.2', '03:10.3' ]

The interesting part is that I can only see 2 VFs in the guest! 

[root@rhel63x64hvm ~]# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:01.3 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:02.0 VGA compatible controller: Device 1234:1111
00:05.0 SCSI storage controller: XenSource, Inc. Xen Platform Device (rev 01)
00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:07.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)


[root@rhel63x64hvm ~]# dmesg | grep -i ixgbe
ixgbevf: Intel(R) 10 Gigabit PCI Express Virtual Function Network Driver - version 2.2.0-k
ixgbevf: Copyright (c) 2009 - 2012 Intel Corporation.
ixgbevf 0000:00:06.0: setting latency timer to 64
ixgbevf 0000:00:06.0: irq 48 for MSI/MSI-X
ixgbevf 0000:00:06.0: irq 49 for MSI/MSI-X
ixgbevf 0000:00:06.0: irq 50 for MSI/MSI-X
ixgbevf 0000:00:07.0: setting latency timer to 64
ixgbevf 0000:00:07.0: irq 51 for MSI/MSI-X
ixgbevf 0000:00:07.0: irq 52 for MSI/MSI-X
ixgbevf 0000:00:07.0: irq 53 for MSI/MSI-X

Also the VFs don't work. I configured an IP to them, tried pinging, but it doesn't work. Also the interrupt counters stay at zero in /proc/interrupts.

[root@rhel63x64hvm ~]# grep eth1 /proc/interrupts
 48:          0   PCI-MSI-edge      eth1-rx-0
 49:          0   PCI-MSI-edge      eth1-tx-0
 50:          0   PCI-MSI-edge      eth1:mbx

Comment 38 Pasi Karkkainen 2012-09-10 16:59:30 UTC

Created attachment 611505 [details]
rhel58 x64 xen qemu-dm log for rhel63 x64 hvm guest with multiple VFs

four (all) VFs passed thru, only two of them are visible in the HVM guest.

Comment 39 Pasi Karkkainen 2012-09-10 19:16:57 UTC

I also tried with max_vfs=4,4 in dom0, so that gives 8 VFs total, and I passed all 8 to the HVM guest. Inside the RHEL6 HVM guest still only 2 were visible in lspci.

Comment 40 Paolo Bonzini 2012-09-14 12:16:35 UTC

Please get similar logs for a RHEL5 guest, too.  Thanks!

Comment 45 Pasi Karkkainen 2012-09-17 14:04:14 UTC

Ok, I just tried with RHEL5.8 x64 HVM guest.

The first time I did "ifconfig eth1 up" inside the HVM guest the guest crashed with a kernel panic! Unfortunately I didn't have serial console set up then, so I couldn't capture the guest kernel crash.

The second time I tried it, the VF actually works ! I tried rebooting the guest again, and it still works. Dunno what was wrong on the first time..

[root@rhel58x64hvm ~]# lspci -vvv

00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
        Subsystem: Intel Corporation Device 7a11
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64
        Region 0: Memory at f4020000 (64-bit, non-prefetchable) [size=16K]
        Region 3: Memory at f4024000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00002000
        Kernel driver in use: ixgbevf
        Kernel modules: ixgbevf

[root@rhel58x64hvm ~]# cat /proc/interrupts 
           CPU0       
  0:     341156    IO-APIC-edge  timer
  1:          9    IO-APIC-edge  i8042
  6:          2    IO-APIC-edge  floppy
  8:          1    IO-APIC-edge  rtc
  9:          0   IO-APIC-level  acpi
 12:        111    IO-APIC-edge  i8042
 14:       4043    IO-APIC-edge  ide0
 15:         47    IO-APIC-edge  ide1
169:         41   IO-APIC-level  uhci_hcd:usb1
177:       2019   IO-APIC-level  eth0
193:       4337       PCI-MSI-X  eth1-rx-0
201:         26       PCI-MSI-X  eth1-tx-0
209:         24       PCI-MSI-X  eth1:mbx
217:        376   IO-APIC-level  xen-platform-pci
NMI:          0 
LOC:     341675 
ERR:          0
MIS:          0

[root@rhel58x64hvm ~]# ethtool -i eth1
driver: ixgbevf
version: 2.1.0-k
firmware-version: N/A
bus-info: 0000:00:06.0

I'll attach qemu-dm.log for the rhel5.8 hvm guest.

Comment 46 Pasi Karkkainen 2012-09-17 14:05:54 UTC

Created attachment 613668 [details]
rhel58 x64 xen qemu-dm log for rhel58 x64 hvm guest

Comment 47 Pasi Karkkainen 2012-09-17 14:08:01 UTC

Hmm, for the working rhel5.8 HVM guest the ixgbevf interrupts are PCI-MSI-X, while for the non-working rhel6.3 HVM guest the interrupts are PCI-MSI-edge, is that relevant?

Comment 48 Yuyu Zhou 2012-09-18 11:19:38 UTC

Hello, Laszlo
I find a machine very close to Pasi's env:
R510, intel-e5606

But till now I am not sure whether the bug is reproducible in that machine:

The 82599 NIC is not plugged in with fiber network line because we have limited number of fiber network port in our office.
So I get following info in the guest:

    ifconfig eth0 up
    ixgbevf: Unable to start - perhaps the PF Driver isn't up yet
    SIOCSIFFLAGS: Network is down
    lspci
    00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
    00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
    00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
    00:01.2 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
    00:02.0 VGA compatible controller: Cirrus Logic GD 5446
    00:03.0 SCSI storage controller: XenSource, Inc. Xen Platform Device (rev 01)
    00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
    00:07.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
    cat /proc/interrupts
    CPU0 CPU1
    0: 153 0 IO-APIC-edge timer
    1: 685 21 IO-APIC-edge i8042
    4: 1313 119 IO-APIC-edge serial
    8: 0 1 IO-APIC-edge rtc0
    9: 0 0 IO-APIC-fasteoi acpi
    12: 693 169 IO-APIC-edge i8042
    14: 204 236 IO-APIC-edge ata_piix
    15: 0 0 IO-APIC-edge ata_piix
    28: 5865 5757 IO-APIC-fasteoi xen-platform-pci
    510: 5849 0 xen-dyn-event blkif
    511: 75 0 xen-dyn-event xenbus
    NMI: 0 0 Non-maskable interrupts
    LOC: 24338 11525 Local timer interrupts
    SPU: 0 0 Spurious interrupts
    PMI: 0 0 Performance monitoring interrupts
    IWI: 0 0 IRQ work interrupts
    RES: 1421 2804 Rescheduling interrupts
    CAL: 149 316 Function call interrupts
    TLB: 306 720 TLB shootdowns
    TRM: 0 0 Thermal event interrupts
    THR: 0 0 Threshold APIC interrupts
    MCE: 0 0 Machine check exceptions
    MCP: 2 2 Machine check polls
    ERR: 0
    MIS: 11

The eng-ops guy is off duty now, so I need to try this again tomorrow if I can get the fiber link.

Comment 49 Pasi Karkkainen 2012-09-18 11:37:29 UTC

(In reply to comment #48)
>
> So I get following info in the guest:
> 
>     ifconfig eth0 up
>     ixgbevf: Unable to start - perhaps the PF Driver isn't up yet
>     SIOCSIFFLAGS: Network is down
>

Is the Physical Function (PF) interface "up" in dom0 ?  

So first you need to "ifconfig ethX up" the PF in dom0, and after that "ifconfig ethX up" the VF in the VM.

Btw I'm using DA (SFP+ Direct Attach) NICs and cables, so no fiber, if that makes a difference..

Comment 50 Qin Guan 2012-09-19 09:32:51 UTC

(In reply to comment #48)

> 
> The eng-ops guy is off duty now, so I need to try this again tomorrow if I
> can get the fiber link.

I think the problem could be reproduced after plug the fiber link into the 82599 card. 

When pass-through the 82599 card into RHEL6 (kernel-2.6.32-279) guest, it could not get IP address (guest will crash and keep rebooting when with "pci=nomsi" in kernel cmd line). 

The same test works with RHEL5.9 guest (kernel-2.6.18-339).

On RHEL6.3 guest:

# lspci -D | grep 82599
0000:00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
0000:00:07.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)

# lspci -vvv

00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
	Subsystem: Intel Corporation Device 7a11
	Physical Slot: 1
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 64
	Region 0: Memory at f4000000 (64-bit, non-prefetchable) [size=16K]
	Region 3: Memory at f4004000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
		Vector table: BAR=3 offset=00000000
		PBA: BAR=3 offset=00002000
	Kernel driver in use: ixgbevf
	Kernel modules: ixgbevf

# grep eth0 /proc/interrupts
 48:          0          0   PCI-MSI-edge      eth0-rx-0
 49:          0          0   PCI-MSI-edge      eth0-tx-0
 50:          0          0   PCI-MSI-edge      eth0:mbx


# dmesg | grep -i ixgbe
ixgbevf: Intel(R) 10 Gigabit PCI Express Virtual Function Network Driver - version 2.2.0-k
ixgbevf: Copyright (c) 2009 - 2012 Intel Corporation.
ixgbevf 0000:00:06.0: setting latency timer to 64
ixgbevf 0000:00:06.0: irq 48 for MSI/MSI-X
ixgbevf 0000:00:06.0: irq 49 for MSI/MSI-X
ixgbevf 0000:00:06.0: irq 50 for MSI/MSI-X
ixgbevf 0000:00:07.0: setting latency timer to 64
ixgbevf 0000:00:07.0: PF still in reset state, assigning new address
ixgbevf 0000:00:07.0: irq 51 for MSI/MSI-X
ixgbevf 0000:00:07.0: irq 52 for MSI/MSI-X
ixgbevf 0000:00:07.0: irq 53 for MSI/MSI-X
ixgbevf: Unable to start - perhaps the PF Driver isn't up yet

Comment 51 Laszlo Ersek 2012-09-19 09:43:46 UTC

(In reply to comment #47)

> Hmm, for the working rhel5.8 HVM guest the ixgbevf interrupts are PCI-MSI-X,
> while for the non-working rhel6.3 HVM guest the interrupts are PCI-MSI-edge,
> is that relevant?

Probably...

(In reply to comment #50)

> I think the problem could be reproduced after plug the fiber link into the
> 82599 card. 
> 
> When pass-through the 82599 card into RHEL6 (kernel-2.6.32-279) guest, it
> could not get IP address (guest will crash and keep rebooting when with
> "pci=nomsi" in kernel cmd line). 
> 
> The same test works with RHEL5.9 guest (kernel-2.6.18-339).

I assume this means the RHEL5.9 guest works *without* "pci=nomsi". (While the RHEL6.3 guest doesn't work without it, and crashes with it.)

Comment 57 Laszlo Ersek 2012-09-19 11:55:22 UTC

(In reply to comment #46)
> Created attachment 613668 [details]
> rhel58 x64 xen qemu-dm log for rhel58 x64 hvm guest

Grepping attachment 610520 [details], attachment 611505 [details] and attachment 613668 [details] for
"first_map=", the qemu-dm logs for the RHEL-6 guest(s) only contain
"first_map=1" entries, whereas the qemu-dm log for the RHEL-5 guest also has
"first_map=0" lines.

These are logged by pt_iomem_map() [tools/ioemu/hw/pass-through.c] in
qemu-dm. The add_msix_mapping() call depends on (first_map==0).

> void pt_iomem_map(PCIDevice *d, int i, uint32_t e_phys, uint32_t e_size,
>                   int type)
> {
>     struct pt_dev *assigned_device  = (struct pt_dev *)d; 
>     uint32_t old_ebase = assigned_device->bases[i].e_physbase;
>     int first_map = ( assigned_device->bases[i].e_size == 0 );
>     int ret = 0;
> 
>     assigned_device->bases[i].e_physbase = e_phys;
>     assigned_device->bases[i].e_size= e_size;
> 
>     PT_LOG("e_phys=%08x maddr=%lx type=%d len=%d index=%d first_map=%d\n",
>         e_phys, (unsigned long)assigned_device->bases[i].access.maddr, 
>         type, e_size, i, first_map);
> 
>     if ( e_size == 0 )
>         return;
> 
>     if ( !first_map && old_ebase != -1 )
>     {
>         add_msix_mapping(assigned_device, i);
>         /* Remove old mapping */
>         ret = xc_domain_memory_mapping(xc_handle, domid,
>                 old_ebase >> XC_PAGE_SHIFT,
>                 assigned_device->bases[i].access.maddr >> XC_PAGE_SHIFT,
>                 (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
>                 DPCI_REMOVE_MAPPING);
>         if ( ret != 0 )
>         {
>             PT_LOG("Error: remove old mapping failed!\n");
>             return;
>         }
>     }
> 
>     /* map only valid guest address */
>     if (e_phys != -1)
>     {

This branch should run for each first_map=1 line though, for those e_phys is
never UINT_MAX.

>         /* Create new mapping */
>         ret = xc_domain_memory_mapping(xc_handle, domid,
>                 assigned_device->bases[i].e_physbase >> XC_PAGE_SHIFT,
>                 assigned_device->bases[i].access.maddr >> XC_PAGE_SHIFT,
>                 (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
>                 DPCI_ADD_MAPPING);
> 
>         if ( ret != 0 )
>         {
>             PT_LOG("Error: create new mapping failed!\n");
>         }
>         
>         ret = remove_msix_mapping(assigned_device, i);

I don't understand what & why we remove here, right after the addition.

>         if ( ret != 0 )
>             PT_LOG("Error: remove MSI-X mmio mapping failed!\n");
> 
>         if ( old_ebase != e_phys && old_ebase != -1 )
>             pt_msix_update_remap(assigned_device, i);
>     }

Most of this function comes from commit f39cc738 ("xen-3.0.3-86.el5"), but
these last lines are from 694b84d3 ("MSI-X mask bit acceleration").

> }

Comment 58 Laszlo Ersek 2012-09-19 12:15:05 UTC

Also,

* rhel58-x64-xen-qemu-dm-log-for-rhel58-x64-hvm.txt:

pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
pt_msix_update_one: now update msix entry 0 with pirq ff gvec c1
pt_msix_update_one: now update msix entry 1 with pirq fe gvec c9
pt_msix_update_one: now update msix entry 2 with pirq fd gvec d1

* rhel58-x64-xen-qemu-dm-log-for-rhel63-x64-hvm.txt:

pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
(4 times, no "now update msix entry" msgs)

* rhel58-x64-xen-qemu-dm-log-for-rhel63-x64-hvm-02.txt:

pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
(8 times, no "now update msix entry" msgs)

* "now update msix entry" is printed by pt_msix_update_one(); call sites:

pt_iomem_map()
 pt_msix_update_remap()
   pt_msix_update()
     pt_msix_update_one()

pt_msixctrl_reg_write() <- registered for "MSI-X Capability Structure reg
                           group" / "Message Control reg" in pt_config_init
  pt_msix_update()
    pt_msix_update_one()

pci_msix_writel() <- registered for iomem writes in pt_msix_init
  pt_msix_update_one()

All three logs describe calls to pt_msixctrl_reg_write(), which is able to
skip the call to pt_msix_update():

> /* write Message Control register for MSI-X */
> static int pt_msixctrl_reg_write(struct pt_dev *ptdev, 
>     struct pt_reg_tbl *cfg_entry, 
>     uint16_t *value, uint16_t dev_value, uint16_t valid_mask)
> {
>     struct pt_reg_info_tbl *reg = cfg_entry->reg;
>     uint16_t writable_mask = 0;
>     uint16_t throughable_mask = 0;
>     uint16_t old_ctrl = cfg_entry->data;
> 
>     /* modify emulate register */
>     writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
>     cfg_entry->data = ((*value & writable_mask) |
>                        (cfg_entry->data & ~writable_mask));
> 
>     PT_LOG("old_ctrl:%04xh new_ctrl:%04xh\n", old_ctrl, cfg_entry->data);
>     
>     /* create value for writing to I/O device register */
>     throughable_mask = ~reg->emu_mask & valid_mask;
>     *value = ((*value & throughable_mask) | (dev_value & ~throughable_mask));
> 
>     /* update MSI-X */
>     if ((*value & PCI_MSIX_ENABLE) && !(*value & PCI_MSIX_MASK))
>         pt_msix_update(ptdev);
> 
>     ptdev->msix->enabled = !!(*value & PCI_MSIX_ENABLE);
> 
>     return 0;
> }

Comment 59 Laszlo Ersek 2012-09-19 13:36:19 UTC

Created attachment 614372 [details]
additional debug messages for qemu-dm

Comment 60 Laszlo Ersek 2012-09-19 13:37:26 UTC

Created attachment 614373 [details]
qemu-dm log (with debug patch) about successful msi-x initialization in RHEL-5 guest

Comment 61 Laszlo Ersek 2012-09-19 13:38:25 UTC

Created attachment 614374 [details]
qemu-dm log (with debug patch) about failed msi-x initialization in RHEL-6 guest

Comment 62 Laszlo Ersek 2012-09-19 14:22:23 UTC

I think this is the interesting part of the RHEL-5 --> RHEL-6 diff (made
between comment 60 and comment 61).

Of course I've missed a newline in my debug patch, I'll suplement it here.

  +pt_msixctrl_reg_write: emu_mask=0000 ro_mask=3fff valid_mask=ffff
                          writable_mask=0000
  +pt_msixctrl_reg_write: value=0002 dev_value=0002
  +pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
  +pt_msixctrl_reg_write: throughable_mask=ffff new_value=0002
  +pt_msixctrl_reg_write: msix_enabled=0

  +pt_msixctrl_reg_write: emu_mask=0000 ro_mask=3fff valid_mask=ffff
                          writable_mask=0000
  +pt_msixctrl_reg_write: value=c002 dev_value=0002
  +pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
  +pt_msixctrl_reg_write: throughable_mask=ffff new_value=c002
  +pt_msixctrl_reg_write: msix_enabled=1

This is done only by RHEL-6; two invocations of pt_msixctrl_reg_write().
The first call ends up with new_value=0002, so "nothing happens", the second
call ends up with new_value=c002 (has both PCI_MSIX_ENABLE=0x8000 and
PCI_MSIX_MASK=0x4000 set). Referring back to comment 58, this means that
msix->enabled will be set at the end of pt_msixctrl_reg_write(), but
pt_msix_update() is *not* called. (A message saying "pt_msixctrl_reg_write:
1" should be present between "throughable_mask" and "msix_enabled".)

   pci_msix_writel: 1

Both RHEL-5 and RHEL-6 trigger pci_msix_writel() at this point.

  -pci_msix_writel: 2
  -pci_msix_writel: 1
  -pci_msix_writel: 1
  -pci_msix_writel: 2
  -pci_msix_writel: 1
  -pci_msix_writel: 2
  -pci_msix_writel: 1
  -pci_msix_writel: 1
  -pci_msix_writel: 2
  -pci_msix_writel: 1
  -pci_msix_writel: 2

RHEL-5 continues to massage this register. Values are not logged,
unfortunately, but we can say that for some calls (pci_msix_writel: 2), the
following block is triggered:

    if ( offset != 3 && entry->io_mem[offset] != val ) {
        PT_LOG("2\n");
        entry->flags = 1;
    }

which corresponds to "dev->msix->msix_entry[entry_nr].flags = 1".

("pci_msix_writel: 1" alone just logs entry to the function and changing
"entry->io_mem[offset]".)


  +pci_msix_writel: 3
   pci_msix_writel: 1
  +pci_msix_writel: 3
   pci_msix_writel: 1
  -pci_msix_writel: 2
  +pci_msix_writel: 3

RHEL-6 *instead* (not in addition) massages the following block:

    if ( offset == 3 )
    {
        PT_LOG("3\n");
        if ( msix->enabled && !(val & 0x1) ) {
            PT_LOG("4\n");
            pt_msix_update_one(dev, entry_nr);
        }
        mask_physical_msix_entry(dev, entry_nr, entry->io_mem[3] & 0x1);
    }

Note that "pci_msix_writel: 4" is never logged, thus pt_msix_update_one() is
not called. "msix->enabled" must be "true" in RHEL-6, from above, but bit#0
is apparently set in "val".

Until now we've set "dev->msix->msix_entry[entry_nr].flags" for the RHEL-5
guest, but have not for the RHEL-6 guest.

   pt_msixctrl_reg_write: emu_mask=0000 ro_mask=3fff valid_mask=ffff
                          writable_mask=0000
  -pt_msixctrl_reg_write: value=8002 dev_value=0002
  +pt_msixctrl_reg_write: value=8002 dev_value=c002
   pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
   pt_msixctrl_reg_write: throughable_mask=ffff new_value=8002
   pt_msixctrl_reg_write: 1

Both guest kernels write new_value=8002 to the control register (the
previous value in the device is different, according to the guests'
different pasts -- RHEL-5 has not touched the control register, RHEL-6 left
c002 there).

PCI_MSIX_ENABLE is set in the new value, but PCI_MSIX_MASK is clear,
therefore pt_msixctrl_reg_write() calls pt_msix_update() for both guests:

   pt_msix_update: 1

... Function entered and we're past an xc_physdev_set_device_msixtbl()
hypercall...

   pt_msix_update: 2: 0

Starting loop that calls pt_msix_update_one() for each MSI-X entry.

   pt_msix_update_one: 1

Entry#0: pt_msix_update_one() is entered for both kernels. The first thing
this function does is check entry->flags; if it's unset, we return early
without doing anything.

  -pt_msix_update_one: 2
  -pt_msix_update_one: 3
  -pt_msix_update_one: now update msix entry 0 with pirq ff gvec b9
  -pt_msix_update_one: 4

That's exactly what happens for RHEL-6 -- we've set entry->flags only for
RHEL-5.

   pt_msix_update: 2: 1
   pt_msix_update_one: 1
  -pt_msix_update_one: 2
  -pt_msix_update_one: 3
  -pt_msix_update_one: now update msix entry 1 with pirq fe gvec c1
  -pt_msix_update_one: 4
   pt_msix_update: 2: 2
   pt_msix_update_one: 1
  -pt_msix_update_one: 2
  -pt_msix_update_one: 3
  -pt_msix_update_one: now update msix entry 2 with pirq fd gvec c9
  -pt_msix_update_one: 4

Lather, rinse, repeat for entries #1 and #2.

   pt_msixctrl_reg_write: msix_enabled=1

After the loop completes in pt_msix_update(), we return to
pt_msixctrl_reg_write(), set dev->msix->enabled, and we're done.

For RHEL-6,

- we fail to set "entry->flags" in pci_msix_writel(),
- in the same function, we fail to call pt_msix_update_one() immediately
  (... even if we did, it wouldn't help: the latter function still depends
  on entry->flags)

I think this is an MSI-X emulation bug in qemu-dm (= xen-userspace).
RHEL-6's access pattern differs from that of RHEL-5, and we don't serve it
correctly.

Thoughts? Thanks.

Comment 63 Laszlo Ersek 2012-09-19 14:46:06 UTC

I think the three pt_msixctrl_reg_write() invocations in comment 61 (and comment 62), on behalf of RHEL-6, can be matched against the three PCI_MSIX_FLAGS write accesses in the guest kernel:

ixgbevf_acquire_msix_vectors() [drivers/net/ixgbevf/ixgbevf_main.c]
  pci_enable_msix() [drivers/pci/msi.c] 
    msix_capability_init()

> static int msix_capability_init(struct pci_dev *dev,
>         struct msix_entry *entries, int nvec)
> {
>   int pos, ret;
>   u16 control;
>   void __iomem *base;
> 
>   pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
>   pci_read_config_word(dev, pos + PCI_MSIX_FLAGS, &control);
> 
>   /* Ensure MSI-X is disabled while it is set up */
>   control &= ~PCI_MSIX_FLAGS_ENABLE;
>   pci_write_config_word(dev, pos + PCI_MSIX_FLAGS, control);

  +pt_msixctrl_reg_write: emu_mask=0000 ro_mask=3fff valid_mask=ffff
                          writable_mask=0000
  +pt_msixctrl_reg_write: value=0002 dev_value=0002
  +pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
  +pt_msixctrl_reg_write: throughable_mask=ffff new_value=0002
  +pt_msixctrl_reg_write: msix_enabled=0

> 
>   /* Request & Map MSI-X table region */
>   base = msix_map_region(dev, pos, multi_msix_capable(control));
>   if (!base)
>     return -ENOMEM;
> 
>   ret = msix_setup_entries(dev, pos, base, entries, nvec);
>   if (ret)
>     return ret;
> 
>   ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
>   if (ret)
>     goto error;
> 
>   /*
>    * Some devices require MSI-X to be enabled before we can touch the
>    * MSI-X registers.  We need to mask all the vectors to prevent
>    * interrupts coming in before they're fully set up.
>    */
>   control |= PCI_MSIX_FLAGS_MASKALL | PCI_MSIX_FLAGS_ENABLE;
>   pci_write_config_word(dev, pos + PCI_MSIX_FLAGS, control);

  +pt_msixctrl_reg_write: emu_mask=0000 ro_mask=3fff valid_mask=ffff
                          writable_mask=0000
  +pt_msixctrl_reg_write: value=c002 dev_value=0002
  +pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
  +pt_msixctrl_reg_write: throughable_mask=ffff new_value=c002
  +pt_msixctrl_reg_write: msix_enabled=1

(but masked)

> 
>   msix_program_entries(dev, entries);
> 
>   ret = populate_msi_sysfs(dev);
>   if (ret) {
>     ret = 0;
>     goto error;
>   }
> 
>   /* Set MSI-X enabled bits and unmask the function */
>   pci_intx_for_msi(dev, 0);
>   dev->msix_enabled = 1;
> 
>   control &= ~PCI_MSIX_FLAGS_MASKALL;
>   pci_write_config_word(dev, pos + PCI_MSIX_FLAGS, control);

   pt_msixctrl_reg_write: emu_mask=0000 ro_mask=3fff valid_mask=ffff
                          writable_mask=0000
  -pt_msixctrl_reg_write: value=8002 dev_value=0002
  +pt_msixctrl_reg_write: value=8002 dev_value=c002
   pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
   pt_msixctrl_reg_write: throughable_mask=ffff new_value=8002
   pt_msixctrl_reg_write: 1

> 
>   return 0;
> 
> error:
>   if (ret < 0) {
>     /*
>      * If we had some success, report the number of irqs
>      * we succeeded in setting up.
>      */
>     struct msi_desc *entry;
>     int avail = 0;
> 
>     list_for_each_entry(entry, &dev->msi_list, list) {
>       if (entry->irq != 0)
>         avail++;
>     }
>     if (avail != 0)
>       ret = avail;
>   }
> 
>   free_msi_irqs(dev);
> 
>   return ret;
> }

Comment 64 Laszlo Ersek 2012-09-19 14:49:32 UTC

On the qemu-dm side, the mis-programming happens between the second and third PCI_MSIX_FLAGS config word accesses, so we should look at msix_program_entries() and pci_intx_for_msi() in the RHEL-6 guest kernel.

Comment 65 Laszlo Ersek 2012-09-19 16:01:59 UTC

I think I understand why the RHEL-5 guest works, but I don't understand how
the RHEL-6 guest can work on the bare metal at all! :)

So, RHEL-5 msix_capability_init() has calls like

		writel(address_lo,
			base + j * PCI_MSIX_ENTRY_SIZE +
			PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
		writel(address_hi,
			base + j * PCI_MSIX_ENTRY_SIZE +
			PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
		writel(data,
			base + j * PCI_MSIX_ENTRY_SIZE +
			PCI_MSIX_ENTRY_DATA_OFFSET);

PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET == 0
PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET == 4
PCI_MSIX_ENTRY_DATA_OFFSET       == 8

Now, after division by 4 (see pci_msix_writel() in qemu-dm) these become 0,
1, 2; therefore updates to the lower address offset, upper address offset,
and data offset of any MSI-X entry kick the following in qemu-dm:

    if ( offset != 3 && entry->io_mem[offset] != val ) {
        entry->flags = 1;
    }

Ie. such a change will indeed mark the MSI-X entry for update in the
emulator, and once the PCI_MSIX_ENABLE flag is set, and the PCI_MSIX_MASK
flag is cleared in the control register, a batched update will be flushed to
the hypervisor.

So what does the RHEL-6 kernel do?

msix_capability_init()
  msix_program_entries() -- loops over all entries
    computes an offset for reading from PCI_MSIX_ENTRY_VECTOR_CTRL (12)
    readl()
    msix_mask_irq(entry, 1)
      __msix_mask_irq(desc, 1)
        computes same PCI_MSIX_ENTRY_VECTOR_CTRL offset
        writel() with lowest value bit set to "1" (from the second param)

In the emulator (pci_msix_writel()), this vector control (?) write, 12/4==3,
corresponds to

    if ( offset == 3 )
    {
        if ( msix->enabled && !(val & 0x1) )
            pt_msix_update_one(dev, entry_nr);
        mask_physical_msix_entry(dev, entry_nr, entry->io_mem[3] & 0x1);
    }

Since the LSB is set, we don't do anything except masking. The entry is not
marked for later update, and we don't update it right now.

(a) I guess I could add an "else" branch to the above "is LSB set?" check,
and if the LSB is in fact set, just say "entry->flags = 1"; ie. schedule a
later update. But!

(b) what I don't understand is this: *when* does the RHEL-6 kernel program:
- PCI_MSIX_ENTRY_LOWER_ADDR == 0,
- PCI_MSIX_ENTRY_UPPER_ADDR == 4,
- PCI_MSIX_ENTRY_DATA       == 8?

*at all*? (Here I used the RHEL-6 macro names.)

I grepped the RHEL-6 tree for them, but the only write accesses are in
write_msi_msg_desc(). Possible call trees:

__pci_restore_msi_state
  write_msi_msg() [drivers/pci/msi.c]
    write_msi_msg_desc()

__pci_restore_msix_state
  write_msi_msg() [drivers/pci/msi.c]
    write_msi_msg_desc()

arch_setup_msi_irqs() [arch/x86/kernel/apic/io_apic.c]
  setup_msi_irq()
    write_msi_msg()
      write_msi_msg_desc()

I'll ignore the first two (they both come from pci_restore_state() ->
pci_restore_msi_state(), which doesn't seem to be relevant). The third
(arch_setup_msi_irqs()) is interesting though.

... We've seen it in comment 63, but msix_capability_init() calls it between
the *first* and *second* accesses to PCI_MSIX_FLAGS (not between second &
third), and qemu-dm logs nothing at all there.

I'll have to add debug messages to the RHEL-6 guest kernel.

Comment 66 Laszlo Ersek 2012-09-19 19:29:54 UTC

(In reply to comment #12)
> Created attachment 606050 [details]
> rhel63 x64 xen hvm guest linux kernel dmesg log

    alloc irq_desc for 48 on node -1               <-----+
    alloc kstat_irqs on node -1                    <-----|--+
  ixgbevf 0000:00:06.0: irq 48 for MSI/MSI-X       <-----|--|--+
    alloc irq_desc for 49 on node -1               <-----+  |  |
    alloc kstat_irqs on node -1                    <-----|--+  |
  ixgbevf 0000:00:06.0: irq 49 for MSI/MSI-X       <-----|--|--+
    alloc irq_desc for 50 on node -1               <-----+  |  |
    alloc kstat_irqs on node -1                    <-----|--+  |
  ixgbevf 0000:00:06.0: irq 50 for MSI/MSI-X       <-----|--|--+
                                                         |  |  |
arch_setup_msi_irqs() [arch/x86/kernel/apic/io_apic.c]   |  |  |
  foreach MSI-X entry:                                   |  |  |
    create_irq_nr()                                      |  |  |
      irq_to_desc_alloc_node()                     ------+  |  |
        init_one_irq_desc()                                 |  |
          init_kstat_irqs()                        ---------+  |
    setup_msi_irq()                                            |
      write_msi_msg()                                          |
        write_msi_msg_desc()                                   |
      dev_printk()                                 ------------+

Thus setup_msi_irq() positively calls write_msi_msg(), which calls
write_msi_msg_desc().

Comment 69 Laszlo Ersek 2012-09-19 21:31:24 UTC

Created attachment 614596 [details]
add debug messages to write_msi_msg_desc() -- debug patch for kernel-2.6.32-279.5.2.el6

Comment 71 Laszlo Ersek 2012-09-20 13:00:32 UTC

Pasi,

our repro env is gone for ten days; we'll have it back on Sep 30th, until Oct 7th. Until then, can you please

- rebuild your RHEL-6.3.z guest kernel (2.6.32-279.5.2.el6) with the patch from
  comment 69 (*),
- using this kernel, repeat your VF test (single VF would be best) and upload
  the guest dmesg (passing "ignore_loglevel" to the guest),
- still with this kernel, repeat your PF test (comment 34) and upload the guest
  dmesg (passing "ignore_loglevel" to the guest).

(*) I'm investigating how I can build source & binary RPMs that I'm at liberty to share with you as a customer. Until then, please add the debug patch to the spec file and rebuild the RPM.

Thanks!
Laszlo

Comment 75 Laszlo Ersek 2012-09-21 10:33:40 UTC

We've been re-granted access to another reproducer machine
(dell-per820-02.lab.bos.redhat.com), and I think I managed to track the
problem a bit further (configuring and passing through a single VF).

After I booted the host, the PF (eth0) was not brought up automatically. I
booted the guest in this state, with the debug patch from comment 69.

This is what I saw in the guest dmesg:

    ixgbevf: Intel(R) 10 Gigabit PCI Express Virtual Function Network Driver
             - version 2.2.0-k
    ixgbevf: Copyright (c) 2009 - 2012 Intel Corporation.
    ixgbevf 0000:00:06.0: setting latency timer to 64
    ixgbevf 0000:00:06.0: PF still in reset state, assigning new address
      alloc irq_desc for 48 on node -1
      alloc kstat_irqs on node -1
    ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now
    ixgbevf 0000:00:06.0: irq 48 for MSI/MSI-X
      alloc irq_desc for 49 on node -1
      alloc kstat_irqs on node -1
    ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now
    ixgbevf 0000:00:06.0: irq 49 for MSI/MSI-X
      alloc irq_desc for 50 on node -1
      alloc kstat_irqs on node -1
    ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now
    ixgbevf 0000:00:06.0: irq 50 for MSI/MSI-X

However, no "PCI-MSI-edge" interrupts appeared in /proc/interrupts, ie. at
this point I did not yet reproduce the original problem. So I tried to bring
up the VF in the guest, with ifconfig. It was refused (link down), with the
following message in the guest dmesg:

  ixgbevf: Unable to start - perhaps the PF Driver isn't up yet

corresponding to what I've written at the beginning of this comment -- the
PF had not been brought up in the host. Thus I did just that with ifconfig
in the host, which succeeded.

Then I retried upping the VF (called "rename2" due to udev magic) inside the
guest. That succeeded too, with the following two symptoms:

(a) the following appeared in /proc/interrupts:

 48:          0   PCI-MSI-edge      rename2-rx-0
 49:          0   PCI-MSI-edge      rename2-tx-0
 50:          0   PCI-MSI-edge      rename2:mbx

(b) The following messages were logged *again* in the guest dmesg (produced
by my debug patch in comment 69):

  ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now
  ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now
  ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now

At this point I can repeatedly bring down & up the VF in the guest. The
"down" operation produces a message like this in the host, for the PF:

  ixgbe 0000:08:00.0: eth0: VF Reset msg received from vf 0

(The passed-through VF is identified as 0000:08:10.0 in the host.)

The "up" operation logs the guest messages under (b) each time I run it, and
(a) remains unchanged.

----o----

Verdict: the write_msi_msg_desc() guest kernel function, which is a core
PCI-MSI-X configuration function, elects not to touch the lower address /
upper address / data registers (ie. actually configure MSI-X), because it
finds that the VF PCI device is not in power saving state D0.

D0 means "Fully-On" (see ACPISpec 5.0, 2.3 Device Power State Definitions):

    This state is assumed to be the highest level of power consumption. The
    device is completely active and responsive, and is expected to remember
    all relevant context continuously.

"/sys/devices/pci0000:00/0000:00:03.0/0000:08:10.0/power/state" contains "0"
on the host side.

In the guest, "/sys/devices/pci0000:00/0000:00:06.0/power/wakeup" is the
only file in that directory, and it has no contents. According to
"Documentation/power/devices.txt", this means that the VF device and/or
driver don't physically support wakeup events. As for the current power
saving state of the device, I'm unable to locate it.

RHEL-5's msix_capability_init() doesn't seem to care about the device's
power state.

----o----

The branch in RHEL-6 write_msi_msg_desc() that makes MSI-X register access
dependent on PCI power state comes from RHEL-6 commit 20a80eaa:

    [pci] MSI: Remove unsafe and unnecessary hardware access
    
which has been made for bug 696511, first built in kernel-2.6.32-182.el6.
Neighboring minor RHEL-6 releases:

  RHEL-6.1        2.6.32-131.el6
  RHEL-6.2        2.6.32-220.el6

Therefore it might be considered a regression from RHEL-6.1. (In this BZ we
have only checked RHEL-6.2, but that release already has the commit.)

(CC'ing Don Zickus :))

----o----

We have two choices here:

- we could implement a guest kernel kludge whereby the PCI_D0 check is
  skipped for Xen HVM guests,
- we could backport or fix PCI power state emulation in xen-userspace.
  Honestly, the thought of it freaks me out.

Comment 76 Pasi Karkkainen 2012-09-21 11:16:12 UTC

Created attachment 615391 [details]
rhel63 x64 xen hvm guest linux kernel dmesg log 2.6.32-279.5.2.el6.bz849223_debug.x86_64 with VF passthru

Comment 77 Pasi Karkkainen 2012-09-21 11:16:50 UTC

Created attachment 615392 [details]
rhel63 x64 xen hvm guest linux kernel dmesg log 2.6.32-279.5.2.el6.bz849223_debug.x86_64 with PF passthru

Comment 78 Pasi Karkkainen 2012-09-21 11:20:09 UTC

(In reply to comment #71)
>
> - rebuild your RHEL-6.3.z guest kernel (2.6.32-279.5.2.el6) with the patch
> from
>   comment 69 (*),
> - using this kernel, repeat your VF test (single VF would be best) and upload
>   the guest dmesg (passing "ignore_loglevel" to the guest),

Done. "rhel63 x64 xen hvm guest linux kernel dmesg log 2.6.32-279.5.2.el6.bz849223_debug.x86_64 with VF passthru".


> - still with this kernel, repeat your PF test (comment 34) and upload the
> guest
>   dmesg (passing "ignore_loglevel" to the guest).
> 

Done. "rhel63 x64 xen hvm guest linux kernel dmesg log 2.6.32-279.5.2.el6.bz849223_debug.x86_64 with PF passthru".

Comment 79 Laszlo Ersek 2012-09-21 11:29:57 UTC

ixgbevf_probe()
  pci_enable_device()
    __pci_enable_device_flags()
      do_pci_enable_device()
        pci_set_power_state(PCI_D0) -- errors returned by this func are
        fatal
                                       in do_pci_enable_device() *except*
                                       -EIO
                                       which is ignored
          __pci_start_power_transition() -- retval ignored
            pci_platform_power_transition()            <------+
              platform_pci_set_power_state()                  |
                pci_platform_pm -> set_state()                |
              pci_update_current_state()                      |
                accesses PCI_PM_CTRL config word              |
          pci_raw_set_power_state()                           |
            accesses PCI_PM_CTRL config word                  |
            pci_restore_bars() -- possibly                    |
            pcie_aspm_pm_state_change() -- possibly           |
          __pci_complete_power_transition()                   |
            pci_platform_power_transition() ---- see here ----+

Ugh, this is a mess.

In xen-userspace, there's a function called pt_pmcsr_reg_write(), "write
Power Management Control/Status register", it could be the culprit.


Andy, Stefan,

is platform_pci_power_manageable() supposed to return true for ixgbevf?

Also, is pci_dev.pm_cap nonzero for ixgbevf? ("PM capability offset in the
configuration space".)

Thanks!

Comment 80 Laszlo Ersek 2012-09-21 11:38:42 UTC

(In reply to comment #76)
> Created attachment 615391 [details]
> rhel63 x64 xen hvm guest linux kernel dmesg log
> 2.6.32-279.5.2.el6.bz849223_debug.x86_64 with VF passthru

Confirms branch 1 in guest for VF:

    alloc irq_desc for 48 on node -1
    alloc kstat_irqs on node -1
  ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now
  ixgbevf 0000:00:06.0: irq 48 for MSI/MSI-X
    alloc irq_desc for 49 on node -1
    alloc kstat_irqs on node -1
  ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now
  ixgbevf 0000:00:06.0: irq 49 for MSI/MSI-X
    alloc irq_desc for 50 on node -1
    alloc kstat_irqs on node -1
  ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now
  ixgbevf 0000:00:06.0: irq 50 for MSI/MSI-X

  [...]

  ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now
  ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now
  ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now

(In reply to comment #77)
> Created attachment 615392 [details]
> rhel63 x64 xen hvm guest linux kernel dmesg log
> 2.6.32-279.5.2.el6.bz849223_debug.x86_64 with PF passthru

Confirms branch 3 in guest for PF:

    alloc irq_desc for 48 on node -1
    alloc kstat_irqs on node -1
  ixgbe 0000:00:06.0: write_msi_msg_desc: branch 3: pos=80
                      msi_control_reg=0052
  ixgbe 0000:00:06.0: read msgctl=0180 r=0
  ixgbe 0000:00:06.0: wrote msgctl=0180 r=0
  ixgbe 0000:00:06.0: msi_lower_address_reg=0054 address_lo=fee0100c r=0
  ixgbe 0000:00:06.0: is-64: msi_upper_address_reg=0058 address_hi=00000000
                      msi_data_reg=005c data=00004161 r=0 r2=0
  ixgbe 0000:00:06.0: irq 48 for MSI/MSI-X

  ixgbe 0000:00:06.0: write_msi_msg_desc: branch 3: pos=80
                      msi_control_reg=0052
  ixgbe 0000:00:06.0: read msgctl=0181 r=0
  ixgbe 0000:00:06.0: wrote msgctl=0181 r=0
  ixgbe 0000:00:06.0: msi_lower_address_reg=0054 address_lo=fee0100c r=0
  ixgbe 0000:00:06.0: is-64: msi_upper_address_reg=0058 address_hi=00000000
                      msi_data_reg=005c data=00004161 r=0 r2=0

  ixgbe 0000:00:06.0: write_msi_msg_desc: branch 3: pos=80
                      msi_control_reg=0052
  ixgbe 0000:00:06.0: read msgctl=0180 r=0
  ixgbe 0000:00:06.0: wrote msgctl=0180 r=0
  ixgbe 0000:00:06.0: msi_lower_address_reg=0054 address_lo=fee0100c r=0
  ixgbe 0000:00:06.0: is-64: msi_upper_address_reg=0058 address_hi=00000000
                      msi_data_reg=005c data=00004169 r=0 r2=0
  ixgbe 0000:00:06.0: irq 48 for MSI/MSI-X

  ixgbe 0000:00:06.0: write_msi_msg_desc: branch 3: pos=80
                      msi_control_reg=0052
  ixgbe 0000:00:06.0: read msgctl=0181 r=0
  ixgbe 0000:00:06.0: wrote msgctl=0181 r=0
  ixgbe 0000:00:06.0: msi_lower_address_reg=0054 address_lo=fee0100c r=0
  ixgbe 0000:00:06.0: is-64: msi_upper_address_reg=0058 address_hi=00000000
                      msi_data_reg=005c data=00004169 r=0 r2=0

Thanks for testing!

Comment 81 Pasi Karkkainen 2012-09-21 11:54:59 UTC

(In reply to comment #75)
> 
> The branch in RHEL-6 write_msi_msg_desc() that makes MSI-X register access
> dependent on PCI power state comes from RHEL-6 commit 20a80eaa:
> 
>     [pci] MSI: Remove unsafe and unnecessary hardware access
>     
> which has been made for bug 696511, first built in kernel-2.6.32-182.el6.
> Neighboring minor RHEL-6 releases:
> 
>   RHEL-6.1        2.6.32-131.el6
>   RHEL-6.2        2.6.32-220.el6
> 
> Therefore it might be considered a regression from RHEL-6.1. (In this BZ we
> have only checked RHEL-6.2, but that release already has the commit.)
> 
> (CC'ing Don Zickus :))
> 

I just tried with 6.1 kernel (2.6.32-131.0.15.el6.x86_64) but I'm seeing the same problem there. Zero interrupts for the VF IRQs and they're PCI-MSI-edge.

Comment 82 Laszlo Ersek 2012-09-21 11:56:25 UTC

Based on comment 80, the guest kernel does see PCI_D0 when the passed through device is a physical function; this check fails only for the VF. (The host-side PM control emulation, pt_pmcsr_reg_write(), is the same.)

This makes me wonder if the root cause is in fact an ixgbevf or core pci driver issue. What I've seen while mapping the callgraph in comment 79 makes me think that devices/drivers not supporting actual power management should just fake the requested PCI_D0 state. See especially:
- pci_platform_power_transition(),
- pci_update_current_state().

Consider the following condition for pci_platform_power_transition():

  !platform_pci_power_manageable(ixgbevf) && (dev->pm_cap != 0)

--> dev->current_state will not be set to PCI_D0.

Alternatively,

  platform_pci_power_manageable(ixgbevf) && platform_pci_set_power_state() < 0

--> pci_update_current_state() is not called, "dev->current_state" is not set.

Setting needinfo wrt. the question at the end of comment 79. Thanks! :)

Comment 83 Laszlo Ersek 2012-09-21 12:00:08 UTC

(In reply to comment #81)
> (In reply to comment #75)

> > which has been made for bug 696511, first built in kernel-2.6.32-182.el6.
> > Neighboring minor RHEL-6 releases:
> > 
> >   RHEL-6.1        2.6.32-131.el6
> >   RHEL-6.2        2.6.32-220.el6
> > 
> > Therefore it might be considered a regression from RHEL-6.1. (In this BZ we
> > have only checked RHEL-6.2, but that release already has the commit.)
> > 
> > (CC'ing Don Zickus :))
> > 
> 
> I just tried with 6.1 kernel (2.6.32-131.0.15.el6.x86_64) but I'm seeing the
> same problem there. Zero interrupts for the VF IRQs and they're PCI-MSI-edge.

Not a regression then, but the cause preventing passthru ixgbevf from working in 6.1 might be different. (For example, ixgbevf has been updated to upstream version 2.2.0-k from 6.2 to 6.3; see comment 4.)

Comment 84 Laszlo Ersek 2012-09-21 12:04:14 UTC

Woah, see upstream linux commit b51306c6.

Comment 85 Pasi Karkkainen 2012-09-21 14:09:40 UTC

Hmm, are you sure b51306c6 is the correct one? I can't find anything matching that from linus's linux.git or from google..

Comment 86 Laszlo Ersek 2012-09-21 14:58:04 UTC

(In reply to comment #85)
> Hmm, are you sure b51306c6 is the correct one? I can't find anything
> matching that from linus's linux.git or from google..

Yes, it looks like a good candidate.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=b51306c6

I'll backport the patch & build a test kernel for you in a few hours.

Comment 87 Laszlo Ersek 2012-09-21 15:14:22 UTC

Created attachment 615482 [details]
[1/1] PCI: Set device power state to PCI_D0 for device without native PM support


Backport upstream Linux commit b51306c63449d7f06ffa689036ba49eb46e898b5,
minus the hunk reverting upstream Linux commit
47e9037ac16637cd7f12b8790ea7ce6680e42168, because we haven't backported
the latter.
---
 drivers/pci/pci.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

Comment 90 Pasi Karkkainen 2012-09-21 16:15:58 UTC

Thanks for the link. No idea why I didn't find that.. not enough coffee :)

Comment 91 Pasi Karkkainen 2012-09-21 16:19:47 UTC

And yes, good find, that definitely sounds like it could fix the problem!

Comment 92 Laszlo Ersek 2012-09-21 22:28:13 UTC

The patch in comment 87 fixes the problem for me. I had to pass through all ports to the guest (two in total, one virtual function per port, I think), but (a) now I can see the interrupt counters increasing:

 48:        114   PCI-MSI-edge      eth1-rx-0
 49:        135   PCI-MSI-edge      eth1-tx-0
 50:         29   PCI-MSI-edge      eth1:mbx

(b) one port/VF has a live cable, the other not, and ifconfig in the guest can see that (SIOCSIFFLAGS: Network is down).

Comment 95 Pasi Karkkainen 2012-09-22 13:44:23 UTC

With the patch from comment #87 I'm seeing the following:

- Passthru 1 VF: The VF doesn't work, interrupt counters stay at zero.
- Passthru 2 VFs: First VF works OK, the second VF doesn't, interrupt counters are zero for it.
- Passthru 4 VFs: Only two first VFs are visible in "lspci" in the guest, and both of them fail - interrupt counters are zero for both of the VFs.

In dom0 both PF ports are 'UP' and connected to a switch.
So there's still something wrong.. 

(The "Passthru 4 VFs, only 2 visible in the guest" issue is probably another separate bug, should I open a new bug about that?)

Comment 96 Laszlo Ersek 2012-09-23 01:24:34 UTC

I tried to do some research; any input and/or corrections are welcome.

(1) Per default, RHEL-5 qemu-dm supports at most two hotpluggable
(passthrough) PCI devices. See vl.h:

/* PCI slot 6~7 support ACPI PCI hot plug */
#define PHP_SLOT_START  (6)
#define PHP_SLOT_END    (8)

(2) Passed through functions (that may or may not share a device on the host
side) show up as separate single-function devices (00:06.0, 00:07.0) in the
guest. This may have been improved upstream (see eg. [1] [2]), but it's very
unlikely we would touch this in RHEL-5.

(3) Based on references [3], [4] and [5]: when passing through a function
from a PCI device, you may have to pass through all functions of that
device. An exception might be if the device supports FLR (function level
reset).

We can derive that
- passing through more than two VFs probably won't work per default (2) (1),
- configuring more than two VFs for the same NIC port, and then passing
through at most two of those (ie. "not all") VFs will not work (3).

From this point we should talk specific BDFs (bus-device-function triplets),
using attachment 606045 [details] from comment 10.

Passing through 03:00.0 (PF) on its own did work. I think it's due to
"FLReset+" in the "DevCap" section. Same for 03:00.1. The ports (PFs) of the
NIC share the host PCI device, but they support function level reset.

The VFs (03:10.[0-7] for PF 03:00.0, and separately, 03:11.[0-7] for PF
03:00.1) do not support FLR. Therefore, if any 03:10.x is passed through,
all existing, sibling VFs must be passed through to the same guest.

The set of all VFs, for both ports together, controlled by the the max_vfs
ixgbe option, must not consist of more than 2 elements per default, because
of (1).

Please test the following configuration with the comment 87 patch:

- Module option for ixgbe:

    max_vfs=1

  This should produce one VF per PF.

- Guest passthrough stanza (and corresponding :

    pci = [ "0000:03:10.0", "0000:03:10.1" ]

  (or whatever BDF the ixgbe driver assigns to the single VF of each port.)

  Pciback should hide the same BDFs.

- One NIC port should be accessible in the guest under 00:06.0
  (see "ethtool -i ethX"), the other under 00:07.0.


There's at least one way to lift the default limit of 2 on passed-through
VFs. Please see bug 835768 comment 14 ("xen_emul_unplug=ide-disks" guest
command line parameter, somewhat described in
"Documentation/kernel-parameters.txt" too).

Unplugging some emulated devices frees up guest BDFs for PCI passthrough.
Hence please repeat the above test with "max_vfs=2" (and dependencies
updated) in the host, and "xen_emul_unplug=ide-disks" specified in the
guest.

Thanks!

[1] http://www.lca2010.org.nz/programme/schedule/view_talk/50048
[2] http://www.lca2010.org.nz/slides/50048.pdf
[3] http://wiki.xen.org/wiki/Xen_PCI_Passthrough#I_get_.22Error:_pci:_0000:02:06.0_must_be_co-assigned_to_the_same_guest_with_0000:02:05.0.22_error_when_trying_to_start_the_guest
[4] http://wiki.xen.org/wiki/Xen_PCI_Passthrough#How_can_I_check_if_PCI_device_supports_FLR_.28Function_Level_Reset.29_.3F
[5] http://wiki.xen.org/wiki/Xen_PCI_Passthrough#passing_multiple_PCI_devices

Comment 97 Pasi Karkkainen 2012-09-23 09:37:52 UTC

(In reply to comment #96)
> I tried to do some research; any input and/or corrections are welcome.
> 
> (1) Per default, RHEL-5 qemu-dm supports at most two hotpluggable
> (passthrough) PCI devices. See vl.h:
> 
> /* PCI slot 6~7 support ACPI PCI hot plug */
> #define PHP_SLOT_START  (6)
> #define PHP_SLOT_END    (8)
> 

Hmm, OK. That explains why I can see only 2 pass through devices in the VM.

> (2) Passed through functions (that may or may not share a device on the host
> side) show up as separate single-function devices (00:06.0, 00:07.0) in the
> guest. This may have been improved upstream (see eg. [1] [2]), but it's very
> unlikely we would touch this in RHEL-5.
> 

Too bad :(

> (3) Based on references [3], [4] and [5]: when passing through a function
> from a PCI device, you may have to pass through all functions of that
> device.
>

The whole point of SR-IOV is to be able to pass through VFs to different/multiple VMs.. I'm pretty certain this works properly in upstream Xen. I need to test/verify that.

Also I'll try VFs with multiple RHEL5 PV domUs.

> An exception might be if the device supports FLR (function level
> reset).
>

The VFs have "FLReset-" in dom0.. that's weird.
 
> 
> Please test the following configuration with the comment 87 patch:
> 

Ok, will do.

> 
> There's at least one way to lift the default limit of 2 on passed-through
> VFs. Please see bug 835768 comment 14 ("xen_emul_unplug=ide-disks" guest
> command line parameter, somewhat described in
> "Documentation/kernel-parameters.txt" too).
> 
> Unplugging some emulated devices frees up guest BDFs for PCI passthrough.
> Hence please repeat the above test with "max_vfs=2" (and dependencies
> updated) in the host, and "xen_emul_unplug=ide-disks" specified in the
> guest.
> 

Ok, will try this aswell.

Comment 99 Pasi Karkkainen 2012-09-24 17:22:06 UTC

(In reply to comment #97)
> (In reply to comment #96)
> 
> > (3) Based on references [3], [4] and [5]: when passing through a function
> > from a PCI device, you may have to pass through all functions of that
> > device.
> >
> 
> The whole point of SR-IOV is to be able to pass through VFs to
> different/multiple VMs.. I'm pretty certain this works properly in upstream
> Xen. I need to test/verify that.
> 
> Also I'll try VFs with multiple RHEL5 PV domUs.
>

I tried SR-IOV with the same RHEL 5.8 dom0, and the following RHEL 5.8 x64 guests all running simultaneously:

- hvm01: 1 VF, works OK.
- hvm02: 1 VF, works OK.
- hvm03: 2 VFs, both VFs work OK.

- pv01: 1 VF, works OK.
- pv02: 1 VF, works OK.
- pv03: 2 VFs, both VFs work OK.

So 8x VFs total, assigned and spread among to 6x RHEL 5.8 VMs, everything working OK!


It seems to me that there's a bug in the RHEL6 kernel that causes the following issues:
- when 1 VF assigned it doesn't work at all, no interrupts received.
- when 2 VFs assigned only one (first) of them works, the other VF doesn't receive any interrupts.

Let me know if you want me to do further tests with RHEL6 guests.

Thanks!

Comment 101 Laszlo Ersek 2012-09-24 18:25:35 UTC

Hi,

(In reply to comment #99)

> It seems to me that there's a bug in the RHEL6 kernel that causes the
> following issues:
> - when 1 VF assigned it doesn't work at all, no interrupts received.
> - when 2 VFs assigned only one (first) of them works, the other VF doesn't
> receive any interrupts.

May I ask if these were precisely the two tests described in comment 96?

If those tests do not work, I'd like to investigate more. Although I'm not sure how I'm going to debug them, since they worked in my test environment:
- the max_vfs=1 case as per comment 92
- I just tested the max_vfs=2 case too, and it works as well. (Setup & results in next comment.)

If those precise tests work in your environment, I'd like to post the patch internally and move this BZ to POST state.

> Let me know if you want me to do further tests with RHEL6 guests.

I think further problems should be reported as separate BZs.

I've asked our Quality Engineering team about their VF passthrough test cases, in order to get a picture of what we support exactly.

Thank you,
Laszlo

Comment 102 Laszlo Ersek 2012-09-24 18:44:14 UTC

(In reply to comment #101)

> May I ask if these were precisely the two tests described in comment 96?
> 
> If those tests do not work, I'd like to investigate more. Although I'm not
> sure how I'm going to debug them, since they worked in my test environment:
> - the max_vfs=1 case as per comment 92
> - I just tested the max_vfs=2 case too, and it works as well. (Setup &
> results in next comment.)

Host (2.6.18-308.el5xen x86_64):

  grub entry:

    kernel /xen.gz-2.6.18-308.el5 dom0_mem=2048M iommu=1 loglvl=all \
           guest_loglvl=all bootscrub=0 com1=115200,8n1
    module /vmlinuz-2.6.18-308.el5xen ro root=/dev/VolGroup00/LogVol00 \
           console=ttyS0,115200n81 pci_pt_e820_access=on ignore_loglevel

  /etc/modprobe.conf:

    options ixgbe max_vfs=2
    options pciback \
            hide="(0000:08:10.0)(0000:08:10.1)(0000:08:10.2)(0000:08:10.3)"

  /etc/modprobe.d/blacklist.conf:

    blacklist ixgbevf

  NIC:

    [root@dell-per820-02 ~]# ethtool -i eth0
    driver: ixgbe
    version: 3.4.8-k
    firmware-version: 0.9-3
    bus-info: 0000:08:00.0

    [root@dell-per820-02 ~]# ethtool -i eth1
    driver: ixgbe
    version: 3.4.8-k
    firmware-version: 0.9-3
    bus-info: 0000:08:00.1

    [root@dell-per820-02 ~]# ifconfig eth0 up
    [root@dell-per820-02 ~]# ifconfig eth1 up
  
  vm config:

    disk = [ "file:/var/lib/xen/images/guest.img,hda,w", ",hdc:cdrom,r" ]
    pci = [ "0000:08:10.0", "0000:08:10.1", "0000:08:10.2", "0000:08:10.3" ]

Guest (2.6.32-279.5.2.el6.bz849223_pci_d0_Z x86_64):

  kernel cmdline:

    ... ignore_loglevel console=tty console=ttyS0,115200n81 \
        xen_emul_unplug=ide-disks

  lspci:

    00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet \
            Controller Virtual Function (rev 01)

    00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet \
            Controller Virtual Function (rev 01)

    00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet \
            Controller Virtual Function (rev 01)

    00:07.0 Ethernet controller: Intel Corporation 82599 Ethernet \
            Controller Virtual Function (rev 01)

  ethtool reports for eth1..eth4 (in this order):

    commonly

      driver: ixgbevf
      version: 2.2.0-k
      firmware-version: 

    specifically

      bus-info: 0000:00:06.0
      bus-info: 0000:00:07.0
      bus-info: 0000:00:04.0
      bus-info: 0000:00:03.0

  when bringing them up with ifconfig (in this order), the host reports

    ixgbe 0000:08:00.0: eth0: VF Reset msg received from vf 0
    ixgbe 0000:08:00.1: eth1: VF Reset msg received from vf 0
    ixgbe 0000:08:00.1: eth1: VF Reset msg received from vf 1
    ixgbe 0000:08:00.0: eth0: VF Reset msg received from vf 1

  [root@dhcp47-109 ~]# grep PCI-MSI /proc/interrupts  

   48:        370   PCI-MSI-edge      eth4-rx-0
   49:        376   PCI-MSI-edge      eth4-tx-0
   50:         11   PCI-MSI-edge      eth4:mbx
   51:        371   PCI-MSI-edge      eth3-rx-0
   52:        377   PCI-MSI-edge      eth3-tx-0
   53:         11   PCI-MSI-edge      eth3:mbx
   54:        374   PCI-MSI-edge      eth1-rx-0
   55:        379   PCI-MSI-edge      eth1-tx-0
   56:         11   PCI-MSI-edge      eth1:mbx
   57:        373   PCI-MSI-edge      eth2-rx-0
   58:        378   PCI-MSI-edge      eth2-tx-0
   59:         11   PCI-MSI-edge      eth2:mbx


... Then I repeated this same test, with the only change that

    pci = [ "0000:08:10.0" ]

was specified in the vm config. The VF worked, interrupts kept increasing. I shut down the guest, removed pciback in dom0, and checked the four VFs for FLReset. None of them support FLR.

Comment 103 Pasi Karkkainen 2012-09-24 18:57:40 UTC

I just tried with Fedora 17 HVM guests aswell:

- F17 HVM with 1 VF assigned: works OK.
- F17 HVM with 2 VFs assigned: both VFs work OK.

I'll re-test with RHEL6 HVM guests.

Comment 104 Laszlo Ersek 2012-09-24 19:30:57 UTC

(
Side point, in reply to comment #97,

> The VFs have "FLReset-" in dom0.. that's weird.

I've just found this pearl in the xend source (commit e094c492 "Use PCIe FLR for VF of Intel 82599 10GbE Controller", bug 581655):

                # Quirk for the VF of Intel 82599 10GbE Controller.
                # We know it does have PCIe FLR capability even if it doesn't
                # report that (dev_cap.PCI_EXP_DEVCAP_FLR is 0).
                # See the 82599 datasheet.
)

Comment 105 Pasi Karkkainen 2012-09-24 19:43:42 UTC

(In reply to comment #104)
> (
> Side point, in reply to comment #97,
> 
> > The VFs have "FLReset-" in dom0.. that's weird.
> 
> I've just found this pearl in the xend source (commit e094c492 "Use PCIe FLR
> for VF of Intel 82599 10GbE Controller", bug 581655):
> 
>                 # Quirk for the VF of Intel 82599 10GbE Controller.
>                 # We know it does have PCIe FLR capability even if it doesn't
>                 # report that (dev_cap.PCI_EXP_DEVCAP_FLR is 0).
>                 # See the 82599 datasheet.
> )

That's a good find! It very well explains why I'm able to pass VFs to many/multiple VMs at the same time.. and that's how it's supposed to be :)

More info soon..

Comment 106 Pasi Karkkainen 2012-09-24 20:53:34 UTC

New tests..

rhel 5.8 x64 dom0 (2.6.18-308.13.1.el5xen):
- max_vfs=1 in /etc/modprobe.conf, other settings similar to comment #102.
- Reboot the physical server before starting tests.
- 2 VFs visible in dom0 lspci & bound to pciback.
- "ifconfig ethX up" both PF ports.

rhel 6.3 x64 hvm guest (2.6.32-279.5.2.el6.bz849223_pci_d0_Z.x86_64):
- start the guest with 1 VF passed through.
- Configure an IP to the VF eth-interface: "ifconfig ethX <ip> netmask <netmask> up".
- Run "ethtool ethX" and verify "Link detected: yes".
- Try pinging the default gateway IP, no replies.
- Try running "tcpdump -i ethX -nn", no packets visible.
- Notice how rx/tx packet counters stay at zero in "ifconfig ethX" output.
- Notice how interrupt counters stay at zero in /proc/interrupts.
- shutdown the guest and start it again.
- repeat the tests and notice it still won't work.
- shutdown the guest and start it again.
- repeat the tests and notice it still won't work.

Ok, so the VF doesn't work in rhel6 hvm guest (which had the pci_d0 patched kernel).

Next I tried with rhel 5.8 x64 hvm guest (2.6.18-308.13.1.el5):
- Start the guest with the same 1 VF passed through.
- Notice the VF interface name is "__tmp254339888" inside the rhel 5.8 hvm guest.

# ethtool -i __tmp254339888
driver: ixgbevf
version: 2.1.0-k
firmware-version: N/A
bus-info: 0000:00:06.0

- Run "ifconfig __tmp254339888 <ip> netmask <netmask> up".
- The console window of the rhel5 guest disappears and the guest kernel crashes, but the guest is still visible in "xm list" output.
- I'll capture the guest kernel crash / stack trace and attach it later.
- "xm destroy <rhel5hvm>".
- start the rhel5 hvm guest again.
- notice the VF interface is now actually called "ethX", like it should.
- "ifconfig ethX <ip> netmask <netmask> up".
- ping the gateway and notice the VF works OK.
- Check the "ifconfig ethX" output and notice how rx/tx counters increase.
- Check "/proc/interrupts" and notice how interrupt counters increase for the VF.
- shutdown the rhel5 hvm guest and re-start it.
- repeat the tests and notice the VF still works OK.
- shutdown the rhel5 hvm guest and re-start it once again.
- repeat the tests and notice the VF still works OK.

So.. after trying to use the non-working rhel6 hvm guest the VF is left in some bad state, and rhel5 hvm guest crashes while trying to use it for the first time. On the second try the VF starts working OK in the rhel5 hvm guest.

And then when the VF is working OK in the rhel5 hvm guest, it keeps working OK even if I reboot or shutdown + restart the rhel5 hvm guest multiple times.

When the VF is in a working state (after running the rhel5.8 hvm guest) I tried booting into the rhel6 guest again, but the VF still fails there.. interrupt counters stay at zero, and the VF won't work.

Next I passed the same VF to Fedora 17 HVM guest, and it works OK there.

I also tried rebooting the physical server again, and starting the rhel5.8 hvm guest as the *first* guest - then the VF works immediately and I don't get any guest crashes. So the rhel5 guest kernel crash I'm seeing is related to rhel6 hvm guest leaving the VF to some bad state.

Summary:
- 1 VF works OK in rhel5 PV guest.
- 1 VF works OK in rhel5 HVM guest.
- 1 VF works OK in F17 HVM guest.
- 1 VF doesn't work in RHEL6 HVM guest, interrupt counters stay at zero.

Comment 107 Pasi Karkkainen 2012-09-24 21:39:45 UTC

Created attachment 616740 [details]
rhel58 x64 xen hvm guest ixgbevf_msix_clean_tx crash log stack trace

this is the rhel 5.8 hvm guest kernel crash that I get when trying to use SR-IOV VF that has been put into some kind of "bad state" by the non-working rhel6.3 hvm guest.

Comment 108 Pasi Karkkainen 2012-09-24 22:15:03 UTC

(In reply to comment #101)
> 
> If those precise tests work in your environment, I'd like to post the patch
> internally and move this BZ to POST state.
> 

The patch helps (and thus is probably needed), but unfortunately it doesn't fix all the problems in my system.

The remaining problems with the rhel 6.3 hvm guest are:

- passthrough 1 VF: the VF doesn't work, interrupt counters stay at zero.
- passthrough 2 VFs: Only the first VF works, the second VF doesn't - the interrupt counters stay at zero.

And like already mentioned these problems are RHEL6.3 guest specific - RHEL5.8 PV, RHEL5.8 HVM and F17 HVM guests do work OK for both of those test cases.

Thanks a lot for all the help!

Comment 109 Laszlo Ersek 2012-09-24 22:34:46 UTC

I'll build a guest kernel with both attachment 615482 [details] and attachment 614596 [details].

Comment 113 Pasi Karkkainen 2012-09-25 18:54:38 UTC

Created attachment 617217 [details]
rhel63 x64 xen hvm guest with 1 vf does not work 2.6.32-279.5.2.el6.bz849223_pci_d0_dbg

Comment 114 Pasi Karkkainen 2012-09-25 18:55:20 UTC

Created attachment 617218 [details]
rhel63 x64 xen hvm guest with 2 vfs only first vf works 2.6.32-279.5.2.el6.bz849223_pci_d0_dbg

Comment 115 Pasi Karkkainen 2012-09-25 18:57:39 UTC

(In reply to comment #109)
> I'll build a guest kernel with both attachment 615482 [details] and
> attachment 614596 [details].

Thanks. New guest dmesg logs attached with kernel 2.6.32-279.5.2.el6.bz849223_pci_d0_dbg.

- 1 VF passthrough: the VF doesn't work, interrupt counters stay at zero.
- 2 VF passthrough: the first VF works OK. The second VF doesn't work, interrupt counters stay at zero for the second VF.

Thanks!

Comment 116 Laszlo Ersek 2012-09-25 20:27:36 UTC

Thanks. It's interesting that for PFs, branch 3 is invoked (comment 80), while for VFs, branch 2.

Comment 117 Pasi Karkkainen 2012-09-25 20:52:01 UTC

Created attachment 617260 [details]
rhel58 x64 xen qemu-dm log for rhel63 x64 hvm guest with 1 vf

Comment 118 Pasi Karkkainen 2012-09-25 20:52:28 UTC

Created attachment 617261 [details]
rhel58 x64 xen qemu-dm log for rhel63 x64 hvm guest with 2 vfs

Comment 119 Pasi Karkkainen 2012-09-25 20:54:13 UTC

I attached qemu-dm logs for both 1vf and 2vfs testcases.

If you take a look at the 2vfs case (where the first vf works, and the second vf doesn't), you can see these differences:


First VF:
pt_pci_read_config: Warning: Return ALL F from libpci read. [00:06.0][Offset:00h][Length:4]
pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
pt_msix_update_one: now update msix entry 0 with pirq ff gvec 59
pt_msix_update_one: now update msix entry 1 with pirq fe gvec 61
pt_msix_update_one: now update msix entry 2 with pirq fd gvec 69


Second VF:
pt_pci_read_config: Warning: Return ALL F from libpci read. [00:07.0][Offset:00h][Length:4]
pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
pci_msix_writel: can not update msix entry 0 since MSI-X is already                 function now.
pci_msix_writel: can not update msix entry 0 since MSI-X is already                 function now.
pci_msix_writel: can not update msix entry 0 since MSI-X is already                 function now.
pci_msix_writel: can not update msix entry 1 since MSI-X is already                 function now.
pci_msix_writel: can not update msix entry 1 since MSI-X is already                 function now.
pci_msix_writel: can not update msix entry 1 since MSI-X is already                 function now.
pci_msix_writel: can not update msix entry 2 since MSI-X is already                 function now.
pci_msix_writel: can not update msix entry 2 since MSI-X is already                 function now.
pci_msix_writel: can not update msix entry 2 since MSI-X is already                 function now.

Comment 120 Pasi Karkkainen 2012-09-25 21:02:02 UTC

.. and the test case with 1 vf passed through doesn't have *any* mention of either "pci_msix_writel" or "pt_msix_update_one" ..

Comment 121 Laszlo Ersek 2012-09-25 23:11:47 UTC

(In reply to comment #120)
> .. and the test case with 1 vf passed through doesn't have *any* mention of
> either "pci_msix_writel" or "pt_msix_update_one" ..

Yes, let's focus on this one first. The corresponding kernel log (comment 113) testifies about three batches of MSI-X updates. The batches are identical. I can't see any reason to repeat the batch. The ixgbevf driver doesn't seem to do it in a loop.

I have a theory involving module removal, based on what I've read on the net. udev is going crazy renaming these interfaces. In the mailing list thread someone claimed that udev modifies its persistent net rules and then reloads the driver at rename time. This would certainly explain the multiple batches of MSI-X initialization. If module removal does not tear down MSI-X to qemu-dm's liking, it could be an explanation.

I shall extend the kernel debug patch with WARN invocations (in order to get stackdumps). I'll also upload a set of xen packages with more logging (to the tune of attachment 614372 [details]).

Comment 122 Laszlo Ersek 2012-09-26 01:38:48 UTC

(In reply to comment #119)

> First VF:
> [pt_msixctrl_reg_write x 3]
> 
> Second VF:
> [pt_msixctrl_reg_write x 3]
> [pci_msix_writel x 9]

The distribution of these log entries between the two VFs seems a bit different actually. The guest log in comment 114 has 15 entries (5 batches) of MSI-X setup, matching qemu-dm.log as follows:

msix_capability_init() call for VF1 (see comment 63 & comment 65):

> pt_pci_read_config: Warning: Return ALL F from libpci read.
> [00:06.0][Offset:00h][Length:4]
> pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h

1st "write_msi_msg_desc: branch 2" batch for 00:06.0 (VF 1) happens here.

> pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
> pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
> pt_msix_update_one: now update msix entry 0 with pirq ff gvec 59
> pt_msix_update_one: now update msix entry 1 with pirq fe gvec 61
> pt_msix_update_one: now update msix entry 2 with pirq fd gvec 69


msix_capability_init() call for VF2:

> pt_pci_read_config: Warning: Return ALL F from libpci read.
> [00:07.0][Offset:00h][Length:4]
> pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h

1st "write_msi_msg_desc: branch 2" batch for 00:07.0 (VF 2) happens here, with no effect.

> pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
> pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h


The following updates don't come from msix_capability_init() -- no control register access:

> pci_msix_writel: can not update msix entry 0 since MSI-X is already         
> function now.
> pci_msix_writel: can not update msix entry 0 since MSI-X is already         
> function now.
> pci_msix_writel: can not update msix entry 0 since MSI-X is already         
> function now.
> pci_msix_writel: can not update msix entry 1 since MSI-X is already         
> function now.
> pci_msix_writel: can not update msix entry 1 since MSI-X is already         
> function now.
> pci_msix_writel: can not update msix entry 1 since MSI-X is already         
> function now.
> pci_msix_writel: can not update msix entry 2 since MSI-X is already         
> function now.
> pci_msix_writel: can not update msix entry 2 since MSI-X is already         
> function now.
> pci_msix_writel: can not update msix entry 2 since MSI-X is already         
> function now.

This is the 2nd batch for VF1, and the 2nd and 3rd batches for VF2, all interleaved. They are triggered by write_msi_msg_desc(), which could be called from __pci_restore_msix_state() (see comment 65).

Comment 123 Laszlo Ersek 2012-09-26 11:41:35 UTC

Created attachment 617528 [details]
write_msi_msg_desc() debug messages (v2), now with WARN

Comment 124 Laszlo Ersek 2012-09-26 12:50:12 UTC

Created attachment 617541 [details]
qemu-dm debug messages (v2), with PCI BDFs

for xen-3.0.3-135.el5_8.5

trailing whitespace was stripped from "tools/ioemu/hw/pass-through.c" first

Comment 126 Laszlo Ersek 2012-09-26 13:35:51 UTC

Pasi,

can you please do the following tests?

  +-------------------------+-------------+---------------------------+
  | host (pciback in sync)  |  max_vfs=1  |         max_vfs=2         |
  +-------------------------+-------------+-------------+-------------+
  | # of passed-through VFs |      1      |      1      |      2      |
  +-------------------------+------+------+------+------+------+------+
  | guest                   | 5.8  | 6.3  | 5.8  | 6.3  | 5.8  | 6.3  |
  +-------------------------+------+------+------+------+------+------+
  | results thus far (with  | pass | FAIL | pass | FAIL | pass | FAIL |
  | comment reference)      | c106 | c106 | c108 | c113 | c108 | c114 |
  +-------------------------+------+------+------+------+------+------+
  | requesting qemu-dm log  | c124 | c124 | c124 | c124 | c124 | c124 |
  +-------------------------+------+------+------+------+------+------+
  | requesting guest log    | base | c123 | base | c123 | base | c123 |
  +-------------------------+------+------+------+------+------+------+

6 tests, 12 log files; please feel free to upload the logs in a tarball.

Please
- reboot the host between the tests,
- specify ignore_loglevel everywhere,
- guests are all HVM.

Thanks!

Comment 127 Laszlo Ersek 2012-09-26 13:47:34 UTC

... and please describe the end result of each test, ie. whether all, or some, or no VFs work. Thanks!

Comment 128 Pasi Karkkainen 2012-09-26 18:36:35 UTC

Created attachment 617718 [details]
sr-iov vf passthru test results to el5.8 and el6.3 hvm guests

Comment 129 Pasi Karkkainen 2012-09-26 18:39:31 UTC

+-------------------------+-------------+---------------------------+
| host (pciback in sync)  |  max_vfs=1  |         max_vfs=2         |
+-------------------------+-------------+-------------+-------------+
| # of passed-through VFs |      1      |      1      |      2      |
+-------------------------+------+------+------+------+------+------+
| guest                   | 5.8  | 6.3  | 5.8  | 6.3  | 5.8  | 6.3  |
+-------------------------+------+------+------+------+------+------+
| results of the test     | pass | FAIL | pass | FAIL | pass | PART |
+-------------------------+------+------+------+------+------+------+

Notes about the test results:

- "FAIL": VF doesn't work, no interrupts received for the VF in /proc/interrupts.

- "PART": Partial success; the first VF works OK, the second VF fails and doesn't get any interrupts in /proc/interrupts.

- "pass": VF interface has a name like "__tmp1960421532" in the el5.8 guest, and doing "ifconfig __tmp1960421532 up" crashes the guest kernel. If the guest has 2 VFs passed through then this happens only for the first VF. After restarting the el5.8 guest it works OK without problems. The crash log/stack trace is in comment #107. So in all el5.8 guest tests I had to reboot/restart the guest once first before doing the actual test and capturing the dmesg log.

Comment 130 Pasi Karkkainen 2012-09-26 18:41:43 UTC

And I forgot to mention about "pass" for the el5.8 guest with 2 VFs passed through: both of the VFs worked OK (after restarting the guest once).

Comment 131 Pasi Karkkainen 2012-09-26 18:42:30 UTC

And I rebooted the physical server 6 times; once before every test.

Comment 133 Laszlo Ersek 2012-09-26 21:56:04 UTC

I think I might have found a clue.

$ grep squash *

max_vfs1-1vf-el63-qemu-dm-log.txt:squash iomem [f4024000, f4024030).
max_vfs2-1vf-el63-qemu-dm-log.txt:squash iomem [f4024000, f4024030).
max_vfs2-2vf-el63-qemu-dm-log.txt:squash iomem [f402c000, f402c030).

The squashed iomem region is exactly the one that is used for programming
the *last VF* passed through. This is the reason why there are no
pci_msix_writel() messages for the last VF passed through, ie. why a
buffered update is not prepared and then flushed -- even though the guest
issues those writes by now (due to the PCI_D0 patch), qemu-dm simply doesn't
have a handler for the range.

In the max_vfs=2, rhel63 guest, 1vf->2vf transition, the squashed iomem
"moves" (see above), and in the second case the pci_msix_writel() messages
show up for the first VF, now that the squashed region "moved over" to the
range belonging to the 2nd (= last) VF.

In the qemu-dm logs for the 5.8 guests, there are no such messages.

In the qemu-dm logs for the 6.3 guests, the message always shows up in such
a block:

+Unknown PV product 3 loaded in guest
+PV driver build 1
+region type 0 at [f4000000,f4020000).
+squash iomem [f4024000, f4024030).
+region type 1 at [c200,c240).

The "squash iomem" message is printed on the following call path:

platform_fixed_ioport_write2 [tools/ioemu/hw/xen_platform.c]
  pci_unplug_netifs          [tools/ioemu/hw/pci.c]
    unregister_iomem         [tools/ioemu/target-i386-dm/exec-dm.c]

The platform_fixed_ioport_write2() --> pci_unplug_netifs() call depends on
UNPLUG_ALL_NICS, which I think is something that the RHEL-6 guest requests.

... Confirmed, see xen_unplug_emulated_devices()
[arch/x86/xen/platform-pci-unplug.c] in the RHEL-6 kernel. In absence of the
"xen_emul_unplug" command line parameter, a default value is used, which is
composed to have XEN_UNPLUG_ALL_NICS. (See the rhel63 guest dmesgs, all
three contain the "unplug emulated NICs" message.)

The RHEL-5 guest has no xen_emul_unplug support.

pci_unplug_netifs() iterates over all netifs, but it will not unplug one if
test_pci_slot() returns 1 for it. I'll have to dig deeper into that test.

"xen_emul_unplug=never" (or "xen_emul_unplug=ide-disks" too, see comment 96)
prevents this, but I think qemu-dm should not unplug the VF.

Comment 134 Laszlo Ersek 2012-09-26 22:03:55 UTC

http://xen.1045712.n5.nabble.com/issue-in-unplug-qemu-PCI-devices-td2549132.html

Comment 135 Laszlo Ersek 2012-09-26 22:11:43 UTC

We backported that check for bug 665032.

Comment 136 Laszlo Ersek 2012-09-26 23:04:08 UTC

Our version of test_pci_slot() will happily allow pci_unplug_netifs() to unplug any ethernet device not in { 00:06.0, 00:07.0 }. When we want to pass through more than two PCI devs, we have to use xen_emul_unplug=..., so that the UNPLUG_ALL_NICS default is not in effect, and we don't reach pci_unplug_netifs().

When we pass through <= 2 devices, test_pci_slot() nonetheless returns 0 for the last one, its

  dpci_infos.php_devs[php_slot].valid

entry is "false". That flag is set to 1 in __insert_to_pci_slot(), and I don't see why, based on the qemu-dm logs and register_real_device(). I'll have to extend the debug patch.

... commits ea4860c1 and f3460ff8 from <git://xenbits.xen.org/qemu-xen-unstable.git> seem somewhat relevant, but I think they're too intrusive.

Comment 137 Laszlo Ersek 2012-09-26 23:53:01 UTC

(In reply to comment #133)

> In the qemu-dm logs for the 6.3 guests, the message always shows up in such
> a block:
> 
> +Unknown PV product 3 loaded in guest
> +PV driver build 1
> +region type 0 at [f4000000,f4020000).
> +squash iomem [f4024000, f4024030).
> +region type 1 at [c200,c240).

I re-checked an older qemu-dm log from a RHEL-6.3 guest here, from comment 61. (At that time we were still looking for PCI_D0 in the guest, but it's irrelevant wrt. "squash iomem" in qemu-dm.) It only has

  Unknown PV product 3 loaded in guest
  PV driver build 1

no "squash iomem".

I've reviewed some others as well:
- comment 32 : 1 VF,  1 squash
- comment 38 : 4 VFs, 3 squashes (although only two squashes match VFs)
- comment 117: 1 VF,  1 squash (guest had PCI_D0 patch)
- comment 118: 2 VFs, 1 squash (ditto)

qemu-dm works differently in our respective environments in this regard.

Comment 138 Pasi Karkkainen 2012-09-27 07:17:02 UTC

So hmm.. do you want me to try something? disable unplug on the guest cmdline, perhaps? Thanks.

Comment 139 Laszlo Ersek 2012-09-27 08:46:34 UTC

Yes, if you could re-run the three 6.3 guest tests (same RPMs as last time), with "xen_emul_unplug=ide-disks" (or "xen_emul_unplug=never") on the guest cmdline, that would be great, just to verify the theory.

But I'll instrument qemu-dm some more and upload a new build shortly.

Comment 140 Laszlo Ersek 2012-09-27 11:12:55 UTC

Created attachment 618009 [details]
qemu-dm debug messages (v3), track "valid" flag too

Comment 142 Laszlo Ersek 2012-09-27 11:18:33 UTC

Pasi,

(In reply to comment #140)
> Created attachment 618009 [details]
> qemu-dm debug messages (v3), track "valid" flag too

please repeat the "max_vfs1-1vf-el63" test with this Xen package. (Reusing the guest kernel from comment 123, ie. from the most recent tests.)

Please make sure that "xen_emul_unplug" is absent from the guest cmdline.

Thanks!

Comment 143 Pasi Karkkainen 2012-09-27 14:57:25 UTC

(In reply to comment #139)
> Yes, if you could re-run the three 6.3 guest tests (same RPMs as last time),
> with "xen_emul_unplug=ide-disks" (or "xen_emul_unplug=never") on the guest
> cmdline, that would be great, just to verify the theory.
> 

I just quickly did the "max_vfs1_1vf_el63" test with "xen_emul_unplug=never" on the guest kernel cmdline, and now the VF works in the guest !! 

(I used the previous version of xen - I'll try the latest instrumented xen soon).

Comment 144 Pasi Karkkainen 2012-09-27 15:28:21 UTC

Created attachment 618145 [details]
sr-iov vf passthru to el6.3 test results for comment #142

Comment 145 Pasi Karkkainen 2012-09-27 15:29:55 UTC

(In reply to comment #142)
> Pasi,
> 
> (In reply to comment #140)
> > Created attachment 618009 [details]
> > qemu-dm debug messages (v3), track "valid" flag too
> 
> please repeat the "max_vfs1-1vf-el63" test with this Xen package. (Reusing
> the guest kernel from comment 123, ie. from the most recent tests.)
> 
> Please make sure that "xen_emul_unplug" is absent from the guest cmdline.
> 
> Thanks!


Done and logs uploaded. Now the VF didn't work in the el6.3 guest, as expected.
I didn't use xen_emul_unplug.

Comment 146 Laszlo Ersek 2012-09-27 18:08:54 UTC

(In reply to comment #145)

> Done and logs uploaded. Now the VF didn't work in the el6.3 guest, as
> expected.
> I didn't use xen_emul_unplug.

Thanks, we're getting closer.

  pci_unplug_netifs: x=32: ethernet controller
  test_pci_slot: 1: slot=4
  region type 0 at [f4000000,f4020000).
  squash iomem [f4024000, f4024030).
  region type 1 at [c200,c240).

This log segment is generated when qemu-dm unplugs the emulated NIC, 00:04.0.

  pci_unplug_netifs: x=48: ethernet controller
  test_pci_slot: 1: slot=6
  test_pci_slot: 2: php_slot=0 valid=1

This log segment is generated when qemu-dm (pci_unplug_netifs()) investigates and correctly skips (ie. does not unplug) the VF, 00:06.0.

The iomem required to set up MSI-X for the VF (00:06.0) is squashed when qemu-dm (correctly) unplugs the emulated card (00:04.0).

"Region type 0 at [f4000000, f4020000)" is from

#define PNPMMIO_SIZE      0x20000

in [tools/ioemu/hw/e1000.c] -- see "vif = [ '..., model=e1000' ]" in comment 0.

Note that these memory ranges don't overlap:

  region type 0 at [f4000000,f4020000)              <--- e1000
  squash iomem              [f4024000, f4024030)    <--- ixgbevf

The bug is in unregister_iomem() [tools/ioemu/target-i386-dm/exec-dm.c]. See the following two references:

http://xenbits.xen.org/gitweb/?p=qemu-xen-unstable.git;a=commitdiff;h=8cc8a365
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1805

I'll build a xen package with that patch backported soon.

Comment 147 Laszlo Ersek 2012-09-27 19:01:06 UTC

Created attachment 618240 [details]
[1/2] Backport a single hunk from qemu-xen-unstable commit 13669683


    commit 13669683830d4508b6c8ed87de088785fa95ed3c
    Author: Ian Jackson <ian.jackson.com>
    Date:   Mon Mar 16 13:47:18 2009 +0000

        Post-merge compilation fixes

        Signed-off-by: Ian Jackson <ian.jackson.com>

as a dependency for the next patch.
---
 tools/ioemu/target-i386-dm/exec-dm.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

Comment 148 Laszlo Ersek 2012-09-27 19:01:19 UTC

Created attachment 618241 [details]
[2/2] qemu-dm: fix unregister_iomem()


Backport of qemu-xen-unstable...

    commit 8cc8a3651c9c5bc2d0086d12f4b870fc525b9387
    Author: Jan Beulich <JBeulich>
    Date:   Tue Feb 7 18:42:56 2012 +0000

        This function (introduced quite a long time ago in
        e7911109f4321e9ba0cc56a253b653600aa46bea - "disable qemu PCI
        devices in HVM domains") appears to be completely broken, causing
        the regression reported in
        http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1805 (due
        to the newly added caller of it in
        56d7747a3cf811910c4cf865e1ebcb8b82502005 - "qemu: clean up MSI-X
        table handling"). It's unclear how the function can ever have
        fulfilled its purpose: the value returned by iomem_index() is
        *not* an index into mmio[].

        Additionally, fix two problems:
        - unregister_iomem() must not clear mmio[].start, otherwise
          cpu_register_physical_memory() won't be able to re-use the
          previous slot, thus causing a leak
        - cpu_unregister_io_memory() must not check mmio[].size, otherwise
          it won't properly clean up entries (temporarily) squashed
          through unregister_iomem()

        Signed-off-by: Jan Beulich <jbeulich>
        Tested-by: Stefano Stabellini <stefano.stabellini.com>
        Tested-by: Yongjie Ren <yongjie.ren>
---
 tools/ioemu/target-i386-dm/exec-dm.c |   12 ++++++++----
 1 files changed, 8 insertions(+), 4 deletions(-)

Comment 150 Laszlo Ersek 2012-09-27 19:57:09 UTC

Pasi,

as I wrote in my email, please

- pick a guest kernel with the PCI_D0 patch in comment 87 / comment 92,
  - optionally with the v2 debug patch in comment 123,

and

- pick a xen userspace with the series in comment 147 - comment 148,
  - optionally with the v3 debug patch in comment 140.

Then please repeat the three 6.3 test from comment 129:
- please reboot the host again between tests,
- do not specify the xen_emul_unplug cmdline param in the guest.

Thanks!

Comment 151 Pasi Karkkainen 2012-09-27 21:39:18 UTC

New tests with PCI_D0 patched and debug enabled el6.3 guest kernel (87+92+123) and with patched + debug-enabled (140+147+148) xen/qemu-dm rpms:

+-------------------------+-------------+---------------------------+
| host (pciback in sync)  |  max_vfs=1  |         max_vfs=2         |
+-------------------------+-------------+-------------+-------------+
| # of passed-through VFs |      1      |      1      |      2      |
+-------------------------+------+------+------+------+------+------+
| guest                   | 5.8  | 6.3  | 5.8  | 6.3  | 5.8  | 6.3  |
+-------------------------+------+------+------+------+------+------+
| results of the test     | pass | pass | pass | pass | pass | pass |
+-------------------------+------+------+------+------+------+------+

Both rhel5.8 and rhel6.3 HVM guests work OK now !

(ok, almost, rhel5 still has the weird kernel crash during the first time the VM is started, but that's a separate issue, and I'll file a separate bug about that).

So it looks like solving this bug needs:
- rhel6 kernel patch for the PCI_D0 issue.
- rhel5 xen qemu-dm patch for the nic unplug / iomem issue.

Thanks a lot !

Comment 152 Pasi Karkkainen 2012-09-27 21:40:25 UTC

Created attachment 618312 [details]
sr-iov vf passthru test results for comment 150 to el5.8 and el6.3 hvm guests

Comment 153 Paolo Bonzini 2012-09-28 09:59:13 UTC

Great job Laszlo, and thanks Pasi for the collaboration!!!

I don't think we need to fix RHEL5 passthrough though.

I cloned this bug to bug 861349 for the RHEL5 qemu-dm fix, and requested an exception for RHEL5.9.

Comment 154 Laszlo Ersek 2012-09-28 10:08:30 UTC

Thank you for the persistent testing! I've cloned this BZ to bug 861352 for the userspace patch.

Comment 155 Laszlo Ersek 2012-09-28 10:10:14 UTC

That was a great race condition, 861352 - 861349 = 3 :)

Since Paolo was first, I'm closing my clone as a duplicate of his.

Comment 156 Pasi Karkkainen 2012-09-28 12:38:03 UTC

Thanks a lot to everyone involved, it took a while and a lot of testing, but luckily it's figured out now :) 

btw will there be another bugzilla id for the rhel6 kernel pci_d0 patch?

Comment 157 Laszlo Ersek 2012-09-28 12:50:08 UTC

(In reply to comment #156)

> btw will there be another bugzilla id for the rhel6 kernel pci_d0 patch?

I posted the pci_d0 patch with reference to this BZ (bug 849223), and the short qemu-dm series for the clone bug 861349.

Originally we couldn't decide if this BZ should belong to RHEL-6, component kernel, or RHEL-5, component xen. Ultimately both had to be modified. By the time we figured it out, I had moved this bug to RHEL-6, component kernel (see comment 50 and comment 51, and click the History link and look for the Component/Version change), so the clone was made for RHEL-5, component xen.

Comment 158 RHEL Program Management 2012-09-28 12:52:56 UTC

This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 159 Pasi Karkkainen 2012-09-28 13:04:07 UTC

(In reply to comment #157)
> (In reply to comment #156)
> 
> > btw will there be another bugzilla id for the rhel6 kernel pci_d0 patch?
> 
> I posted the pci_d0 patch with reference to this BZ (bug 849223), and the
> short qemu-dm series for the clone bug 861349.
> 
> Originally we couldn't decide if this BZ should belong to RHEL-6, component
> kernel, or RHEL-5, component xen. Ultimately both had to be modified. By the
> time we figured it out, I had moved this bug to RHEL-6, component kernel
> (see comment 50 and comment 51, and click the History link and look for the
> Component/Version change), so the clone was made for RHEL-5, component xen.
>

Yep, makes sense. 

Thanks again for the big amount of debugging / instrumenting / research work for this bug !

Comment 160 Pasi Karkkainen 2012-10-03 19:18:57 UTC

FYI: I added the RHEL5.8 HVM guest "kernel crash when running ifup" issue as a separate bug #862862: https://bugzilla.redhat.com/show_bug.cgi?id=862862

Comment 161 Jarod Wilson 2012-10-04 20:45:20 UTC

Patch(es) available on kernel-2.6.32-318.el6

Comment 164 bfan 2012-10-23 10:40:21 UTC

This bug reproduced on the same machine, verify it with:

Version:
Host(RHEL5.9):
    - kernel version: 2.6.18-343.el5xen
    - Xen version: xen-3.0.3-142.el5
    - machine/CPU: dell-per510/Intel Xeon
Guest(RHEL6.4):
    - Kernel version: 2.6.32-335

Steps:
1. enable VFs in host
2. assign VFs to guest
3. ping each vf of guest from host


Results:
[in guest]
[root@dhcp-8-202 ~]# lspci | grep 82599
00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:05.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:07.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:08.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:09.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:0a.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:0b.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:0c.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:0d.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:0e.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:0f.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:11.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:12.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)

[in host]
ping each vf of guest successfully from host

Comment 167 errata-xmlrpc 2013-02-21 06:47:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0496.html