Bug 865736

Summary: Only 2 VF can be seen in RHEL5.9 PV guest
Product: Red Hat Enterprise Linux 5 Reporter: bfan
Component: kernel-xenAssignee: Laszlo Ersek <lersek>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.9CC: ddutile, drjones, hhuang, leiwang, lersek, pasik, plougher, wshi, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.18-360.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-30 23:38:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514489    
Attachments:
Description Flags
warn if acpi_os_map_memory() fails in acpi_tb_find_rsdp()
none
debug messages in pci_scan_slot()
none
(proposed dom0 patch) xen PV passthru: assign SR-IOV virtual functions to separate virtual slots
none
(proposed dom0 patch, v2) xen PV passthru: assign SR-IOV virtual functions to separate virtual slots none

Description bfan 2012-10-12 09:39:09 UTC
Description of problem:
Assign 14 VFs to guest, but only 2 VFs can be seen in RHEL5.9 PV guest.

It's not dup to Bug 835768 - SR-IOV: Given 14 VFs to RHEL7 guest but only 2 enabled

That bug is for RHEL7 HVM guest and is related to emul_xen_unplug
while this one is for RHEL5 PV which do not has that option.

Version-Release number of selected component (if applicable):
Host:  RHEL5.9 2.6.18-343.el5xen, xen-3.0.3-142.el5
Guest: RHEL5.9 2.6.18-343.el5xen

How reproducible:
100%

Steps to Reproduce:
[in host]:
1.enable VF in Domain0 with max_vfs=7
2.bind PCI device to pciback driver
  # echo ${VF_device_ID} > /sys/bus/pci/drivers/igbvf/unbind
  # echo ${VF_device_ID} > /sys/bus/pci/drivers/pciback/new_slot
  # echo ${VF_device_ID} > /sys/bus/pci/drivers/pciback/bind
3. create a guest with vfs
  # xm cr ${file.cfg} pci=${VF_device_ID} pci=${VF_device_ID} ... pci=${VF_device_ID}
4. check VF with "xm pci-list"
  # xm pci-list ${domainID}

[in guest]:
1. check VF by "lspci"
  # lspci

2.check VF by "ifconfig"
  # ifconfig

Actual results:
Only 2 VF can be seen in guest,
# lspci | grep 82576
00:00.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:01.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)

Expected results:
All VF assigned to guest can be seen.

Additional info:
Same issue with RHEL5.8, so it is not a regression.

Comment 3 Laszlo Ersek 2012-10-16 07:38:22 UTC
From the guest dmesg:

  pcifront pci-0: Installing PCI frontend
  pcifront pci-0: Creating PCI Frontend Bus 0000:00
  ACPI Error (tbxfroot-0512): Could not map memory at 0000040E for length 2
                              [20060707]
  ACPI Exception (tbxfroot-0400): AE_NO_MEMORY, RSDP structure not found -
                                  Flags=8 [20060707]
  ACPI: System description tables not found
  pci 0000:00:00.0: reg 10: [mem 0xf4248000-0xf424bfff 64bit]
  pci 0000:00:00.0: reg 1c: [mem 0xf4268000-0xf426bfff 64bit]


pcifront_scan_root() [drivers/xen/pcifront/pci_op.c] prints "Creating PCI Frontend Bus". I'll have to dig into it.

pcifront_scan_root
  pci_scan_bus_parented
    pci_create_bus
    pci_scan_child_bus
      pci_scan_slot /* for each slot */
        pci_scan_single_device /* for each function */
          pci_scan_device
            pci_setup_device
              pci_read_bases
                __pci_read_base
                  prints "pci 0000:00:00.0: reg 10: [mem..." etc
          pci_device_add
          pci_scan_msi_device
  pci_walk_bus
    pcifront_claim_resource
      pci_claim_resource
        insert_resource
          __request_resource
  pci_bus_add_devices
    pci_bus_add_device

When scanning the pcifront PCI bus, only one passthru VF is found.

Additionally, some time during the scan, an attempt is made to find the ACPI Root System Description Pointer. The attempt fails, which is why the tbxfroot-0512 / tbxfroot-0400 errors are printed by acpi_tb_find_rsdp() and acpi_find_root_pointer(). Address 0000040E is ACPI_EBDA_PTR_LOCATION.

The mapping attempt for that address is made with acpi_os_map_memory(), which ultimately ends up in __ioremap() [arch/i386/mm/ioremap-xen.c]. I notice that "xm dmesg" contains

(XEN) mm.c:630:d2 Non-privileged (2) attempt to map I/O space 00000000

I'll add a WARN() to acpi_tb_find_rsdp() to get a stack trace and see where exactly it is called during the bus scan. I have a fleeting suspicion that this ACPI error gets in the way of enumerating the rest of devices.

Two side points:
- ACPI is disabled in PV domU's (domU dmesg: "ACPI: Interpreter disabled."; see "acpi_disabled" in "arch/x86_64/kernel/setup-xen.c"),
- a web search for the tbxfroot errors at the top turns up a few hits, even Xen-related, but nothing usable.

Comment 4 Laszlo Ersek 2012-10-16 10:37:32 UTC
Created attachment 628089 [details]
warn if acpi_os_map_memory() fails in acpi_tb_find_rsdp()

pcifront pci-0: Installing PCI frontend
pcifront pci-0: Creating PCI Frontend Bus 0000:00
WARNING: at drivers/acpi/tables/tbxfroot.c:508 acpi_tb_find_rsdp()

Call Trace:
 [<ffffffff803870fb>] acpi_find_root_pointer+0x63/0x20f
 [<ffffffff803729a0>] acpi_os_get_root_pointer+0x9/0x26
 [<ffffffff803872fa>] acpi_get_firmware_table+0x53/0x278
 [<ffffffff8039607f>] acpi_hest_firmware_first_pci+0x42/0x1e4
 [<ffffffff80353e59>] __pci_bus_find_cap+0x48/0x57
 [<ffffffff80352b04>] pci_setup_device+0xd5/0x2d2
 [<ffffffff80352dff>] pci_scan_single_device+0xfe/0x12f
 [<ffffffff80352e4e>] pci_scan_slot+0x1e/0x51
 [<ffffffff8035332b>] pci_scan_child_bus+0x23/0x99
 [<ffffffff80353437>] pci_scan_bus_parented+0x16/0x21
 [<ffffffff803c5a22>] pcifront_scan_root+0x98/0x117
...
ACPI Error (tbxfroot-0513): Could not map memory at 0000040E for length 2 [20060707]
ACPI Exception (tbxfroot-0400): AE_NO_MEMORY, RSDP structure not found - Flags=8 [20060707]
ACPI: System description tables not found
pci 0000:00:00.0: reg 10: [mem 0xf4248000-0xf424bfff 64bit]
pci 0000:00:00.0: reg 1c: [mem 0xf4268000-0xf426bfff 64bit]

Comment 5 Laszlo Ersek 2012-10-16 11:06:58 UTC
pcifront_scan_root
  pci_scan_bus_parented
    pci_create_bus
    pci_scan_child_bus
      pci_scan_slot /* for each slot */
        pci_scan_single_device /* for each function */
          pci_scan_device
            pci_setup_device
              set_pci_aer_firmware_first        <---- traversed now
                acpi_hest_firmware_first_pci
                  acpi_get_firmware_table("HEST")
                    acpi_os_get_root_pointer
                      acpi_find_root_pointer
                        acpi_tb_find_rsdp
              pci_read_bases
                __pci_read_base
                  prints "pci 0000:00:00.0: reg 10: [mem..." etc
          pci_device_add
          pci_scan_msi_device
  pci_walk_bus
    pcifront_claim_resource
      pci_claim_resource
        insert_resource
          __request_resource
  pci_bus_add_devices
    pci_bus_add_device

set_pci_aer_firmware_first() has return type "void", and it only decides
about "pdev->aer_firmware_first". So its failure seems to be unrelated to
the premature end of the bus scan.

Comment 6 Laszlo Ersek 2012-10-16 11:41:31 UTC
Created attachment 628112 [details]
debug messages in pci_scan_slot()

The scan terminates because the device reports itself as non-multi-function and the first found function is func 0.

pcifront pci-0: Installing PCI frontend
pcifront pci-0: Creating PCI Frontend Bus 0000:00
pci_scan_slot: scan_all_fns=1
pci_scan_slot: func=0 devfn=0
pci 0000:00:00.0: reg 10: [mem 0xf4248000-0xf424bfff 64bit]
pci 0000:00:00.0: reg 1c: [mem 0xf4268000-0xf426bfff 64bit]
pci_scan_slot: dev=ffff88018058b000
pci 0000:00:00.0: pci_scan_slot: nr=0 multifunction=0

			/*
		 	 * If this is a single function device,
		 	 * don't scan past the first function.
		 	 */
			if (!dev->multifunction) {
				if (func > 0) {
					dev->multifunction = 1;
				} else {
 					break;
				}
			}

dev->multifunction comes from pci_setup_device() (which is called, through several layers, inside the above loop); paraphrasing:

  u8 hdr_type;

  pci_read_config_byte(dev, PCI_HEADER_TYPE, &hdr_type);
  dev->multifunction = !!(hdr_type & 0x80);

I'll have to see why the device is reported as single-function, when the host clearly constructs it as multi-function. From the dom0 dmesg:

pciback: vpci: 0000:03:10.0: assign to virtual slot 0
pciback: vpci: 0000:03:10.1: assign to virtual slot 0 func 1
pciback: vpci: 0000:03:10.2: assign to virtual slot 0 func 2
pciback: vpci: 0000:03:10.3: assign to virtual slot 0 func 3
pciback: vpci: 0000:03:10.4: assign to virtual slot 0 func 4
pciback: vpci: 0000:03:10.5: assign to virtual slot 0 func 5
pciback: vpci: 0000:03:10.6: assign to virtual slot 0 func 6

(Printed by pciback_add_pci_dev(), file "drivers/xen/pciback/vpci.c".)

Comment 7 Laszlo Ersek 2012-10-16 12:56:17 UTC
The non-multifunction setting seems to come straight from the igbvf device.

When the PV domU goes through

  pci_scan_slot
    pci_scan_single_device
       pci_scan_device
         pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, &l)
         pci_setup_device
           pci_read_config_byte(dev, PCI_HEADER_TYPE, &hdr_type)
           dev->multifunction = !!(hdr_type & 0x80);

dom0 reports (with pciback.verbose_request=1)

  pciback: 0000:03:10.0: read 4 bytes at 0x0
  pciback: 0000:03:10.0: read 4 bytes at 0x0 = 10ca8086
  pciback: 0000:03:10.0: read 1 bytes at 0xe
  pciback: 0000:03:10.0: read 1 bytes at 0xe = 0

(PCI_VENDOR_ID == 0x00, PCI_HEADER_TYPE == 0x0e -- offsets in config space).

The pciback messages are printed by pciback_config_read() [drivers/xen/pciback/conf_space.c], which first retrieves the value from the real device, then modifies it as appropriate, based on the quirks/overlays installed for the given config space offset.

The list of config space header overlays can be found in "drivers/xen/pciback/conf_space_header.c", array "header_common". Offset PCI_HEADER_TYPE is not overlaid.

Checking in dom0:

# lspci -s 03:10.0 -v -x -nn
03:10.0 Ethernet controller [0200]: Intel Corporation 82576 Virtual Function [8086:10ca] (rev 01)
        Subsystem: Intel Corporation Device [8086:a04c]
        Flags: bus master, fast devsel, latency 0
        [virtual] Memory at f4248000 (64-bit, non-prefetchable) [size=16K]
        [virtual] Memory at f4268000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Kernel driver in use: pciback
        Kernel modules: igbvf
00: ff ff ff ff 04 00 10 00 01 00 00 02 00 00 00 00
                                              ^^
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 4c a0
30: 00 00 00 00 70 00 00 00 00 00 00 00 00 00 00 00

Offset 0x0e has value 0x00, which decodes as non-multi-function (MSB is clear), PCI_HEADER_TYPE_NORMAL.

(It is interesting that vendor & device are both reported as 0xFFFF, but "header_common" does have overlays for those.)

Compare the physical function:

# lspci -s 03:00.0 -v -x -nn
03:00.0 Ethernet controller [0200]: Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 01)
        Subsystem: Intel Corporation Gigabit ET Dual Port Server Adapter [8086:a04c]
        Flags: bus master, fast devsel, latency 0, IRQ 21
        Memory at f4200000 (32-bit, non-prefetchable) [size=128K]
        Memory at f4400000 (32-bit, non-prefetchable) [size=4M]
        I/O ports at d000 [size=32]
        Memory at f4240000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-6c-22-d0
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
        Kernel driver in use: igb
        Kernel modules: igb
00: 86 80 c9 10 07 05 10 00 01 00 00 02 10 00 80 00
                                              ^^
10: 00 00 20 f4 00 00 40 f4 01 d0 00 00 00 00 24 f4
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 4c a0
30: 00 00 00 00 40 00 00 00 00 00 00 00 05 01 00 00

Here MSB is set (multi-function device).

Comment 8 Laszlo Ersek 2012-10-16 13:03:57 UTC
(In reply to comment #7)

> (It is interesting that vendor & device are both reported as 0xFFFF, but
> "header_common" does have overlays for those.)

The answer to that is in upstream Linux commit fd5b221b.

Comment 9 Laszlo Ersek 2012-10-16 14:22:23 UTC
Asked for guidance on xen-devel:
http://lists.xen.org/archives/html/xen-devel/2012-10/msg01217.html

Comment 10 Laszlo Ersek 2012-10-16 17:15:57 UTC
Created attachment 628292 [details]
(proposed dom0 patch) xen PV passthru: assign SR-IOV virtual functions to separate virtual slots

This patch solves the problem for me. Now dom0 prints:

pciback: vpci: 0000:03:10.0: assign to virtual slot 0
pciback: vpci: 0000:03:10.1: assign to virtual slot 1
pciback: vpci: 0000:03:10.2: assign to virtual slot 2
pciback: vpci: 0000:03:10.3: assign to virtual slot 3
pciback: vpci: 0000:03:10.4: assign to virtual slot 4
pciback: vpci: 0000:03:10.5: assign to virtual slot 5
pciback: vpci: 0000:03:10.6: assign to virtual slot 6

In the guest:

[root@pv-guest ~]# lspci
00:00.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:01.1 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:02.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:03.3 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:04.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:05.5 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:06.6 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)

(slot == func is an artifact and neither required nor relevant)

They all get IP addresses over DHCP and are externally pingable.

Comment 11 Laszlo Ersek 2012-10-16 17:37:37 UTC
Posted upstream patch:
http://lists.xen.org/archives/html/xen-devel/2012-10/msg01239.html

Comment 12 Don Dutile (Red Hat) 2012-10-16 17:46:42 UTC
(In reply to comment #11)
> Posted upstream patch:
> http://lists.xen.org/archives/html/xen-devel/2012-10/msg01239.html

ack to proposed patch....

Comment 13 Laszlo Ersek 2012-10-17 10:08:03 UTC
posted upstream v2 patch:
http://lists.xen.org/archives/html/xen-devel/2012-10/msg01291.html

Comment 14 Laszlo Ersek 2012-10-17 12:00:09 UTC
Created attachment 628739 [details]
(proposed dom0 patch, v2) xen PV passthru: assign SR-IOV virtual functions to separate virtual slots

Simplified patch as suggested on xen-devel for v1.

The second hunk from upstream v2 is not backported because we don't have <http://xenbits.xensource.com/hg/linux-2.6.18-xen.hg/rev/4b9f2293d750>.

Tested the v2 patch too (locally built dom0, same RHEL-5 domU as before), with results visible in comment 10.

The RHEL-6 PV guest doesn't support Xen pcifront. (Upstream Linux gained it with upstream commit 956a9202, which targeted 2.6.34.)

Comment 15 Laszlo Ersek 2012-10-17 12:24:30 UTC
(In reply to comment #10)

> (slot == func is an artifact and neither required nor relevant)

Note that this working is *not* by blind luck. "scan_all_fns" used by pci_scan_slot() is invariably 1 in the Xen kernel. See

include/asm-x86_64/mach-xen/asm/pci.h
include/asm-i386/mach-xen/asm/pci.h

/* On Xen we have to scan all functions since Xen hides bridges from
 * us.  If a bridge is at fn=0 and that slot has a multifunction
 * device, we won't find the additional devices without scanning all
 * functions. */
#undef pcibios_scan_all_fns
#define pcibios_scan_all_fns(a, b)	1

Comment 18 RHEL Program Management 2013-05-01 06:40:09 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unable to address this
request at this time.

Red Hat invites you to ask your support representative to
propose this request, if appropriate, in the next release of
Red Hat Enterprise Linux.

Comment 21 RHEL Program Management 2013-06-07 14:30:48 UTC
This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 25 Phillip Lougher 2013-06-11 17:51:08 UTC
Patch(es) available in kernel-2.6.18-360.el5
You can download this test kernel (or newer) from http://people.redhat.com/plougher/el5/
Detailed testing feedback is always welcomed.
If you require guidance regarding testing, please ask the bug assignee.

Comment 27 bfan 2013-07-16 06:20:50 UTC
Verified with:
host: 2.6.18-365.el5xen xen-3.0.3-144.el5
guest: 2.6.18-348.el5xen

[in host]
[root@dhcp-9-22 home]# xm cr pv.cfg pci=0000:03:10.0 pci=0000:03:10.1 pci=0000:03:10.2 pci=0000:03:10.3 pci=0000:03:10.4 pci=0000:03:10.5 pci=0000:03:10.6 pci=0000:03:10.7 pci=0000:03:11.0 pci=0000:03:11.1 pci=0000:03:11.2 pci=0000:03:11.3 pci=0000:03:11.4 pci=0000:03:11.5
Using config file "./pv.cfg".
Using <class 'grub.GrubConf.GrubConfigFile'> to parse /grub/menu.lst
Started domain xen-pv-64
[root@dhcp-9-22 home]# xm li
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0     4947     8 r-----    201.2
xen-pv-64                                  3     1024     2 -b----      0.3
[root@dhcp-9-22 home]# xm pci-list 3
domain   bus   slot   func
0    3     10     0      
0    3     10     1      
0    3     10     2      
0    3     10     3      
0    3     10     4      
0    3     10     5      
0    3     10     6      
0    3     10     7      
0    3     11     0      
0    3     11     1      
0    3     11     2      
0    3     11     3      
0    3     11     4      
0    3     11     5      


[in guest]
[root@dhcp-8-166 ~]# lspci
00:00.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:01.1 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:02.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:03.3 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:04.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:05.5 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:06.6 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:07.7 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:08.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:09.1 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:0a.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:0b.3 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:0c.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
00:0d.5 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)

So change the status to verified

Comment 29 errata-xmlrpc 2013-09-30 23:38:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-1348.html