Bug 1040626

Summary: Error starting domain: internal error: missing IFLA_VF_INFO in netlink response
Product: Red Hat Enterprise Linux 7 Reporter: Alex Williamson <alex.williamson>
Component: libnl3Assignee: Thomas Graf <tgraf>
Status: CLOSED CURRENTRELEASE QA Contact: Desktop QE <desktop-qa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.0CC: acathrow, alex.williamson, dallan, dcbw, dyuan, florin.stingaciu, gsun, honzhang, jiahu, laine, mzhan, rkhan, tgraf, thaller, tpelka, vbenes, xuzhang, ypei
Target Milestone: rcKeywords: OtherQA, Regression, TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libnl3-3.2.21-5.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-13 09:54:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1067873    

Description Alex Williamson 2013-12-11 17:51:14 UTC
Description of problem:

Got the following error attempting to start a domain:

Error starting domain: internal error: missing IFLA_VF_INFO in netlink response

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 100, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 122, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/domain.py", line 1220, in startup
    self._backend.create()
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 69

This occurs after adding the following xml fragment to the VM (cold add):

    <interface type='hostdev' managed='yes'>
      <mac address='02:10:91:73:00:00'/>
      <driver name='vfio'/>
      <source>
        <address type='pci' domain='0x0000' bus='0x01' slot='0x10' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </interface>

This same fragment works on F20.

If I instead add the device with this fragment:

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x10' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </hostdev>

It works, but now I'm not able to have libvirt program the MAC address of the 82599 VF being assigned.

Version-Release number of selected component (if applicable):
libvirt-1.1.1-14.el7.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Alex Williamson 2013-12-11 18:18:03 UTC
Note that the failing xml follows the example provided here:

http://libvirt.org/formatdomain.html#elementsNICSHostdev

Comment 7 Alex Williamson 2014-01-20 22:54:12 UTC
An 82599 supports 64 VFs per PF.  Binary search says that the problem only occurs for 32 or more VFs, same as report in comment 5.

Comment 8 Alex Williamson 2014-01-20 22:58:47 UTC
Re-adding tgraf needinfo from comment 3

Comment 11 Alex Williamson 2014-01-20 23:31:23 UTC
Just to confirm, I can start the VM with the max 63 VFs configured with the *4 fix that went into libnl 1.1.4

Comment 16 Thomas Graf 2014-02-26 13:17:00 UTC
*** Bug 1069548 has been marked as a duplicate of this bug. ***

Comment 17 Yulong Pei 2014-02-27 07:55:11 UTC
This bug blocked igb, bnx2x NIC's sr-iov testing. so set TestBlocker flag.

Comment 18 Xuesong Zhang 2014-03-06 11:14:25 UTC
As for this bug, test with the latest build, it can be changed the status to verified now.

package version:
libvirt-1.1.1-26.el7.x86_64
qemu-kvm-rhev-1.5.3-52.el7.x86_64
kernel-3.10.0-105.el7.x86_64
libnl3-3.2.21-5.el7.x86_64

steps:
1. find one host contains 82599 SR-IOV card, and generate the max vfs number on the host. Make sure the vf number is large than 32
# lspci|grep 82599|wc -l
128

2. add the following xml to one shutoff guest.
    <interface type='hostdev' managed='yes'>
      <mac address='52:54:00:0e:09:61'/>
      <source>
        <address type='pci' domain='0x0000' bus='0x44' slot='0x1f' function='0x4'/>
      </source>
    </interface>

3. the guest can be started up without any error.
# virsh start a
Domain a started

4. check the dumpxml of the guest, make sure the interface is in there.
# virsh dumpxml a|grep hostdev -A5
    <interface type='hostdev' managed='yes'>
      <mac address='52:54:00:0e:09:61'/>
      <driver name='vfio'/>
      <source>
        <address type='pci' domain='0x0000' bus='0x44' slot='0x1f' function='0x4'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </interface>

Comment 21 Ludek Smid 2014-06-13 09:54:48 UTC
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Comment 23 florin.stingaciu 2015-02-23 21:42:39 UTC
(In reply to Ludek Smid from comment #21)
> This request was resolved in Red Hat Enterprise Linux 7.0.
> 
> Contact your manager or support representative in case you have further
> questions about the request.

I am experiencing this same issue while trying to boot a VM. I'm using a Mellanox ConnectX3 configured with 8 VFs on a hypervisor running CentOS 7.  

01:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
01:00.1 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
01:00.2 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
01:00.3 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
01:00.4 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
01:00.5 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
01:00.6 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
01:00.7 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
01:01.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

Here are the relevant package versions:
libnl3-3.2.21-6.el7.x86_64
kernel-3.10.0-123.el7.x86_64
libvirt-1.1.1-29.el7_0.7.x86_64
qemu-kvm-1.5.3-60.el7_0.11.x86_64

The configuration for the PCI interface on the VM:
    <interface type='hostdev' managed='yes'>
      <mac address='52:54:00:c0:34:2b'/>
      <source>
        <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>

This configuration fails upon boot with the following error:
error: internal error: missing IFLA_VF_INFO in netlink response

If I define a PCI device in the following manner, the VM boots up fine and I can see the interface:
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev> 

One thing worth mentioning is that the VFs are on top of a infiniband interface. I've been troubleshooting this for a couple of days now without any luck. I've also brought this to the attention of the libvirt mailing list. Any help would be greatly appreciated.

Comment 24 Laine Stump 2015-02-23 22:33:35 UTC
Mellanox cards are a bit different from othe SRIOV cards, and their drivers are (or at least very recently were) under active development to make them more similar to standard SRIOV. The problem you are experiencing may have the same symptoms as this BZ, but it is not the same problem.

Comment 25 florin.stingaciu 2015-02-23 22:45:18 UTC
(In reply to Laine Stump from comment #24)
> Mellanox cards are a bit different from othe SRIOV cards, and their drivers
> are (or at least very recently were) under active development to make them
> more similar to standard SRIOV. The problem you are experiencing may have
> the same symptoms as this BZ, but it is not the same problem.

Should I open a new ticket or should I attempt to get in touch with Mellanox?

Comment 26 Laine Stump 2015-02-23 22:54:26 UTC
I would recommend direct communication with Mellanox.