Bug 1369768 - [RFE] [Neutron] SRIOV agent does not allow the co-existence of VF and PF on the same node
Summary: [RFE] [Neutron] SRIOV agent does not allow the co-existence of VF and PF on t...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z3
: 10.0 (Newton)
Assignee: Brent Eagles
QA Contact: Yariv
URL:
Whiteboard:
Depends On:
Blocks: 1235009 1392584 1392585
TreeView+ depends on / blocked
 
Reported: 2016-08-24 11:28 UTC by Eran Kuris
Modified: 2017-06-28 15:31 UTC (History)
17 users (show)

Fixed In Version: openstack-neutron-9.2.0-5.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1392584 (view as bug list)
Environment:
Last Closed: 2017-06-28 15:31:11 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Launchpad 1616442 None None None 2016-08-24 11:33:19 UTC
Launchpad 1639901 None None None 2016-11-07 19:22:26 UTC
OpenStack gerrit 360447 None None None 2016-11-07 15:40:01 UTC
OpenStack gerrit 442088 None None None 2017-03-09 12:17:43 UTC
Red Hat Product Errata RHBA-2017:1594 normal SHIPPED_LIVE openstack-neutron bug fix advisory 2017-06-28 19:13:28 UTC

Description Eran Kuris 2016-08-24 11:28:32 UTC
Description of problem:
When assigning neutron port to PF (neutron port type - direct-physical )
the vm is booted and active but there is errors in sriov agent log 

2016-08-24 13:47:11.702 1484 DEBUG neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-028806d4-c857-4956-9198-230cb0de5e18 - - - - -] Agent rpc_loop - iteration:391 started daemon_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py:367
2016-08-24 13:47:11.703 1484 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-028806d4-c857-4956-9198-230cb0de5e18 - - - - -] Agent out of sync with plugin!
2016-08-24 13:47:11.703 1484 DEBUG neutron.agent.linux.utils [req-028806d4-c857-4956-9198-230cb0de5e18 - - - - -] Running command (rootwrap daemon): ['ip', 'link', 'show', 'enp5s0f1'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:99
2016-08-24 13:47:11.715 1484 ERROR neutron.agent.linux.utils [req-028806d4-c857-4956-9198-230cb0de5e18 - - - - -] Exit code: 1; Stdin: ; Stdout: ; Stderr: Device "enp5s0f1" does not exist.

2016-08-24 13:47:11.715 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib [req-028806d4-c857-4956-9198-230cb0de5e18 - - - - -] Failed executing ip command
2016-08-24 13:47:11.715 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib Traceback (most recent call last):
2016-08-24 13:47:11.715 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/pci_lib.py", line 83, in get_assigned_macs
2016-08-24 13:47:11.715 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib     out = self._as_root([], "link", ("show", self.dev_name))
2016-08-24 13:47:11.715 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 95, in _as_root
2016-08-24 13:47:11.715 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib     log_fail_as_error=self.log_fail_as_error)
2016-08-24 13:47:11.715 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 104, in _execute
2016-08-24 13:47:11.715 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib     log_fail_as_error=log_fail_as_error)
2016-08-24 13:47:11.715 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 138, in execute
2016-08-24 13:47:11.715 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib     raise RuntimeError(msg)
2016-08-24 13:47:11.715 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib RuntimeError: Exit code: 1; Stdin: ; Stdout: ; Stderr: Device "enp5s0f1" does not exist.
2016-08-24 13:47:11.715 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib 
2016-08-24 13:47:11.715 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib 

2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-028806d4-c857-4956-9198-230cb0de5e18 - - - - -] Error in agent loop. Devices info: {}
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent Traceback (most recent call last):
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py", line 380, in daemon_loop
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent     device_info = self.scan_devices(devices, updated_devices_copy)
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent   File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 154, in wrapper
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent     return f(*args, **kwargs)
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py", line 199, in scan_devices
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent     curr_devices = self.eswitch_mgr.get_assigned_devices_info()
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/eswitch_manager.py", line 277, in get_assigned_devices_info
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent     for device in embedded_switch.get_assigned_devices_info():
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/eswitch_manager.py", line 150, in get_assigned_devices_info
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent     list(vf_to_pci_slot_mapping.keys()))
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/pci_lib.py", line 87, in get_assigned_macs
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent     reason=e)
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent IpCommandDeviceError: ip command failed on device enp5s0f1: Exit code: 1; Stdin: ; Stdout: ; Stderr: Device "enp5s0f1" does not exist.
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent 
2016-08-24 13:47:11.717 1484 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agen
Version-Release number of selected component (if applicable):
RHOS-10 
[root@controller1 ~(keystone_admin)]# rpm -qa |grep neutron 
python-neutron-lib-0.3.0-0.20160803002107.405f896.el7ost.noarch
openstack-neutron-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
puppet-neutron-9.1.0-0.20160813031056.7cf5e07.el7ost.noarch
python-neutron-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
openstack-neutron-lbaas-9.0.0-0.20160816191643.4e7301e.el7ost.noarch
python-neutron-fwaas-9.0.0-0.20160817171450.e1ac68f.el7ost.noarch
python-neutron-lbaas-9.0.0-0.20160816191643.4e7301e.el7ost.noarch
openstack-neutron-ml2-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
openstack-neutron-metering-agent-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
openstack-neutron-openvswitch-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
python-neutronclient-5.0.0-0.20160812094704.ec20f7f.el7ost.noarch
openstack-neutron-common-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
openstack-neutron-fwaas-9.0.0-0.20160817171450.e1ac68f.el7ost.noarch
[root@controller1 ~(keystone_admin)]# rpm -qa |grep nova
python-novaclient-5.0.1-0.20160724130722.6b11a1c.el7ost.noarch
openstack-nova-api-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
puppet-nova-9.1.0-0.20160813014843.b94f0a0.el7ost.noarch
openstack-nova-common-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-novncproxy-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-conductor-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
python-nova-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-scheduler-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-cert-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-console-14.0.0-0.20160817225441.04cef3b.el7ost.noarch

How reproducible:


Steps to Reproduce:
1.deploy SRIOV setup and set PF functionality  you can use guide : 
https://docs.google.com/document/d/1qQbJlLI1hSlE4uwKpmVd0BoGSDBd8Z0lTzx5itQ6WL0/edit#
2.boot vm & assign it to PF 
3.check in compute node sriov agent log 

Actual results:
error 

Expected results:
no errors 

Additional info:

Comment 2 Assaf Muller 2016-08-29 19:58:54 UTC
Following the upstream Launchpad bug, this is only relevant if we intend to allow users to create both VFs and PFs on the same node. Is that functionality relevant and required? If so this RHBZ should indeed block the RFE, otherwise it shouldn't, I'll leave that decision up to you.

Comment 3 Nir Yechiel 2016-08-30 13:30:16 UTC
(In reply to Assaf Muller from comment #2)
> Following the upstream Launchpad bug, this is only relevant if we intend to
> allow users to create both VFs and PFs on the same node. Is that
> functionality relevant and required? If so this RHBZ should indeed block the
> RFE, otherwise it shouldn't, I'll leave that decision up to you.

Eran/Assaf: I am not sure I am following. Once you enable PF, no VFs will be available to do SR-IOV. And once you enable a single VF, the PF won’t be available anymore in the card. So what do you mean by "allow users to create both VFs and PFs on the same node"? Are you referring to using VFs and PFs on the same host but from separate network adapters?

Thanks,
Nir

Comment 4 Eran Kuris 2016-08-30 14:08:50 UTC
Yes Nir this is exactly what I meant. 
Also there is scenario that  Network adapter is already configured with VFs and after  that I want to change it to work with all PF after I disassociate all ports from VFs.

Comment 5 Nir Yechiel 2016-08-30 14:16:01 UTC
(In reply to Eran Kuris from comment #4)
> Yes Nir this is exactly what I meant. 
> Also there is scenario that  Network adapter is already configured with VFs
> and after  that I want to change it to work with all PF after I disassociate
> all ports from VFs.

And just to understand this bug - you are saying that this is working, but the agent log shows errors?

Thanks,
Nir

Comment 6 Eran Kuris 2016-08-30 14:19:11 UTC
(In reply to Nir Yechiel from comment #5)
> (In reply to Eran Kuris from comment #4)
> > Yes Nir this is exactly what I meant. 
> > Also there is scenario that  Network adapter is already configured with VFs
> > and after  that I want to change it to work with all PF after I disassociate
> > all ports from VFs.
> 
> And just to understand this bug - you are saying that this is working, but
> the agent log shows errors?
> 
> Thanks,
> Nir

You are correct

Comment 7 Nir Yechiel 2016-08-30 14:24:51 UTC
Thanks for the clarifications, Eran. @Assaf, the use case is valid and we should look at the upstream bug. But since the functionally is there this should not be treated as a high priority bug or block the RHOSP 10 RFE, IMO.

Thanks,
Nir

Comment 8 Ricardo Noriega 2016-09-01 11:09:52 UTC
Hello guys,

I'm having the same stacktrace from the SRIOV agent, but I think there is indeed an issue.

Let's say a compute node has got 2 network adapters (em1 and em2) with its correspondent VFs configured. If you start a VM with a direct-physical binding it will take one of these NICs. At that moment, SRIOV agent starts to show those ERROR messages including the "device dictionary" completely empty.

In consequence, you cannot allocate VMs with VFs eventhough there is still another NIC available.

IMHO, until this issue is not fixed, coexistence of VFs/PFs won't be usable. 

@Nir, could you reconsider the priority of this issue?? Thanks!

Comment 9 Assaf Muller 2016-09-01 12:21:25 UTC
(In reply to Ricardo Noriega from comment #8)
> Hello guys,
> 
> I'm having the same stacktrace from the SRIOV agent, but I think there is
> indeed an issue.
> 
> Let's say a compute node has got 2 network adapters (em1 and em2) with its
> correspondent VFs configured. If you start a VM with a direct-physical
> binding it will take one of these NICs. At that moment, SRIOV agent starts
> to show those ERROR messages including the "device dictionary" completely
> empty.
> 
> In consequence, you cannot allocate VMs with VFs eventhough there is still
> another NIC available.
> 
> IMHO, until this issue is not fixed, coexistence of VFs/PFs won't be usable. 
> 
> @Nir, could you reconsider the priority of this issue?? Thanks!

You'll have to excuse my cheekyness but I'm going to steal your comment and post it in the Launchpad bug.

Comment 10 Ricardo Noriega 2016-09-01 13:00:09 UTC
(In reply to Assaf Muller from comment #9)
> (In reply to Ricardo Noriega from comment #8)
> > Hello guys,
> > 
> > I'm having the same stacktrace from the SRIOV agent, but I think there is
> > indeed an issue.
> > 
> > Let's say a compute node has got 2 network adapters (em1 and em2) with its
> > correspondent VFs configured. If you start a VM with a direct-physical
> > binding it will take one of these NICs. At that moment, SRIOV agent starts
> > to show those ERROR messages including the "device dictionary" completely
> > empty.
> > 
> > In consequence, you cannot allocate VMs with VFs eventhough there is still
> > another NIC available.
> > 
> > IMHO, until this issue is not fixed, coexistence of VFs/PFs won't be usable. 
> > 
> > @Nir, could you reconsider the priority of this issue?? Thanks!
> 
> You'll have to excuse my cheekyness but I'm going to steal your comment and
> post it in the Launchpad bug.

Your cheekyness is excused. No prob.

Comment 12 Brent Eagles 2016-10-14 15:51:27 UTC
With:
https://review.openstack.org/#/c/360447

and a modified version of (as per comments provided in review)

https://review.openstack.org/#/c/377781/

A udev rule to call a script to set the vf count I got really close to co-existence. In the absence of NetworkManager, I need to add a script to be called from /proc/sys/kernel/hotplug to set the SR-IOV's link status to "up" before I could allocate VFs. At the moment, I can flip back and forth using a device as a PF or as VFs. With both PFs enabled, I can also use one for VFs and the other as a PF. We'll have to examine if this is route that.

Obviously adding supporting scripting, etc. is outside the scope this bug and what we can expect from a user, but I mention it to outline a.) how I tested the patches and b.) what we might need to add to tripleo for supporting this functionality.

Now all we need to do is to help these patches along to get merged u/s.

Comment 14 Assaf Muller 2016-10-17 14:38:55 UTC
I'm moving this back from 11 to 10.z. If we find that we cannot backport the fix to 10 for whatever reason we'll think about this again.

Comment 24 Jon Schlueter 2017-03-09 12:17:44 UTC
stable/newton patch is proposed to upstream https://review.openstack.org/#/c/442088

Comment 29 Ziv Greenberg 2017-06-20 14:34:13 UTC
It was verified in puddle 2017-06-15.5
The /var/log/neutron/sriov-nic-agent.log is clean without any errors.

Comment 31 errata-xmlrpc 2017-06-28 15:31:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1594


Note You need to log in before you can comment on or make changes to this bug.