1438828 – Switching from PF to VF usage randomly causes SR-IOV VF use to stop working

Bug 1438828 - Switching from PF to VF usage randomly causes SR-IOV VF use to stop working

Summary: Switching from PF to VF usage randomly causes SR-IOV VF use to stop working

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	11.0 (Ocata)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	async
Target Release:	11.0 (Ocata)
Assignee:	Brent Eagles
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:	1370047
Blocks:	1479029
TreeView+	depends on / blocked

Reported:	2017-04-04 13:46 UTC by Eran Kuris
Modified:	2017-09-13 13:24 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Release Note
Doc Text:	To use SR-IOV physical function (PF) and virtual functions (VFs) in the same environment, add the 'nm_controlled' and 'hotplug' parameters to the SR-IOV PF configuration in your compute.yaml heat template: -type: interface name: nic6 use_dhcp: false nm_controlled: true hotplug: true When an OpenStack instance that was using a direct physical function is destroyed, the PCI device is released back to OpenStack and the host system. The root PCI device is then configured to support the number of virtual functions configured during deployment. This process involves the coordination of the host operating system, NetworkManager and OpenStack and may require a short interval of time before the virtual functions are available for use.
Clone Of:
Environment:
Last Closed:	2017-09-13 12:58:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
log (24.80 KB, text/plain) 2017-04-04 13:46 UTC, Eran Kuris	no flags	Details
setup config files (4.86 KB, application/x-gzip) 2017-04-27 12:55 UTC, Eran Kuris	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1679703	0	None	None	None	2017-04-04 13:54:38 UTC

Description Eran Kuris 2017-04-04 13:46:21 UTC

Description of problem:

Booted VM with Direct-physical port (The entire PF is associated to the instance).
When I deleted the instance I expected that PF will be available and online.
Actually when I am trying to boot instance with direct port (VF)
I get this error message :

VM in error state- 
fault | {"message": "Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 102fde1b-22d3-4b05-8246-0f1af520455a. Last exception: internal error: Unable to configure VF 4 of PF 'p1p1' because the PF is not online. Please change host network config", "code": 500, "details": "  File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 524, in build_instances |     filter_properties, instances[0].uuid)  

[root@compute-0 ~]# ifconfig |grep p1p1 --->PF is not online
it's impossible to create instance with ditect port (VF) 
sosreport:
https://drive.google.com/drive/folders/0B_izhJVSkOTDdnV3SmtNWnUwYUk

Version-Release number of selected component (if applicable):
[root@controller-0 ~]# rpm -qa |grep neutron 
openstack-neutron-10.0.0-11.el7ost.noarch
python-neutron-lib-1.1.0-1.el7ost.noarch
openstack-neutron-sriov-nic-agent-10.0.0-11.el7ost.noarch
openstack-neutron-ml2-10.0.0-11.el7ost.noarch
python-neutronclient-6.1.0-1.el7ost.noarch
openstack-neutron-common-10.0.0-11.el7ost.noarch
openstack-neutron-openvswitch-10.0.0-11.el7ost.noarch
python-neutron-10.0.0-11.el7ost.noarch
puppet-neutron-10.3.0-2.el7ost.noarch
[root@controller-0 ~]# rpm -qa |grep nova
openstack-nova-common-15.0.2-1.el7ost.noarch
openstack-nova-cert-15.0.2-1.el7ost.noarch
puppet-nova-10.4.0-3.el7ost.noarch
openstack-nova-compute-15.0.2-1.el7ost.noarch
openstack-nova-placement-api-15.0.2-1.el7ost.noarch
openstack-nova-console-15.0.2-1.el7ost.noarch
openstack-nova-novncproxy-15.0.2-1.el7ost.noarch
openstack-nova-conductor-15.0.2-1.el7ost.noarch
openstack-nova-scheduler-15.0.2-1.el7ost.noarch
python-nova-15.0.2-1.el7ost.noarch
openstack-nova-api-15.0.2-1.el7ost.noarch
python-novaclient-7.1.0-1.el7ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy SRIOV setup with PF support 
2. boot instance with Direct-physical port
3. Delete VM that is associated to PF
4. boot instance with Direct port (VF)

Expected results:
VM with direct port should be booted. PF should be released

Additional info:
Workaround - systemctl restart network

Comment 1 Eran Kuris 2017-04-04 13:46:59 UTC

Created attachment 1268662 [details]
log

Comment 4 Brent Eagles 2017-04-24 18:10:41 UTC

I will also need the os-net-config version used for this test to ensure you have the required patches. Also note that you need to add the following parameters to the SR-IOV physical function configuration in your heat templates:

 nm_controlled = true
 hotplug = true

Comment 5 Eran Kuris 2017-04-24 20:11:37 UTC

(In reply to Brent Eagles from comment #4)
> I will also need the os-net-config version used for this test to ensure you
> have the required patches. Also, note that you need to add the following
> parameters to the SR-IOV physical function configuration in your heat
> templates:
> 
>  nm_controlled = true
>  hotplug = true

Is there any chance to get "os-net-config" from SOS-report? 
about the new parameters, I will try to deploy setup and check it. Will do my best to do it soon.
Please provide the specific path of the config file that I need to add those parameters, so I do not miss anything.
thanks.

Comment 6 Brent Eagles 2017-04-26 13:47:35 UTC

@Eran, The parameters are applied to the interface configuration on the network templates. For example, in your version of tripleo-heat-templates/network/config/multiple-nics/compute.yaml, you need entries for the PF interfaces 

 -type: interface
  name: nic6
  use_dhcp: false
  nm_controlled: true
  hotplug: true

Network manager will take care of bringing the interface back up. The exact path of the relevant file where these parameters would be would depend on the test environment. 

With respect to os-net-config - aren't the versions of all of the installed packages kept somewhere?

Comment 7 Eran Kuris 2017-04-26 13:53:49 UTC

(In reply to Brent Eagles from comment #6)
> @Eran, The parameters are applied to the interface configuration on the
> network templates. For example, in your version of
> tripleo-heat-templates/network/config/multiple-nics/compute.yaml, you need
> entries for the PF interfaces 
> 
>  -type: interface
>   name: nic6
>   use_dhcp: false
>   nm_controlled: true
>   hotplug: true
> 
> Network manager will take care of bringing the interface back up. The exact
> path of the relevant file where these parameters would be would depend on
> the test environment. 
> 
> With respect to os-net-config - aren't the versions of all of the installed
> packages kept somewhere?
 
Hmm, I don't know if they kept somewhere.

Comment 8 Eran Kuris 2017-04-27 12:47:49 UTC

root@compute-0 ~]# rpm -qa |grep os-net 
os-net-config-6.0.0-3.el7ost.noarch

When I set my setup with >   nm_controlled: true
                         >   hotplug: true

I didn't success to boot VF instance.
Got  error: 
{"message": "Build of instance 594cacae-6bdd-45b7-ae1b-4102b1d86cce aborted: Failed to allocate the network(s), not rescheduling.", "code": 500, "details": "  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 1780, in _do_build_and_run_instance |
|                                      |     filter_properties)                                                                                                                                                                                                                                                     |
|                                      |   File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 1990, in _build_and_run_instance

Comment 9 Eran Kuris 2017-04-27 12:55:49 UTC

Created attachment 1274630 [details]
setup config files

Comment 13 Brent Eagles 2017-05-09 18:34:06 UTC

This seems to be timing-dependent. I was able to create VMs with PF ports, delete them and create VMs with VF ports on this system most of the time. It is only when I deleted a VM with a PF port and created the VF based one shortly (within 30 seconds?) thereafter. I'll study further to see if I can find out where the race(s) lie.

Note You need to log in before you can comment on or make changes to this bug.