Bug 1370047

Summary: Cant revert from neutron-port PF to VF because sriov_numvfs parameter get "0" value
Product: Red Hat OpenStack Reporter: Eran Kuris <ekuris>
Component: openstack-tripleoAssignee: James Slagle <jslagle>
Status: CLOSED CURRENTRELEASE QA Contact: Arik Chernetsky <achernet>
Severity: high Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: atelang, beagles, berrange, dasmith, eglynn, ekuris, fbaudin, jdonohue, jschluet, kchamart, ksundara, mburns, nyechiel, oblaut, rhel-osp-director-maint, rnoriega, sbauza, sferdjao, sgordon, skramaja, srevivo, vchundur, vromanso, yrachman
Target Milestone: ---Keywords: Reopened, ZStream
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-14 15:23:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1233921, 1401639, 1401640, 1438828    
Attachments:
Description Flags
pf_test from vladikr env. none

Description Eran Kuris 2016-08-25 07:23:12 UTC
Description of problem:
When manage SR-IOV PFs as Neutron ports I can see that 
/sys/class/net/enp5s0f1/device/sriov_numvfs parameter gets "0" value . 
when I delete the PF port  so I can switch to SRIOV - direct port (VF) I cant boot vm because sriov_numvfs parameter equal to "0" value
Version-Release number of selected component (if applicable):
 rpm -qa |grep neutron
python-neutron-lib-0.3.0-0.20160803002107.405f896.el7ost.noarch
openstack-neutron-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
puppet-neutron-9.1.0-0.20160813031056.7cf5e07.el7ost.noarch
python-neutron-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
openstack-neutron-lbaas-9.0.0-0.20160816191643.4e7301e.el7ost.noarch
python-neutron-fwaas-9.0.0-0.20160817171450.e1ac68f.el7ost.noarch
python-neutron-lbaas-9.0.0-0.20160816191643.4e7301e.el7ost.noarch
openstack-neutron-ml2-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
openstack-neutron-metering-agent-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
openstack-neutron-openvswitch-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
python-neutronclient-5.0.0-0.20160812094704.ec20f7f.el7ost.noarch
openstack-neutron-common-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
openstack-neutron-fwaas-9.0.0-0.20160817171450.e1ac68f.el7ost.noarch
[root@controller1 ~(keystone_admin)]# rpm -qa |grep nova
python-novaclient-5.0.1-0.20160724130722.6b11a1c.el7ost.noarch
openstack-nova-api-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
puppet-nova-9.1.0-0.20160813014843.b94f0a0.el7ost.noarch
openstack-nova-common-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-novncproxy-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-conductor-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
python-nova-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-scheduler-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-cert-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-console-14.0.0-0.20160817225441.04cef3b.el7ost.noarch


How reproducible:
always 

Steps to Reproduce:
1.Set SRIOV ENV and PF support : https://docs.google.com/document/d/1qQbJlLI1hSlE4uwKpmVd0BoGSDBd8Z0lTzx5itQ6WL0/edit#
2. BOOT VM that assign to PF (neutron port- direct-physical) -  should boot well
3. check cat /sys/class/net/enp5s0f1/device/sriov_numvfs  (=0)
4. delete vm  and check again sriov_numvfs   (=0)
5. I expect that numvfs should return to the default value that was configured 

Actual results:


Expected results:


Additional info:

Comment 2 Brent Eagles 2016-08-29 19:47:32 UTC
Eran was this on a system where you were modifying the VF count to 0, or is it
VF >0 
allocate PF
deallocate PF
VF == 0
?

Comment 3 Eran Kuris 2016-08-30 07:00:03 UTC
Brent I  did not change the VF count to 0 manually . 
The scenario is : 
VF >0 
allocate PF
deallocate PF
VF == 0

Comment 4 Eran Kuris 2016-08-30 07:46:44 UTC
According to this bug my opinion is that RFE : https://bugzilla.redhat.com/show_bug.cgi?id=1233921
is block because its not looks like there is dynamic change between PF & VF

Comment 5 Brent Eagles 2016-08-30 13:54:33 UTC
When the tripleo SR-IOV support is completed, this should be taken care of because there will be a script that runs when the interface is brought back up the VFs will get reset to the expected configured value. Of course this is contingent on there being and ifup on that PF.

@Eran, can you check the up/down status of the PF once it's been "released" if you get the chance?

Comment 6 Eran Kuris 2016-08-31 09:11:16 UTC
(In reply to Brent Eagles from comment #5)
> When the tripleo SR-IOV support is completed, this should be taken care of
> because there will be a script that runs when the interface is brought back
> up the VFs will get reset to the expected configured value. Of course this
> is contingent on there being and ifup on that PF.
> 
> @Eran, can you check the up/down status of the PF once it's been "released"
> if you get the chance?

Yes Brent it's been released after I delete the VM that associate to PF

Comment 7 Brent Eagles 2016-08-31 16:05:19 UTC
Actually, that's not what I meant to ask. I was referring to whether the interface was up or down. If it is not set to "up" then we cannot rely on the ifup-local hook that we install to resolve the VF count issue. If it is down, can you try bringing it up and seeing if the VFs come back or not.

Comment 8 Eran Kuris 2016-09-01 05:36:45 UTC
(In reply to Brent Eagles from comment #7)
> Actually, that's not what I meant to ask. I was referring to whether the
> interface was up or down. If it is not set to "up" then we cannot rely on
> the ifup-local hook that we install to resolve the VF count issue. If it is
> down, can you try bringing it up and seeing if the VFs come back or not.

It set to up after I release the PF .

Comment 9 Brent Eagles 2016-09-01 15:40:21 UTC
Okay thanks. So this means that the persistent VF thing added by tripleo isn't going to help. Vladik, is this something that nova can do when the pci device is released? Alternatively, we'll have to get the SR-IOV agent involved on the compute node.

Comment 10 Vladik Romanovsky 2016-09-02 14:26:59 UTC
Created attachment 1197223 [details]
pf_test from vladikr env.

Comment 11 Vladik Romanovsky 2016-09-02 14:33:10 UTC
Could we please reproduce the bug, but this time please enable the VFs using the max_vfs parameter and not via sysfs tunable.

Unload the ixgbe (or other driver of the card ) driver and load it again
modprobe ixgbe max_vfs=X

I can't reproduce this problem with my card. I've added an output from my server.
Regardless, it looks like this is a VFIO or a libvirt issue (hostdev managed=True), rather than nova. In Nova, we are relying on this behaviour as well.

I think we should try reproducing with max_vfs and if it doesn't work we should try consulting Alex Williamson from the kvm team.

Vladik

Comment 12 Ricardo Noriega 2016-09-02 15:19:06 UTC
This is my take away from the testing environment for Telefonica.

- sysfs: Using a command like this "echo 4 > /sys/class/net/em1/device/sriov_numvfs" does not persist the number of VFs per network adapter. So after allocating/deallocating a PF, the VFs configured previously are gone.

- ixgbe max_vfs parameter works well. Doing the same kind of test as before, the VFs are back, however, the parent interface has got its link state as DOWN what makes its children (VFs) not to be available for allocation.

Ricky

Comment 14 Ricardo Noriega 2016-09-02 15:38:13 UTC
Latest news!

As per Vladik instructions, with the NetworkManager enabled, the parent interface comes back in UP state. So combination of ixgbe max_vfs parameter + NetworkManager service makes the job.

Ricky

Comment 22 Eran Kuris 2016-11-28 12:28:19 UTC
it still exist from my last check

Comment 24 Stephen Gordon 2017-02-23 20:13:32 UTC
(In reply to Eran Kuris from comment #22)
> it still exist from my last check

I think we were really expecting you to be a little more expansive here as to the ask, as the upstream comment Nir referred to above was yes that's right and this is expected behavior:

https://bugs.launchpad.net/nova/+bug/1616769/comments/1

Comment 28 Brent Eagles 2017-07-14 15:23:32 UTC
This was fixed via changes to tripleo/director. The user needs to enable network management using "nm_controlled: true" on RHEL or "hotplug: true" on CentOS on the relevant interfaces in the network environment files.

Please see:
https://bugzilla.redhat.com/show_bug.cgi?id=1392584
https://bugzilla.redhat.com/show_bug.cgi?id=1392585