Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1589796

Summary:	OSP13-sriov-HA: After restart to nova compute docker unable to boot VF/PF instance on one of computesriov nodes
Product:	Red Hat OpenStack	Reporter:	Eran Kuris <ekuris>
Component:	documentation	Assignee:	Irina <igallagh>
Status:	CLOSED DUPLICATE	QA Contact:	RHOS Documentation Team <rhos-docs>
Severity:	urgent	Docs Contact:
Priority:	medium
Version:	13.0 (Queens)	CC:	amuller, bcafarel, beagles, bhaley, chrisw, dalvarez, dasmith, dcadzow, eglynn, ekuris, jhakimra, jraju, kchamart, lyarwood, mariel, mbooth, oblaut, pkesavar, pveiga, sbauza, sclewis, sgordon, skramaja, smooney, srevivo, stephenfin, tfreger, vromanso
Target Milestone:	---	Keywords:	Reopened, Triaged, ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:	update / upgrade unsupported for the Release Candidate	Story Points:	---
Clone Of:
Clones:	1590716 1593290 (view as bug list)		Environment:
Last Closed:	2020-09-29 09:21:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1878201
Bug Blocks:	1590716, 1593290, 1615656

Description Eran Kuris 2018-06-11 12:07:18 UTC

Description of problem:

After a minor update of OSP13 -SRIOV -HA (3 controller 2 computes)setup I tried to boot new PF & VF instances.  On computesriov-1  I could boot any kind of instance (PF/VF) but when I tried to boot it on computesriov-0 it failed.
instance with normal port (ovs port) booted without any problems.
 
(overcloud) [stack@undercloud-0 ~]$  nova list
+--------------------------------------+--------+--------+------------+-------------+--------------------+
| ID                                   | Name   | Status | Task State | Power State | Networks           |
+--------------------------------------+--------+--------+------------+-------------+--------------------+
| 143bcb1d-9486-461b-b09f-56deb56efbec | Normal | ACTIVE | -          | Running     | net-64-2=10.0.2.13 |
| b33b3de7-4274-4464-ae57-143b8605e0ac | PF     | ACTIVE | -          | Running     | net-64-2=10.0.2.11 |
| a7642e5f-9e57-49bd-8bc6-ee5237b14de7 | VF     | ERROR  | -          | NOSTATE     |                    |
+--------------------------------------+--------+--------+------------+-------------+--------------------+
(overcloud) [stack@undercloud-0 ~]$  nova show VF
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Property                             | Value                                                                                                                                                                                                                    |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                    | MANUAL                                                                                                                                                                                                                   |
| OS-EXT-AZ:availability_zone          |                                                                                                                                                                                                                          |
| OS-EXT-SRV-ATTR:host                 | -                                                                                                                                                                                                                        |
| OS-EXT-SRV-ATTR:hostname             | vf                                                                                                                                                                                                                       |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | -                                                                                                                                                                                                                        |
| OS-EXT-SRV-ATTR:instance_name        | instance-00000013                                                                                                                                                                                                        |
| OS-EXT-SRV-ATTR:kernel_id            |                                                                                                                                                                                                                          |
| OS-EXT-SRV-ATTR:launch_index         | 0                                                                                                                                                                                                                        |
| OS-EXT-SRV-ATTR:ramdisk_id           |                                                                                                                                                                                                                          |
| OS-EXT-SRV-ATTR:reservation_id       | r-11g4pfuo                                                                                                                                                                                                               |
| OS-EXT-SRV-ATTR:root_device_name     | -                                                                                                                                                                                                                        |
| OS-EXT-SRV-ATTR:user_data            | -                                                                                                                                                                                                                        |
| OS-EXT-STS:power_state               | 0                                                                                                                                                                                                                        |
| OS-EXT-STS:task_state                | -                                                                                                                                                                                                                        |
| OS-EXT-STS:vm_state                  | error                                                                                                                                                                                                                    |
| OS-SRV-USG:launched_at               | -                                                                                                                                                                                                                        |
| OS-SRV-USG:terminated_at             | -                                                                                                                                                                                                                        |
| accessIPv4                           |                                                                                                                                                                                                                          |
| accessIPv6                           |                                                                                                                                                                                                                          |
| config_drive                         |                                                                                                                                                                                                                          |
| created                              | 2018-06-11T11:55:42Z                                                                                                                                                                                                     |
| description                          | VF                                                                                                                                                                                                                       |
| fault                                | {"message": "No valid host was found. There are not enough hosts available.", "code": 500, "details": "  File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 1116, in schedule_and_build_instances |
|                                      |     instance_uuids, return_alternates=True)                                                                                                                                                                              |
|                                      |   File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 716, in _schedule_instances                                                                                                                  |
|                                      |     return_alternates=return_alternates)                                                                                                                                                                                 |
|                                      |   File \"/usr/lib/python2.7/site-packages/nova/scheduler/utils.py\", line 726, in wrapped                                                                                                                                |
|                                      |     return func(*args, **kwargs)                                                                                                                                                                                         |
|                                      |   File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py\", line 53, in select_destinations                                                                                                           |
|                                      |     instance_uuids, return_objects, return_alternates)                                                                                                                                                                   |
|                                      |   File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py\", line 37, in __run_method                                                                                                                  |
|                                      |     return getattr(self.instance, __name)(*args, **kwargs)                                                                                                                                                               |
|                                      |   File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py\", line 42, in select_destinations                                                                                                              |
|                                      |     instance_uuids, return_objects, return_alternates)                                                                                                                                                                   |
|                                      |   File \"/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py\", line 158, in select_destinations                                                                                                                   |
|                                      |     return cctxt.call(ctxt, 'select_destinations', **msg_args)                                                                                                                                                           |
|                                      |   File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line 174, in call                                                                                                                              |
|                                      |     retry=self.retry)                                                                                                                                                                                                    |
|                                      |   File \"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py\", line 131, in _send                                                                                                                              |
|                                      |     timeout=timeout, retry=retry)                                                                                                                                                                                        |
|                                      |   File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 559, in send                                                                                                                     |
|                                      |     retry=retry)                                                                                                                                                                                                         |
|                                      |   File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 550, in _send                                                                                                                    |
|                                      |     raise result                                                                                                                                                                                                         |
|                                      | ", "created": "2018-06-11T11:55:46Z"}                                                                                                                                                                                    |
| flavor:disk                          | 10                                                                                                                                                                                                                       |
| flavor:ephemeral                     | 0                                                                                                                                                                                                                        |
| flavor:extra_specs                   | {}                                                                                                                                                                                                                       |
| flavor:original_name                 | m1.medium                                                                                                                                                                                                                |
| flavor:ram                           | 1024                                                                                                                                                                                                                     |
| flavor:swap                          | 0                                                                                                                                                                                                                        |
| flavor:vcpus                         | 1                                                                                                                                                                                                                        |
| hostId                               |                                                                                                                                                                                                                          |
| host_status                          |                                                                                                                                                                                                                          |
| id                                   | a7642e5f-9e57-49bd-8bc6-ee5237b14de7                                                                                                                                                                                     |
| image                                | rhel74 (5c8bd7b6-0b49-41e2-8a12-297802d4212d)                                                                                                                                                                            |
| key_name                             | -                                                                                                                                                                                                                        |
| locked                               | False                                                                                                                                                                                                                    |
| metadata                             | {}                                                                                                                                                                                                                       |
| name                                 | VF                                                                                                                                                                                                                       |
| os-extended-volumes:volumes_attached | []                                                                                                                                                                                                                       |
| status                               | ERROR                                                                                                                                                                                                                    |
| tags                                 | []                                                                                                                                                                                                                       |
| tenant_id                            | ad804165fc554a2299bf1c4c761f1374                                                                                                                                                                                         |
| updated                              | 2018-06-11T11:55:45Z                                                                                                                                                                                                     |
| user_id                              | 40b75a00c54a487d8e3ceb05e4e92599                                                                                                                                                                                         |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Version-Release number of selected component (if applicable):
OSP13 -p 
2018-06-08.3



Steps to Reproduce:
0. deploy osp13 ha sriov setup
1. create env : 
wget http://file.tlv.redhat.com/~ekuris/custom_ci_image/rhel-guest-image-7.4-191.x86_64.qcow2
sudo yum -y install libguestfs-tools
sudo yum -y install libvirt && sudo systemctl start libvirtd
virt-customize -a rhel-guest-image-7.4-191.x86_64.qcow2 --root-password password:123456
virt-edit -a rhel-guest-image-7.4-191.x86_64.qcow2 -e 's/^disable_root: 1/disable_root: 0/' /etc/cloud/cloud.cfg
virt-edit -a rhel-guest-image-7.4-191.x86_64.qcow2 -e 's/^ssh_pwauth:\s+0/ssh_pwauth: 1/' /etc/cloud/cloud.cfg
openstack image create --container-format bare --disk-format qcow2 --public --file rhel-guest-image-7.4-191.x86_64.qcow2 rhel74
openstack network create net-64-1
openstack subnet create --subnet-range 10.0.1.0/24  --network net-64-1 --dhcp subnet_4_1
openstack router create Router_eNet
openstack router add subnet Router_eNet subnet_4_1
openstack router set --external-gateway nova Router_eNet
openstack flavor create --public m1.medium --id 3 --ram 1024 --disk 10 --vcpus 1
openstack port create --network net-64-1 --vnic-type direct direct_sriov
openstack port create --network net-64-1 --vnic-type direct-physical PF_sriov
openstack port create --network net-64-1 normal
openstack server create --flavor 3 --image rhel74 --nic port-id=direct_sriov  VF
openstack server create --flavor 3 --image rhel74 --nic port-id=PF_sriov  PF
openstack server create --flavor 3 --image rhel74 --nic port-id=normal  Normal
openstack floating ip create nova
openstack floating ip create nova
openstack floating ip create nova
openstack server add floating ip VF <fip>
openstack server add floating ip PF <fip>
openstack server add floating ip Normal <fip>
openstack security group rule create --protocol icmp --ingress --prefix 0.0.0.0/0 <sec-id>
openstack security group rule create --protocol tcp --ingress --prefix 0.0.0.0/0 <sec-id>

2. run a minor update process:
3.sudo rhos-release 13 -p 2018-06-08.3

4. openstack undercloud upgrade | tee undercloud_upgrade.log

5.add it to /etc/sysconfig/docker

sudo vi /etc/sysconfig/docker
Added docker-registry.engineering.redhat.com 

6.sudo systemctl restart docker.service

7. source stackrc 

8. openstack overcloud container image prepare --namespace docker-registry.engineering.redhat.com/rhosp13 --tag  2018-06-08.3 --prefix openstack- -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml --output-images-file /home/stack/update-container-images.yaml


9. sudo openstack overcloud container image upload --config-file ~/update-container-images.yaml --verbose



10.openstack overcloud container image prepare --namespace 192.168.24.1:8787/rhosp13 --tag  2018-06-08.3  --prefix openstack- -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml--output-env-file ~/update-container-params.yaml



11. ansible -i /usr/bin/tripleo-ansible-inventory overcloud -m shell -b -a "yum localinstall -y http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-latest.noarch.rpm; rhos-release 13 -p 2018-06-08.3"



12.openstack overcloud update prepare --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml --container-registry-file ~/update-container-params.yaml 



13. openstack overcloud update run --nodes Controller 

14. openstack overcloud update run --nodes computesriov-0
15. openstack overcloud update run --nodes computesriov-1


16. openstack overcloud update converge --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml -e ~/update-container-params.yaml | tee update_converge.log


THT files: https://code.engineering.redhat.com/gerrit/gitweb?p=Neutron-QE.git;a=tree;f=BM_heat_template/ospd-13-vlan-multiple-nic-sriov-hybrid-ha;h=19d6eb0ceb99a4d79cb30d72f4aac093bbe0d041;hb=refs/heads/master



17. After the minor update,  remove the old instances 
18 try to boot 3 instances as you did on step 1 
openstack server create --flavor 3 --image rhel74 --nic port-id=direct_sriov  VF
openstack server create --flavor 3 --image rhel74 --nic port-id=PF_sriov  PF
openstack server create --flavor 3 --image rhel74 --nic port-id=normal  Normal

Comment 2 Brent Eagles 2018-06-11 12:49:37 UTC

We'll need to retest to confirm but there may be an issue with libvirt's list of known devices on the compute node. I reattempted creating an instance that used the direct port and it failed the same way as in the report. Looking at Nova's scheduler logs nova thinks there are no nodes available that can service the Pci request. I then logged into the problem compute node and restart the nova_compute and nova_libvirt containers and reattempted creating the instance. This second attempt after restarting the nova containers succeeded. We should have someone from the Compute or NFV DFGs look into this.

Comment 3 Sahid Ferdjaoui 2018-06-13 12:50:33 UTC

Can you provide logs of the compute node. especially Nova log in debug mode. So we will be able to see:

1/ nova.conf is well configured with a pci_whitelist
2/ The PCI resources are well collected.

Comment 13 Sahid Ferdjaoui 2018-06-13 18:41:17 UTC

Eran, Using 'devname' for pci/passthrough_whitelist option means that only the specified device is allowed to be used by Nova.

If you are configuring devname=p1p1 it means only that device can be attached to a VM. In the compute host, p1p1 is the PF. I have difficulty to see how you have been able to boot any instance using a VF.

You could use 'devname' but for each VF you'll have to create an entry like devname=p1p1_0, devname=p1p1_1... OR you could use the 'vendor_id' and 'product_id'.

Something like:

[pci]
passthrough_whitelist={"vendor_id":"8086","product_id":"154d", physical_network":"datacentre"},{"vendor_id":"8086","product_id":"10ed", physical_network":"datacentre"}

Or even more simple:

[pci]
passthrough_whitelist={"vendor_id":"8086", physical_network":"datacentre"}


The pci/passthrough_whitelist option needs to be configured on all compute nodes based on the hardware. It's important to note that the issue might be in OSPd.

Comment 14 Eran Kuris 2018-06-14 05:57:32 UTC

Sahid / Lee,
I am sorry but I checked again with my original configuration on the setup that you work on it yesterday. I change it only on "computesriov-0". according to your explanation with my configuration, I cant boot instance with VF direct port
After changing the configuration and restart the docker: 
$ sudo docker restart neutron_sriov_agent
$ sudo docker restart nova_compute

you can see that I success to boot VF(direct port):

(overcloud) [stack@undercloud-0 ~]$ nova show  VF
+--------------------------------------+----------------------------------------------------------+
| Property                             | Value                                                    |
+--------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig                    | MANUAL                                                   |
| OS-EXT-AZ:availability_zone          | nova                                                     |
| OS-EXT-SRV-ATTR:host                 | computesriov-0.localdomain                               |
| OS-EXT-SRV-ATTR:hostname             | vf                                                       |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | computesriov-0.localdomain                               |
| OS-EXT-SRV-ATTR:instance_name        | instance-00000069                                        |
| OS-EXT-SRV-ATTR:kernel_id            |                                                          |
| OS-EXT-SRV-ATTR:launch_index         | 0                                                        |
| OS-EXT-SRV-ATTR:ramdisk_id           |                                                          |
| OS-EXT-SRV-ATTR:reservation_id       | r-qx561tdy                                               |
| OS-EXT-SRV-ATTR:root_device_name     | /dev/vda                                                 |
| OS-EXT-SRV-ATTR:user_data            | -                                                        |
| OS-EXT-STS:power_state               | 1                                                        |
| OS-EXT-STS:task_state                | -                                                        |
| OS-EXT-STS:vm_state                  | active                                                   |
| OS-SRV-USG:launched_at               | 2018-06-14T05:44:29.000000                               |
| OS-SRV-USG:terminated_at             | -                                                        |
| accessIPv4                           |                                                          |
| accessIPv6                           |                                                          |
| config_drive                         |                                                          |
| created                              | 2018-06-14T05:44:03Z                                     |
| description                          | VF                                                       |
| flavor:disk                          | 10                                                       |
| flavor:ephemeral                     | 0                                                        |
| flavor:extra_specs                   | {}                                                       |
| flavor:original_name                 | m1.medium                                                |
| flavor:ram                           | 1024                                                     |
| flavor:swap                          | 0                                                        |
| flavor:vcpus                         | 1                                                        |
| hostId                               | 3e0a3165cc44e61e3d95917d9e51cc1c904aef1f522fedb4ea1bc1af |
| host_status                          | UP                                                       |
| id                                   | 2bdd80d7-6119-425e-bacd-cf9f50439d62                     |
| image                                | rhel74 (d00a5c4f-2ce2-4a44-b8bf-815f43dd0e0a)            |
| key_name                             | -                                                        |
| locked                               | False                                                    |
| metadata                             | {}                                                       |
| name                                 | VF                                                       |
| net-64-1 network                     | 10.0.1.12, 10.35.166.81                                  |
| os-extended-volumes:volumes_attached | []                                                       |
| progress                             | 0                                                        |
| security_groups                      | default                                                  |
| status                               | ACTIVE                                                   |
| tags                                 | []                                                       |
| tenant_id                            | 7089378fc3b840e49dda23c6e4e81f89                         |
| updated                              | 2018-06-14T05:44:29Z                                     |
| user_id                              | e807c6609b2041279250a2e739f786ce                         |
+--------------------------------------+----------------------------------------

THE CONFIGURATION ON computesriov-0 
overcloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.12
Last login: Thu Jun 14 05:36:02 2018 from 192.168.24.1
[heat-admin@computesriov-0 ~]$ sudo -i 
[root@computesriov-0 ~]# ip link show 

73: p1p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether a0:36:9f:7f:28:b8 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 76:94:1c:47:e4:52, spoof checking on, link-state auto, trust off, query_rss off
    vf 1 MAC a6:91:61:c4:15:82, spoof checking on, link-state auto, trust off, query_rss off
    vf 2 MAC a6:d8:24:36:e1:be, spoof checking on, link-state auto, trust off, query_rss off
    vf 3 MAC 9e:79:1f:44:47:44, spoof checking on, link-state auto, trust off, query_rss off
    vf 4 MAC fa:16:3e:c3:cf:e7, vlan 229, spoof checking on, link-state auto, trust off, query_rss off
74: p1p1_0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 76:94:1c:47:e4:52 brd ff:ff:ff:ff:ff:ff
75: p1p1_1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether a6:91:61:c4:15:82 brd ff:ff:ff:ff:ff:ff
76: p1p1_2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether a6:d8:24:36:e1:be brd ff:ff:ff:ff:ff:ff
77: p1p1_3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 9e:79:1f:44:47:44 brd ff:ff:ff:ff:ff:ff
[root@computesriov-0 ~]# grep -iR "passthrough_whitelist" /var/lib/config-data/

/var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf:passthrough_whitelist={"devname":"p1p1","physical_network":"datacentre"}
/var/lib/config-data/nova_libvirt/etc/nova/nova.conf:passthrough_whitelist={"devname":"p1p1","physical_network":"datacentre"}

(overcloud) [stack@undercloud-0 ~]$ neutron port-list |grep fa:16:3e:c3:cf:e7 
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
| 662eb061-5dfa-4589-83bf-c7482f966434 | direct_sriov                                    | 7089378fc3b840e49dda23c6e4e81f89 | fa:16:3e:c3:cf:e7 | {"subnet_id": "114a87fd-2d0a-48c0-b828-ad2095f3ae7f", "ip_address": "10.0.1.12"}  


You can take the system for more debugging. just let me know

Comment 15 Eran Kuris 2018-06-14 05:59:18 UTC

Please take a look on a comment 14

Comment 16 Sahid Ferdjaoui 2018-06-14 07:08:23 UTC

Eran, There are not indication that your instance is attached with a VF except that the name of the port.

I checked on the host, nova.conf is configured with:

[pci]
passthrough_whitelist={"vendor_id":"8086","product_id":"154d","physical_network":"datacentre"}

If you want ensure that the instance is using a VF, the best is to check libvirt configuration, you should see for 'interface' element a type 'hostdev'.

  <interface type='hostdev'/>

For a PF that would be a 'hostdev' element

  <hostdev/>

Thanks,
s.

Comment 17 Eran Kuris 2018-06-14 07:29:51 UTC

(In reply to Sahid Ferdjaoui from comment #16)
> Eran, There are not indication that your instance is attached with a VF
> except that the name of the port.
> 
> I checked on the host, nova.conf is configured with:
> 
> [pci]
> passthrough_whitelist={"vendor_id":"8086","product_id":"154d",
> "physical_network":"datacentre"}
> 
> If you want ensure that the instance is using a VF, the best is to check
> libvirt configuration, you should see for 'interface' element a type
> 'hostdev'.
> 
>   <interface type='hostdev'/>
> 
> For a PF that would be a 'hostdev' element
> 
>   <hostdev/>
> 
> Thanks,
> s.

As we spoke in IRC and you can see in comment14 I have the indication that the vm attached to VF in "ip link show" output.

also with same configuration I success to run Minor update on OSP11 & OSP12 and the system worked well 
also upgrade from osp11 to 12  worked too with same configuration

Comment 18 Sahid Ferdjaoui 2018-06-14 08:55:18 UTC

After some time working on which of the nova.conf are really taking effect to the compute service it seems that it is possible to specify the PF as devname and all the VFs related are whitelisted. It's my mistake, sorry for that.

2018-06-14 08:38:37.214 1 DEBUG oslo_service.service [req-487f2adb-3dde-4a9f-8a15-1d3e84bbf523 - - - - -] pci.passthrough_whitelist      = ['{"devname":"p1p1","physical_network":"datacentre"}'] log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2902

Final resource view: name=computesriov-0.localdomain phys_ram=65486MB used_ram=4096MB phys_disk=465GB used_disk=0GB total_vcp
us=24 used_vcpus=0 pci_stats=[PciDevicePool(count=1,numa_node=0,product_id='154d',tags={dev_type='type-PF',physical_network='datacentre'},vendor_id='8086'), PciDevicePool(count=5,numa_node=0,product_id='10ed',tags={dev_type='type-VF',physi
cal_network='datacentre'},vendor_id='8086')]

What I can say is that, the issue does not look to be related to Nova since it is reporting a correct view of the devices (PF/VFs). We probably need further investigation to identify root cause.

Comment 21 Saravanan KR 2018-06-14 13:30:38 UTC

Checking on the attached sosreport, the passthrough_whiltelist configuration is applied correctly on the nova_libvirt container. I couldn't find any configuration related issue.

-------------
[stack@undercloud01 sosreport-computesriov-1-20180611113424]$ grep '^passthrough' -RnH var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf 
var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf:8714:passthrough_whitelist={"devname":"p1p1","physical_network":"datacentre"}
--------------

The last error that i could see in the var/log//containers/neutron/sriov-nic-agent.log of the sosreport of computesriov-1 is below:

--------------------
2018-06-11 10:42:34.466 1010711 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent IpCommandDeviceError: ip command failed on device p1p1: Exit code: 1; Stdin: ; Stdout: ; Stderr: Device "p1p1" does not exist.
2018-06-11 10:42:34.466 1010711 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent
2018-06-11 10:42:34.466 1010711 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent
2018-06-11 10:42:34.492 1010711 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-6982704a-45cd-45d1-8b94-2a9e11b0de59 - - - - -] Agent out of sync with plugin!
--------------------

I don't know what this error means, Brent could you see if it is causing the error?

Comment 22 Saravanan KR 2018-06-15 11:44:16 UTC

Whenever there is PF attached VM created a overcloud node, restarting nova_compute and neutron_sriov_agent docker containers, results in the pci_status reported as empty. From then on, on that particular node, none of the PF/VF based VMs could be created.

Steps to reproduce:
-------------------
1) deployment with 2 sriov computes - sriov0 and sriov1
2) create PF attached VM - created on sriov0 node
3) create VF attached VM - created on sriov1 node
4) restart nova_compute and neutron_sriov_agent containers on both sriov0 and sriov1 nodes
5) delete PF attached VM and VF attached VM
6) create PF attached VM - created on sriov1 node
7) create VF attached VM - results in error

The issue is when the nova_compute docker is restarted on the node sriov0, the pci_stats reported from sriov0 node is empty.
-------------
2018-06-15 10:12:28.034 1 INFO nova.compute.resource_tracker [req-4a14fa3b-81e2-439d-a7fd-fcc686d45756 - - - - -] Final resource view: name=computesriov-1.localdomain phys_ram=65454MB used_r am=5120MB phys_disk=465GB used_disk=10GB total_vcpus=24 used_vcpus=1 pci_stats=[]   
-------------

At this stage no VMs on sriov0 node, restart nova_compute and neutron_sriov_agent containers, then the pci_stats on the same node is reported correctly.

--------------
2018-06-15 10:26:45.602 1 INFO nova.compute.resource_tracker [req-0041d1af-b373-41f1-b9b7-f5deea236377 - - - - -] Final resource view: name=computesriov-0.loc
aldomain phys_ram=65486MB used_ram=4096MB phys_disk=465GB used_disk=0GB total_vcpus=24 used_vcpus=0 pci_stats=[PciDevicePool(count=1,numa_node=0,product_id='1
54d',tags={dev_type='type-PF',physical_network='datacentre'},vendor_id='8086'), PciDevicePool(count=5,numa_node=0,product_id='10ed',tags={dev_type='type-VF',p
hysical_network='datacentre'},vendor_id='8086')]
--------------

Now, why nova_compute docker container is reporting with pci_status, is yet to be investigated. 

Thanks to Eran for chipping in on his weekend to confirm this theory. The update/upgrade testing is done as follows:
1) deploy cluster with 2 sriov computes
2) create VMs with PF and VF
3) update/upgrade
4) existing vms are fine
5) delete the existing vms
6) create the VF and PF vms again
7) one of the node sees the failure

After restarting nova_compute container, it should report with PF device info in the pci_status as the node sriov0 had a active VM with PF attached. This is not happening.

And then when the PF attached VM is deleted on sriov0 node, then nova_compute should report the pci_stats with both PF and VF details as sriov0 does not have any VMs active. 

This need to be investigated further. Sahid could you check why the pci_status reported wrongly in this specific case?

Comment 23 Saravanan KR 2018-06-15 12:15:49 UTC

Typo: in comment #22, read "pci_status" as "pci_stats".

Comment 24 Sahid Ferdjaoui 2018-06-18 09:14:24 UTC

There is nothing we can do in Nova. When you passthrough the PF, it's detached from host kernel to be attached to the guest.

When you restart nova service we reset the list of devices reported. To see the PF and its VFs reappear the VM needs to be destroy, the PF will be released and reattached to kenrel host. Then we need to wait next time Nova will execute its periodic task which report devices.

Comment 27 Saravanan KR 2018-06-18 10:31:19 UTC

(In reply to Sahid Ferdjaoui from comment #24)
> There is nothing we can do in Nova. When you passthrough the PF, it's
> detached from host kernel to be attached to the guest.
> 
> When you restart nova service we reset the list of devices reported. To see
> the PF and its VFs reappear the VM needs to be destroy, the PF will be
> released and reattached to kenrel host. Then we need to wait next time Nova
> will execute its periodic task which report devices.

Thanks for the explanation. I tried to delete the VM with PF and see if nova_compute's pci_stats is recovering from [] to valid list. I have monitored for any hour, it does not recover (until the docker nova_compute is restarted).

Comment 28 Sahid Ferdjaoui 2018-06-18 10:43:35 UTC

Ok I understand a bit more.

libvirt is still well reporting the devices even if detached from host. The PF is now reported to be used via vfio-pci.

My thinking is that we are still in presence of a configuration issue as indicated beginning of the bug.

  passthrough_whitelist={"devname":"p1p1","physical_network":"datacentre"}

Using devname=p1p1 is not a recommended way even more when passthrough the PF. That because when the PF is using vfio-pci driver the device does not have a dev name assigned to it meaning that Nova is not able anymore to report the device.

The correct configuration is to use product_id + vendor_id as indicated:

  passthrough_whitelist={"vendor_id":"8086","product_id":"154d","physical_network":"datacentre"}

Comment 29 Sahid Ferdjaoui 2018-06-18 10:57:31 UTC

A correction, the ixgbe driver is reporting VFs with a different product_id so the good configuration would be:

  passthrough_whitelist=[{"vendor_id":"8086","product_id":"154d","physical_network":"datacentre"},  # PF
                         {"vendor_id":"8086","product_id":"10ed","physical_network":"datacentre"}]  # VFs

Comment 30 Eran Kuris 2018-06-18 11:47:54 UTC

(In reply to Sahid Ferdjaoui from comment #29)
> A correction, the ixgbe driver is reporting VFs with a different product_id
> so the good configuration would be:
> 
>  
> passthrough_whitelist=[{"vendor_id":"8086","product_id":"154d",
> "physical_network":"datacentre"},  # PF
>                         
> {"vendor_id":"8086","product_id":"10ed","physical_network":"datacentre"}]  #
> VFs

Sahid I don't understand how is it related to configuration when this use case  worked for OSP10 - OSP12

Comment 31 Sahid Ferdjaoui 2018-06-19 12:13:22 UTC

Based on my investigation we are not in situation of a regression. That issue was not discovered but present in the previous releases.

It seems that with a guest running that have attached PF, restarting nova compute service and then destroying the guest creates some strange behaviors.

a) if device is whitelisted using its dev name, the PF is not released in database so it's not more possible to acquire it. A restart of compute service seems to resolve the case.

b) if device is whitelisted using its product_id, the PF is well released in database so it's possible to start new guest using it, but its VFs are not discovered. A restart of compute service seems to resolve the case.

A simple patch looks to resolve the issue but after some discussions with Stephen it seems that would not be the best solution. We are still working on it.

diff --git a/nova/pci/manager.py b/nova/pci/manager.py
index ad015db510..1eb4186f8b 100644
--- a/nova/pci/manager.py
+++ b/nova/pci/manager.py
@@ -174,7 +174,7 @@ class PciDevTracker(object):
                                  'pci_exception': e.format_message()})
                     # Note(yjiang5): remove the device by force so that
                     # db entry is cleaned in next sync.
-                    existed.status = fields.PciDeviceStatus.REMOVED
+                    #existed.status = fields.PciDeviceStatus.REMOVED
                 else:
                     # Note(yjiang5): no need to update stats if an assigned
                     # device is hot removed.

The issues looks to be with the in-memory tree that is maintaining the relation between PF/VFs that is not well sync. Also it's not clear why even after the next call of the periodic task that is responsible to collect resources the VFs are still not discovered when we can see that libvirt is returning them.

Comment 34 Eran Kuris 2018-06-20 10:02:00 UTC

So  I tested this scenario on OSP12 puddle 2018-05-29.1 and it reproduces which mean its not regression.


1. deployed osp12 puddle 2018-05-29.1 
2. booted PF 
3. restarted the nova docker 
4. delete PF 
5. create PF  = ERROR

the issue  does not have relation with the upgrade/ Minor update

Comment 35 Toni Freger 2018-06-24 06:05:05 UTC

*** Bug 1590716 has been marked as a duplicate of this bug. ***

Comment 37 Sahid Ferdjaoui 2018-10-19 12:28:12 UTC

The PCI module which is responsible of managing devices is going to be removed and there is no easy wait to fix this issue. I also it has been provided a workaround by restarting the compute service. I'm closing it as WONTFIX, please reopen if necessary.

Comment 47 Matthew Booth 2019-07-12 14:26:52 UTC

SME: Sean Mooney <smooney>