Bug 1589796
| Summary: | OSP13-sriov-HA: After restart to nova compute docker unable to boot VF/PF instance on one of computesriov nodes | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Eran Kuris <ekuris> | |
| Component: | documentation | Assignee: | Irina <igallagh> | |
| Status: | CLOSED DUPLICATE | QA Contact: | RHOS Documentation Team <rhos-docs> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 13.0 (Queens) | CC: | amuller, bcafarel, beagles, bhaley, chrisw, dalvarez, dasmith, dcadzow, eglynn, ekuris, jhakimra, jraju, kchamart, lyarwood, mariel, mbooth, oblaut, pkesavar, pveiga, sbauza, sclewis, sgordon, skramaja, smooney, srevivo, stephenfin, tfreger, vromanso | |
| Target Milestone: | --- | Keywords: | Reopened, Triaged, ZStream | |
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: |
update / upgrade unsupported for the Release Candidate
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1590716 1593290 (view as bug list) | Environment: | ||
| Last Closed: | 2020-09-29 09:21:00 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1878201 | |||
| Bug Blocks: | 1590716, 1593290, 1615656 | |||
We'll need to retest to confirm but there may be an issue with libvirt's list of known devices on the compute node. I reattempted creating an instance that used the direct port and it failed the same way as in the report. Looking at Nova's scheduler logs nova thinks there are no nodes available that can service the Pci request. I then logged into the problem compute node and restart the nova_compute and nova_libvirt containers and reattempted creating the instance. This second attempt after restarting the nova containers succeeded. We should have someone from the Compute or NFV DFGs look into this. Can you provide logs of the compute node. especially Nova log in debug mode. So we will be able to see: 1/ nova.conf is well configured with a pci_whitelist 2/ The PCI resources are well collected. Eran, Using 'devname' for pci/passthrough_whitelist option means that only the specified device is allowed to be used by Nova.
If you are configuring devname=p1p1 it means only that device can be attached to a VM. In the compute host, p1p1 is the PF. I have difficulty to see how you have been able to boot any instance using a VF.
You could use 'devname' but for each VF you'll have to create an entry like devname=p1p1_0, devname=p1p1_1... OR you could use the 'vendor_id' and 'product_id'.
Something like:
[pci]
passthrough_whitelist={"vendor_id":"8086","product_id":"154d", physical_network":"datacentre"},{"vendor_id":"8086","product_id":"10ed", physical_network":"datacentre"}
Or even more simple:
[pci]
passthrough_whitelist={"vendor_id":"8086", physical_network":"datacentre"}
The pci/passthrough_whitelist option needs to be configured on all compute nodes based on the hardware. It's important to note that the issue might be in OSPd.
Sahid / Lee,
I am sorry but I checked again with my original configuration on the setup that you work on it yesterday. I change it only on "computesriov-0". according to your explanation with my configuration, I cant boot instance with VF direct port
After changing the configuration and restart the docker:
$ sudo docker restart neutron_sriov_agent
$ sudo docker restart nova_compute
you can see that I success to boot VF(direct port):
(overcloud) [stack@undercloud-0 ~]$ nova show VF
+--------------------------------------+----------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | computesriov-0.localdomain |
| OS-EXT-SRV-ATTR:hostname | vf |
| OS-EXT-SRV-ATTR:hypervisor_hostname | computesriov-0.localdomain |
| OS-EXT-SRV-ATTR:instance_name | instance-00000069 |
| OS-EXT-SRV-ATTR:kernel_id | |
| OS-EXT-SRV-ATTR:launch_index | 0 |
| OS-EXT-SRV-ATTR:ramdisk_id | |
| OS-EXT-SRV-ATTR:reservation_id | r-qx561tdy |
| OS-EXT-SRV-ATTR:root_device_name | /dev/vda |
| OS-EXT-SRV-ATTR:user_data | - |
| OS-EXT-STS:power_state | 1 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2018-06-14T05:44:29.000000 |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| config_drive | |
| created | 2018-06-14T05:44:03Z |
| description | VF |
| flavor:disk | 10 |
| flavor:ephemeral | 0 |
| flavor:extra_specs | {} |
| flavor:original_name | m1.medium |
| flavor:ram | 1024 |
| flavor:swap | 0 |
| flavor:vcpus | 1 |
| hostId | 3e0a3165cc44e61e3d95917d9e51cc1c904aef1f522fedb4ea1bc1af |
| host_status | UP |
| id | 2bdd80d7-6119-425e-bacd-cf9f50439d62 |
| image | rhel74 (d00a5c4f-2ce2-4a44-b8bf-815f43dd0e0a) |
| key_name | - |
| locked | False |
| metadata | {} |
| name | VF |
| net-64-1 network | 10.0.1.12, 10.35.166.81 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| status | ACTIVE |
| tags | [] |
| tenant_id | 7089378fc3b840e49dda23c6e4e81f89 |
| updated | 2018-06-14T05:44:29Z |
| user_id | e807c6609b2041279250a2e739f786ce |
+--------------------------------------+----------------------------------------
THE CONFIGURATION ON computesriov-0
overcloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.12
Last login: Thu Jun 14 05:36:02 2018 from 192.168.24.1
[heat-admin@computesriov-0 ~]$ sudo -i
[root@computesriov-0 ~]# ip link show
73: p1p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether a0:36:9f:7f:28:b8 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 76:94:1c:47:e4:52, spoof checking on, link-state auto, trust off, query_rss off
vf 1 MAC a6:91:61:c4:15:82, spoof checking on, link-state auto, trust off, query_rss off
vf 2 MAC a6:d8:24:36:e1:be, spoof checking on, link-state auto, trust off, query_rss off
vf 3 MAC 9e:79:1f:44:47:44, spoof checking on, link-state auto, trust off, query_rss off
vf 4 MAC fa:16:3e:c3:cf:e7, vlan 229, spoof checking on, link-state auto, trust off, query_rss off
74: p1p1_0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 76:94:1c:47:e4:52 brd ff:ff:ff:ff:ff:ff
75: p1p1_1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether a6:91:61:c4:15:82 brd ff:ff:ff:ff:ff:ff
76: p1p1_2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether a6:d8:24:36:e1:be brd ff:ff:ff:ff:ff:ff
77: p1p1_3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 9e:79:1f:44:47:44 brd ff:ff:ff:ff:ff:ff
[root@computesriov-0 ~]# grep -iR "passthrough_whitelist" /var/lib/config-data/
/var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf:passthrough_whitelist={"devname":"p1p1","physical_network":"datacentre"}
/var/lib/config-data/nova_libvirt/etc/nova/nova.conf:passthrough_whitelist={"devname":"p1p1","physical_network":"datacentre"}
(overcloud) [stack@undercloud-0 ~]$ neutron port-list |grep fa:16:3e:c3:cf:e7
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
| 662eb061-5dfa-4589-83bf-c7482f966434 | direct_sriov | 7089378fc3b840e49dda23c6e4e81f89 | fa:16:3e:c3:cf:e7 | {"subnet_id": "114a87fd-2d0a-48c0-b828-ad2095f3ae7f", "ip_address": "10.0.1.12"}
You can take the system for more debugging. just let me know
Please take a look on a comment 14 Eran, There are not indication that your instance is attached with a VF except that the name of the port.
I checked on the host, nova.conf is configured with:
[pci]
passthrough_whitelist={"vendor_id":"8086","product_id":"154d","physical_network":"datacentre"}
If you want ensure that the instance is using a VF, the best is to check libvirt configuration, you should see for 'interface' element a type 'hostdev'.
<interface type='hostdev'/>
For a PF that would be a 'hostdev' element
<hostdev/>
Thanks,
s.
(In reply to Sahid Ferdjaoui from comment #16) > Eran, There are not indication that your instance is attached with a VF > except that the name of the port. > > I checked on the host, nova.conf is configured with: > > [pci] > passthrough_whitelist={"vendor_id":"8086","product_id":"154d", > "physical_network":"datacentre"} > > If you want ensure that the instance is using a VF, the best is to check > libvirt configuration, you should see for 'interface' element a type > 'hostdev'. > > <interface type='hostdev'/> > > For a PF that would be a 'hostdev' element > > <hostdev/> > > Thanks, > s. As we spoke in IRC and you can see in comment14 I have the indication that the vm attached to VF in "ip link show" output. also with same configuration I success to run Minor update on OSP11 & OSP12 and the system worked well also upgrade from osp11 to 12 worked too with same configuration After some time working on which of the nova.conf are really taking effect to the compute service it seems that it is possible to specify the PF as devname and all the VFs related are whitelisted. It's my mistake, sorry for that.
2018-06-14 08:38:37.214 1 DEBUG oslo_service.service [req-487f2adb-3dde-4a9f-8a15-1d3e84bbf523 - - - - -] pci.passthrough_whitelist = ['{"devname":"p1p1","physical_network":"datacentre"}'] log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2902
Final resource view: name=computesriov-0.localdomain phys_ram=65486MB used_ram=4096MB phys_disk=465GB used_disk=0GB total_vcp
us=24 used_vcpus=0 pci_stats=[PciDevicePool(count=1,numa_node=0,product_id='154d',tags={dev_type='type-PF',physical_network='datacentre'},vendor_id='8086'), PciDevicePool(count=5,numa_node=0,product_id='10ed',tags={dev_type='type-VF',physi
cal_network='datacentre'},vendor_id='8086')]
What I can say is that, the issue does not look to be related to Nova since it is reporting a correct view of the devices (PF/VFs). We probably need further investigation to identify root cause.
Checking on the attached sosreport, the passthrough_whiltelist configuration is applied correctly on the nova_libvirt container. I couldn't find any configuration related issue.
-------------
[stack@undercloud01 sosreport-computesriov-1-20180611113424]$ grep '^passthrough' -RnH var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf
var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf:8714:passthrough_whitelist={"devname":"p1p1","physical_network":"datacentre"}
--------------
The last error that i could see in the var/log//containers/neutron/sriov-nic-agent.log of the sosreport of computesriov-1 is below:
--------------------
2018-06-11 10:42:34.466 1010711 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent IpCommandDeviceError: ip command failed on device p1p1: Exit code: 1; Stdin: ; Stdout: ; Stderr: Device "p1p1" does not exist.
2018-06-11 10:42:34.466 1010711 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent
2018-06-11 10:42:34.466 1010711 ERROR neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent
2018-06-11 10:42:34.492 1010711 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-6982704a-45cd-45d1-8b94-2a9e11b0de59 - - - - -] Agent out of sync with plugin!
--------------------
I don't know what this error means, Brent could you see if it is causing the error?
Whenever there is PF attached VM created a overcloud node, restarting nova_compute and neutron_sriov_agent docker containers, results in the pci_status reported as empty. From then on, on that particular node, none of the PF/VF based VMs could be created.
Steps to reproduce:
-------------------
1) deployment with 2 sriov computes - sriov0 and sriov1
2) create PF attached VM - created on sriov0 node
3) create VF attached VM - created on sriov1 node
4) restart nova_compute and neutron_sriov_agent containers on both sriov0 and sriov1 nodes
5) delete PF attached VM and VF attached VM
6) create PF attached VM - created on sriov1 node
7) create VF attached VM - results in error
The issue is when the nova_compute docker is restarted on the node sriov0, the pci_stats reported from sriov0 node is empty.
-------------
2018-06-15 10:12:28.034 1 INFO nova.compute.resource_tracker [req-4a14fa3b-81e2-439d-a7fd-fcc686d45756 - - - - -] Final resource view: name=computesriov-1.localdomain phys_ram=65454MB used_r am=5120MB phys_disk=465GB used_disk=10GB total_vcpus=24 used_vcpus=1 pci_stats=[]
-------------
At this stage no VMs on sriov0 node, restart nova_compute and neutron_sriov_agent containers, then the pci_stats on the same node is reported correctly.
--------------
2018-06-15 10:26:45.602 1 INFO nova.compute.resource_tracker [req-0041d1af-b373-41f1-b9b7-f5deea236377 - - - - -] Final resource view: name=computesriov-0.loc
aldomain phys_ram=65486MB used_ram=4096MB phys_disk=465GB used_disk=0GB total_vcpus=24 used_vcpus=0 pci_stats=[PciDevicePool(count=1,numa_node=0,product_id='1
54d',tags={dev_type='type-PF',physical_network='datacentre'},vendor_id='8086'), PciDevicePool(count=5,numa_node=0,product_id='10ed',tags={dev_type='type-VF',p
hysical_network='datacentre'},vendor_id='8086')]
--------------
Now, why nova_compute docker container is reporting with pci_status, is yet to be investigated.
Thanks to Eran for chipping in on his weekend to confirm this theory. The update/upgrade testing is done as follows:
1) deploy cluster with 2 sriov computes
2) create VMs with PF and VF
3) update/upgrade
4) existing vms are fine
5) delete the existing vms
6) create the VF and PF vms again
7) one of the node sees the failure
After restarting nova_compute container, it should report with PF device info in the pci_status as the node sriov0 had a active VM with PF attached. This is not happening.
And then when the PF attached VM is deleted on sriov0 node, then nova_compute should report the pci_stats with both PF and VF details as sriov0 does not have any VMs active.
This need to be investigated further. Sahid could you check why the pci_status reported wrongly in this specific case?
Typo: in comment #22, read "pci_status" as "pci_stats". There is nothing we can do in Nova. When you passthrough the PF, it's detached from host kernel to be attached to the guest. When you restart nova service we reset the list of devices reported. To see the PF and its VFs reappear the VM needs to be destroy, the PF will be released and reattached to kenrel host. Then we need to wait next time Nova will execute its periodic task which report devices. (In reply to Sahid Ferdjaoui from comment #24) > There is nothing we can do in Nova. When you passthrough the PF, it's > detached from host kernel to be attached to the guest. > > When you restart nova service we reset the list of devices reported. To see > the PF and its VFs reappear the VM needs to be destroy, the PF will be > released and reattached to kenrel host. Then we need to wait next time Nova > will execute its periodic task which report devices. Thanks for the explanation. I tried to delete the VM with PF and see if nova_compute's pci_stats is recovering from [] to valid list. I have monitored for any hour, it does not recover (until the docker nova_compute is restarted). Ok I understand a bit more.
libvirt is still well reporting the devices even if detached from host. The PF is now reported to be used via vfio-pci.
My thinking is that we are still in presence of a configuration issue as indicated beginning of the bug.
passthrough_whitelist={"devname":"p1p1","physical_network":"datacentre"}
Using devname=p1p1 is not a recommended way even more when passthrough the PF. That because when the PF is using vfio-pci driver the device does not have a dev name assigned to it meaning that Nova is not able anymore to report the device.
The correct configuration is to use product_id + vendor_id as indicated:
passthrough_whitelist={"vendor_id":"8086","product_id":"154d","physical_network":"datacentre"}
A correction, the ixgbe driver is reporting VFs with a different product_id so the good configuration would be:
passthrough_whitelist=[{"vendor_id":"8086","product_id":"154d","physical_network":"datacentre"}, # PF
{"vendor_id":"8086","product_id":"10ed","physical_network":"datacentre"}] # VFs
(In reply to Sahid Ferdjaoui from comment #29) > A correction, the ixgbe driver is reporting VFs with a different product_id > so the good configuration would be: > > > passthrough_whitelist=[{"vendor_id":"8086","product_id":"154d", > "physical_network":"datacentre"}, # PF > > {"vendor_id":"8086","product_id":"10ed","physical_network":"datacentre"}] # > VFs Sahid I don't understand how is it related to configuration when this use case worked for OSP10 - OSP12 Based on my investigation we are not in situation of a regression. That issue was not discovered but present in the previous releases.
It seems that with a guest running that have attached PF, restarting nova compute service and then destroying the guest creates some strange behaviors.
a) if device is whitelisted using its dev name, the PF is not released in database so it's not more possible to acquire it. A restart of compute service seems to resolve the case.
b) if device is whitelisted using its product_id, the PF is well released in database so it's possible to start new guest using it, but its VFs are not discovered. A restart of compute service seems to resolve the case.
A simple patch looks to resolve the issue but after some discussions with Stephen it seems that would not be the best solution. We are still working on it.
diff --git a/nova/pci/manager.py b/nova/pci/manager.py
index ad015db510..1eb4186f8b 100644
--- a/nova/pci/manager.py
+++ b/nova/pci/manager.py
@@ -174,7 +174,7 @@ class PciDevTracker(object):
'pci_exception': e.format_message()})
# Note(yjiang5): remove the device by force so that
# db entry is cleaned in next sync.
- existed.status = fields.PciDeviceStatus.REMOVED
+ #existed.status = fields.PciDeviceStatus.REMOVED
else:
# Note(yjiang5): no need to update stats if an assigned
# device is hot removed.
The issues looks to be with the in-memory tree that is maintaining the relation between PF/VFs that is not well sync. Also it's not clear why even after the next call of the periodic task that is responsible to collect resources the VFs are still not discovered when we can see that libvirt is returning them.
So I tested this scenario on OSP12 puddle 2018-05-29.1 and it reproduces which mean its not regression. 1. deployed osp12 puddle 2018-05-29.1 2. booted PF 3. restarted the nova docker 4. delete PF 5. create PF = ERROR the issue does not have relation with the upgrade/ Minor update *** Bug 1590716 has been marked as a duplicate of this bug. *** The PCI module which is responsible of managing devices is going to be removed and there is no easy wait to fix this issue. I also it has been provided a workaround by restarting the compute service. I'm closing it as WONTFIX, please reopen if necessary. SME: Sean Mooney <smooney> |
Description of problem: After a minor update of OSP13 -SRIOV -HA (3 controller 2 computes)setup I tried to boot new PF & VF instances. On computesriov-1 I could boot any kind of instance (PF/VF) but when I tried to boot it on computesriov-0 it failed. instance with normal port (ovs port) booted without any problems. (overcloud) [stack@undercloud-0 ~]$ nova list +--------------------------------------+--------+--------+------------+-------------+--------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------+--------+------------+-------------+--------------------+ | 143bcb1d-9486-461b-b09f-56deb56efbec | Normal | ACTIVE | - | Running | net-64-2=10.0.2.13 | | b33b3de7-4274-4464-ae57-143b8605e0ac | PF | ACTIVE | - | Running | net-64-2=10.0.2.11 | | a7642e5f-9e57-49bd-8bc6-ee5237b14de7 | VF | ERROR | - | NOSTATE | | +--------------------------------------+--------+--------+------------+-------------+--------------------+ (overcloud) [stack@undercloud-0 ~]$ nova show VF +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Property | Value | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | | | OS-EXT-SRV-ATTR:host | - | | OS-EXT-SRV-ATTR:hostname | vf | | OS-EXT-SRV-ATTR:hypervisor_hostname | - | | OS-EXT-SRV-ATTR:instance_name | instance-00000013 | | OS-EXT-SRV-ATTR:kernel_id | | | OS-EXT-SRV-ATTR:launch_index | 0 | | OS-EXT-SRV-ATTR:ramdisk_id | | | OS-EXT-SRV-ATTR:reservation_id | r-11g4pfuo | | OS-EXT-SRV-ATTR:root_device_name | - | | OS-EXT-SRV-ATTR:user_data | - | | OS-EXT-STS:power_state | 0 | | OS-EXT-STS:task_state | - | | OS-EXT-STS:vm_state | error | | OS-SRV-USG:launched_at | - | | OS-SRV-USG:terminated_at | - | | accessIPv4 | | | accessIPv6 | | | config_drive | | | created | 2018-06-11T11:55:42Z | | description | VF | | fault | {"message": "No valid host was found. There are not enough hosts available.", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 1116, in schedule_and_build_instances | | | instance_uuids, return_alternates=True) | | | File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 716, in _schedule_instances | | | return_alternates=return_alternates) | | | File \"/usr/lib/python2.7/site-packages/nova/scheduler/utils.py\", line 726, in wrapped | | | return func(*args, **kwargs) | | | File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py\", line 53, in select_destinations | | | instance_uuids, return_objects, return_alternates) | | | File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py\", line 37, in __run_method | | | return getattr(self.instance, __name)(*args, **kwargs) | | | File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py\", line 42, in select_destinations | | | instance_uuids, return_objects, return_alternates) | | | File \"/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py\", line 158, in select_destinations | | | return cctxt.call(ctxt, 'select_destinations', **msg_args) | | | File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line 174, in call | | | retry=self.retry) | | | File \"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py\", line 131, in _send | | | timeout=timeout, retry=retry) | | | File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 559, in send | | | retry=retry) | | | File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 550, in _send | | | raise result | | | ", "created": "2018-06-11T11:55:46Z"} | | flavor:disk | 10 | | flavor:ephemeral | 0 | | flavor:extra_specs | {} | | flavor:original_name | m1.medium | | flavor:ram | 1024 | | flavor:swap | 0 | | flavor:vcpus | 1 | | hostId | | | host_status | | | id | a7642e5f-9e57-49bd-8bc6-ee5237b14de7 | | image | rhel74 (5c8bd7b6-0b49-41e2-8a12-297802d4212d) | | key_name | - | | locked | False | | metadata | {} | | name | VF | | os-extended-volumes:volumes_attached | [] | | status | ERROR | | tags | [] | | tenant_id | ad804165fc554a2299bf1c4c761f1374 | | updated | 2018-06-11T11:55:45Z | | user_id | 40b75a00c54a487d8e3ceb05e4e92599 | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ Version-Release number of selected component (if applicable): OSP13 -p 2018-06-08.3 Steps to Reproduce: 0. deploy osp13 ha sriov setup 1. create env : wget http://file.tlv.redhat.com/~ekuris/custom_ci_image/rhel-guest-image-7.4-191.x86_64.qcow2 sudo yum -y install libguestfs-tools sudo yum -y install libvirt && sudo systemctl start libvirtd virt-customize -a rhel-guest-image-7.4-191.x86_64.qcow2 --root-password password:123456 virt-edit -a rhel-guest-image-7.4-191.x86_64.qcow2 -e 's/^disable_root: 1/disable_root: 0/' /etc/cloud/cloud.cfg virt-edit -a rhel-guest-image-7.4-191.x86_64.qcow2 -e 's/^ssh_pwauth:\s+0/ssh_pwauth: 1/' /etc/cloud/cloud.cfg openstack image create --container-format bare --disk-format qcow2 --public --file rhel-guest-image-7.4-191.x86_64.qcow2 rhel74 openstack network create net-64-1 openstack subnet create --subnet-range 10.0.1.0/24 --network net-64-1 --dhcp subnet_4_1 openstack router create Router_eNet openstack router add subnet Router_eNet subnet_4_1 openstack router set --external-gateway nova Router_eNet openstack flavor create --public m1.medium --id 3 --ram 1024 --disk 10 --vcpus 1 openstack port create --network net-64-1 --vnic-type direct direct_sriov openstack port create --network net-64-1 --vnic-type direct-physical PF_sriov openstack port create --network net-64-1 normal openstack server create --flavor 3 --image rhel74 --nic port-id=direct_sriov VF openstack server create --flavor 3 --image rhel74 --nic port-id=PF_sriov PF openstack server create --flavor 3 --image rhel74 --nic port-id=normal Normal openstack floating ip create nova openstack floating ip create nova openstack floating ip create nova openstack server add floating ip VF <fip> openstack server add floating ip PF <fip> openstack server add floating ip Normal <fip> openstack security group rule create --protocol icmp --ingress --prefix 0.0.0.0/0 <sec-id> openstack security group rule create --protocol tcp --ingress --prefix 0.0.0.0/0 <sec-id> 2. run a minor update process: 3.sudo rhos-release 13 -p 2018-06-08.3 4. openstack undercloud upgrade | tee undercloud_upgrade.log 5.add it to /etc/sysconfig/docker sudo vi /etc/sysconfig/docker Added docker-registry.engineering.redhat.com 6.sudo systemctl restart docker.service 7. source stackrc 8. openstack overcloud container image prepare --namespace docker-registry.engineering.redhat.com/rhosp13 --tag 2018-06-08.3 --prefix openstack- -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml --output-images-file /home/stack/update-container-images.yaml 9. sudo openstack overcloud container image upload --config-file ~/update-container-images.yaml --verbose 10.openstack overcloud container image prepare --namespace 192.168.24.1:8787/rhosp13 --tag 2018-06-08.3 --prefix openstack- -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml--output-env-file ~/update-container-params.yaml 11. ansible -i /usr/bin/tripleo-ansible-inventory overcloud -m shell -b -a "yum localinstall -y http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-latest.noarch.rpm; rhos-release 13 -p 2018-06-08.3" 12.openstack overcloud update prepare --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml --container-registry-file ~/update-container-params.yaml 13. openstack overcloud update run --nodes Controller 14. openstack overcloud update run --nodes computesriov-0 15. openstack overcloud update run --nodes computesriov-1 16. openstack overcloud update converge --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml -e ~/update-container-params.yaml | tee update_converge.log THT files: https://code.engineering.redhat.com/gerrit/gitweb?p=Neutron-QE.git;a=tree;f=BM_heat_template/ospd-13-vlan-multiple-nic-sriov-hybrid-ha;h=19d6eb0ceb99a4d79cb30d72f4aac093bbe0d041;hb=refs/heads/master 17. After the minor update, remove the old instances 18 try to boot 3 instances as you did on step 1 openstack server create --flavor 3 --image rhel74 --nic port-id=direct_sriov VF openstack server create --flavor 3 --image rhel74 --nic port-id=PF_sriov PF openstack server create --flavor 3 --image rhel74 --nic port-id=normal Normal