Description of problem: RHOS 10 sriov deployment on RHEL 7.4 deployed incorrectly. On RHEL 7.3 the Version-Release number of selected component (if applicable): RHOS 10 RHEL 7.4 How reproducible: Perform RHOS 10 sriov deployment using rhel 7.4 os. Actual results: The VF is missing on the compute until additional manual restart of the compute node. Expected results: The VF should appear on the compute node. Additional info: Manual reboot of the compute fix the count and add the VF. Compute: -------- VF count on the compute before the reboot: [root@compute-0 ~]# cat /sys/class/net/ens2f0/device/sriov_numvfs 0 [root@compute-0 ~]# cat /sys/class/net/ens2f1/device/sriov_numvfs 0 VF count on the compute after the reboot: [root@compute-0 ~]# cat /sys/class/net/ens2f0/device/sriov_numvfs 5 [root@compute-0 ~]# cat /sys/class/net/ens2f1/device/sriov_numvfs 5 Controller: ----------- Mysql nova database before the compute reboot: MariaDB [nova]> select * from pci_devices; +---------------------+------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+---------------+------------+-----------+-------------+ | created_at | updated_at | deleted_at | deleted | id | compute_node_id | address | product_id | vendor_id | dev_type | dev_id | label | status | extra_info | instance_uuid | request_id | numa_node | parent_addr | +---------------------+------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+---------------+------------+-----------+-------------+ | 2017-08-16 15:28:31 | NULL | NULL | 0 | 1 | 1 | 0000:0b:00.0 | 10fb | 8086 | type-PF | pci_0000_0b_00_0 | label_8086_10fb | available | {} | NULL | NULL | 0 | NULL | | 2017-08-16 15:28:31 | NULL | NULL | 0 | 2 | 1 | 0000:0b:00.1 | 10fb | 8086 | type-PF | pci_0000_0b_00_1 | label_8086_10fb | available | {} | NULL | NULL | 0 | NULL | +---------------------+------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+---------------+------------+-----------+-------------+ 2 rows in set (0.00 sec) MariaDB [nova]> select hypervisor_hostname, pci_stats from compute_nodes; +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | hypervisor_hostname | pci_stats | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | compute-0.localdomain | {"nova_object.version": "1.1", "nova_object.changes": ["objects"], "nova_object.name": "PciDevicePoolList", "nova_object.data": {"objects": [{"nova_object.version": "1.1", "nova_object.changes": ["count", "numa_node", "vendor_id", "product_id", "tags"], "nova_object.name": "PciDevicePool", "nova_object.data": {"count": 2, "numa_node": 0, "vendor_id": "8086", "product_id": "10fb", "tags": {"dev_type": "type-PF", "physical_network": "sriov"}}, "nova_object.namespace": "nova"}]}, "nova_object.namespace": "nova"} | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) Mysql nova database after the compute reboot: MariaDB [nova]> select * from pci_devices; +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+---------------+------------+-----------+--------------+ | created_at | updated_at | deleted_at | deleted | id | compute_node_id | address | product_id | vendor_id | dev_type | dev_id | label | status | extra_info | instance_uuid | request_id | numa_node | parent_addr | +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+---------------+------------+-----------+--------------+ | 2017-08-16 15:28:31 | 2017-08-17 06:07:23 | NULL | 0 | 1 | 1 | 0000:0b:00.0 | 10fb | 8086 | type-PF | pci_0000_0b_00_0 | label_8086_10fb | available | {} | NULL | NULL | 0 | NULL | | 2017-08-16 15:28:31 | 2017-08-17 06:07:23 | NULL | 0 | 2 | 1 | 0000:0b:00.1 | 10fb | 8086 | type-PF | pci_0000_0b_00_1 | label_8086_10fb | available | {} | NULL | NULL | 0 | NULL | | 2017-08-17 06:07:23 | NULL | NULL | 0 | 3 | 1 | 0000:0b:10.0 | 10ed | 8086 | type-VF | pci_0000_0b_10_0 | label_8086_10ed | available | {} | NULL | NULL | 0 | 0000:0b:00.0 | | 2017-08-17 06:07:23 | NULL | NULL | 0 | 4 | 1 | 0000:0b:10.1 | 10ed | 8086 | type-VF | pci_0000_0b_10_1 | label_8086_10ed | available | {} | NULL | NULL | 0 | 0000:0b:00.1 | | 2017-08-17 06:07:23 | NULL | NULL | 0 | 5 | 1 | 0000:0b:10.2 | 10ed | 8086 | type-VF | pci_0000_0b_10_2 | label_8086_10ed | available | {} | NULL | NULL | 0 | 0000:0b:00.0 | | 2017-08-17 06:07:23 | NULL | NULL | 0 | 6 | 1 | 0000:0b:10.3 | 10ed | 8086 | type-VF | pci_0000_0b_10_3 | label_8086_10ed | available | {} | NULL | NULL | 0 | 0000:0b:00.1 | | 2017-08-17 06:07:23 | NULL | NULL | 0 | 7 | 1 | 0000:0b:10.4 | 10ed | 8086 | type-VF | pci_0000_0b_10_4 | label_8086_10ed | available | {} | NULL | NULL | 0 | 0000:0b:00.0 | | 2017-08-17 06:07:23 | NULL | NULL | 0 | 8 | 1 | 0000:0b:10.5 | 10ed | 8086 | type-VF | pci_0000_0b_10_5 | label_8086_10ed | available | {} | NULL | NULL | 0 | 0000:0b:00.1 | | 2017-08-17 06:07:23 | NULL | NULL | 0 | 9 | 1 | 0000:0b:10.6 | 10ed | 8086 | type-VF | pci_0000_0b_10_6 | label_8086_10ed | available | {} | NULL | NULL | 0 | 0000:0b:00.0 | | 2017-08-17 06:07:23 | NULL | NULL | 0 | 10 | 1 | 0000:0b:10.7 | 10ed | 8086 | type-VF | pci_0000_0b_10_7 | label_8086_10ed | available | {} | NULL | NULL | 0 | 0000:0b:00.1 | | 2017-08-17 06:07:23 | NULL | NULL | 0 | 11 | 1 | 0000:0b:11.0 | 10ed | 8086 | type-VF | pci_0000_0b_11_0 | label_8086_10ed | available | {} | NULL | NULL | 0 | 0000:0b:00.0 | | 2017-08-17 06:07:23 | NULL | NULL | 0 | 12 | 1 | 0000:0b:11.1 | 10ed | 8086 | type-VF | pci_0000_0b_11_1 | label_8086_10ed | available | {} | NULL | NULL | 0 | 0000:0b:00.1 | +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+---------------+------------+-----------+--------------+ 12 rows in set (0.00 sec) MariaDB [nova]> select hypervisor_hostname, pci_stats from compute_nodes; +-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | hypervisor_hostname | pci_stats | +-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | compute-0.localdomain | {"nova_object.version": "1.1", "nova_object.changes": ["objects"], "nova_object.name": "PciDevicePoolList", "nova_object.data": {"objects": [{"nova_object.version": "1.1", "nova_object.changes": ["count", "numa_node", "vendor_id", "product_id", "tags"], "nova_object.name": "PciDevicePool", "nova_object.data": {"count": 2, "numa_node": 0, "vendor_id": "8086", "product_id": "10fb", "tags": {"dev_type": "type-PF", "physical_network": "sriov"}}, "nova_object.namespace": "nova"}, {"nova_object.version": "1.1", "nova_object.changes": ["count", "numa_node", "vendor_id", "product_id", "tags"], "nova_object.name": "PciDevicePool", "nova_object.data": {"count": 10, "numa_node": 0, "vendor_id": "8086", "product_id": "10ed", "tags": {"dev_type": "type-VF", "physical_network": "sriov"}}, "nova_object.namespace": "nova"}]}, "nova_object.namespace": "nova"} | +-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec)
Created attachment 1314563 [details] Sosreport of the compute before the reboot
On the rhel 7.3 the deployment passed cerrectly with the same templates.
There appears to be a problem with the network templates being used. The network configuration files for the interfaces indicated in the report do not contained NM_CONTROLLED=yes as required on RHEL. For RHEL, the udev rules will not be fired when the PCI device is hotplugged "back" into the system like it does on CentOS. We rely on NetworkManager to recognize the network device and bring it "up", which will ultimately cause the allocate_vfs script to be called. It is curious that the same templates worked in RHEL 7.3 as it should have behaved the same way.
Hi Brent Verified with Intel NICs also See compute.yaml type: interface name: p1p1 use_dhcp: false defroute: false nm_controlled: true hotplug: true And with in compute ifcfg-p1p1 contain nm_controlled=yes Additional Suggestions?
Okay, this differs somewhat from the sosreport attached to the bug where there are no interfaces with NM_CONTROLLER=yes. Can you verify that you are getting the same behavior and if so, can you provide sosreport or similar for your test environment? We are mainly looking for the contents of the interface files and the message logs - particularly PCI plugging and NetworkManager.
I did a simple test against a RHEL 7.4 install that included the types of scripts and interface file mods that tripleo would create. The NetworkManager, ifup-local* and allocate_vfs script mechanisms seem to do the job with re-initializing the VF count. I think it will expedite things if I can get access to a system that is exhibiting the problem behavior.
I've located the problem and am testing potential fix. The cause was a regression introduced by a recent fix I made to allow updates on compute nodes that had guest instances that had "consumed" a physical function and the PCI device was not available.
Some clarifications with respect to this bug: - this is a regression introduced by https://review.openstack.org/#/c/478503/ - the bug is that the puppet no longer writes the VF counts defined in the heat variables to the interface's sriov_numvfs file (e.g. /sys/class/net/ens2f0/device/sriov_numvfs) - the regression was backported throughout current versions and consequently will also need to be fixed throughout - there is a workaround that does not require compute node reboot: ifdown/ifup of the affected interfaces - this is not specific to any particular version of RHEL.
Setting blocker flag to '+' following discussion on rhos-pgm ML.
Verified On NFV-CI for RHOS 10
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2654