Description of problem: Migration from ML2OVS to OVN fails in case some PF/VF ports are in use. from ~/overcloud-deploy-ovn.sh.log "[2020/09/15 05:32:07 PM] [ERROR] nic enp6s0f3 not found in available nics (eno1, eno2, eno3, eno4, enp5s0f0, enp5s0f1, enp5s0f2, enp5s0f3, enp6s0f0, enp6s0f1, enp6s0f2, ib0, ib1)", This environment is configured to use the following compute I/Fs for SRIOV: At compute-0: enp6s0f3 (with 5 VFs, from 0 to 4) At compute-1: enp7s0f3 (with 5 VFs, from 0 to 4) PF VM is running on computesriov-0, so enp6s0f3 is not available because it is in use. [root@computesriov-0 ~]# ip link show enp6s0f3 Device "enp6s0f3" does not exist. If the VM will be removed, the I/F would be available again. Something similar happens at computesriov-1 with the VM that is using a VF port (in this case, it is using enp7s0f3v4). [root@computesriov-1 ~]# ip link show enp7s0f3 59: enp7s0f3: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether f8:f2:1e:16:df:c6 brd ff:ff:ff:ff:ff:ff vf 0 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off vf 1 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off vf 2 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off vf 3 link/ether de:34:7f:5b:95:ec brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state enable, trust off vf 4 link/ether fa:16:3e:c1:42:5f brd ff:ff:ff:ff:ff:ff, vlan 358, spoof checking on, link-state enable, trust off [root@computesriov-1 ~]# ip link show enp7s0f3v4 Device "enp7s0f3v4" does not exist. So we need to understand what recommend to our customers with SRIOV environments. Will it be supported to migrate to OVN with running VMs that use SRIOV ports? What should be the prerequisites for such migration? Feel free to change the component to more relevant one. Version-Release number of selected component (if applicable): RHOS-16.1-RHEL-8-20200903.n.0 How reproducible: Tried once and this happened Steps to Reproduce: 1. Install HA SRIOV environment with ML2OVS (3 virt controllers + 2 baremetal computes with SRIOV-compatible interfaces) 2. Launch instances that use SRIOV ports 3. Adjust templates according to the official documentation [1] and start migration to ML2OVN Actual results: Migration fails due to missing interfaces on compute nodes Expected results: Migration succeeds Additional info: [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/networking_with_open_virtual_network/migrating-ml2ovs-to-ovn#ml2-ovs-to-ovn-migration-prepare
Additional issue is that the environment is broken since the moment when the migration failed. 1) According to overcloud-deploy-ovn.sh.log (at undercloud), migration failed ~ 2020/09/15 05:32:07 PM. 2) I see plenty of rabbit errors at computesriov-0's nova-compute (the PF VM ran at compute-0, where the interface has not been released) [root@computesriov-0 ~]# grep -rc "ERROR oslo.messaging._drivers.impl_rabbit \[-\] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed" /var/log/containers/nova/nova-compute.log* /var/log/containers/nova/nova-compute.log:4584 /var/log/containers/nova/nova-compute.log.1:1990 The oldest rabbit error happened when the migration failed approx (I checked that all nodes are synchronized, with UTC time zone): 2020-09-15 17:33:01.014 7 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 110] Connection timed out 2020-09-15 17:33:05.497 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [277bcf3a-3e76-489b-835c-cc700fe830eb] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.: TimeoutError: [Errno 110] Connection timed out 2020-09-15 17:33:06.104 7 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out 2020-09-15 17:33:11.117 7 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out 2020-09-15 17:33:11.509 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [277bcf3a-3e76-489b-835c-cc700fe830eb] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out 2020-09-15 17:33:16.125 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [5c207a7d-9f12-42a1-848d-7eedeb769b36] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out 2020-09-15 17:33:16.129 7 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 2.0 seconds): socket.timeout: timed out 2020-09-15 17:33:17.521 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [277bcf3a-3e76-489b-835c-cc700fe830eb] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out 2020-09-15 17:33:21.287 7 ERROR oslo.messaging._drivers.impl_rabbit [req-5899d69c-e256-43df-baf9-8c312b160af3 - - - - -] [8efb6c30-e81c-41bb-b349-c36114bb5f47] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out 2020-09-15 17:33:22.139 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [5c207a7d-9f12-42a1-848d-7eedeb769b36] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out 2020-09-15 17:33:23.150 7 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out 2020-09-15 17:33:23.532 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [277bcf3a-3e76-489b-835c-cc700fe830eb] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 2 seconds.: socket.timeout: timed out ... And so on since then. 3) I tried to delete the VM with the PF port that was 'captured' by libvirt, the PF port itself and the FIP that was associated with it but the interface was not released. [heat-admin@computesriov-0 ~]$ ip link show enp6s0f3 Device "enp6s0f3" does not exist. Same result in galera DB. This command shows that 0 PF interfaces are available on computesriov-0: [root@controller-0 ~]# podman exec -it -uroot galera-bundle-podman-0 mysql --skip-column-names nova -e 'select hypervisor_hostname,pci_stats from compute_nodes where hypervisor_hostname="computesriov-0.localdomain";' +----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | computesriov-0.localdomain | {"nova_object.name": "PciDevicePoolList", "nova_object.namespace": "nova", "nova_object.version": "1.1", "nova_object.data": {"objects": [{"nova_object.name": "PciDevicePool", "nova_object.namespace": "nova", "nova_object.version": "1.1", "nova_object.data": {"product_id": "1572", "vendor_id": "8086", "numa_node": 0, "tags": {"dev_type": "type-PF", "physical_network": "datacentre", "trusted": "true"}, "count": 0}, "nova_object.changes": ["numa_node", "vendor_id", "count", "tags", "product_id"]}]}, "nova_object.changes": ["objects"]} | +----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Note: I retested with a workload that uses only VF ports (no PF ports). OVN migration passed in this case.
I tested ovn migration on sriov enviroment with existing workload VMs that have VF and PF ports but when the workload VMs are shut off during the migration. The migration passed successfully. After the migration to OVN, the VMs were turned on again and connectivity was ok.
[600d bug triage] As the issue is not appear again, we can close the BZ and feel free to reopen if it appears again.