Bug 1879546 - [OSP16.1]Migration from ML2OVS to ML2OVN fails in case there are active VMs with SRIOV direct-physical (PF) ports
Summary: [OSP16.1]Migration from ML2OVS to ML2OVN fails in case there are active VMs w...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: 17.1
Assignee: Jakub Libosvar
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks: 2169448
TreeView+ depends on / blocked
 
Reported: 2020-09-16 13:57 UTC by Roman Safronov
Modified: 2023-09-04 20:25 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2169448 (view as bug list)
Environment:
Last Closed: 2023-04-17 15:15:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-501 0 None None None 2021-11-18 14:31:19 UTC

Description Roman Safronov 2020-09-16 13:57:39 UTC
Description of problem:
Migration from ML2OVS to OVN fails in case some PF/VF ports are in use.

from ~/overcloud-deploy-ovn.sh.log
"[2020/09/15 05:32:07 PM] [ERROR] nic enp6s0f3 not found in available nics (eno1, eno2, eno3, eno4, enp5s0f0, enp5s0f1, enp5s0f2, enp5s0f3, enp6s0f0, enp6s0f1, enp6s0f2, ib0, ib1)",



This environment is configured to use the following compute I/Fs for SRIOV:
At compute-0: enp6s0f3 (with 5 VFs, from 0 to 4)
At compute-1: enp7s0f3 (with 5 VFs, from 0 to 4)

PF VM is running on computesriov-0, so enp6s0f3 is not available because it is in use.
[root@computesriov-0 ~]# ip link show enp6s0f3
Device "enp6s0f3" does not exist.
If the VM will be removed, the I/F would be available again.


Something similar happens at computesriov-1 with the VM that is using a VF port (in this case, it is using enp7s0f3v4).
[root@computesriov-1 ~]# ip link show enp7s0f3
59: enp7s0f3: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000                                                                                                                      
    link/ether f8:f2:1e:16:df:c6 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 1     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 2     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 3     link/ether de:34:7f:5b:95:ec brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state enable, trust off
    vf 4     link/ether fa:16:3e:c1:42:5f brd ff:ff:ff:ff:ff:ff, vlan 358, spoof checking on, link-state enable, trust off                                                                                                                  
[root@computesriov-1 ~]# ip link show enp7s0f3v4
Device "enp7s0f3v4" does not exist.


So we need to understand what recommend to our customers with SRIOV environments.
Will it be supported to migrate to OVN with running VMs that use SRIOV ports?
What should be the prerequisites for such migration?

Feel free to change the component to more relevant one.


Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20200903.n.0

How reproducible:
Tried once and this happened

Steps to Reproduce:
1. Install HA SRIOV environment with ML2OVS (3 virt controllers + 2 baremetal computes with SRIOV-compatible interfaces)
2. Launch instances that use SRIOV ports
3. Adjust templates according to the official documentation [1] and start migration to ML2OVN


Actual results:
Migration fails due to missing interfaces on compute nodes

Expected results:
Migration succeeds 

Additional info:

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/networking_with_open_virtual_network/migrating-ml2ovs-to-ovn#ml2-ovs-to-ovn-migration-prepare

Comment 2 Roman Safronov 2020-09-17 10:44:01 UTC
Additional issue is that the environment is broken since the moment when the migration failed.

1) According to overcloud-deploy-ovn.sh.log (at undercloud), migration failed ~ 2020/09/15 05:32:07 PM.

2) I see plenty of rabbit errors at computesriov-0's nova-compute (the PF VM ran at compute-0, where the interface has not been released)
[root@computesriov-0 ~]# grep -rc "ERROR oslo.messaging._drivers.impl_rabbit \[-\] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed" /var/log/containers/nova/nova-compute.log*                                  
/var/log/containers/nova/nova-compute.log:4584
/var/log/containers/nova/nova-compute.log.1:1990
The oldest rabbit error happened when the migration failed approx (I checked that all nodes are synchronized, with UTC time zone):

2020-09-15 17:33:01.014 7 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 110] Connection timed out
2020-09-15 17:33:05.497 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [277bcf3a-3e76-489b-835c-cc700fe830eb] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.: TimeoutError: [Errno 110] Connection timed out
2020-09-15 17:33:06.104 7 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out
2020-09-15 17:33:11.117 7 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out
2020-09-15 17:33:11.509 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [277bcf3a-3e76-489b-835c-cc700fe830eb] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out
2020-09-15 17:33:16.125 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [5c207a7d-9f12-42a1-848d-7eedeb769b36] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out
2020-09-15 17:33:16.129 7 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 2.0 seconds): socket.timeout: timed out
2020-09-15 17:33:17.521 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [277bcf3a-3e76-489b-835c-cc700fe830eb] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out
2020-09-15 17:33:21.287 7 ERROR oslo.messaging._drivers.impl_rabbit [req-5899d69c-e256-43df-baf9-8c312b160af3 - - - - -] [8efb6c30-e81c-41bb-b349-c36114bb5f47] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out
2020-09-15 17:33:22.139 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [5c207a7d-9f12-42a1-848d-7eedeb769b36] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out
2020-09-15 17:33:23.150 7 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out
2020-09-15 17:33:23.532 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [277bcf3a-3e76-489b-835c-cc700fe830eb] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 2 seconds.: socket.timeout: timed out
...
And so on since then.


3) I tried to delete the VM with the PF port that was 'captured' by libvirt, the PF port itself and the FIP that was associated with it but the interface was not released.
[heat-admin@computesriov-0 ~]$ ip link show enp6s0f3
Device "enp6s0f3" does not exist.

Same result in galera DB. This command shows that 0 PF interfaces are available on computesriov-0:
[root@controller-0 ~]# podman exec -it -uroot galera-bundle-podman-0 mysql --skip-column-names nova -e 'select hypervisor_hostname,pci_stats from compute_nodes where hypervisor_hostname="computesriov-0.localdomain";' 
+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| computesriov-0.localdomain | {"nova_object.name": "PciDevicePoolList", "nova_object.namespace": "nova", "nova_object.version": "1.1", "nova_object.data": {"objects": [{"nova_object.name": "PciDevicePool", "nova_object.namespace": "nova", "nova_object.version": "1.1", "nova_object.data": {"product_id": "1572", "vendor_id": "8086", "numa_node": 0, "tags": {"dev_type": "type-PF", "physical_network": "datacentre", "trusted": "true"}, "count": 0}, "nova_object.changes": ["numa_node", "vendor_id", "count", "tags", "product_id"]}]}, "nova_object.changes": ["objects"]} |
+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Comment 4 Roman Safronov 2020-09-23 08:44:49 UTC
Note: I retested with a workload that uses only VF ports (no PF ports). OVN migration passed in this case.

Comment 6 Roman Safronov 2022-03-23 11:10:09 UTC
I tested ovn migration on sriov enviroment with existing workload VMs that have VF and PF ports but when the workload VMs are shut off during the migration. The migration passed successfully. After the migration to OVN, the VMs were turned on again and connectivity was ok.

Comment 18 Fernando Royo 2023-04-17 15:15:38 UTC
[600d bug triage] As the issue is not appear again, we can close the BZ and feel free to reopen if it appears again.


Note You need to log in before you can comment on or make changes to this bug.