Bug 1879546
Summary: | [OSP16.1]Migration from ML2OVS to ML2OVN fails in case there are active VMs with SRIOV direct-physical (PF) ports | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Roman Safronov <rsafrono> | |
Component: | python-networking-ovn | Assignee: | Jakub Libosvar <jlibosva> | |
Status: | CLOSED NEXTRELEASE | QA Contact: | Eran Kuris <ekuris> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 16.1 (Train) | CC: | apevec, froyo, gurpsing, hakhande, jlibosva, lhh, majopela, mblue, oblaut, ralonsoh, scohen, supadhya, ykarel | |
Target Milestone: | beta | Keywords: | Triaged | |
Target Release: | 17.1 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2169448 (view as bug list) | Environment: | ||
Last Closed: | 2023-04-17 15:15:38 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2169448 |
Description
Roman Safronov
2020-09-16 13:57:39 UTC
Additional issue is that the environment is broken since the moment when the migration failed. 1) According to overcloud-deploy-ovn.sh.log (at undercloud), migration failed ~ 2020/09/15 05:32:07 PM. 2) I see plenty of rabbit errors at computesriov-0's nova-compute (the PF VM ran at compute-0, where the interface has not been released) [root@computesriov-0 ~]# grep -rc "ERROR oslo.messaging._drivers.impl_rabbit \[-\] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed" /var/log/containers/nova/nova-compute.log* /var/log/containers/nova/nova-compute.log:4584 /var/log/containers/nova/nova-compute.log.1:1990 The oldest rabbit error happened when the migration failed approx (I checked that all nodes are synchronized, with UTC time zone): 2020-09-15 17:33:01.014 7 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 110] Connection timed out 2020-09-15 17:33:05.497 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [277bcf3a-3e76-489b-835c-cc700fe830eb] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.: TimeoutError: [Errno 110] Connection timed out 2020-09-15 17:33:06.104 7 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out 2020-09-15 17:33:11.117 7 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out 2020-09-15 17:33:11.509 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [277bcf3a-3e76-489b-835c-cc700fe830eb] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out 2020-09-15 17:33:16.125 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [5c207a7d-9f12-42a1-848d-7eedeb769b36] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out 2020-09-15 17:33:16.129 7 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 2.0 seconds): socket.timeout: timed out 2020-09-15 17:33:17.521 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [277bcf3a-3e76-489b-835c-cc700fe830eb] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out 2020-09-15 17:33:21.287 7 ERROR oslo.messaging._drivers.impl_rabbit [req-5899d69c-e256-43df-baf9-8c312b160af3 - - - - -] [8efb6c30-e81c-41bb-b349-c36114bb5f47] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out 2020-09-15 17:33:22.139 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [5c207a7d-9f12-42a1-848d-7eedeb769b36] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: socket.timeout: timed out 2020-09-15 17:33:23.150 7 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out 2020-09-15 17:33:23.532 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [277bcf3a-3e76-489b-835c-cc700fe830eb] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 2 seconds.: socket.timeout: timed out ... And so on since then. 3) I tried to delete the VM with the PF port that was 'captured' by libvirt, the PF port itself and the FIP that was associated with it but the interface was not released. [heat-admin@computesriov-0 ~]$ ip link show enp6s0f3 Device "enp6s0f3" does not exist. Same result in galera DB. This command shows that 0 PF interfaces are available on computesriov-0: [root@controller-0 ~]# podman exec -it -uroot galera-bundle-podman-0 mysql --skip-column-names nova -e 'select hypervisor_hostname,pci_stats from compute_nodes where hypervisor_hostname="computesriov-0.localdomain";' +----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | computesriov-0.localdomain | {"nova_object.name": "PciDevicePoolList", "nova_object.namespace": "nova", "nova_object.version": "1.1", "nova_object.data": {"objects": [{"nova_object.name": "PciDevicePool", "nova_object.namespace": "nova", "nova_object.version": "1.1", "nova_object.data": {"product_id": "1572", "vendor_id": "8086", "numa_node": 0, "tags": {"dev_type": "type-PF", "physical_network": "datacentre", "trusted": "true"}, "count": 0}, "nova_object.changes": ["numa_node", "vendor_id", "count", "tags", "product_id"]}]}, "nova_object.changes": ["objects"]} | +----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ Note: I retested with a workload that uses only VF ports (no PF ports). OVN migration passed in this case. I tested ovn migration on sriov enviroment with existing workload VMs that have VF and PF ports but when the workload VMs are shut off during the migration. The migration passed successfully. After the migration to OVN, the VMs were turned on again and connectivity was ok. [600d bug triage] As the issue is not appear again, we can close the BZ and feel free to reopen if it appears again. |