Bug 2023742 - [OSP16.2] Octavia load balancer is in ERROR state after ovs2ovn migration [NEEDINFO]
Summary: [OSP16.2] Octavia load balancer is in ERROR state after ovs2ovn migration
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: z2
: 17.1
Assignee: Arnau Verdaguer
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks: 2023747
TreeView+ depends on / blocked
 
Reported: 2021-11-16 12:26 UTC by Roman Safronov
Modified: 2023-08-07 15:25 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2023747 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:
Embargoed:
ifrangs: needinfo? (averdagu)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-10823 0 None None None 2021-11-16 12:28:23 UTC

Description Roman Safronov 2021-11-16 12:26:40 UTC
Description of problem:
When running ovs2ovn migration with a workload that include octavia load balancer, at the end of the migration the load balancer is in ERROR state. 

The issue happens (see details below) because a failover of the LB was triggered and the failover tries to create a VM plugged on the lb-mgmt-net, but it failed. 
One of the possible solutions (according to Greg, gthiemonge) can be disabling the Octavia services (except octavia-api) during a migration.

Version-Release number of selected component (if applicable):
RHOS-16.2-RHEL-8-20211027.n.1

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP environment with openvswitch firewall driver, 
2. Create a workload that consist of 2 VMs running on connected to internal network and have FIPs on external network. Create an octavia load balancer and use the VMs as members. Make sure octavia health monitor is also running.
3. Run ovs2ovn migration according to the official documentation.

Actual results:
Load balancer status is ERROR

Expected results:
Load balancer status is ACTIVE. 

Additional info from Greg (gthiemonge):
A failover of the LB was triggered after the migration because the Octavia health-manager service didn't receive any heartbeat packets from the amphora (I don't know what should be the behavior of the existing VMs after the migration, but I guess triggering a failover of the load balancers is acceptable).

The failover creates a VM plugged on the lb-mgmt-net but it failed:

On networker-0, we can see the exception in the health-manager logs:

/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:14.699 15 ERROR octavia.controller.worker.v1.controller_worker Traceback (most recent call last):
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:14.699 15 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:14.699 15 ERROR octavia.controller.worker.v1.controller_worker     result = task.execute(**arguments)
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:14.699 15 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/octavia/controller/worker/v1/tasks/compute_tasks.py", line 249, in execute
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:14.699 15 ERROR octavia.controller.worker.v1.controller_worker     raise exceptions.ComputeBuildException(fault=fault)
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:14.699 15 ERROR octavia.controller.worker.v1.controller_worker octavia.common.exceptions.ComputeBuildException: Failed to build compute instance due to: {'code': 500, 'created': '2021-11-10T13:47:10Z', 'message': 'Build of instance 8637dde8-1372-426b-a5e0-a92a8a237ce7 aborted: Failed to allocate the network(s), not rescheduling.', 'details': 'Traceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6619, in _create_domain_and_network\n    network_info)\n  File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__\n    next(self.gen)\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 478, in wait_for_instance_event\n    actual_event = event.wait()\n  File "/usr/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait\n    result = hub.switch()\n  File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch\n    return self.greenlet.switch()\neventlet.timeout.Timeout: 300 seconds\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2442, in _build_and_run_instance\n    block_device_info=block_device_info)\n  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 3746, in spawn\n    cleanup_instance_disks=created_disks)\n  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6642, in _create_domain_and_network\n    raise exception.VirtualInterfaceCreateException()\nnova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2168, in _do_build_and_run_instance\n    filter_properties, request_spec)\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2508, in _build_and_run_instance\n    reason=msg)\nnova.exception.BuildAbortException: Build of instance 8637dde8-1372-426b-a5e0-a92a8a237ce7 aborted: Failed to allocate the network(s), not rescheduling.\n'}
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:14.699 15 ERROR octavia.controller.worker.v1.controller_worker
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker [-] Amphora cc27198b-863d-4a56-aff9-a84986225b39 failover exception: Failed to build compute instance due to: {'code': 500, 'created': '2021-11-10T13:47:10Z', 'message': 'Build of instance 8637dde8-1372-426b-a5e0-a92a8a237ce7 aborted: Failed to allocate the network(s), not rescheduling.', 'details': 'Traceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6619, in _create_domain_and_network\n    network_info)\n  File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__\n    next(self.gen)\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 478, in wait_for_instance_event\n    actual_event = event.wait()\n  File "/usr/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait\n    result = hub.switch()\n  File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch\n    return self.greenlet.switch()\neventlet.timeout.Timeout: 300 seconds\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2442, in _build_and_run_instance\n    block_device_info=block_device_info)\n  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 3746, in spawn\n    cleanup_instance_disks=created_disks)\n  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6642, in _create_domain_and_network\n    raise exception.VirtualInterfaceCreateException()\nnova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2168, in _do_build_and_run_instance\n    filter_properties, request_spec)\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2508, in _build_and_run_instance\n    reason=msg)\nnova.exception.BuildAbortException: Build of instance 8637dde8-1372-426b-a5e0-a92a8a237ce7 aborted: Failed to allocate the network(s), not rescheduling.\n'}: octavia.common.exceptions.ComputeBuildException: Failed to build compute instance due to: {'code': 500, 'created': '2021-11-10T13:47:10Z', 'message': 'Build of instance 8637dde8-1372-426b-a5e0-a92a8a237ce7 aborted: Failed to allocate the network(s), not rescheduling.', 'details': 'Traceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6619, in _create_domain_and_network\n    network_info)\n  File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__\n    next(self.gen)\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 478, in wait_for_instance_event\n    actual_event = event.wait()\n  File "/usr/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait\n    result = hub.switch()\n  File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch\n    return self.greenlet.switch()\neventlet.timeout.Timeout: 300 seconds\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2442, in _build_and_run_instance\n    block_device_info=block_device_info)\n  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 3746, in spawn\n    cleanup_instance_disks=created_disks)\n  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6642, in _create_domain_and_network\n    raise exception.VirtualInterfaceCreateException()\nnova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2168, in _do_build_and_run_instance\n    filter_properties, request_spec)\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2508, in _build_and_run_instance\n    reason=msg)\nnova.exception.BuildAbortException: Build of instance 8637dde8-1372-426b-a5e0-a92a8a237ce7 aborted: Failed to allocate the network(s), not rescheduling.\n'}
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker Traceback (most recent call last):
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/octavia/controller/worker/v1/controller_worker.py", line 895, in failover_amphora
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker     failover_amphora_tf.run()
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/taskflow/engines/action_engine/engine.py", line 247, in run
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker     for _state in self.run_iter(timeout=timeout):
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/taskflow/engines/action_engine/engine.py", line 340, in run_iter
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker     failure.Failure.reraise_if_any(er_failures)
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/taskflow/types/failure.py", line 339, in reraise_if_any
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker     failures[0].reraise()
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/taskflow/types/failure.py", line 346, in reraise
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker     six.reraise(*self._exc_info)
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker     raise value
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker     result = task.execute(**arguments)
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/octavia/controller/worker/v1/tasks/compute_tasks.py", line 249, in execute
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker     raise exceptions.ComputeBuildException(fault=fault)
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker octavia.common.exceptions.ComputeBuildException: Failed to build compute instance due to: {'code': 500, 'created': '2021-11-10T13:47:10Z', 'message': 'Build of instance 8637dde8-1372-426b-a5e0-a92a8a237ce7 aborted: Failed to allocate the network(s), not rescheduling.', 'details': 'Traceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6619, in _create_domain_and_network\n    network_info)\n  File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__\n    next(self.gen)\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 478, in wait_for_instance_event\n    actual_event = event.wait()\n  File "/usr/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait\n    result = hub.switch()\n  File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch\n    return self.greenlet.switch()\neventlet.timeout.Timeout: 300 seconds\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2442, in _build_and_run_instance\n    block_device_info=block_device_info)\n  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 3746, in spawn\n    cleanup_instance_disks=created_disks)\n  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6642, in _create_domain_and_network\n    raise exception.VirtualInterfaceCreateException()\nnova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2168, in _do_build_and_run_instance\n    filter_properties, request_spec)\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2508, in _build_and_run_instance\n    reason=msg)\nnova.exception.BuildAbortException: Build of instance 8637dde8-1372-426b-a5e0-a92a8a237ce7 aborted: Failed to allocate the network(s), not rescheduling.\n'}
/var/log/containers/octavia/health-manager.log.5.gz:2021-11-10 13:47:17.006 15 ERROR octavia.controller.worker.v1.controller_worker 

Related logs in nova on compute-0 (/var/log/containers/nova/nova-compute.log.11.gz)

2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [req-d4e1a790-1578-4ef3-b300-a7e9e0269216 ffa80baa6b574c8ea4d2636eb02090a7 ba759e30bc8b44c58fc15f2a4cac0394 - default default] [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7] Instance failed to spawn: nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7] Traceback (most recent call last):
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6619, in _create_domain_and_network
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]     network_info)
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]   File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]     next(self.gen)
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]   File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 478, in wait_for_instance_event
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]     actual_event = event.wait()
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]   File "/usr/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]     result = hub.switch()
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]   File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]     return self.greenlet.switch()
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7] eventlet.timeout.Timeout: 300 seconds
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7] During handling of the above exception, another exception occurred:
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7] Traceback (most recent call last):
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]   File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2668, in _build_resources
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]     yield resources
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]   File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2442, in _build_and_run_instance
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]     block_device_info=block_device_info)
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 3746, in spawn
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]     cleanup_instance_disks=created_disks)
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6642, in _create_domain_and_network
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]     raise exception.VirtualInterfaceCreateException()
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7] nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2021-11-10 13:47:09.029 8 ERROR nova.compute.manager [instance: 8637dde8-1372-426b-a5e0-a92a8a237ce7]

Comment 1 Roman Safronov 2021-11-16 12:44:31 UTC
Note: please don't confuse this issue with BZ 2005964. The BZ 2005964 was found on an environment with iptables_hybrid firewall driver and with mellanox driver involved.

Comment 2 Roman Safronov 2021-11-16 14:33:22 UTC
One more update from Greg

Roman Safronov <rsafrono> wrote:
Greg,
Is there a command for disabling octavia services? Or users should just stop octavia-related containers? 
Could you please suggest commands that customers should run before ovs2ovn migration in order to disable octavia services properly? I think we need to include such instructions into official documentation.

"systemctl stop tripleo_octavia_worker tripleo_octavia_housekeeping tripleo_octavia_health_manager tripleo_octavia_driver_agent" on Controllers and/or Networkers should prevent this issue.

Just an additional note: the LB was in ERROR because Octavia detected a connectivity issue during the migration and failed to recover from it, but the LB was fully functional after the end of the migration. And there's no way to remove this error flag (except a failover, which creates a new VM)


Note You need to log in before you can comment on or make changes to this bug.