Bug 1974831
| Summary: | Loadbalancer/amphora stuck in ERROR, possible keepalived issue | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Priscila <pveiga> |
| Component: | openstack-octavia | Assignee: | Gregory Thiemonge <gthiemon> |
| Status: | CLOSED ERRATA | QA Contact: | Bruna Bonguardo <bbonguar> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 16.1 (Train) | CC: | bbonguar, bperkins, gregraka, gthiemon, ihrachys, lpeer, majopela, njohnston, scohen |
| Target Milestone: | z7 | Keywords: | Triaged |
| Target Release: | 16.1 (Train on RHEL 8.2) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-octavia-5.0.3-1.20210712123304.8c32d2e.el8ost | Doc Type: | Bug Fix |
| Doc Text: |
With this update, there is resolution to a problem that prevented the RHOSP Load-balancing service (octavia) to fail over load balancers with multiple failed amphorae.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-12-09 20:20:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I checked the logs, this is another amphora_driver_tasks Task that is failing:
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker Traceback (most recent call last):
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker File "/usr/lib/python3.6/site-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker result = task.execute(**arguments)
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker File "/usr/lib/python3.6/site-packages/octavia/controller/worker/v1/tasks/amphora_driver_tasks.py", line 133, in execute
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker loadbalancer, amphorae[amphora_index], timeout_dict)
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker File "/usr/lib/python3.6/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 296, in reload
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker self._apply('reload_listener', loadbalancer, amphora, timeout_dict)
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker File "/usr/lib/python3.6/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 293, in _apply
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker amp, loadbalancer.id, *args)
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker File "/usr/lib/python3.6/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 899, in _action
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker return exc.check_exception(r)
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker File "/usr/lib/python3.6/site-packages/octavia/amphorae/drivers/haproxy/exceptions.py", line 44, in check_exception
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker raise responses[status_code]()
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker octavia.amphorae.drivers.haproxy.exceptions.NotFound: Not Found
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker
2021-06-16 11:50:24.579 129 DEBUG octavia.controller.worker.v1.controller_worker [-] Task '1-amphora-reload-listener' (ed7df91a-6153-4a09-be17-c4f8c3a3d48f) transitioned into state 'REVERTING' from state 'FAILURE' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:194
Which corresponds to the AmphoraIndexListenersReload class
https://opendev.org/openstack/octavia/src/branch/stable/train/octavia/controller/worker/v1/tasks/amphora_driver_tasks.py#L128-L144
We have to discuss if we need the same loadbalancer_repo.get() call to refresh the load balancer object.
We can also check if some other classes have the same issue.
I found the issue, there's a missing commit in 16.1, it is on stable/train but it hasn't been backported yet: https://review.opendev.org/c/openstack/octavia/+/761805 Bug cannot be verified until 16.1 z7 puddle is available. #Verified in version:
[stack@undercloud-0 ~]$ cat /var/lib/rhos-release/latest-installed
16.1 -p RHOS-16.1-RHEL-8-20210804.n.0
#Creating HA load balancer:
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer create --name lb1 --vip-subnet-id external_subnet
+---------------------+--------------------------------------+
| Field | Value |
+---------------------+--------------------------------------+
| admin_state_up | True |
| created_at | 2021-08-23T09:11:00 |
| description | |
| flavor_id | None |
| id | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 |
| listeners | |
| name | lb1 |
| operating_status | OFFLINE |
| pools | |
| project_id | 75228e22f16c4c2087a2ab2324427aa8 |
| provider | amphora |
| provisioning_status | PENDING_CREATE |
| updated_at | None |
| vip_address | 10.0.0.218 |
| vip_network_id | b3650b06-20fb-4bf3-90e2-aa806ee13920 |
| vip_port_id | ff460bec-5283-403d-a644-5e6d9605c79f |
| vip_qos_policy_id | None |
| vip_subnet_id | 5f5aea73-0bc1-452c-aa17-64616b99d953 |
+---------------------+--------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer show lb1
+---------------------+--------------------------------------+
| Field | Value |
+---------------------+--------------------------------------+
| admin_state_up | True |
| created_at | 2021-08-23T09:11:00 |
| description | |
| flavor_id | None |
| id | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 |
| listeners | |
| name | lb1 |
| operating_status | ONLINE |
| pools | |
| project_id | 75228e22f16c4c2087a2ab2324427aa8 |
| provider | amphora |
| provisioning_status | ACTIVE |
| updated_at | 2021-08-23T09:13:05 |
| vip_address | 10.0.0.218 |
| vip_network_id | b3650b06-20fb-4bf3-90e2-aa806ee13920 |
| vip_port_id | ff460bec-5283-403d-a644-5e6d9605c79f |
| vip_qos_policy_id | None |
| vip_subnet_id | 5f5aea73-0bc1-452c-aa17-64616b99d953 |
+---------------------+--------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer listener create --name listener1 --protocol TCP --protocol-port 22 lb1
+-----------------------------+--------------------------------------+
| Field | Value |
+-----------------------------+--------------------------------------+
| admin_state_up | True |
| connection_limit | -1 |
| created_at | 2021-08-23T09:15:49 |
| default_pool_id | None |
| default_tls_container_ref | None |
| description | |
| id | 4ef152f1-018f-4113-8707-aadc81192702 |
| insert_headers | None |
| l7policies | |
| loadbalancers | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 |
| name | listener1 |
| operating_status | OFFLINE |
| project_id | 75228e22f16c4c2087a2ab2324427aa8 |
| protocol | TCP |
| protocol_port | 22 |
| provisioning_status | PENDING_CREATE |
| sni_container_refs | [] |
| timeout_client_data | 50000 |
| timeout_member_connect | 5000 |
| timeout_member_data | 50000 |
| timeout_tcp_inspect | 0 |
| updated_at | None |
| client_ca_tls_container_ref | None |
| client_authentication | NONE |
| client_crl_container_ref | None |
| allowed_cidrs | None |
+-----------------------------+--------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer pool create --protocol TCP --listener listener1 --lb-algorithm ROUND_ROBIN --session-persistence type=SOURCE_IP
+----------------------+--------------------------------------+
| Field | Value |
+----------------------+--------------------------------------+
| admin_state_up | True |
| created_at | 2021-08-23T09:39:44 |
| description | |
| healthmonitor_id | |
| id | 10aa023e-509d-4ee8-8d2b-98e3e461bbcc |
| lb_algorithm | ROUND_ROBIN |
| listeners | 4ef152f1-018f-4113-8707-aadc81192702 |
| loadbalancers | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 |
| members | |
| name | |
| operating_status | OFFLINE |
| project_id | 75228e22f16c4c2087a2ab2324427aa8 |
| protocol | TCP |
| provisioning_status | PENDING_CREATE |
| session_persistence | type=SOURCE_IP |
| | cookie_name=None |
| | persistence_timeout=None |
| | persistence_granularity=None |
| updated_at | None |
| tls_container_ref | None |
| ca_tls_container_ref | None |
| crl_container_ref | None |
| tls_enabled | False |
+----------------------+--------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer status show lb1
{
"loadbalancer": {
"id": "61e24d04-0e21-4cc5-9ba4-94469e1e7028",
"name": "lb1",
"operating_status": "ONLINE",
"provisioning_status": "ACTIVE",
"listeners": [
{
"id": "4ef152f1-018f-4113-8707-aadc81192702",
"name": "listener1",
"operating_status": "ONLINE",
"provisioning_status": "ACTIVE",
"pools": [
{
"id": "10aa023e-509d-4ee8-8d2b-98e3e461bbcc",
"name": "",
"provisioning_status": "ACTIVE",
"operating_status": "ONLINE",
"members": []
}
]
}
]
}
}
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
| id | loadbalancer_id | status | role | lb_network_ip | ha_ip |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
| 7bf6634e-4f56-428d-b1ff-48e45d6099d0 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | ALLOCATED | BACKUP | 172.24.2.221 | 10.0.0.218 |
| bf4eecfb-b741-40b2-819c-d22695b1e4a2 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | ALLOCATED | MASTER | 172.24.1.193 | 10.0.0.218 |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
#Run "openstack loadbalancer amphora failover <ID of one of the amphorae>"
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora failover bf4eecfb-b741-40b2-819c-d22695b1e4a2
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+----------------+--------+---------------+------------+
| id | loadbalancer_id | status | role | lb_network_ip | ha_ip |
+--------------------------------------+--------------------------------------+----------------+--------+---------------+------------+
| 7bf6634e-4f56-428d-b1ff-48e45d6099d0 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | ALLOCATED | BACKUP | 172.24.2.221 | 10.0.0.218 |
| bf4eecfb-b741-40b2-819c-d22695b1e4a2 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | PENDING_DELETE | MASTER | 172.24.1.193 | 10.0.0.218 |
| 9ee56b75-ecd8-4efd-9f83-c3b96c0b8be9 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | PENDING_CREATE | None | None | None |
+--------------------------------------+--------------------------------------+----------------+--------+---------------+------------+
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+----------------+--------+---------------+------------+
| id | loadbalancer_id | status | role | lb_network_ip | ha_ip |
+--------------------------------------+--------------------------------------+----------------+--------+---------------+------------+
| 7bf6634e-4f56-428d-b1ff-48e45d6099d0 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | ALLOCATED | BACKUP | 172.24.2.221 | 10.0.0.218 |
| bf4eecfb-b741-40b2-819c-d22695b1e4a2 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | PENDING_DELETE | MASTER | 172.24.1.193 | 10.0.0.218 |
| 9ee56b75-ecd8-4efd-9f83-c3b96c0b8be9 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | BOOTING | None | None | None |
+--------------------------------------+--------------------------------------+----------------+--------+---------------+------------+
(...)
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
| id | loadbalancer_id | status | role | lb_network_ip | ha_ip |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
| 7bf6634e-4f56-428d-b1ff-48e45d6099d0 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | ALLOCATED | BACKUP | 172.24.2.221 | 10.0.0.218 |
| 9ee56b75-ecd8-4efd-9f83-c3b96c0b8be9 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | ALLOCATED | MASTER | 172.24.1.240 | 10.0.0.218 |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer status show lb1
{
"loadbalancer": {
"id": "61e24d04-0e21-4cc5-9ba4-94469e1e7028",
"name": "lb1",
"operating_status": "ONLINE",
"provisioning_status": "ACTIVE",
"listeners": [
{
"id": "4ef152f1-018f-4113-8707-aadc81192702",
"name": "listener1",
"operating_status": "ONLINE",
"provisioning_status": "ACTIVE",
"pools": [
{
"id": "10aa023e-509d-4ee8-8d2b-98e3e461bbcc",
"name": "",
"provisioning_status": "ACTIVE",
"operating_status": "ONLINE",
"members": []
}
]
}
]
}
}
No ERROR messages were found in the logs.
Moving the bug to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3762 |
Description of problem: Octavia is going into an error state when testing failover. How reproducible: Always Steps to Reproduce: 1) create HA loadbalancer openstack loadbalancer create --name lb1 --vip-subnet-id public-subnet openstack loadbalancer listener create --name listener1 --protocol TCP --protocol-port 22 lb1 openstack loadbalancer pool create --protocol TCP --listener listener1 --lb-algorithm ROUND_ROBIN --session-persistence type=SOURCE_IP 2) run "openstack loadbalancer amphora failover <ID of one of the loadbalancers>" 3) another amphora instance is built and the instance that it's replacing gets deleted and the primary/secondary load balancer go into an error state Actual results: Amphora agent returned unexpected result code 400 with response ERROR octavia.amphorae.drivers.haproxy.exceptions ... Amphora agent returned unexpected result code 400 with response {'message': 'Invalid request', 'details': "[ALERT] 166/075021 (4837) : Proxy ...': unable to find local peer '...' in peers section '..._peers'.\n[WARNING] 166/075021 (4837) : Removing incomplete section 'peers ..._peers' (no peer named '...').\n[ALERT] 166/075021 (4837) : Fatal errors found in configuration.\n"} ERROR octavia.controller.worker.v1.tasks.amphora_driver_tasks... Failed to update listeners on amphora .... Skipping this amphora as it is failing to update due to: Invalid request: octavia.amphorae.drivers.haproxy.exceptions.InvalidRequest: Invalid request Expected results: amphora failover to work Additional info: The patch was cherry-picked to branch stable/victoria as commit a865c03c3b4573ce5855c1d90aad43b5950ea9c5, also the change has been successfully merged by Zuul - https://review.opendev.org/q/Ia46a05ab9fdc97ed9be699e5b2ae90daca3ab9a2was cherry-picked to branch stable/victoria as commit a865c03c3b4573ce5855c1d90aad43b5950ea9c5, also the change has been successfully merged by Zuul - https://review.opendev.org/q/Ia46a05ab9fdc97ed9be699e5b2ae90daca3ab9a2 it looks like the amphora_driver_tasks.py file went from: class AmpListenersUpdate(BaseAmphoraTask): """Task to update the listeners on one amphora.""" def execute(self, loadbalancer, amphora, timeout_dict=None): # Note, we don't want this to cause a revert as it may be used # in a failover flow with both amps failing. Skip it and let # health manager fix it. try: self.amphora_driver.update_amphora_listeners( loadbalancer, amphora, timeout_dict) except Exception as e: LOG.error('Failed to update listeners on amphora %s. Skipping ' 'this amphora as it is failing to update due to: %s', amphora.id, str(e)) self.amphora_repo.update(db_apis.get_session(), amphora.id, status=constants.ERROR) TO: class AmpListenersUpdate(BaseAmphoraTask): """Task to update the listeners on one amphora.""" def execute(self, loadbalancer, amphora, timeout_dict=None): # Note, we don't want this to cause a revert as it may be used # in a failover flow with both amps failing. Skip it and let # health manager fix it. try: # Make sure we have a fresh load balancer object loadbalancer = self.loadbalancer_repo.get(db_apis.get_session(), id=loadbalancer.id) self.amphora_driver.update_amphora_listeners( loadbalancer, amphora, timeout_dict) except Exception as e: LOG.error('Failed to update listeners on amphora %s. Skipping ' 'this amphora as it is failing to update due to: %s', amphora.id, str(e)) self.amphora_repo.update(db_apis.get_session(), amphora.id, status=constants.ERROR) with the latest patch. Attached is a cat of that file from the octavia_worker container which is missing the addition of: loadbalancer = self.loadbalancer_repo.get(db_apis.get_session(), id=loadbalancer.id)