Bug 1974831 - Loadbalancer/amphora stuck in ERROR, possible keepalived issue
Summary: Loadbalancer/amphora stuck in ERROR, possible keepalived issue
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-octavia
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z7
: 16.1 (Train on RHEL 8.2)
Assignee: Gregory Thiemonge
QA Contact: Bruna Bonguardo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-22 15:54 UTC by Priscila
Modified: 2021-12-09 20:20 UTC (History)
9 users (show)

Fixed In Version: openstack-octavia-5.0.3-1.20210712123304.8c32d2e.el8ost
Doc Type: Bug Fix
Doc Text:
With this update, there is resolution to a problem that prevented the RHOSP Load-balancing service (octavia) to fail over load balancers with multiple failed amphorae.
Clone Of:
Environment:
Last Closed: 2021-12-09 20:20:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 761805 0 None MERGED Fix load balancers with failed amphora failover 2021-06-24 06:13:39 UTC
OpenStack gerrit 763733 0 None MERGED Fix load balancers with failed amphora failover 2021-06-24 06:16:10 UTC
Red Hat Issue Tracker OSP-5389 0 None None None 2021-11-12 19:05:49 UTC
Red Hat Product Errata RHBA-2021:3762 0 None None None 2021-12-09 20:20:38 UTC

Description Priscila 2021-06-22 15:54:32 UTC
Description of problem:
Octavia is going into an error state when testing failover.

How reproducible: Always


Steps to Reproduce:
1) create HA loadbalancer

openstack loadbalancer create --name lb1 --vip-subnet-id public-subnet
openstack loadbalancer listener create --name listener1 --protocol TCP --protocol-port 22 lb1
openstack loadbalancer pool create --protocol TCP --listener listener1 --lb-algorithm ROUND_ROBIN --session-persistence type=SOURCE_IP

2) run "openstack loadbalancer amphora failover <ID of one of the loadbalancers>"

3) another amphora instance is built and the instance that it's replacing gets deleted and the primary/secondary load balancer go into an error state


Actual results: Amphora agent returned unexpected result code 400 with response

ERROR octavia.amphorae.drivers.haproxy.exceptions ... Amphora agent returned unexpected result code 400 with response {'message': 'Invalid request', 'details': "[ALERT] 166/075021 (4837) : Proxy ...': unable to find local peer '...' in peers section '..._peers'.\n[WARNING] 166/075021 (4837) : Removing incomplete section 'peers ..._peers' (no peer named '...').\n[ALERT] 166/075021 (4837) : Fatal errors found in configuration.\n"}

ERROR octavia.controller.worker.v1.tasks.amphora_driver_tasks... Failed to update listeners on amphora .... Skipping this amphora as it is failing to update due to: Invalid request: octavia.amphorae.drivers.haproxy.exceptions.InvalidRequest: Invalid request


Expected results: amphora failover to work


Additional info: The patch was cherry-picked to branch stable/victoria as commit a865c03c3b4573ce5855c1d90aad43b5950ea9c5, also the change has been successfully merged by Zuul

- https://review.opendev.org/q/Ia46a05ab9fdc97ed9be699e5b2ae90daca3ab9a2was cherry-picked to branch stable/victoria as commit a865c03c3b4573ce5855c1d90aad43b5950ea9c5, also the change has been successfully merged by Zuul

- https://review.opendev.org/q/Ia46a05ab9fdc97ed9be699e5b2ae90daca3ab9a2

it looks like the amphora_driver_tasks.py file went from:

class AmpListenersUpdate(BaseAmphoraTask):
    """Task to update the listeners on one amphora."""

    def execute(self, loadbalancer, amphora, timeout_dict=None):
        # Note, we don't want this to cause a revert as it may be used
        # in a failover flow with both amps failing. Skip it and let
        # health manager fix it.
        try:
            self.amphora_driver.update_amphora_listeners(
                loadbalancer, amphora, timeout_dict)
        except Exception as e:
            LOG.error('Failed to update listeners on amphora %s. Skipping '
                      'this amphora as it is failing to update due to: %s',
                      amphora.id, str(e))
            self.amphora_repo.update(db_apis.get_session(), amphora.id,
                                     status=constants.ERROR)

TO: 

class AmpListenersUpdate(BaseAmphoraTask):
    """Task to update the listeners on one amphora."""

    def execute(self, loadbalancer, amphora, timeout_dict=None):
        # Note, we don't want this to cause a revert as it may be used
        # in a failover flow with both amps failing. Skip it and let
        # health manager fix it.
        try:
            # Make sure we have a fresh load balancer object
            loadbalancer = self.loadbalancer_repo.get(db_apis.get_session(),
                                                      id=loadbalancer.id)
            self.amphora_driver.update_amphora_listeners(
                loadbalancer, amphora, timeout_dict)
        except Exception as e:
            LOG.error('Failed to update listeners on amphora %s. Skipping '
                      'this amphora as it is failing to update due to: %s',
                      amphora.id, str(e))
            self.amphora_repo.update(db_apis.get_session(), amphora.id,
                                     status=constants.ERROR)

with the latest patch. Attached is a cat of that file from the octavia_worker container which is missing the addition of:

            loadbalancer = self.loadbalancer_repo.get(db_apis.get_session(),
                                                      id=loadbalancer.id)

Comment 2 Gregory Thiemonge 2021-06-23 09:35:18 UTC
I checked the logs, this is another amphora_driver_tasks Task that is failing:

2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker Traceback (most recent call last):
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker     result = task.execute(**arguments)
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/octavia/controller/worker/v1/tasks/amphora_driver_tasks.py", line 133, in execute
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker     loadbalancer, amphorae[amphora_index], timeout_dict)
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 296, in reload
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker     self._apply('reload_listener', loadbalancer, amphora, timeout_dict)
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 293, in _apply
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker     amp, loadbalancer.id, *args)
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 899, in _action
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker     return exc.check_exception(r)
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker   File "/usr/lib/python3.6/site-packages/octavia/amphorae/drivers/haproxy/exceptions.py", line 44, in check_exception
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker     raise responses[status_code]()
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker octavia.amphorae.drivers.haproxy.exceptions.NotFound: Not Found
2021-06-16 11:50:24.562 129 ERROR octavia.controller.worker.v1.controller_worker 
2021-06-16 11:50:24.579 129 DEBUG octavia.controller.worker.v1.controller_worker [-] Task '1-amphora-reload-listener' (ed7df91a-6153-4a09-be17-c4f8c3a3d48f) transitioned into state 'REVERTING' from state 'FAILURE' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:194

Which corresponds to the AmphoraIndexListenersReload class

https://opendev.org/openstack/octavia/src/branch/stable/train/octavia/controller/worker/v1/tasks/amphora_driver_tasks.py#L128-L144

We have to discuss if we need the same loadbalancer_repo.get() call to refresh the load balancer object.
We can also check if some other classes have the same issue.

Comment 6 Gregory Thiemonge 2021-06-24 06:13:41 UTC
I found the issue, there's a missing commit in 16.1, it is on stable/train but it hasn't been backported yet:

https://review.opendev.org/c/openstack/octavia/+/761805

Comment 13 Bruna Bonguardo 2021-07-22 10:34:26 UTC
Bug cannot be verified until 16.1 z7 puddle is available.

Comment 17 Bruna Bonguardo 2021-08-23 12:30:46 UTC
#Verified in version:

[stack@undercloud-0 ~]$ cat /var/lib/rhos-release/latest-installed
16.1  -p RHOS-16.1-RHEL-8-20210804.n.0

#Creating HA load balancer:

(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer create --name lb1 --vip-subnet-id external_subnet
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| admin_state_up      | True                                 |
| created_at          | 2021-08-23T09:11:00                  |
| description         |                                      |
| flavor_id           | None                                 |
| id                  | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 |
| listeners           |                                      |
| name                | lb1                                  |
| operating_status    | OFFLINE                              |
| pools               |                                      |
| project_id          | 75228e22f16c4c2087a2ab2324427aa8     |
| provider            | amphora                              |
| provisioning_status | PENDING_CREATE                       |
| updated_at          | None                                 |
| vip_address         | 10.0.0.218                           |
| vip_network_id      | b3650b06-20fb-4bf3-90e2-aa806ee13920 |
| vip_port_id         | ff460bec-5283-403d-a644-5e6d9605c79f |
| vip_qos_policy_id   | None                                 |
| vip_subnet_id       | 5f5aea73-0bc1-452c-aa17-64616b99d953 |
+---------------------+--------------------------------------+

(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer show lb1
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| admin_state_up      | True                                 |
| created_at          | 2021-08-23T09:11:00                  |
| description         |                                      |
| flavor_id           | None                                 |
| id                  | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 |
| listeners           |                                      |
| name                | lb1                                  |
| operating_status    | ONLINE                               |
| pools               |                                      |
| project_id          | 75228e22f16c4c2087a2ab2324427aa8     |
| provider            | amphora                              |
| provisioning_status | ACTIVE                               |
| updated_at          | 2021-08-23T09:13:05                  |
| vip_address         | 10.0.0.218                           |
| vip_network_id      | b3650b06-20fb-4bf3-90e2-aa806ee13920 |
| vip_port_id         | ff460bec-5283-403d-a644-5e6d9605c79f |
| vip_qos_policy_id   | None                                 |
| vip_subnet_id       | 5f5aea73-0bc1-452c-aa17-64616b99d953 |
+---------------------+--------------------------------------+

(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer listener create --name listener1 --protocol TCP --protocol-port 22 lb1
+-----------------------------+--------------------------------------+
| Field                       | Value                                |
+-----------------------------+--------------------------------------+
| admin_state_up              | True                                 |
| connection_limit            | -1                                   |
| created_at                  | 2021-08-23T09:15:49                  |
| default_pool_id             | None                                 |
| default_tls_container_ref   | None                                 |
| description                 |                                      |
| id                          | 4ef152f1-018f-4113-8707-aadc81192702 |
| insert_headers              | None                                 |
| l7policies                  |                                      |
| loadbalancers               | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 |
| name                        | listener1                            |
| operating_status            | OFFLINE                              |
| project_id                  | 75228e22f16c4c2087a2ab2324427aa8     |
| protocol                    | TCP                                  |
| protocol_port               | 22                                   |
| provisioning_status         | PENDING_CREATE                       |
| sni_container_refs          | []                                   |
| timeout_client_data         | 50000                                |
| timeout_member_connect      | 5000                                 |
| timeout_member_data         | 50000                                |
| timeout_tcp_inspect         | 0                                    |
| updated_at                  | None                                 |
| client_ca_tls_container_ref | None                                 |
| client_authentication       | NONE                                 |
| client_crl_container_ref    | None                                 |
| allowed_cidrs               | None                                 |
+-----------------------------+--------------------------------------+

(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer pool create --protocol TCP --listener listener1 --lb-algorithm ROUND_ROBIN --session-persistence type=SOURCE_IP
+----------------------+--------------------------------------+
| Field                | Value                                |
+----------------------+--------------------------------------+
| admin_state_up       | True                                 |
| created_at           | 2021-08-23T09:39:44                  |
| description          |                                      |
| healthmonitor_id     |                                      |
| id                   | 10aa023e-509d-4ee8-8d2b-98e3e461bbcc |
| lb_algorithm         | ROUND_ROBIN                          |
| listeners            | 4ef152f1-018f-4113-8707-aadc81192702 |
| loadbalancers        | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 |
| members              |                                      |
| name                 |                                      |
| operating_status     | OFFLINE                              |
| project_id           | 75228e22f16c4c2087a2ab2324427aa8     |
| protocol             | TCP                                  |
| provisioning_status  | PENDING_CREATE                       |
| session_persistence  | type=SOURCE_IP                       |
|                      | cookie_name=None                     |
|                      | persistence_timeout=None             |
|                      | persistence_granularity=None         |
| updated_at           | None                                 |
| tls_container_ref    | None                                 |
| ca_tls_container_ref | None                                 |
| crl_container_ref    | None                                 |
| tls_enabled          | False                                |
+----------------------+--------------------------------------+

(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer status show lb1
{
    "loadbalancer": {
        "id": "61e24d04-0e21-4cc5-9ba4-94469e1e7028",
        "name": "lb1",
        "operating_status": "ONLINE",
        "provisioning_status": "ACTIVE",
        "listeners": [
            {
                "id": "4ef152f1-018f-4113-8707-aadc81192702",
                "name": "listener1",
                "operating_status": "ONLINE",
                "provisioning_status": "ACTIVE",
                "pools": [
                    {
                        "id": "10aa023e-509d-4ee8-8d2b-98e3e461bbcc",
                        "name": "",
                        "provisioning_status": "ACTIVE",
                        "operating_status": "ONLINE",
                        "members": []
                    }
                ]
            }
        ]
    }
}

(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
| id                                   | loadbalancer_id                      | status    | role   | lb_network_ip | ha_ip      |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
| 7bf6634e-4f56-428d-b1ff-48e45d6099d0 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | ALLOCATED | BACKUP | 172.24.2.221  | 10.0.0.218 |
| bf4eecfb-b741-40b2-819c-d22695b1e4a2 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | ALLOCATED | MASTER | 172.24.1.193  | 10.0.0.218 |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+

#Run "openstack loadbalancer amphora failover <ID of one of the amphorae>"

(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora failover bf4eecfb-b741-40b2-819c-d22695b1e4a2

(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+----------------+--------+---------------+------------+
| id                                   | loadbalancer_id                      | status         | role   | lb_network_ip | ha_ip      |
+--------------------------------------+--------------------------------------+----------------+--------+---------------+------------+
| 7bf6634e-4f56-428d-b1ff-48e45d6099d0 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | ALLOCATED      | BACKUP | 172.24.2.221  | 10.0.0.218 |
| bf4eecfb-b741-40b2-819c-d22695b1e4a2 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | PENDING_DELETE | MASTER | 172.24.1.193  | 10.0.0.218 |
| 9ee56b75-ecd8-4efd-9f83-c3b96c0b8be9 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | PENDING_CREATE | None   | None          | None       |
+--------------------------------------+--------------------------------------+----------------+--------+---------------+------------+
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+----------------+--------+---------------+------------+
| id                                   | loadbalancer_id                      | status         | role   | lb_network_ip | ha_ip      |
+--------------------------------------+--------------------------------------+----------------+--------+---------------+------------+
| 7bf6634e-4f56-428d-b1ff-48e45d6099d0 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | ALLOCATED      | BACKUP | 172.24.2.221  | 10.0.0.218 |
| bf4eecfb-b741-40b2-819c-d22695b1e4a2 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | PENDING_DELETE | MASTER | 172.24.1.193  | 10.0.0.218 |
| 9ee56b75-ecd8-4efd-9f83-c3b96c0b8be9 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | BOOTING        | None   | None          | None       |
+--------------------------------------+--------------------------------------+----------------+--------+---------------+------------+
(...)
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
| id                                   | loadbalancer_id                      | status    | role   | lb_network_ip | ha_ip      |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
| 7bf6634e-4f56-428d-b1ff-48e45d6099d0 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | ALLOCATED | BACKUP | 172.24.2.221  | 10.0.0.218 |
| 9ee56b75-ecd8-4efd-9f83-c3b96c0b8be9 | 61e24d04-0e21-4cc5-9ba4-94469e1e7028 | ALLOCATED | MASTER | 172.24.1.240  | 10.0.0.218 |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+

(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer status show lb1
{
    "loadbalancer": {
        "id": "61e24d04-0e21-4cc5-9ba4-94469e1e7028",
        "name": "lb1",
        "operating_status": "ONLINE",
        "provisioning_status": "ACTIVE",
        "listeners": [
            {
                "id": "4ef152f1-018f-4113-8707-aadc81192702",
                "name": "listener1",
                "operating_status": "ONLINE",
                "provisioning_status": "ACTIVE",
                "pools": [
                    {
                        "id": "10aa023e-509d-4ee8-8d2b-98e3e461bbcc",
                        "name": "",
                        "provisioning_status": "ACTIVE",
                        "operating_status": "ONLINE",
                        "members": []
                    }
                ]
            }
        ]
    }
}


No ERROR messages were found in the logs.

Moving the bug to VERIFIED.

Comment 34 errata-xmlrpc 2021-12-09 20:20:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3762


Note You need to log in before you can comment on or make changes to this bug.