Bug 2239459 - Octavia Active Standby Amphora goes to error after failure
Summary: Octavia Active Standby Amphora goes to error after failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-octavia
Version: 17.1 (Wallaby)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z2
: 17.1
Assignee: Gregory Thiemonge
QA Contact: Bruna Bonguardo
Greg Rakauskas
URL:
Whiteboard:
Depends On: 2237225
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-09-18 13:41 UTC by Gregory Thiemonge
Modified: 2024-01-16 14:31 UTC (History)
5 users (show)

Fixed In Version: openstack-octavia-8.0.2-17.1.20231016170842.907b8b1.el9ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2237225
Environment:
Last Closed: 2024-01-16 14:31:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 2033734 0 None None None 2023-10-09 07:55:39 UTC
Launchpad 2033894 0 None None None 2023-10-09 07:55:39 UTC
OpenStack gerrit 893536 0 None MERGED Reduce duration of failovers with amphora in ERROR 2023-11-03 08:31:14 UTC
OpenStack gerrit 893537 0 None MERGED Fix timeout duration in start_vrrp_service during failovers 2023-11-03 08:31:16 UTC
OpenStack gerrit 893612 0 None MERGED Fix amphorae in ERROR during the failover 2023-11-03 08:31:18 UTC
Red Hat Issue Tracker OSP-28823 0 None None None 2023-09-18 13:41:53 UTC
Red Hat Product Errata RHBA-2024:0209 0 None None None 2024-01-16 14:31:03 UTC

Description Gregory Thiemonge 2023-09-18 13:41:19 UTC
+++ This bug was initially created as a clone of Bug #2237225 +++

Description of problem:
Customer is building an active/standby LoadBalancer with 2 amphoras.
Then he simulates a disaster/outage by shutting down both amphoras (openstack server stop) and look at recovery.

He noticed that after the rebuilt of the master, the backup amphora goes in error state:
openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
| id                                   | loadbalancer_id                      | status    | role   | lb_network_ip | ha_ip      |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
| 9007868a-ab80-43f8-aa80-96a6ccff1e9e | c10cf3e5-8655-41ce-8e53-9cf7243fea62 | ERROR     | BACKUP | 172.21.2.97   | 10.0.0.214 |
| f3b02d22-0573-468f-9518-5db89b3471b5 | c10cf3e5-8655-41ce-8e53-9cf7243fea62 | ALLOCATED | MASTER | 172.21.2.55   | 10.0.0.214 |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+

The backup amphora is ultimately rebuilt but it took about 40 minutes to complete.

We need your help to understand why it goes to error state and why it takes so long to recover.

We have sosreport with octavia in debug mode attached to the case.

Version-Release number of selected component (if applicable):
OSP 16.2.4
puppet-octavia-15.5.1-2.20220821005128.a56b33a.el8ost.noarch 
openstack-octavia-common-5.1.3-2.20220927125110.57a6265.el8ost.noarch

How reproducible:
100% can be reproduced at will

Steps to Reproduce:
1. Create LB with active/standby
2. shutdown both instances
3. standby instance will go in error and will recover later up to 40 minutes after.

Actual results:
Long recovery of the amphoras during a disaster situation.

Expected results:
Very quick recovery

Additional info:
sosreport with octavia in debug

--- Additional comment from Gregory Thiemonge on 2023-09-04 12:18:29 UTC ---

There are 2 issues, I created 2 launchpad bugs:

- failover of ACTIVE_STANDBY LBs can take a lot of time in amphorav1 https://bugs.launchpad.net/octavia/+bug/2033894
- a failover of an ACTIVE_STANDBY LB recreate only one amphora when both amps are failing https://bugs.launchpad.net/octavia/+bug/2033734

Note: the amphora in ERROR status can be recreated manually with: openstack loadbalancer amphora failover <amp_id> (a loadbalancer failover can also fix it)

Comment 14 errata-xmlrpc 2024-01-16 14:31:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 17.1.2 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:0209


Note You need to log in before you can comment on or make changes to this bug.