Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2237225

Summary: [OSP16.2] Octavia Active Standby Amphora goes to error after failure
Product: Red Hat OpenStack Reporter: ggrimaux
Component: openstack-octaviaAssignee: Gregory Thiemonge <gthiemon>
Status: CLOSED WONTFIX QA Contact: Bruna Bonguardo <bbonguar>
Severity: medium Docs Contact: Greg Rakauskas <gregraka>
Priority: medium    
Version: 16.2 (Train)CC: gthiemon, jsoliman, tweining
Target Milestone: zstreamKeywords: Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-octavia-5.1.3-2.20231114084819.355f6b1.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2239459 (view as bug list) Environment:
Last Closed: 2024-01-23 10:44:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2239459    

Description ggrimaux 2023-09-04 09:48:59 UTC
Description of problem:
Customer is building an active/standby LoadBalancer with 2 amphoras.
Then he simulates a disaster/outage by shutting down both amphoras (openstack server stop) and look at recovery.

He noticed that after the rebuilt of the master, the backup amphora goes in error state:
openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
| id                                   | loadbalancer_id                      | status    | role   | lb_network_ip | ha_ip      |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
| 9007868a-ab80-43f8-aa80-96a6ccff1e9e | c10cf3e5-8655-41ce-8e53-9cf7243fea62 | ERROR     | BACKUP | 172.21.2.97   | 10.0.0.214 |
| f3b02d22-0573-468f-9518-5db89b3471b5 | c10cf3e5-8655-41ce-8e53-9cf7243fea62 | ALLOCATED | MASTER | 172.21.2.55   | 10.0.0.214 |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+

The backup amphora is ultimately rebuilt but it took about 40 minutes to complete.

We need your help to understand why it goes to error state and why it takes so long to recover.

We have sosreport with octavia in debug mode attached to the case.

Version-Release number of selected component (if applicable):
OSP 16.2.4
puppet-octavia-15.5.1-2.20220821005128.a56b33a.el8ost.noarch 
openstack-octavia-common-5.1.3-2.20220927125110.57a6265.el8ost.noarch

How reproducible:
100% can be reproduced at will

Steps to Reproduce:
1. Create LB with active/standby
2. shutdown both instances
3. standby instance will go in error and will recover later up to 40 minutes after.

Actual results:
Long recovery of the amphoras during a disaster situation.

Expected results:
Very quick recovery

Additional info:
sosreport with octavia in debug

Comment 1 Gregory Thiemonge 2023-09-04 12:18:29 UTC
There are 2 issues, I created 2 launchpad bugs:

- failover of ACTIVE_STANDBY LBs can take a lot of time in amphorav1 https://bugs.launchpad.net/octavia/+bug/2033894
- a failover of an ACTIVE_STANDBY LB recreate only one amphora when both amps are failing https://bugs.launchpad.net/octavia/+bug/2033734

Note: the amphora in ERROR status can be recreated manually with: openstack loadbalancer amphora failover <amp_id> (a loadbalancer failover can also fix it)

Comment 4 Gregory Thiemonge 2023-09-04 12:38:59 UTC
For stable branches only (amphorav1), we may also want to add a few lines that reload the amphora DB object at https://opendev.org/openstack/octavia/src/branch/stable/train/octavia/controller/worker/v1/tasks/amphora_driver_tasks.py#L394