Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2237225

Summary:	[OSP16.2] Octavia Active Standby Amphora goes to error after failure
Product:	Red Hat OpenStack	Reporter:	ggrimaux
Component:	openstack-octavia	Assignee:	Gregory Thiemonge <gthiemon>
Status:	CLOSED WONTFIX	QA Contact:	Bruna Bonguardo <bbonguar>
Severity:	medium	Docs Contact:	Greg Rakauskas <gregraka>
Priority:	medium
Version:	16.2 (Train)	CC:	gthiemon, jsoliman, tweining
Target Milestone:	zstream	Keywords:	Triaged
Target Release:	16.2 (Train on RHEL 8.4)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-octavia-5.1.3-2.20231114084819.355f6b1.el8ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2239459 (view as bug list)		Environment:
Last Closed:	2024-01-23 10:44:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2239459

Description ggrimaux 2023-09-04 09:48:59 UTC

Description of problem:
Customer is building an active/standby LoadBalancer with 2 amphoras.
Then he simulates a disaster/outage by shutting down both amphoras (openstack server stop) and look at recovery.

He noticed that after the rebuilt of the master, the backup amphora goes in error state:
openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
| id                                   | loadbalancer_id                      | status    | role   | lb_network_ip | ha_ip      |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
| 9007868a-ab80-43f8-aa80-96a6ccff1e9e | c10cf3e5-8655-41ce-8e53-9cf7243fea62 | ERROR     | BACKUP | 172.21.2.97   | 10.0.0.214 |
| f3b02d22-0573-468f-9518-5db89b3471b5 | c10cf3e5-8655-41ce-8e53-9cf7243fea62 | ALLOCATED | MASTER | 172.21.2.55   | 10.0.0.214 |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+

The backup amphora is ultimately rebuilt but it took about 40 minutes to complete.

We need your help to understand why it goes to error state and why it takes so long to recover.

We have sosreport with octavia in debug mode attached to the case.

Version-Release number of selected component (if applicable):
OSP 16.2.4
puppet-octavia-15.5.1-2.20220821005128.a56b33a.el8ost.noarch 
openstack-octavia-common-5.1.3-2.20220927125110.57a6265.el8ost.noarch

How reproducible:
100% can be reproduced at will

Steps to Reproduce:
1. Create LB with active/standby
2. shutdown both instances
3. standby instance will go in error and will recover later up to 40 minutes after.

Actual results:
Long recovery of the amphoras during a disaster situation.

Expected results:
Very quick recovery

Additional info:
sosreport with octavia in debug

Comment 1 Gregory Thiemonge 2023-09-04 12:18:29 UTC

There are 2 issues, I created 2 launchpad bugs:

- failover of ACTIVE_STANDBY LBs can take a lot of time in amphorav1 https://bugs.launchpad.net/octavia/+bug/2033894
- a failover of an ACTIVE_STANDBY LB recreate only one amphora when both amps are failing https://bugs.launchpad.net/octavia/+bug/2033734

Note: the amphora in ERROR status can be recreated manually with: openstack loadbalancer amphora failover <amp_id> (a loadbalancer failover can also fix it)

Comment 4 Gregory Thiemonge 2023-09-04 12:38:59 UTC

For stable branches only (amphorav1), we may also want to add a few lines that reload the amphora DB object at https://opendev.org/openstack/octavia/src/branch/stable/train/octavia/controller/worker/v1/tasks/amphora_driver_tasks.py#L394