Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1874542

Summary: Amphora compute resources are not cleaned up from Octavia's database when they're effectively not running
Product: Red Hat OpenStack Reporter: Andrea Veri <averi>
Component: openstack-octaviaAssignee: Nate Johnston <njohnston>
Status: CLOSED DUPLICATE QA Contact: Bruna Bonguardo <bbonguar>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 16.1 (Train)CC: gthiemon, ihrachys, lpeer, majopela, michjohn, scohen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-02 15:17:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andrea Veri 2020-09-01 14:57:12 UTC
Description of problem:

During a resiliency testing where an entire AZ was power cycled we noticed a set of misbehaving Octavia load balancers, specifically these LBs were showing ERROR/PENDING_UPDATES states. The problem persisted once all the hypervisors, networker(s) nodes and the one controller present in that AZ came back up. 

The thing that caught my attention was the fact some of the Amphora compute resources were still hanging around in the database while they weren't effectively running. As such and due to an inconsistent state the LB itself was marked as ERROR/PENDING_UPDATE.

While we rolled all the AZs we have configured, the problem occurred during the power cycling of one specific AZ. I'm confident this one AZ was the one which had the haproxy virtual IPs sticking into it for external API endpoints with a possibility of in-flight operations happening as some of the hypervisors were power cycled *before* the one control plane host.

This is a follow-up to https://access.redhat.com/support/cases/#/case/02737265 as requested by engineering.

Version-Release number of selected component (if applicable):

16.1

How reproducible:
100%

Steps to Reproduce:
1. Powercycle an entire AZ starting from hypervisors, networkers and the one controller sitting under that AZ
2. Wait 10/15 minutes and boot all nodes back
3. LBs in inconsistent states (ERROR/PENDING_UPDATE), with non-existent amphora compute resources still showing up on the database

Next resiliency test is happening very soon. We're going to switch the way we power cycle systems. We'll first shutdown the one controller in that AZ, then hypervisors and networker nodes simultaneously.

Actual results:

LBs in inconsistent states (ERROR/PENDING_UPDATE), with non-existent amphora compute resources still showing up on the database


Expected results:

LBs to perform failovers properly and most important housekeeping to verify associated amphora compute resources were effectively not running anymore.

Additional info:

What I'd expect from the Octavia Housekeeping worker would be to set the DELETED bit on the database for the target amphora(s) whenever the associated compute resource ID is not there. Doing that manually does the right thing and the non-existent compute resource is wiped correctly. Now that looks to me like a bug on the code as the housekeeping procedures should look at each amphora, verify whether the compute resource ID is really there and if NOT mark the amphora as DELETED which in turn unblocks the load balancer from being in a wonky state.

Comment 1 Michael Johnson 2020-09-01 21:18:11 UTC
The database state was accurate per the sosreports included on the customer ticket.

The root cause of the issue is https://bugzilla.redhat.com/show_bug.cgi?id=1725189 where nova will fail to release attached resources from an instance when the compute host is down. This causes issues in Octavia as it can't detach the network ports from the instance to reallocate those resources to a replacement instance.

The indication of this is:
health-manager.log:2020-08-26 11:11:21.079 77 ERROR octavia.controller.worker.v1.controller_worker [-] Amphora e077e480-176e-4563-b385-823ba783fd87 failover exception: Port a8328583-43f8-43ff-b586-82240e1661db failed to detach (device_id 667153b6-beb5-4366-ad4d-4f2e14f66b58) within the required time (300 s).: octavia.network.base.TimeoutException: Port a8328583-43f8-43ff-b586-82240e1661db failed to detach (device_id 667153b6-beb5-4366-ad4d-4f2e14f66b58) within the required time (300 s).

As the compute hosts were being shutdown, Octavia was attempting to repair the load balancers but getting blocked with the above nova behavior.

In response to this issue in nova, the Octavia team has created a work around in support of https://bugzilla.redhat.com/show_bug.cgi?id=1723482 as it is not clear if/when the nova issue will be resolved.

With the associated fix to BZ 1723482, Octavia would have either be successful at repairing the load balancer, or if no compute hosts were left functional, a followup load balancer failover (via the API) would have restored full service to the load balancer.

Comment 2 Andrea Veri 2020-09-02 09:11:06 UTC
Michael, ran another resiliency testing this morning and I can confirm the bug you mention (https://bugzilla.redhat.com/show_bug.cgi?id=1723482) is exactly what we're hitting:

|__Flow 'octavia-failover-amphora-flow': octavia.network.base.TimeoutException: Port eda7e7c2-d0d7-4291-baea-71b6e1f73b7c failed to detach (device_id 4fdf25b0-5999-4b5b-82a8-8e15ba74ae7a) within the required time (300 s).
2020-09-02 08:51:45.565 77 ERROR octavia.controller.worker.v1.controller_worker octavia.network.base.TimeoutException: Port eda7e7c2-d0d7-4291-baea-71b6e1f73b7c failed to detach (device_id 4fdf25b0-5999-4b5b-82a8-8e15ba74ae7a) within the required time (300 s).
2020-09-02 08:51:45.603 77 ERROR octavia.controller.worker.v1.controller_worker [-] Amphora d4f54a14-1fb0-4adb-b82e-585cc6ebc2fd failover exception: Port eda7e7c2-d0d7-4291-baea-71b6e1f73b7c failed to detach (device_id 4fdf25b0-5999-4b5b-82a8-8e15ba74ae7a) within the required time (300 s).: octavia.network.base.TimeoutException: Port eda7e7c2-d0d7-4291-baea-71b6e1f73b7c failed to detach (device_id 4fdf25b0-5999-4b5b-82a8-8e15ba74ae7a) within the required time (300 s).

Feel free to close this bug and we can continue on https://bugzilla.redhat.com/show_bug.cgi?id=1723482 to avoid duplicates, thanks!

Comment 3 Gregory Thiemonge 2020-09-02 15:17:54 UTC
Marked as duplicate of BZ 1874927 (which is similar to 1723482 but for OSP16.1)

*** This bug has been marked as a duplicate of bug 1874927 ***