Bug 1992691 - Octavia Housekeeping cannot rotate amphora certificates: TimeOutException: contacting the amphora timed out
Summary: Octavia Housekeeping cannot rotate amphora certificates: TimeOutException: co...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-octavia
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z8
: 16.1 (Train on RHEL 8.2)
Assignee: Gregory Thiemonge
QA Contact: Bruna Bonguardo
URL:
Whiteboard:
: 1966429 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-11 15:06 UTC by ggrimaux
Modified: 2023-06-07 14:17 UTC (History)
8 users (show)

Fixed In Version: openstack-octavia-5.0.3-1.20210817153318.8c32d2e.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-24 11:00:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack Storyboard 2007635 0 None None None 2021-08-12 14:06:46 UTC
OpenStack gerrit 754057 0 None MERGED Ignore DELETED amphorae when performing certificate rotation 2021-12-09 16:57:21 UTC
Red Hat Issue Tracker OSP-6999 0 None None None 2021-11-15 12:58:14 UTC
Red Hat Knowledge Base (Solution) 7017716 0 None None None 2023-06-07 14:17:28 UTC
Red Hat Product Errata RHBA-2022:0986 0 None None None 2022-03-24 11:01:18 UTC

Description ggrimaux 2021-08-11 15:06:16 UTC
Description of problem:
Client deleted a load balancer and while it was actually deleted, the associated amphora were not and are now in error state.

While looking in the logs we saw this:
WARNING octavia.network.drivers.neutron.allowed_address_pairs Can't deallocate VIP because the vip port $UUID cannot be found in neutron. Continuing cleanup.: octavia.network.base.PortNotFound: port not found

I will share the exact log details with UUIDs in the next private comments below.

If it helps, Octavia is being called by kubernetes on top so maybe something started from there (about that neutron port) ?

We have sosreports from the related nodes.


Version-Release number of selected component (if applicable):
OSP16.1.3

How reproducible:
Random

Steps to Reproduce:
1. the LB is deleted
2. Amphora are not deleted and in error state
3.

Actual results:
We have amphora stuck in error state.

Expected results:
Amphora instances should be deleted properly.

Additional info:
We have sosreport from the related nodes.

Comment 2 Gregory Thiemonge 2021-08-12 13:58:26 UTC
I believe this is a known upstream issue and the fix is already in 16.2.
The upstream change that fixes it is:

https://review.opendev.org/c/openstack/octavia/+/754057

Basically, it prevents the octavia housekeeping service to perform certificate rotation on DELETED amphorae.


Here are some findings from the logs:

The load balancer is 2c53638b-f288-4344-b426-3903bc602bbb
It was deleted on 2021-07-29:

2021-07-29 14:38:29.321 134 INFO octavia.controller.queue.v1.endpoints [-] Deleting load balancer '2c53638b-f288-4344-b426-3903bc602bbb'...

During a load balancer deletion, amphorae's status is set to DELETED, those DB entries are deleted from the DB by the housekeeping service after a certain period of time (7 days)

1st amphora: 314a1cb6-87ff-423b-8b9e-969075ec60a4
2nd amphora: 697d1ee9-0253-476e-8666-43025aac26ca

Both amphorae have 'cert_busy=True', which means that octavia (the housekeeping service) is updating the certificates in the amphorae.

In the housekeeping logs, we have:

2021-08-01 11:47:23.695 7 INFO octavia.controller.worker.v1.controller_worker [-] Start amphora cert rotation, amphora's id is: 314a1cb6-87ff-423b-8b9e-969075ec60a4
2021-08-01 11:47:23.761 7 INFO octavia.controller.worker.v1.controller_worker [-] Start amphora cert rotation, amphora's id is: 697d1ee9-0253-476e-8666-43025aac26ca
2021-08-01 12:03:31.429 7 ERROR octavia.amphorae.drivers.haproxy.rest_api_driver [-] Connection retries (currently set to 120) exhausted.  The amphora is unavailable. Reason: HTTPSConnectionPool(host='172.24.5.186', port=9443): Max retries exceeded with url: // (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa1f0cfac18>: Failed to establish a new connection: [Errno 113] No route to host',))
2021-08-01 12:03:31.431 7 ERROR octavia.controller.worker.v1.controller_worker octavia.amphorae.driver_exceptions.exceptions.TimeOutException: contacting the amphora timed out
2021-08-01 12:03:31.493 7 ERROR octavia.amphorae.drivers.haproxy.rest_api_driver [-] Connection retries (currently set to 120) exhausted.  The amphora is unavailable. Reason: HTTPSConnectionPool(host='172.24.4.132', port=9443): Max retries exceeded with url: // (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa201b787b8>: Failed to establish a new connection: [Errno 113] No route to host',))
2021-08-01 12:03:31.495 7 ERROR octavia.controller.worker.v1.controller_worker octavia.amphorae.driver_exceptions.exceptions.TimeOutException: contacting the amphora timed out

So the cert rotation was performed _after_ the deletion of the LB, it failed because the amphorae are not reachable (they no longer exist). And then the status of the amphorae is set to ERROR.

>> Maybe one small bug here: cert_busy should have been reverted to False if cert rotation fails

After looking at the code I've found that the bug has been fixed upstream but is still present in 16.1

The fix is https://review.opendev.org/c/openstack/octavia/+/754057

Without this patch, the housekeeping might perform some actions on DELETED amps which triggers some errors in the logs.

1. The consequence of the bug is that some DELETED amphorae appear as ERROR when listing amphorae.
2. It doesn't affect existing load balancers because only deleted resources trigger errors.


Cleaning those amphorae from the DB:

Since we don't have an API to delete amphorae in Octavia (it will be in OSP17), we have to use mysql to delete them:
  DELETE FROM amphora WHERE id = '314a1cb6-87ff-423b-8b9e-969075ec60a4'
  DELETE FROM amphora WHERE id = '697d1ee9-0253-476e-8666-43025aac26ca'
(Please create a backup of the DB before manually updating the DB!)
  

Additional note:

The following line:
Can't deallocate VIP because the vip port 62098591-664c-41f1-8caa-b5da9ecc0d62 cannot be found in neutron. Continuing cleanup.: octavia.network.base.PortNotFound: port not found (port id: 62098591-664c-41f1-8caa-b5da9ecc0d62).
is just a warning message, the load balancer deletion is not interrupted, this PortNotFound exception is correctly handled by Octavia

Comment 7 Gregory Thiemonge 2021-09-07 15:22:20 UTC
*** Bug 1966429 has been marked as a duplicate of this bug. ***

Comment 9 Gregory Thiemonge 2021-09-07 17:11:14 UTC
A workaround for this issue:

There's a setting in octavia.conf:

[house_keeping]/amphora_expiry_age
the default value is 1 week.

It means that an amphora with a DELETED status will be effectively deleted from the DB after one week.
During this one-week period, the housekeeping service might try to rotate the certificate if the cert_expiration timestamp is reached (certificates are rotated every 30 days) and then the amphorae are set to ERROR (this is the bug).

Setting this value to one hour (3600) will reduce the chance of hitting this bug.

The impact is that it will be more difficult to analyze other problems (having DELETED resources in the DB is really helpful for admins and for support teams).

Note that there's still a _small_ chance that this issue happens if the amphora's cert_expiration timestamp is reached during the next hour that follows the deletion of an amphora.

Comment 16 errata-xmlrpc 2022-03-24 11:00:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.8 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0986


Note You need to log in before you can comment on or make changes to this bug.