Description of problem: Client deleted a load balancer and while it was actually deleted, the associated amphora were not and are now in error state. While looking in the logs we saw this: WARNING octavia.network.drivers.neutron.allowed_address_pairs Can't deallocate VIP because the vip port $UUID cannot be found in neutron. Continuing cleanup.: octavia.network.base.PortNotFound: port not found I will share the exact log details with UUIDs in the next private comments below. If it helps, Octavia is being called by kubernetes on top so maybe something started from there (about that neutron port) ? We have sosreports from the related nodes. Version-Release number of selected component (if applicable): OSP16.1.3 How reproducible: Random Steps to Reproduce: 1. the LB is deleted 2. Amphora are not deleted and in error state 3. Actual results: We have amphora stuck in error state. Expected results: Amphora instances should be deleted properly. Additional info: We have sosreport from the related nodes.
I believe this is a known upstream issue and the fix is already in 16.2. The upstream change that fixes it is: https://review.opendev.org/c/openstack/octavia/+/754057 Basically, it prevents the octavia housekeeping service to perform certificate rotation on DELETED amphorae. Here are some findings from the logs: The load balancer is 2c53638b-f288-4344-b426-3903bc602bbb It was deleted on 2021-07-29: 2021-07-29 14:38:29.321 134 INFO octavia.controller.queue.v1.endpoints [-] Deleting load balancer '2c53638b-f288-4344-b426-3903bc602bbb'... During a load balancer deletion, amphorae's status is set to DELETED, those DB entries are deleted from the DB by the housekeeping service after a certain period of time (7 days) 1st amphora: 314a1cb6-87ff-423b-8b9e-969075ec60a4 2nd amphora: 697d1ee9-0253-476e-8666-43025aac26ca Both amphorae have 'cert_busy=True', which means that octavia (the housekeeping service) is updating the certificates in the amphorae. In the housekeeping logs, we have: 2021-08-01 11:47:23.695 7 INFO octavia.controller.worker.v1.controller_worker [-] Start amphora cert rotation, amphora's id is: 314a1cb6-87ff-423b-8b9e-969075ec60a4 2021-08-01 11:47:23.761 7 INFO octavia.controller.worker.v1.controller_worker [-] Start amphora cert rotation, amphora's id is: 697d1ee9-0253-476e-8666-43025aac26ca 2021-08-01 12:03:31.429 7 ERROR octavia.amphorae.drivers.haproxy.rest_api_driver [-] Connection retries (currently set to 120) exhausted. The amphora is unavailable. Reason: HTTPSConnectionPool(host='172.24.5.186', port=9443): Max retries exceeded with url: // (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa1f0cfac18>: Failed to establish a new connection: [Errno 113] No route to host',)) 2021-08-01 12:03:31.431 7 ERROR octavia.controller.worker.v1.controller_worker octavia.amphorae.driver_exceptions.exceptions.TimeOutException: contacting the amphora timed out 2021-08-01 12:03:31.493 7 ERROR octavia.amphorae.drivers.haproxy.rest_api_driver [-] Connection retries (currently set to 120) exhausted. The amphora is unavailable. Reason: HTTPSConnectionPool(host='172.24.4.132', port=9443): Max retries exceeded with url: // (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa201b787b8>: Failed to establish a new connection: [Errno 113] No route to host',)) 2021-08-01 12:03:31.495 7 ERROR octavia.controller.worker.v1.controller_worker octavia.amphorae.driver_exceptions.exceptions.TimeOutException: contacting the amphora timed out So the cert rotation was performed _after_ the deletion of the LB, it failed because the amphorae are not reachable (they no longer exist). And then the status of the amphorae is set to ERROR. >> Maybe one small bug here: cert_busy should have been reverted to False if cert rotation fails After looking at the code I've found that the bug has been fixed upstream but is still present in 16.1 The fix is https://review.opendev.org/c/openstack/octavia/+/754057 Without this patch, the housekeeping might perform some actions on DELETED amps which triggers some errors in the logs. 1. The consequence of the bug is that some DELETED amphorae appear as ERROR when listing amphorae. 2. It doesn't affect existing load balancers because only deleted resources trigger errors. Cleaning those amphorae from the DB: Since we don't have an API to delete amphorae in Octavia (it will be in OSP17), we have to use mysql to delete them: DELETE FROM amphora WHERE id = '314a1cb6-87ff-423b-8b9e-969075ec60a4' DELETE FROM amphora WHERE id = '697d1ee9-0253-476e-8666-43025aac26ca' (Please create a backup of the DB before manually updating the DB!) Additional note: The following line: Can't deallocate VIP because the vip port 62098591-664c-41f1-8caa-b5da9ecc0d62 cannot be found in neutron. Continuing cleanup.: octavia.network.base.PortNotFound: port not found (port id: 62098591-664c-41f1-8caa-b5da9ecc0d62). is just a warning message, the load balancer deletion is not interrupted, this PortNotFound exception is correctly handled by Octavia
*** Bug 1966429 has been marked as a duplicate of this bug. ***
A workaround for this issue: There's a setting in octavia.conf: [house_keeping]/amphora_expiry_age the default value is 1 week. It means that an amphora with a DELETED status will be effectively deleted from the DB after one week. During this one-week period, the housekeeping service might try to rotate the certificate if the cert_expiration timestamp is reached (certificates are rotated every 30 days) and then the amphorae are set to ERROR (this is the bug). Setting this value to one hour (3600) will reduce the chance of hitting this bug. The impact is that it will be more difficult to analyze other problems (having DELETED resources in the DB is really helpful for admins and for support teams). Note that there's still a _small_ chance that this issue happens if the amphora's cert_expiration timestamp is reached during the next hour that follows the deletion of an amphora.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.8 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0986