Bug 2293048 - 17.1 minor update has control plane API outage
Summary: 17.1 minor update has control plane API outage
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 17.1 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z4
: 17.1
Assignee: Lukas Bezdicka
QA Contact: Archana Singh
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-06-19 09:07 UTC by Lukas Bezdicka
Modified: 2025-03-22 04:25 UTC (History)
5 users (show)

Fixed In Version: puppet-tripleo-14.2.3-17.1.20240821083746.40278e1.el8ost puppet-tripleo-14.2.3-17.1.20240821080808.40278e1.el9ost tripleo-ansible-3.3.1-17.1.20240918093754.8debef3.el8ost tripleo-ansible-3.3.1-17.1.20240918100824.8debef3.el9ost openstack-tripleo-heat-templates-14.3.1-17.1.20240919123748.e7c7ce3.el8ost openstack-tripleo-heat-templates-14.3.1-17.1.20240919130751.e7c7ce3.el9ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-11-21 09:30:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-32347 0 None None None 2024-06-19 09:13:31 UTC
Red Hat Product Errata RHSA-2024:9978 0 None None None 2024-11-21 09:30:35 UTC

Description Lukas Bezdicka 2024-06-19 09:07:56 UTC
Description of problem:
During minor update on OSP17 we still cause API outage on controlplane. The CI is simply set to ignore certain number of failed requests.

Version-Release number of selected component (if applicable):
17.1

How reproducible:
Always

This issue boils down to how packets are routed
~~~
{request} -> VIP -> haproxy -> roundrobin(backendA,backendB,backendC)
~~~

Minor update migrates VIP off the controller that is being upgraded but haproxy on next node than still does round robin against all 3 controllers for each endpoint. To solve this we could use haproxy control socket but we have it deliberately disabled and it feels reasons were security.

I propose 2 simple changes to resolve this issue:
- Introduce nftables rules with 20m timeout disabling SYN to the controller being updated. 20m timeout in case of failure during update that would leave the node disables. The rules will have simple comment making them easy to cleanup in postupdate task. SYN disabled to let requests that landed on the node to finish and just trigger backend availability check in the other two haproxy servers. This change dropped 503s in testing from ~90 to 2~3.

- Change default backend setting in haproxy to have option redispatch. Haproxy availability check takes few seconds (up to 5) in which it still does pick the disabled backend. It will than keep retrying 3 times as we have retry settings set to 3 but it will never try different node. On first run you will still get ~2 503s as the config first has to be changed but on subsequent reruns we had 0 503s.

Comment 17 errata-xmlrpc 2024-11-21 09:30:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHOSP 17.1.4 (openstack-tripleo-heat-templates) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:9978

Comment 18 Red Hat Bugzilla 2025-03-22 04:25:20 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.