Description of problem: During minor update on OSP17 we still cause API outage on controlplane. The CI is simply set to ignore certain number of failed requests. Version-Release number of selected component (if applicable): 17.1 How reproducible: Always This issue boils down to how packets are routed ~~~ {request} -> VIP -> haproxy -> roundrobin(backendA,backendB,backendC) ~~~ Minor update migrates VIP off the controller that is being upgraded but haproxy on next node than still does round robin against all 3 controllers for each endpoint. To solve this we could use haproxy control socket but we have it deliberately disabled and it feels reasons were security. I propose 2 simple changes to resolve this issue: - Introduce nftables rules with 20m timeout disabling SYN to the controller being updated. 20m timeout in case of failure during update that would leave the node disables. The rules will have simple comment making them easy to cleanup in postupdate task. SYN disabled to let requests that landed on the node to finish and just trigger backend availability check in the other two haproxy servers. This change dropped 503s in testing from ~90 to 2~3. - Change default backend setting in haproxy to have option redispatch. Haproxy availability check takes few seconds (up to 5) in which it still does pick the disabled backend. It will than keep retrying 3 times as we have retry settings set to 3 but it will never try different node. On first run you will still get ~2 503s as the config first has to be changed but on subsequent reruns we had 0 503s.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHOSP 17.1.4 (openstack-tripleo-heat-templates) security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:9978
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days