Description of problem: In environments with significant number of HA L3 routers (not DVR) there seems to be significant stability issues during failover or maintenance events. With L3 agent running on 3 controller nodes (default); the routers will operate normally when all agents are online. However, during reboot of a controller node or network failure of one node, the other L3 agents will come under significant load driven by keepalived. The issues seem to be independent of workload network traffic (an idle environment can show the same behavior). A network interruption between L3 agents can drive significant keepalived load on the other L3 nodes causing a cascading failure of networking services. This additional keepalived load drives high OVS cpu load. When L3 agents are colocated on controller nodes this failure can cause outages to other control plane services (ie crashing pacemaker services). This keepalived load seems to also causes numerous L3 router instances to flap between MASTER and STANDBY mode. Often multiple master instances are activated simultaneously causing networking delay and outages associated with duplicate IPs and MACs on the same L2 network. This issue occurs across multiple independent environments and does not seem to be related to the underlying hardware (including networking). Overall this seems to be an issue/bug with keepalived. Once the high load and stability issue occur, killing/restarting keepalived can resolve this issue. I'll provide additional details in private comments. Version-Release number of selected component (if applicable): OSP 13 keepalived-1.3.5-16.el7.x86_64 openstack-neutron-12.1.1-6.el7ost.noarch How reproducible: unknown overall - 100% in these specific environments Steps to Reproduce: 1. Deploy a few hundred l3 HA routers 2. reboot one l3 node or cause network disruption 3. observe high load on remaining node driven by keepalived Actual results: incomplete, failed l3 failover, high system load and failures Expected results: l3 failover, no significant impact to l3 nodes
Hey team, Wanted to ask if we have plan for this? Thank you.
*** Bug 1869047 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2385
Hi Rodolfo, per my previous comment. Is it possible get a z13 hotfix created for this BZ? Thanks! -Richard
@richard, The package is already released and we don't approve a HF for a bug that's already released. Please update to the latest RH OSP 13 , that should have the fix.