Bug 1869355 - OSP 13 - HA L3 router/keepalived stability issues (ML2/OVS)
Summary: OSP 13 - HA L3 router/keepalived stability issues (ML2/OVS)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: z15
: 13.0 (Queens)
Assignee: Rodolfo Alonso
QA Contact: Candido Campos
URL:
Whiteboard:
: 1869047 (view as bug list)
Depends On:
Blocks: 2077016 2096223
TreeView+ depends on / blocked
 
Reported: 2020-08-17 16:26 UTC by Matt Flusche
Modified: 2024-03-25 16:18 UTC (History)
28 users (show)

Fixed In Version: openstack-neutron-12.1.1-38.el7ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2077016 (view as bug list)
Environment:
Last Closed: 2021-06-16 10:58:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-160 0 None None None 2022-02-13 14:42:31 UTC
Red Hat Knowledge Base (Solution) 3948641 0 None None None 2022-03-28 13:19:13 UTC
Red Hat Product Errata RHBA-2021:2385 0 None None None 2021-06-16 10:59:47 UTC

Description Matt Flusche 2020-08-17 16:26:22 UTC
Description of problem:
In environments with significant number of HA L3 routers (not DVR) there seems to be significant stability issues during failover or maintenance events.  With L3 agent running on 3 controller nodes (default); the routers will operate normally when all agents are online.  However, during reboot of a controller node or network failure of one node, the other L3 agents will come under significant load driven by keepalived.

The issues seem to be independent of workload network traffic (an idle environment can show the same behavior).

A network interruption between L3 agents can drive significant keepalived load on the other L3 nodes causing a cascading failure of networking services.  This additional keepalived load drives high OVS cpu load.

When L3 agents are colocated on controller nodes this failure can cause outages to other control plane services (ie crashing pacemaker services).

This keepalived load seems to also causes numerous L3 router instances to flap between MASTER and STANDBY mode.  Often multiple master instances are activated simultaneously causing networking delay and outages associated with duplicate IPs and MACs on the same L2 network.

This issue occurs across multiple independent environments and does not seem to be related to the underlying hardware (including networking). 

Overall this seems to be an issue/bug with keepalived. Once the high load and stability issue occur, killing/restarting keepalived can resolve this issue.

I'll provide additional details in private comments.

Version-Release number of selected component (if applicable):
OSP 13
keepalived-1.3.5-16.el7.x86_64
openstack-neutron-12.1.1-6.el7ost.noarch

How reproducible:
unknown overall - 100% in these specific environments 

Steps to Reproduce:
1. Deploy a few hundred l3 HA routers
2. reboot one l3 node or cause network disruption 
3. observe high load on remaining node driven by keepalived

Actual results:
incomplete, failed l3 failover, high system load and failures

Expected results:
l3 failover, no significant impact to l3 nodes

Comment 8 jhardee 2020-08-31 17:21:25 UTC
Hey team,

Wanted to ask if we have plan for this?

Thank you.

Comment 49 Slawek Kaplonski 2020-09-21 19:55:48 UTC
*** Bug 1869047 has been marked as a duplicate of this bug. ***

Comment 85 errata-xmlrpc 2021-06-16 10:58:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2385

Comment 87 Richard Barrott 2021-08-24 05:30:37 UTC
Hi Rodolfo,
per my previous comment.  Is it possible get a z13 hotfix created for this BZ?

Thanks!

-Richard

Comment 88 PURANDHAR SAIRAM MANNIDI 2021-08-24 05:38:18 UTC
@richard,

The package is already released and we don't approve a HF for a bug that's already released.

Please update to the latest RH OSP 13 , that should have the fix.


Note You need to log in before you can comment on or make changes to this bug.