Bug 1869355

Summary: OSP 13 - HA L3 router/keepalived stability issues (ML2/OVS)
Product: Red Hat OpenStack Reporter: Matt Flusche <mflusche>
Component: openstack-neutronAssignee: Rodolfo Alonso <ralonsoh>
Status: CLOSED ERRATA QA Contact: Candido Campos <ccamposr>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: ahyder, alolivei, bcafarel, bdobreli, bperkins, bsawyers, bshephar, chrisw, cluster-maint, dalvarez, dhill, ekuris, eolivare, fleitner, jdolling, jhardee, ldenny, ltamagno, oblaut, pmannidi, pveiga, ralonsoh, rbarrott, rohara, scohen, skaplons, sputhenp, takirby
Target Milestone: z15Keywords: TestCannotAutomate, Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-neutron-12.1.1-38.el7ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2077016 (view as bug list) Environment:
Last Closed: 2021-06-16 10:58:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2077016, 2096223    

Description Matt Flusche 2020-08-17 16:26:22 UTC
Description of problem:
In environments with significant number of HA L3 routers (not DVR) there seems to be significant stability issues during failover or maintenance events.  With L3 agent running on 3 controller nodes (default); the routers will operate normally when all agents are online.  However, during reboot of a controller node or network failure of one node, the other L3 agents will come under significant load driven by keepalived.

The issues seem to be independent of workload network traffic (an idle environment can show the same behavior).

A network interruption between L3 agents can drive significant keepalived load on the other L3 nodes causing a cascading failure of networking services.  This additional keepalived load drives high OVS cpu load.

When L3 agents are colocated on controller nodes this failure can cause outages to other control plane services (ie crashing pacemaker services).

This keepalived load seems to also causes numerous L3 router instances to flap between MASTER and STANDBY mode.  Often multiple master instances are activated simultaneously causing networking delay and outages associated with duplicate IPs and MACs on the same L2 network.

This issue occurs across multiple independent environments and does not seem to be related to the underlying hardware (including networking). 

Overall this seems to be an issue/bug with keepalived. Once the high load and stability issue occur, killing/restarting keepalived can resolve this issue.

I'll provide additional details in private comments.

Version-Release number of selected component (if applicable):
OSP 13
keepalived-1.3.5-16.el7.x86_64
openstack-neutron-12.1.1-6.el7ost.noarch

How reproducible:
unknown overall - 100% in these specific environments 

Steps to Reproduce:
1. Deploy a few hundred l3 HA routers
2. reboot one l3 node or cause network disruption 
3. observe high load on remaining node driven by keepalived

Actual results:
incomplete, failed l3 failover, high system load and failures

Expected results:
l3 failover, no significant impact to l3 nodes

Comment 8 jhardee 2020-08-31 17:21:25 UTC
Hey team,

Wanted to ask if we have plan for this?

Thank you.

Comment 49 Slawek Kaplonski 2020-09-21 19:55:48 UTC
*** Bug 1869047 has been marked as a duplicate of this bug. ***

Comment 85 errata-xmlrpc 2021-06-16 10:58:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2385

Comment 87 Richard Barrott 2021-08-24 05:30:37 UTC
Hi Rodolfo,
per my previous comment.  Is it possible get a z13 hotfix created for this BZ?

Thanks!

-Richard

Comment 88 PURANDHAR SAIRAM MANNIDI 2021-08-24 05:38:18 UTC
@richard,

The package is already released and we don't approve a HF for a bug that's already released.

Please update to the latest RH OSP 13 , that should have the fix.