Bug 1281603

Summary: OSP7 with l3_ha=false takes longer time to fail neutron ports when one Controller is rebooted
Product: Red Hat OpenStack Reporter: Rama <rnishtal>
Component: openstack-neutronAssignee: lpeer <lpeer>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Ofer Blaut <oblaut>
Severity: unspecified Docs Contact:
Priority: medium    
Version: 7.0 (Kilo)CC: amuller, chrisw, jdonohue, nyechiel, rnishtal, shivrao, srevivo
Target Milestone: ---Keywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-04 16:09:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1191185, 1198809, 1243520    
Attachments:
Description Flags
VM Connectivity tests with l3_ha=false none

Description Rama 2015-11-12 21:44:44 UTC
Created attachment 1093472 [details]
VM Connectivity tests with l3_ha=false

Description of problem:
In OSP7 setting l3_ha=false, neutron ports take much longer time to failover to surviving controller nodes. This results in connectivity losses to VM.


Version-Release number of selected component (if applicable):
OSP7 + y1

How reproducible:
Always.
Including Cisco n1kv plugin requires l3_ha set to be true. It is taking anywhere between 2-3 minutes to reestablish connectivity to a VM if controller node is rebooted.

Also once the node comes back there is uneven distribution of routers now. It isn't clear whether or not the ports have to fail back or should remain as they were.

Steps to Reproduce:
1.Preferably configure n1kv plugin and set l3_ha=false and restart neutron.
2.Create around 50-100 VM's. While a SSH/Ping test being run on the VM's reboot a controller node. Check the time the connectivity is lost
3.Let the rebooted controller node come back and check the names spaces for qrotuers.

Actual results:


Expected results:
Expected to fail back in 15-30 seconds as in OSP6.

Additional info:

Comment 2 Shiva Prasad Rao 2015-11-12 22:05:33 UTC
*Cisco n1kv plugin requires l3_ha set to be FALSE.

Comment 3 Assaf Muller 2015-12-09 21:56:31 UTC
(In reply to Rama from comment #0)
> Created attachment 1093472 [details]
> VM Connectivity tests with l3_ha=false
> 
> Description of problem:
> In OSP7 setting l3_ha=false, neutron ports take much longer time to failover
> to surviving controller nodes. This results in connectivity losses to VM.

... snip ...

> Expected results:
> Expected to fail back in 15-30 seconds as in OSP6.

Are you comparing OSP 6 with neutron.conf:l3_ha = True to OSP 7 with neutron.conf:l3_ha = False and neutron.conf:allow_automatic_l3agent_failover = True? The first provides failover in the range of 10 seconds because the routers are scheduled pre-emptively. The second has to first detect that an agent is down, reschedule it, notify the intended agent, then have the agent configure the router.

Here's how long I'd estimate a failover to take with conf:allow_automatic_l3agent_failover = True:

If neutron.conf:agent_down_time is set to the default of 75 seconds, then:
1) The rescheduling loop runs every 75 / 2 seconds, or 38 seconds.
2) The rescheduling loop considers an agent as dead if it hasn't reported in 75 * 2, or 150 seconds.

In the worst case that would be 38 + 150 seconds = 188 seconds, or just over 3 minutes to start rescheduling routers on the dead host.

3) It takes roughly 8 seconds to configure a router on a VM of mine. Let's assume 1 second on a faster baremetal. A large number of routers on a dead agent scales linearly with the number of routers scheduled to that host. For 500 routers this can take another 8 minutes. I've seen reports of this taking over an hour in real life.

L3 HA was built special purpose to mitigate the issues with allow_automatic_l3agent_failover.

Comment 4 Joe Donohue 2016-02-11 19:54:19 UTC
making this bugzilla public with partner's permission

Comment 5 Assaf Muller 2016-06-04 16:09:50 UTC
Waiting for needinfo from 2015, closing for now. Please re-open if relevant.

Comment 6 Red Hat Bugzilla 2023-09-14 03:12:56 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days