Bug 1281603 - OSP7 with l3_ha=false takes longer time to fail neutron ports when one Controller is rebooted [NEEDINFO]
Summary: OSP7 with l3_ha=false takes longer time to fail neutron ports when one Contro...
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 7.0 (Kilo)
Hardware: All
OS: Linux
medium
unspecified
Target Milestone: ---
: 7.0 (Kilo)
Assignee: lpeer
QA Contact: Ofer Blaut
URL:
Whiteboard:
Keywords: ZStream
Depends On:
Blocks: 1191185 1198809 1243520
TreeView+ depends on / blocked
 
Reported: 2015-11-12 21:44 UTC by Rama
Modified: 2016-06-04 16:09 UTC (History)
7 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2016-06-04 16:09:50 UTC
amuller: needinfo? (rnishtal)


Attachments (Terms of Use)
VM Connectivity tests with l3_ha=false (11.56 KB, text/plain)
2015-11-12 21:44 UTC, Rama
no flags Details

Description Rama 2015-11-12 21:44:44 UTC
Created attachment 1093472 [details]
VM Connectivity tests with l3_ha=false

Description of problem:
In OSP7 setting l3_ha=false, neutron ports take much longer time to failover to surviving controller nodes. This results in connectivity losses to VM.


Version-Release number of selected component (if applicable):
OSP7 + y1

How reproducible:
Always.
Including Cisco n1kv plugin requires l3_ha set to be true. It is taking anywhere between 2-3 minutes to reestablish connectivity to a VM if controller node is rebooted.

Also once the node comes back there is uneven distribution of routers now. It isn't clear whether or not the ports have to fail back or should remain as they were.

Steps to Reproduce:
1.Preferably configure n1kv plugin and set l3_ha=false and restart neutron.
2.Create around 50-100 VM's. While a SSH/Ping test being run on the VM's reboot a controller node. Check the time the connectivity is lost
3.Let the rebooted controller node come back and check the names spaces for qrotuers.

Actual results:


Expected results:
Expected to fail back in 15-30 seconds as in OSP6.

Additional info:

Comment 2 Shiva Prasad Rao 2015-11-12 22:05:33 UTC
*Cisco n1kv plugin requires l3_ha set to be FALSE.

Comment 3 Assaf Muller 2015-12-09 21:56:31 UTC
(In reply to Rama from comment #0)
> Created attachment 1093472 [details]
> VM Connectivity tests with l3_ha=false
> 
> Description of problem:
> In OSP7 setting l3_ha=false, neutron ports take much longer time to failover
> to surviving controller nodes. This results in connectivity losses to VM.

... snip ...

> Expected results:
> Expected to fail back in 15-30 seconds as in OSP6.

Are you comparing OSP 6 with neutron.conf:l3_ha = True to OSP 7 with neutron.conf:l3_ha = False and neutron.conf:allow_automatic_l3agent_failover = True? The first provides failover in the range of 10 seconds because the routers are scheduled pre-emptively. The second has to first detect that an agent is down, reschedule it, notify the intended agent, then have the agent configure the router.

Here's how long I'd estimate a failover to take with conf:allow_automatic_l3agent_failover = True:

If neutron.conf:agent_down_time is set to the default of 75 seconds, then:
1) The rescheduling loop runs every 75 / 2 seconds, or 38 seconds.
2) The rescheduling loop considers an agent as dead if it hasn't reported in 75 * 2, or 150 seconds.

In the worst case that would be 38 + 150 seconds = 188 seconds, or just over 3 minutes to start rescheduling routers on the dead host.

3) It takes roughly 8 seconds to configure a router on a VM of mine. Let's assume 1 second on a faster baremetal. A large number of routers on a dead agent scales linearly with the number of routers scheduled to that host. For 500 routers this can take another 8 minutes. I've seen reports of this taking over an hour in real life.

L3 HA was built special purpose to mitigate the issues with allow_automatic_l3agent_failover.

Comment 4 Joe Donohue 2016-02-11 19:54:19 UTC
making this bugzilla public with partner's permission

Comment 5 Assaf Muller 2016-06-04 16:09:50 UTC
Waiting for needinfo from 2015, closing for now. Please re-open if relevant.


Note You need to log in before you can comment on or make changes to this bug.