Created attachment 1093472 [details]
VM Connectivity tests with l3_ha=false
Description of problem:
In OSP7 setting l3_ha=false, neutron ports take much longer time to failover to surviving controller nodes. This results in connectivity losses to VM.
Version-Release number of selected component (if applicable):
OSP7 + y1
Including Cisco n1kv plugin requires l3_ha set to be true. It is taking anywhere between 2-3 minutes to reestablish connectivity to a VM if controller node is rebooted.
Also once the node comes back there is uneven distribution of routers now. It isn't clear whether or not the ports have to fail back or should remain as they were.
Steps to Reproduce:
1.Preferably configure n1kv plugin and set l3_ha=false and restart neutron.
2.Create around 50-100 VM's. While a SSH/Ping test being run on the VM's reboot a controller node. Check the time the connectivity is lost
3.Let the rebooted controller node come back and check the names spaces for qrotuers.
Expected to fail back in 15-30 seconds as in OSP6.
*Cisco n1kv plugin requires l3_ha set to be FALSE.
(In reply to Rama from comment #0)
> Created attachment 1093472 [details]
> VM Connectivity tests with l3_ha=false
> Description of problem:
> In OSP7 setting l3_ha=false, neutron ports take much longer time to failover
> to surviving controller nodes. This results in connectivity losses to VM.
... snip ...
> Expected results:
> Expected to fail back in 15-30 seconds as in OSP6.
Are you comparing OSP 6 with neutron.conf:l3_ha = True to OSP 7 with neutron.conf:l3_ha = False and neutron.conf:allow_automatic_l3agent_failover = True? The first provides failover in the range of 10 seconds because the routers are scheduled pre-emptively. The second has to first detect that an agent is down, reschedule it, notify the intended agent, then have the agent configure the router.
Here's how long I'd estimate a failover to take with conf:allow_automatic_l3agent_failover = True:
If neutron.conf:agent_down_time is set to the default of 75 seconds, then:
1) The rescheduling loop runs every 75 / 2 seconds, or 38 seconds.
2) The rescheduling loop considers an agent as dead if it hasn't reported in 75 * 2, or 150 seconds.
In the worst case that would be 38 + 150 seconds = 188 seconds, or just over 3 minutes to start rescheduling routers on the dead host.
3) It takes roughly 8 seconds to configure a router on a VM of mine. Let's assume 1 second on a faster baremetal. A large number of routers on a dead agent scales linearly with the number of routers scheduled to that host. For 500 routers this can take another 8 minutes. I've seen reports of this taking over an hour in real life.
L3 HA was built special purpose to mitigate the issues with allow_automatic_l3agent_failover.
making this bugzilla public with partner's permission
Waiting for needinfo from 2015, closing for now. Please re-open if relevant.