Bug 1281603 - OSP7 with l3_ha=false takes longer time to fail neutron ports when one Controller is rebooted [NEEDINFO]
OSP7 with l3_ha=false takes longer time to fail neutron ports when one Contro...
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron (Show other bugs)
7.0 (Kilo)
All Linux
medium Severity unspecified
: ---
: 7.0 (Kilo)
Assigned To: lpeer
Ofer Blaut
: ZStream
Depends On:
Blocks: 1191185 1198809 1243520
  Show dependency treegraph
 
Reported: 2015-11-12 16:44 EST by Rama
Modified: 2016-06-04 12:09 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-06-04 12:09:50 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
amuller: needinfo? (rnishtal)


Attachments (Terms of Use)
VM Connectivity tests with l3_ha=false (11.56 KB, text/plain)
2015-11-12 16:44 EST, Rama
no flags Details

  None (edit)
Description Rama 2015-11-12 16:44:44 EST
Created attachment 1093472 [details]
VM Connectivity tests with l3_ha=false

Description of problem:
In OSP7 setting l3_ha=false, neutron ports take much longer time to failover to surviving controller nodes. This results in connectivity losses to VM.


Version-Release number of selected component (if applicable):
OSP7 + y1

How reproducible:
Always.
Including Cisco n1kv plugin requires l3_ha set to be true. It is taking anywhere between 2-3 minutes to reestablish connectivity to a VM if controller node is rebooted.

Also once the node comes back there is uneven distribution of routers now. It isn't clear whether or not the ports have to fail back or should remain as they were.

Steps to Reproduce:
1.Preferably configure n1kv plugin and set l3_ha=false and restart neutron.
2.Create around 50-100 VM's. While a SSH/Ping test being run on the VM's reboot a controller node. Check the time the connectivity is lost
3.Let the rebooted controller node come back and check the names spaces for qrotuers.

Actual results:


Expected results:
Expected to fail back in 15-30 seconds as in OSP6.

Additional info:
Comment 2 Shiva Prasad Rao 2015-11-12 17:05:33 EST
*Cisco n1kv plugin requires l3_ha set to be FALSE.
Comment 3 Assaf Muller 2015-12-09 16:56:31 EST
(In reply to Rama from comment #0)
> Created attachment 1093472 [details]
> VM Connectivity tests with l3_ha=false
> 
> Description of problem:
> In OSP7 setting l3_ha=false, neutron ports take much longer time to failover
> to surviving controller nodes. This results in connectivity losses to VM.

... snip ...

> Expected results:
> Expected to fail back in 15-30 seconds as in OSP6.

Are you comparing OSP 6 with neutron.conf:l3_ha = True to OSP 7 with neutron.conf:l3_ha = False and neutron.conf:allow_automatic_l3agent_failover = True? The first provides failover in the range of 10 seconds because the routers are scheduled pre-emptively. The second has to first detect that an agent is down, reschedule it, notify the intended agent, then have the agent configure the router.

Here's how long I'd estimate a failover to take with conf:allow_automatic_l3agent_failover = True:

If neutron.conf:agent_down_time is set to the default of 75 seconds, then:
1) The rescheduling loop runs every 75 / 2 seconds, or 38 seconds.
2) The rescheduling loop considers an agent as dead if it hasn't reported in 75 * 2, or 150 seconds.

In the worst case that would be 38 + 150 seconds = 188 seconds, or just over 3 minutes to start rescheduling routers on the dead host.

3) It takes roughly 8 seconds to configure a router on a VM of mine. Let's assume 1 second on a faster baremetal. A large number of routers on a dead agent scales linearly with the number of routers scheduled to that host. For 500 routers this can take another 8 minutes. I've seen reports of this taking over an hour in real life.

L3 HA was built special purpose to mitigate the issues with allow_automatic_l3agent_failover.
Comment 4 Joe Donohue 2016-02-11 14:54:19 EST
making this bugzilla public with partner's permission
Comment 5 Assaf Muller 2016-06-04 12:09:50 EDT
Waiting for needinfo from 2015, closing for now. Please re-open if relevant.

Note You need to log in before you can comment on or make changes to this bug.