1281603 – OSP7 with l3_ha=false takes longer time to fail neutron ports when one Controller is rebooted

Bug 1281603 - OSP7 with l3_ha=false takes longer time to fail neutron ports when one Controller is rebooted

Summary: OSP7 with l3_ha=false takes longer time to fail neutron ports when one Contro...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	7.0 (Kilo)
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	7.0 (Kilo)
Assignee:	lpeer
QA Contact:	Ofer Blaut
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1191185 1198809 1243520
TreeView+	depends on / blocked

Reported:	2015-11-12 21:44 UTC by Rama
Modified:	2023-09-14 03:12 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-06-04 16:09:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
VM Connectivity tests with l3_ha=false (11.56 KB, text/plain) 2015-11-12 21:44 UTC, Rama	no flags	Details
View All

Description Rama 2015-11-12 21:44:44 UTC

Created attachment 1093472 [details]
VM Connectivity tests with l3_ha=false

Description of problem:
In OSP7 setting l3_ha=false, neutron ports take much longer time to failover to surviving controller nodes. This results in connectivity losses to VM.


Version-Release number of selected component (if applicable):
OSP7 + y1

How reproducible:
Always.
Including Cisco n1kv plugin requires l3_ha set to be true. It is taking anywhere between 2-3 minutes to reestablish connectivity to a VM if controller node is rebooted.

Also once the node comes back there is uneven distribution of routers now. It isn't clear whether or not the ports have to fail back or should remain as they were.

Steps to Reproduce:
1.Preferably configure n1kv plugin and set l3_ha=false and restart neutron.
2.Create around 50-100 VM's. While a SSH/Ping test being run on the VM's reboot a controller node. Check the time the connectivity is lost
3.Let the rebooted controller node come back and check the names spaces for qrotuers.

Actual results:


Expected results:
Expected to fail back in 15-30 seconds as in OSP6.

Additional info:

Comment 2 Shiva Prasad Rao 2015-11-12 22:05:33 UTC

*Cisco n1kv plugin requires l3_ha set to be FALSE.

Comment 3 Assaf Muller 2015-12-09 21:56:31 UTC

(In reply to Rama from comment #0)
> Created attachment 1093472 [details]
> VM Connectivity tests with l3_ha=false
> 
> Description of problem:
> In OSP7 setting l3_ha=false, neutron ports take much longer time to failover
> to surviving controller nodes. This results in connectivity losses to VM.

... snip ...

> Expected results:
> Expected to fail back in 15-30 seconds as in OSP6.

Are you comparing OSP 6 with neutron.conf:l3_ha = True to OSP 7 with neutron.conf:l3_ha = False and neutron.conf:allow_automatic_l3agent_failover = True? The first provides failover in the range of 10 seconds because the routers are scheduled pre-emptively. The second has to first detect that an agent is down, reschedule it, notify the intended agent, then have the agent configure the router.

Here's how long I'd estimate a failover to take with conf:allow_automatic_l3agent_failover = True:

If neutron.conf:agent_down_time is set to the default of 75 seconds, then:
1) The rescheduling loop runs every 75 / 2 seconds, or 38 seconds.
2) The rescheduling loop considers an agent as dead if it hasn't reported in 75 * 2, or 150 seconds.

In the worst case that would be 38 + 150 seconds = 188 seconds, or just over 3 minutes to start rescheduling routers on the dead host.

3) It takes roughly 8 seconds to configure a router on a VM of mine. Let's assume 1 second on a faster baremetal. A large number of routers on a dead agent scales linearly with the number of routers scheduled to that host. For 500 routers this can take another 8 minutes. I've seen reports of this taking over an hour in real life.

L3 HA was built special purpose to mitigate the issues with allow_automatic_l3agent_failover.

Comment 4 Joe Donohue 2016-02-11 19:54:19 UTC

making this bugzilla public with partner's permission

Comment 5 Assaf Muller 2016-06-04 16:09:50 UTC

Waiting for needinfo from 2015, closing for now. Please re-open if relevant.

Comment 6 Red Hat Bugzilla 2023-09-14 03:12:56 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.