1740608 – Problem with os-net-config restarting all 3 controllers at the same time during a scale out

Bug 1740608 - Problem with os-net-config restarting all 3 controllers at the same time during a scale out

Summary: Problem with os-net-config restarting all 3 controllers at the same time duri...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	os-net-config
Sub Component:
Version:	10.0 (Newton)
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	RHOS Maint
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-13 11:14 UTC by Elf Lewis
Modified:	2023-03-24 15:18 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-11-11 17:43:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-23679	0	None	None	None	2023-03-24 15:18:00 UTC

Description Elf Lewis 2019-08-13 11:14:04 UTC

Description of problem:

During a recent scale out we experienced a full customer dataplane outage.

An engineer had previously manually added a route to the br-ex persistent route file on all three controllers. os-net-config detected the fact that the file was incorrect, and rebuilt all interfaces on all 3 controllers. This would not have been too much of an issue, but this rebuild of interfaces took place on all three controllers at the same time.

Once the interfaces had been rebuilt, the control plane returned to service, once PCS had managed to settle the cluster. Customer data traffic was unfortunately still not able to flow until a rolling restart of the neutron services had been done on all three controllers. So Director/os-net-config was not able to gracefully recover from this problem.

We believe this problem would also occur if you add new routes using director. We do not think that taking the entire control plane down due to a manual route addition, or indeed if we added a new route via Director, is acceptable.

How reproducible:

Always.

Steps to reproduce:
1. Build out an OSP10 cloud.
2. Add a new route manually.
3. Perform a scale out via Director.

Actual Results:

At a point in the scale out all 3 controllers will have a network restart at the same time, taking the control plane down. It is possible you will also need to manually restart neutron on all 3 controllers.

Expected Results:

Changes should be applied serially. If a change requires a neutron restart, The other controllers should wait until the first controller is back before proceeding with the changes on the other 2 controllers.

Aditional Info:

Looking at the code for os-net-config upstream i can see huge improvements have been made to the way os-net-config handles changes. I would ask Red Hat consider backporting these changes, or fixing os-net-config in some other way so this issue does not reoccur.

I would ask that os-net-config be looked at in OSP10 to behave in a more controlled way, without taking all 3 controllers down at the same time, or provide us with a way of limiting the impact of os-net-config. Or have a deploy fail if it detects changes that could cause os-net-config to behave in this way.

A simple change to a minor file, caused us to experience 40 minutes of down time

Comment 1 Bob Fournier 2019-08-13 12:40:14 UTC

It appears that this is a backport request for https://bugzilla.redhat.com/show_bug.cgi?id=1650298, which will not restart interfaces when routes change.

Comment 2 Bob Fournier 2019-11-11 17:43:16 UTC

This can't be backported to OSP-10.  In OSP-10, os-collect-config is used to call os-net-config while in OSP-13 this is done by a heat hook. The heat hook is necessary to implement the restart interface functionality.  The version of heat in OSP-16 does not support this and its not possible to backport this heat functionality to OSP-10.

Note You need to log in before you can comment on or make changes to this bug.