Description of problem:
During a recent scale out we experienced a full customer dataplane outage.
An engineer had previously manually added a route to the br-ex persistent route file on all three controllers. os-net-config detected the fact that the file was incorrect, and rebuilt all interfaces on all 3 controllers. This would not have been too much of an issue, but this rebuild of interfaces took place on all three controllers at the same time.
Once the interfaces had been rebuilt, the control plane returned to service, once PCS had managed to settle the cluster. Customer data traffic was unfortunately still not able to flow until a rolling restart of the neutron services had been done on all three controllers. So Director/os-net-config was not able to gracefully recover from this problem.
We believe this problem would also occur if you add new routes using director. We do not think that taking the entire control plane down due to a manual route addition, or indeed if we added a new route via Director, is acceptable.
Steps to reproduce:
1. Build out an OSP10 cloud.
2. Add a new route manually.
3. Perform a scale out via Director.
At a point in the scale out all 3 controllers will have a network restart at the same time, taking the control plane down. It is possible you will also need to manually restart neutron on all 3 controllers.
Changes should be applied serially. If a change requires a neutron restart, The other controllers should wait until the first controller is back before proceeding with the changes on the other 2 controllers.
Looking at the code for os-net-config upstream i can see huge improvements have been made to the way os-net-config handles changes. I would ask Red Hat consider backporting these changes, or fixing os-net-config in some other way so this issue does not reoccur.
I would ask that os-net-config be looked at in OSP10 to behave in a more controlled way, without taking all 3 controllers down at the same time, or provide us with a way of limiting the impact of os-net-config. Or have a deploy fail if it detects changes that could cause os-net-config to behave in this way.
A simple change to a minor file, caused us to experience 40 minutes of down time
It appears that this is a backport request for https://bugzilla.redhat.com/show_bug.cgi?id=1650298, which will not restart interfaces when routes change.
This can't be backported to OSP-10. In OSP-10, os-collect-config is used to call os-net-config while in OSP-13 this is done by a heat hook. The heat hook is necessary to implement the restart interface functionality. The version of heat in OSP-16 does not support this and its not possible to backport this heat functionality to OSP-10.