Description of problem: After a staged reboot on Neutron Controllers all active agents (dhcp and l3) tend to consolidate on the longest running controller and there is no mechanism for being automatically rebalanced. Only new agents will tend to be placed on the two idle controllers but unless your users create and delete networks and routers the scenario might be that all agents will be on a single controller for very long time. This makes the controllers to handle different load and potentially leading to more failures than in a well-balanced status. Version-Release number of selected component (if applicable): Identified on RHOS 11 but likely to happen on all of them. How reproducible: On a loaded environment always. On an idle testing environment not sure. Steps to Reproduce: On a three-controller cloud create: - Some networks dhcp enabled - Some routers - Reboot controller-2 - Reboot controller-1 - Reboot controller-0 Most of the l3/dhcp agents will tend to be active into controller-2 Note that perhaps with a low amount of networks and routers with no load the relocation might be done properly. When you have an environment where the controllers are misbehaving and go for a staged reboot as a recovery mechanism, the controller will not be responsive until the are rebooted so the only one taking ownership of the routers and networks going down. Actual results: 90%+ controllers active in the controller with the higher uptime after reboot. Expected results: Implement a balancing mechanism that might allow the active agents being rebalanced among the three controllers. Additional info: Perhaps some config flag to enable or disable the rebalancing mechanism would be interesting in case some operators prefer the current behavior.
I'll jot down some initial thoughts: The DHCP case isn't as urgent as the majority (84% according to UP) use A/A/A, so rebalancing is not relevant. Rebalancing L3 HA routers invokes data plane downtime. Therefor I don't think automatic rebalancing is desirable. I think in the case of downtime it's something operators would like to control. With that in mind I think we could provide a script that would use the API to rebalance routers (and DHCP if needed). If this boils down to a CLI driven script shipped in a Neutron or TripleO RPM that operators would invoke manually on demand, would that solve the issue as you see it?
Assaf, Some tooling like the script you mentioned would work for me, not sure if this should have to sorted by any PM. I filed the RFE based on the internal discussion about if we felt that current behavior was the best way to manage HA setups. Or at least raise awareness that operators will tend to do staged reboots and, as result of it, consolidate most of the agents in one of the controllers.
Approving for OSP 14, under the understanding that we're treating this as a low priority RFE.
According to email discussion, Networking DFG will be responsible to run regression testing when, David Manchado, will perform functional testing.
Pushing this out of RHOSP 14, given other priorities.