Bug 1550923 - [RFE] Implement a rebalancing mechanism for dhcp/l3 agents to spread the load among all controllers
Summary: [RFE] Implement a rebalancing mechanism for dhcp/l3 agents to spread the load...
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: unspecified
Hardware: x86_64
OS: Linux
low
low
Target Milestone: Upstream M2
: ---
Assignee: OSP Team
QA Contact: Toni Freger
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-02 09:53 UTC by David Manchado
Modified: 2021-07-29 04:39 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description David Manchado 2018-03-02 09:53:00 UTC
Description of problem:
After a staged reboot on Neutron Controllers all active agents (dhcp and l3) tend to consolidate on the longest running controller and there is no mechanism for being automatically rebalanced.
Only new agents will tend to be placed on the two idle controllers but unless your users create and delete networks and routers the scenario might be that all agents will be on a single controller for very long time.
This makes the controllers to handle different load and potentially leading to more failures than in a well-balanced status.

Version-Release number of selected component (if applicable):
Identified on RHOS 11 but likely to happen on all of them.

How reproducible:
On a loaded environment always.
On an idle testing environment not sure.


Steps to Reproduce:
On a three-controller cloud create:
- Some networks dhcp enabled
- Some routers
- Reboot controller-2
- Reboot controller-1
- Reboot controller-0

Most of the l3/dhcp agents will tend to be active into controller-2

Note that perhaps with a low amount of networks and routers with no load the relocation might be done properly. When you have an environment where the controllers are misbehaving and go for a staged reboot as a recovery mechanism, the controller will not be responsive until the are rebooted so the only one taking ownership of the routers and networks going down.


Actual results:
90%+ controllers active in the controller with the higher uptime after reboot.

Expected results:
Implement a balancing mechanism that might allow the active agents being rebalanced among the three controllers.

Additional info:
Perhaps some config flag to enable or disable the rebalancing mechanism would be interesting in case some operators prefer the current behavior.

Comment 1 Assaf Muller 2018-03-02 16:00:56 UTC
I'll jot down some initial thoughts:

The DHCP case isn't as urgent as the majority (84% according to UP) use A/A/A, so rebalancing is not relevant.

Rebalancing L3 HA routers invokes data plane downtime. Therefor I don't think automatic rebalancing is desirable. I think in the case of downtime it's something operators would like to control. With that in mind I think we could provide a script that would use the API to rebalance routers (and DHCP if needed). If this boils down to a CLI driven script shipped in a Neutron or TripleO RPM that operators would invoke manually on demand, would that solve the issue as you see it?

Comment 2 David Manchado 2018-03-06 10:25:32 UTC
Assaf,

Some tooling like the script you mentioned would work for me, not sure if this should have to sorted by any PM.

I filed the RFE based on the internal discussion about if we felt that current behavior was the best way to manage HA setups.
Or at least raise awareness that operators will tend to do staged reboots and, as result of it, consolidate most of the agents in one of the controllers.

Comment 3 Assaf Muller 2018-03-08 15:27:38 UTC
Approving for OSP 14, under the understanding that we're treating this as a low priority RFE.

Comment 4 Toni Freger 2018-04-16 10:51:31 UTC
According to email discussion, Networking DFG will be responsible to run regression testing when, David Manchado, will perform functional testing.

Comment 7 Nir Yechiel 2018-05-17 13:01:52 UTC
Pushing this out of RHOSP 14, given other priorities.


Note You need to log in before you can comment on or make changes to this bug.