Bug 828591 - PRD35 - [RFE] ability to "rebalance" cluster load with a single button
Summary: PRD35 - [RFE] ability to "rebalance" cluster load with a single button
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: RFEs
Version: 3.0.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.5.0
Assignee: Martin Sivák
QA Contact: Lukas Svaty
URL:
Whiteboard: sla
Depends On: plugable-scheduler 975630
Blocks: 1124080 rhev3.5beta 1156165
TreeView+ depends on / blocked
 
Reported: 2012-06-04 23:42 UTC by Bryan Yount
Modified: 2019-08-15 03:34 UTC (History)
14 users (show)

Fixed In Version: vt2.2
Doc Type: Enhancement
Doc Text:
Administrators can now identify the optimal balance of virtual machines within a cluster. In addition, administrators can determine how to place new virtual machine workloads into a cluster with enough total available resources, and avoid scenarios whereby no single host has enough resources for a new virtual machine.
Clone Of:
Environment:
Last Closed: 2015-02-11 17:50:05 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:
sgrinber: Triaged+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Article) 1289123 0 None None None Never
Red Hat Knowledge Base (Solution) 135483 0 None None None 2012-07-30 19:03:45 UTC
Red Hat Product Errata RHSA-2015:0158 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Virtualization Manager 3.5.0 2015-02-11 22:38:50 UTC

Comment 1 Doron Fediuck 2012-09-20 12:31:35 UTC
Hi Bryan,
I assume your cluster has a relevant policy set (even-distribution or power saving).
So what I'd like to understand is why not let it do the job and get what you
ask for?

If we add such a button, it'll have to execute the policy chosen for each policy,
and it'll take time, since we only migrate one VM at a time and then re-evaluate
the situation to decide what should be done next.
So it looks like such a button will behave just like the current timely process
which moves VMs based on the cluster policy.
Given the above description, do you think such a button will help?

Comment 2 Bryan Yount 2012-10-15 20:37:41 UTC
(In reply to comment #1)
> Hi Bryan,
> I assume your cluster has a relevant policy set (even-distribution or power
> saving).
> So what I'd like to understand is why not let it do the job and get what you
> ask for?

Correct me if I am wrong, but doesn't the cluster policy only act over time and not immediately? The customer's example was: what if I take a host offline for maintenance and then I bring it back into the cluster, will the host automatically start receiving VM load from the other hosts?

My understanding was that it only starts receiving new VMs that are powered on after that point or VMs that you manually migrate from other hosts.

> If we add such a button, it'll have to execute the policy chosen for each
> policy,
> and it'll take time, since we only migrate one VM at a time and then
> re-evaluate
> the situation to decide what should be done next.
> So it looks like such a button will behave just like the current timely
> process
> which moves VMs based on the cluster policy.
> Given the above description, do you think such a button will help?

So, maybe my understanding of how the cluster policy works was incorrect initially (see above) but I will say, the customer wants a button that makes this happen more quickly (perhaps more than 1 VM migrating at a time) or the ability to force the policy to be reevaluated instead of waiting for it to happen naturally.

Comment 3 Itamar Heim 2012-10-15 23:27:01 UTC
(In reply to comment #2)
> (In reply to comment #1)
> > Hi Bryan,
> > I assume your cluster has a relevant policy set (even-distribution or power
> > saving).
> > So what I'd like to understand is why not let it do the job and get what you
> > ask for?
> 
> Correct me if I am wrong, but doesn't the cluster policy only act over time
> and not immediately? The customer's example was: what if I take a host
> offline for maintenance and then I bring it back into the cluster, will the
> host automatically start receiving VM load from the other hosts?
> 
> My understanding was that it only starts receiving new VMs that are powered
> on after that point or VMs that you manually migrate from other hosts.
> 
> > If we add such a button, it'll have to execute the policy chosen for each
> > policy,
> > and it'll take time, since we only migrate one VM at a time and then
> > re-evaluate
> > the situation to decide what should be done next.
> > So it looks like such a button will behave just like the current timely
> > process
> > which moves VMs based on the cluster policy.
> > Given the above description, do you think such a button will help?
> 
> So, maybe my understanding of how the cluster policy works was incorrect
> initially (see above) but I will say, the customer wants a button that makes
> this happen more quickly (perhaps more than 1 VM migrating at a time) or the
> ability to force the policy to be reevaluated instead of waiting for it to
> happen naturally.

agree - the policy would only balance today based on hosts too loaded.
if the hosts are not 'balanced', but not violating their 'sla' (>X% cpu over y minutes), load balancing won't happen.

this can be relevant as more scheduling policies are added.
also, it could be relevant if a scheduling policy is defined, but not 'enabled' (not a concept we have today).
then it will not happen automatically, rather only when admin clicks 'rebalance'.
but need to remember current scheduling policies try to migrate only one VM in each cycle (and only if the previous migration finished successfully)

Comment 7 Itamar Heim 2013-01-16 11:34:00 UTC
ok, so iiuc, they want a load balancing policy which is more agressive than the current one based on SLA limits, rather more of: calculate the average cpu/core ratio across the cluster, and start moving VMs from hosts which are >X% higher that average to other hosts which are Y% lower than that average?

this becomes tricky when we want to take more than just cpu into account (ram, network, io).

also sounds a bit like a task for drools to find the optimal placing, then act on it (our current scheduler optimzes one vm at a time).

i'm not sure this can make 3.3, but we are working on allowing to write "your own" scheduling code.

doron - sounds like the RFE here would be still to "trigger load balancing X once", which we don't have today.

Comment 8 Doron Fediuck 2013-01-16 18:07:59 UTC
(In reply to comment #7)
> ok, so iiuc, they want a load balancing policy which is more agressive than
> the current one based on SLA limits, rather more of: calculate the average
> cpu/core ratio across the cluster, and start moving VMs from hosts which are
> >X% higher that average to other hosts which are Y% lower than that average?
> 
> this becomes tricky when we want to take more than just cpu into account
> (ram, network, io).
> 
> also sounds a bit like a task for drools to find the optimal placing, then
> act on it (our current scheduler optimzes one vm at a time).
> 
> i'm not sure this can make 3.3, but we are working on allowing to write
> "your own" scheduling code.
> 
> doron - sounds like the RFE here would be still to "trigger load balancing X
> once", which we don't have today.

I agree it reads like 'load balance now'.
Bryan- having a similar VM amount on all hosts may be misleading, as it depends on which load you're actually balancing. So if one VM's vCPU goes wild, we may see numbers closer to 1,19,20 and if you add memory in it gets even harder. Since you may want to prioritize differently the wight of every resource (unsupported yet but in our future plans).
So asking for immediate balancing based on policy x sounds like the optimal resolution in this case.

Comment 11 Simon Grinberg 2013-02-17 16:49:40 UTC
We'll have to wait for implementation of the infrastructure in bug 912059

Comment 12 Bryan Yount 2013-07-18 19:03:43 UTC
(In reply to Simon Grinberg from comment #11)
> We'll have to wait for implementation of the infrastructure in bug 912059

I guess this means the feature won't happen until *after* 3.3, correct? Or can 912059 and this RFE happen simultaneously in 3.3?

Comment 13 Doron Fediuck 2013-07-21 21:36:47 UTC
(In reply to Bryan Yount from comment #12)
> (In reply to Simon Grinberg from comment #11)
> > We'll have to wait for implementation of the infrastructure in bug 912059
> 
> I guess this means the feature won't happen until *after* 3.3, correct? Or
> can 912059 and this RFE happen simultaneously in 3.3?

Bryan,
bug 912059 will allow users to write their own logic for the balancing policy,
so technically this can be done by the customer in a similar way to a vdsm
hook.
However, there is one thing that worries me; and these are the results of 'mass
migrations'. Today's balancing is taking one VM from every loaded host and migrate it to an under-utilized host (if exists). Implementing several migrations
at the same time may overload your networking, thus leading to other issues. So
when looking into such a balancing logic, you should verify the system is not
being overloaded. So if you had one host in maintenance, and all running VMs are
on a different host, such logic will trigger multiple migrations from one host
to the other. Is this what you expect to see?

Comment 14 Bryan Yount 2013-07-23 00:04:26 UTC
(In reply to Doron Fediuck from comment #13)
> However, there is one thing that worries me; and these are the results of
> 'mass
> migrations'. Today's balancing is taking one VM from every loaded host and
> migrate it to an under-utilized host (if exists). Implementing several
> migrations
> at the same time may overload your networking, thus leading to other issues.

I completely agree; that worries me too. You wouldn't want the balancing action to happen right away but it would be nice to get some indication that things were happening behind the scenes to rebalance things a bit.

I know we already do these calculations for new VMs being started but they want this to happen for already-running VMs.

> So
> when looking into such a balancing logic, you should verify the system is not
> being overloaded. So if you had one host in maintenance, and all running VMs
> are
> on a different host, such logic will trigger multiple migrations from one
> host
> to the other. Is this what you expect to see?

Something like that, yes. To properly implement this feature, I would imagine a new "Status" would need to be created such as "Scheduled for migration to host X" with a button to cancel said migration. This way, the user would know that something was about to happen to rebalance things.

Obviously that's just a suggestion and I will leave it up to the actual engineers to figure out the best way to do this but the customer really just wants a way to rebalance things after taking a hypervisor offline for maintenance.

Comment 18 Doron Fediuck 2014-06-18 06:16:26 UTC
This will be handled using OptaPlanner integration, as a recommendation view.
Note that the implementation expects an already deployed OptaPlanner we
can integrate with.

Comment 19 Scott Herold 2014-07-21 14:37:58 UTC
In conjunction with 877209.  This requires an external oVirt Optimizer to be up and running.  This should be tested against the upstream OptaPlanner engine.

Comment 24 Lukas Svaty 2014-09-11 12:54:29 UTC
As no automated way of optimizing the cluster at the moment exist this should be moved as RFE to 3.6.

At the moment only way to optimize cluster is to get suggestions from optimizer regarding migrations (which have to be done 1 by 1) or to find optimal solution for cluster when admin is trying to start new vms and marked them as to be started.

These two scenarios are in these bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=877209
https://bugzilla.redhat.com/show_bug.cgi?id=1093051

moving back to ASSIGNED

Comment 25 Doron Fediuck 2014-09-14 13:00:43 UTC
(In reply to Lukas Svaty from comment #24)
> As no automated way of optimizing the cluster at the moment exist this
> should be moved as RFE to 3.6.
> 

The new feature handles 2 use cases:

1. Allowing to run a VM in a fragmented cluster by showing what needs to move where to allow running the given VM.

2. Re-balancing a given cluster, as this BZ is asking for.

As you can see in comment 18, the decision was to solve case 2 using a recommendation view, which is what you can get by asking the optimizer to present a re-balanced cluster. The system offers a button to perform one step at a time in order to avoid destabilizing the cluster with multiple concurrent migrations.

In the future we will consider further automation by constantly balancing the clusters, but it's out of this scope.

Based on the above moving back to ON_QA.

Comment 26 Yaniv Lavi 2015-01-19 15:11:46 UTC
Original request was:
3. What is the nature and description of the request?
The customer would like a button on the RHEV GUI that says "Balance load" or something similar. This is requested because after performing maintenance on a hypervisor or a few hypervisors, the number of VMs per hypervisor is no longer balanced. Since the VMs have been moved all over the place, they would like to be abler to click this button to re-balance everything according to the pre-defined cluster policy.

4. Why does the user need this? (List the business requirements here)
It is a hassle to try to manually rebalance the cluster load when there are many hundreds of VMs

5. Functional requirements:
A button on RHEV-M that says "balance load" or "apply cluster policy"

6. For each functional requirement listed in question 5, specify how Red Hat
and the customer can test to confirm the requirement is successfully
implemented.

The button will begin migrating VMs according to the cluster policy already defined. In this example, it would balance the load.

Comment 28 errata-xmlrpc 2015-02-11 17:50:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html

Comment 29 Bryan Yount 2015-03-12 15:08:49 UTC
Attaching a kbase article on how to configure the "Optimizer" which provides the requested functionality in this RFE.


Note You need to log in before you can comment on or make changes to this bug.