Bug 1323952 - Change migration parameters
Summary: Change migration parameters
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 3.6.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-3.6.6
: 3.6.6
Assignee: Tomas Jelinek
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks: migration_improvements 1328636 1339521
TreeView+ depends on / blocked
 
Reported: 2016-04-05 07:31 UTC by Tomas Jelinek
Modified: 2016-06-07 21:35 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Release Note
Doc Text:
migration behavior changes - newly there will be only at most 2 migrations started(instead of 3) in parallel, at higher speed, to increase the chances that migrations converge instead of timing out. Values can be changed back or to any other value in vdsm.conf
Clone Of:
: 1339521 (view as bug list)
Environment:
Last Closed: 2016-05-30 10:53:57 UTC
oVirt Team: Virt
Embargoed:
rule-engine: ovirt-3.6.z+
mgoldboi: planning_ack+
tjelinek: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 55676 0 master MERGED migrations: change migration parameters 2016-04-06 20:37:12 UTC
oVirt gerrit 55786 0 ovirt-3.6 MERGED migrations: change migration parameters 2016-04-18 11:32:42 UTC
oVirt gerrit 56558 0 None None None 2016-05-05 12:16:28 UTC

Description Tomas Jelinek 2016-04-05 07:31:43 UTC
It's been a while since the default migration parameters have been set and the current values prevent the migration in many cases to finish successfully.

Current values:
max_outgoing_migrations=3
migration_max_bandwidth=32 (mbps)

this values are outdated and would require a bigger change, but as a simple fix which should already help a lot for the timeframe of 3.6 changing the defaults to:
max_outgoing_migrations=2
migration_max_bandwidth=45 (mbps)

Comment 1 Michal Skrivanek 2016-04-06 11:57:07 UTC
https://access.redhat.com/solutions/744423 should be updated as well

Comment 2 Michal Skrivanek 2016-04-14 07:15:47 UTC
finally agreed on: 
max_outgoing/incoming_migrations=2
migration_max_bandwidth=52
migration_progress_timeout=240
migration_downtime_delay=20
migration_downtime_steps=5

Comment 3 Israel Pinto 2016-04-18 08:09:37 UTC
Tested with those parameters on:
RHEVM version: 3.6.5.3-0.1.el6
VDSM Version:vdsm-4.17.26-0.el7ev
(set them in vdsm.conf)

I ran load on VM with pig loading tool case:
loadTool -v -p 1 -t 1 -m 4000M -l mem -s 500 &
And we have improvement of 2 min in the results,
I takes 2.5 min instead of 4.5 min.
Very good improvement

Comment 5 Israel Pinto 2016-05-01 07:23:49 UTC
Tested with those parameters on:
RHEVM version:  3.6.6-0.1.el6
VDSM Version:  vdsm-4.17.27-0.el7ev

I ran load on VM with pig loading tool case:
loadTool -v -p 1 -t 1 -m 4000M -l mem -s 500 &
And we have improvement of ~ 2 min in the results,
Now it takes 2 min 48 sec instead of 4 min  45 sec.
Very good improvement

Comment 6 Michal Skrivanek 2016-05-05 12:16:28 UTC
additional change proposed: Change 56558

migrations: change convergence schedule from time to iterations

Currently the convergence schedule reacted to specific number of seconds of
stalling. It turned out to be incorrect, because the algorithm which detected
stalling was not detecting it properly and fundamentally can not.
The reason is that the algorithm remembers the last progress time and then if the
data_remaining is bigger it considers the migration to be stalling.

This does not work properly, because the qemu works in iterations and each
iteration takes significant amount of time.

In the first iteration there can be no stalling, qemu only copies memory and
does not look at how much of it changed.

After this period of time it checks if it can move the rest of the VMs memory
when it pauses for the downtime. If not, it starts a new iteration. This new
iteration starts copying all the dirtied memory which can be a lot.

During next iteration, the actual remaining data is compared to the minimal
remaining data to determine if the migration is stalling. Until the next
iteration reaches the level of remaining data from the previous one it will be
considered stalling.

But this does not actually mean we
want to enlarge the downtime during the iteration since we don't know if qemu
can or can not migrate the last part. This information can be found only
between two iterations - e.g. if it is detected that the copying is in new
iteration it means the current downtime was not enough so the enlargement of
the downtime may make sense.

So, the current patch changes the meaning of the "limit" from number of seconds
the migration can be stalling to number of iterations the migration can be
stalling (it may make sense to wait for more than one iteration to give qemu
couple of tries before enlarging the downtime).

The detection of new iteration is done this way:
- in each monitoring cycle the current amount of remaining memory is remembered
- in the next cycle the remembered remaining memory is compared to the current
  one
- the result can be:
  - the current amount is smaller than the remembered => migration is either
    progressing within the same iteration, or it is already in the next
    iteration but the progress was fast enough to get below the remembered value.
    In this case we can fail to recognize the next iteration but it does not
    really matter since the migration progressed in between.
  - the current amount is equal or higher than the remembered => it is a new
    iteration because inside of one iteration the amount of memory has to
    slowly go down and can not stay at one point even less grow

The detection is not perfect but is the best we can have until libvirt 1.3 when
we will get this info reported.


Note You need to log in before you can comment on or make changes to this bug.