2013096 – Cluster-wide live migration limits and timeouts are not suitable

Bug 2013096 - Cluster-wide live migration limits and timeouts are not suitable

Summary: Cluster-wide live migration limits and timeouts are not suitable

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Installation
Sub Component:
Version:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.3
Assignee:	Simone Tiraboschi
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	2011179
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-12 06:08 UTC by Israel Pinto
Modified:	2021-10-12 12:21 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2011179
Environment:
Last Closed:	2021-10-12 12:21:41 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Israel Pinto 2021-10-12 06:08:08 UTC

+++ This bug was initially created as a clone of Bug #2011179 +++

Description of problem:
Right now we have cluster-wide live migration limits and timeouts settings
Which causing migration to failed in same cases like when we have load on VM memory. 

Doc: https://docs.openshift.com/container-platform/4.7/virt/live_migration/virt-live-migration-limits.html#virt-live-migration-limits-ref_virt-live-migration-limits
Values: 
liveMigrationConfig:
    bandwidthPerMigration: 64Mi
    completionTimeoutPerGiB: 800
    parallelMigrationsPerCluster: 5
    parallelOutboundMigrationsPerNode: 2
    progressTimeout: 150



We should change the: bandwidthPerMigration and maybe also progressTimeout
To be more higher.

--- Additional comment from Simone Tiraboschi on 2021-10-07 12:05:38 IDT ---

@dvossel do we have better defaults that should generically fit every cluster size?

--- Additional comment from David Vossel on 2021-10-07 19:44:04 IDT ---

> @dvossel do we have better defaults that should generically fit every cluster size?


The "bandwidthPerMigration" setting actually never worked, and has been disabled by default now starting in 0.44.0 (cnv-4.9) [1].  So unless "bandwidthPerMigration" is being set explicitly by HCO, it has never worked that I am aware of.


As far as better defaults go, it's possible we can do better here. I'd like evidence to support what settings are causing migrations to fail in a default case.

> Right now we have cluster-wide live migration limits and timeouts settings
> Which causing migration to failed in same cases like when we have load on VM memory. 


@ipinto do you have anymore information pertaining to why the migrations fail? Is this specifically during cases where the VM's memory is under constant stress? 

If so, it's unclear whether changing any of the default migration parameters would help or not if the memory can never converge to the target node during migration. In situations like this, usage of post copy instead of pre copy might be the only solution that offers a guarantee that migrations will succeed. 




1. https://github.com/kubevirt/kubevirt/pull/6007

--- Additional comment from Simone Tiraboschi on 2021-10-07 23:17:46 IDT ---

(In reply to David Vossel from comment #2)
> > @dvossel do we have better defaults that should generically fit every cluster size?
> 
> 
> The "bandwidthPerMigration" setting actually never worked, and has been
> disabled by default now starting in 0.44.0 (cnv-4.9) [1].  So unless
> "bandwidthPerMigration" is being set explicitly by HCO, it has never worked
> that I am aware of.

HCO is currently setting it with a default of 64Mi just because it was exactly the value that we documented in CNV 2.6 as the default on kubevirt-config confimap, please see:
https://docs.openshift.com/container-platform/4.7/virt/live_migration/virt-live-migration-limits.html#virt-live-migration-limits-ref_virt-live-migration-limits

--- Additional comment from David Vossel on 2021-10-08 18:25:43 IDT ---

(In reply to Simone Tiraboschi from comment #3)
> (In reply to David Vossel from comment #2)
> > > @dvossel do we have better defaults that should generically fit every cluster size?
> > 
> > 
> > The "bandwidthPerMigration" setting actually never worked, and has been
> > disabled by default now starting in 0.44.0 (cnv-4.9) [1].  So unless
> > "bandwidthPerMigration" is being set explicitly by HCO, it has never worked
> > that I am aware of.
> 
> HCO is currently setting it with a default of 64Mi just because it was
> exactly the value that we documented in CNV 2.6 as the default on
> kubevirt-config confimap


ah, interesting. If the HCO is setting 64Mi for the migration bandwidth explicitly, then we now have a change in behavior between 4.8 and 4.9 due to the bug that prevented the migration bandwidth from ever being applied correctly until now.

If we have evidence that this bandwidth setting is now causing migrations to fail when we'd expect them to succeed, then we'll need to revise the default that the HCO sets to a more reasonable one. Deciding on a new default setting would need a discussion with supporting data.

If we want to restore functionality back to the previous 4.8 behavior, then removing the bandwidthPerMigration setting on the kubevirt CR entirely would match previous behavior.

--- Additional comment from Simone Tiraboschi on 2021-10-08 19:42:52 IDT ---

(In reply to David Vossel from comment #4)
> ah, interesting. If the HCO is setting 64Mi for the migration bandwidth
> explicitly, then we now have a change in behavior between 4.8 and 4.9 due to
> the bug that prevented the migration bandwidth from ever being applied
> correctly until now.

Just to be more accurate.
HCO was already setting that value on CNV 4.8 exactly with the same default.
So we should eventually amend it also there.

--- Additional comment from Israel Pinto on 2021-10-09 21:31:04 IDT ---

(In reply to David Vossel from comment #2)
> > @dvossel do we have better defaults that should generically fit every cluster size?
> 
> 
> The "bandwidthPerMigration" setting actually never worked, and has been
> disabled by default now starting in 0.44.0 (cnv-4.9) [1].  So unless
> "bandwidthPerMigration" is being set explicitly by HCO, it has never worked
> that I am aware of.
> 
> 
> As far as better defaults go, it's possible we can do better here. I'd like
> evidence to support what settings are causing migrations to fail in a
> default case.
> 
> > Right now we have cluster-wide live migration limits and timeouts settings
> > Which causing migration to failed in same cases like when we have load on VM memory. 
> 
> 
> @ipinto do you have anymore information pertaining to why the
> migrations fail? Is this specifically during cases where the VM's memory is
> under constant stress? 

@dvossel
We have simple load memory test with the following command:
stress-ng --vm 1 --vm-bytes 15% --vm-method all --verify -t 1800s -v --hdd 1 --io 1
This test passed before the bandwidth change.
With bandwidth of: 128Mb it passed.
VMs are not idle we need the at least 128Mb , we limit the migration to be 2 per node so we not load the cluster network. 
  
> 
> If so, it's unclear whether changing any of the default migration parameters
> would help or not if the memory can never converge to the target node during
> migration. In situations like this, usage of post copy instead of pre copy
> might be the only solution that offers a guarantee that migrations will
> succeed. 
> 
> 
> 
> 
> 1. https://github.com/kubevirt/kubevirt/pull/6007

--- Additional comment from David Vossel on 2021-10-11 16:55:28 IDT ---

My recommendation is that the HCO disables setting the migration bandwidth setting entirely, which results in no throttling of migration bandwidth. Here's why.

In reality, "unlimited" has effectively been the default for every release so far up until 4.9. Once the setting of 64Mi took hold in 4.9 due to a bug fix on KubeVirt's side, it caused some significant issues.

Customer environments have been using the "unlimited" migration bandwidth for over a year now without issue. I'm concerned that any default we pick won't satisfy everyone and could cause issues during the CNV update path. I'd rather users opt in to setting their own migration bandwidth throttling (we can document some recommendations) than potentially break someone's ability to successfully migrate.

The alternative to disabling migration throttling is to set a default so high we know it won't impact anyone, which in practice is the same as disabling the bandwidth feature by default.

Comment 1 Simone Tiraboschi 2021-10-12 12:21:41 UTC

The value is not effective on 4.8.z due to a kubevirt bug and the upgrade to 4.9.0 will will fix it, fixing this on 4.9.0 is enough.

Note You need to log in before you can comment on or make changes to this bug.