2011179 – Cluster-wide live migration limits and timeouts are not suitable

Bug 2011179 - Cluster-wide live migration limits and timeouts are not suitable

Summary: Cluster-wide live migration limits and timeouts are not suitable

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Installation
Sub Component:
Version:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Simone Tiraboschi
QA Contact:	ibesso
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2013096
TreeView+	depends on / blocked

Reported:	2021-10-06 08:22 UTC by Israel Pinto
Modified:	2021-11-02 16:01 UTC (History)
CC List:	6 users (show)
Fixed In Version:	hco-bundle-registry-container-v4.9.0-249
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2013096 (view as bug list)
Environment:
Last Closed:	2021-11-02 16:01:09 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt hyperconverged-cluster-operator pull 1567	None	Merged	Avoid setting a default for bandwidthPerMigration and dropping it if == 64Mi	2021-10-11 20:58:59 UTC
Github	kubevirt hyperconverged-cluster-operator pull 1568	None	Merged	[release-1.5] Avoid setting a default for bandwidthPerMigration	2021-10-12 07:29:49 UTC
Red Hat Product Errata	RHSA-2021:4104	None	None	None	2021-11-02 16:01:30 UTC

Description Israel Pinto 2021-10-06 08:22:13 UTC

Description of problem:
Right now we have cluster-wide live migration limits and timeouts settings
Which causing migration to failed in same cases like when we have load on VM memory. 

Doc: https://docs.openshift.com/container-platform/4.7/virt/live_migration/virt-live-migration-limits.html#virt-live-migration-limits-ref_virt-live-migration-limits
Values: 
liveMigrationConfig:
    bandwidthPerMigration: 64Mi
    completionTimeoutPerGiB: 800
    parallelMigrationsPerCluster: 5
    parallelOutboundMigrationsPerNode: 2
    progressTimeout: 150



We should change the: bandwidthPerMigration and maybe also progressTimeout
To be more higher.

Comment 1 Simone Tiraboschi 2021-10-07 09:05:38 UTC

@dvossel do we have better defaults that should generically fit every cluster size?

Comment 2 David Vossel 2021-10-07 16:44:04 UTC

> @dvossel do we have better defaults that should generically fit every cluster size?


The "bandwidthPerMigration" setting actually never worked, and has been disabled by default now starting in 0.44.0 (cnv-4.9) [1].  So unless "bandwidthPerMigration" is being set explicitly by HCO, it has never worked that I am aware of.


As far as better defaults go, it's possible we can do better here. I'd like evidence to support what settings are causing migrations to fail in a default case.

> Right now we have cluster-wide live migration limits and timeouts settings
> Which causing migration to failed in same cases like when we have load on VM memory. 


@ipinto do you have anymore information pertaining to why the migrations fail? Is this specifically during cases where the VM's memory is under constant stress? 

If so, it's unclear whether changing any of the default migration parameters would help or not if the memory can never converge to the target node during migration. In situations like this, usage of post copy instead of pre copy might be the only solution that offers a guarantee that migrations will succeed. 




1. https://github.com/kubevirt/kubevirt/pull/6007

Comment 3 Simone Tiraboschi 2021-10-07 20:17:46 UTC

(In reply to David Vossel from comment #2)
> > @dvossel do we have better defaults that should generically fit every cluster size?
> 
> 
> The "bandwidthPerMigration" setting actually never worked, and has been
> disabled by default now starting in 0.44.0 (cnv-4.9) [1].  So unless
> "bandwidthPerMigration" is being set explicitly by HCO, it has never worked
> that I am aware of.

HCO is currently setting it with a default of 64Mi just because it was exactly the value that we documented in CNV 2.6 as the default on kubevirt-config confimap, please see:
https://docs.openshift.com/container-platform/4.7/virt/live_migration/virt-live-migration-limits.html#virt-live-migration-limits-ref_virt-live-migration-limits

Comment 4 David Vossel 2021-10-08 15:25:43 UTC

(In reply to Simone Tiraboschi from comment #3)
> (In reply to David Vossel from comment #2)
> > > @dvossel do we have better defaults that should generically fit every cluster size?
> > 
> > 
> > The "bandwidthPerMigration" setting actually never worked, and has been
> > disabled by default now starting in 0.44.0 (cnv-4.9) [1].  So unless
> > "bandwidthPerMigration" is being set explicitly by HCO, it has never worked
> > that I am aware of.
> 
> HCO is currently setting it with a default of 64Mi just because it was
> exactly the value that we documented in CNV 2.6 as the default on
> kubevirt-config confimap


ah, interesting. If the HCO is setting 64Mi for the migration bandwidth explicitly, then we now have a change in behavior between 4.8 and 4.9 due to the bug that prevented the migration bandwidth from ever being applied correctly until now.

If we have evidence that this bandwidth setting is now causing migrations to fail when we'd expect them to succeed, then we'll need to revise the default that the HCO sets to a more reasonable one. Deciding on a new default setting would need a discussion with supporting data.

If we want to restore functionality back to the previous 4.8 behavior, then removing the bandwidthPerMigration setting on the kubevirt CR entirely would match previous behavior.

Comment 5 Simone Tiraboschi 2021-10-08 16:42:52 UTC

(In reply to David Vossel from comment #4)
> ah, interesting. If the HCO is setting 64Mi for the migration bandwidth
> explicitly, then we now have a change in behavior between 4.8 and 4.9 due to
> the bug that prevented the migration bandwidth from ever being applied
> correctly until now.

Just to be more accurate.
HCO was already setting that value on CNV 4.8 exactly with the same default.
So we should eventually amend it also there.

Comment 6 Israel Pinto 2021-10-09 18:31:04 UTC

(In reply to David Vossel from comment #2)
> > @dvossel do we have better defaults that should generically fit every cluster size?
> 
> 
> The "bandwidthPerMigration" setting actually never worked, and has been
> disabled by default now starting in 0.44.0 (cnv-4.9) [1].  So unless
> "bandwidthPerMigration" is being set explicitly by HCO, it has never worked
> that I am aware of.
> 
> 
> As far as better defaults go, it's possible we can do better here. I'd like
> evidence to support what settings are causing migrations to fail in a
> default case.
> 
> > Right now we have cluster-wide live migration limits and timeouts settings
> > Which causing migration to failed in same cases like when we have load on VM memory. 
> 
> 
> @ipinto do you have anymore information pertaining to why the
> migrations fail? Is this specifically during cases where the VM's memory is
> under constant stress? 

@dvossel
We have simple load memory test with the following command:
stress-ng --vm 1 --vm-bytes 15% --vm-method all --verify -t 1800s -v --hdd 1 --io 1
This test passed before the bandwidth change.
With bandwidth of: 128Mb it passed.
VMs are not idle we need the at least 128Mb , we limit the migration to be 2 per node so we not load the cluster network. 
  
> 
> If so, it's unclear whether changing any of the default migration parameters
> would help or not if the memory can never converge to the target node during
> migration. In situations like this, usage of post copy instead of pre copy
> might be the only solution that offers a guarantee that migrations will
> succeed. 
> 
> 
> 
> 
> 1. https://github.com/kubevirt/kubevirt/pull/6007

Comment 7 David Vossel 2021-10-11 13:55:28 UTC

My recommendation is that the HCO disables setting the migration bandwidth setting entirely, which results in no throttling of migration bandwidth. Here's why.

In reality, "unlimited" has effectively been the default for every release so far up until 4.9. Once the setting of 64Mi took hold in 4.9 due to a bug fix on KubeVirt's side, it caused some significant issues.

Customer environments have been using the "unlimited" migration bandwidth for over a year now without issue. I'm concerned that any default we pick won't satisfy everyone and could cause issues during the CNV update path. I'd rather users opt in to setting their own migration bandwidth throttling (we can document some recommendations) than potentially break someone's ability to successfully migrate.

The alternative to disabling migration throttling is to set a default so high we know it won't impact anyone, which in practice is the same as disabling the bandwidth feature by default.

Comment 8 ibesso 2021-10-15 18:17:49 UTC

Verification performed on 4.9.0-250
-----------------------------------
"bandwidthPerMigration" does not exist in the spec.liveMigrationConfig stanza:
  liveMigrationConfig:
    completionTimeoutPerGiB: 800
    parallelMigrationsPerCluster: 5
    parallelOutboundMigrationsPerNode: 2
    progressTimeout: 150
  workloadUpdateStrategy:

I can still set it (and even with 64Mi), but after removing the key-value pair, it remains nonexistent.
Moving to verified.

Comment 11 ibesso 2021-10-16 10:58:07 UTC

In another upgrade scenario (non-default value) from 4.8.3-19 to 4.9.0-250:
1. set bandwidthPerMigration: 128Mi in HCO CR.
2. upgrade to 4.9.0-250.
3. verify that the value was not retained, and the `bandwidthPerMigration` key was removed altogether:
  "liveMigrationConfig": {
    "completionTimeoutPerGiB": 800,
    "parallelMigrationsPerCluster": 5,
    "parallelOutboundMigrationsPerNode": 2,
    "progressTimeout": 150
  },

The value should have been retained.

Comment 12 ibesso 2021-10-16 16:42:15 UTC

please disregard my last comment (comment 11).
verification for the upgrade with non-default value in bandwidthPerMigration (256Mi) was successful: the value was retained after upgrade.
upgrade path was 4.8.3-19 -> 4.9.0-250.

Here is the terminal buffer (before and after upgrade): http://pastebin.test.redhat.com/1001861

Comment 15 errata-xmlrpc 2021-11-02 16:01:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4104

Note You need to log in before you can comment on or make changes to this bug.