Bug 2013096

Summary: Cluster-wide live migration limits and timeouts are not suitable
Product: Container Native Virtualization (CNV) Reporter: Israel Pinto <ipinto>
Component: InstallationAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED WONTFIX QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.9.0CC: cnv-qe-bugs, dbasunag, dvossel, ibesso, kmajcher, stirabos
Target Milestone: ---   
Target Release: 4.8.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2011179 Environment:
Last Closed: 2021-10-12 12:21:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2011179    
Bug Blocks:    

Description Israel Pinto 2021-10-12 06:08:08 UTC
+++ This bug was initially created as a clone of Bug #2011179 +++

Description of problem:
Right now we have cluster-wide live migration limits and timeouts settings
Which causing migration to failed in same cases like when we have load on VM memory. 

Doc: https://docs.openshift.com/container-platform/4.7/virt/live_migration/virt-live-migration-limits.html#virt-live-migration-limits-ref_virt-live-migration-limits
Values: 
liveMigrationConfig:
    bandwidthPerMigration: 64Mi
    completionTimeoutPerGiB: 800
    parallelMigrationsPerCluster: 5
    parallelOutboundMigrationsPerNode: 2
    progressTimeout: 150



We should change the: bandwidthPerMigration and maybe also progressTimeout
To be more higher.

--- Additional comment from Simone Tiraboschi on 2021-10-07 12:05:38 IDT ---

@dvossel do we have better defaults that should generically fit every cluster size?

--- Additional comment from David Vossel on 2021-10-07 19:44:04 IDT ---

> @dvossel do we have better defaults that should generically fit every cluster size?


The "bandwidthPerMigration" setting actually never worked, and has been disabled by default now starting in 0.44.0 (cnv-4.9) [1].  So unless "bandwidthPerMigration" is being set explicitly by HCO, it has never worked that I am aware of.


As far as better defaults go, it's possible we can do better here. I'd like evidence to support what settings are causing migrations to fail in a default case.

> Right now we have cluster-wide live migration limits and timeouts settings
> Which causing migration to failed in same cases like when we have load on VM memory. 


@ipinto do you have anymore information pertaining to why the migrations fail? Is this specifically during cases where the VM's memory is under constant stress? 

If so, it's unclear whether changing any of the default migration parameters would help or not if the memory can never converge to the target node during migration. In situations like this, usage of post copy instead of pre copy might be the only solution that offers a guarantee that migrations will succeed. 




1. https://github.com/kubevirt/kubevirt/pull/6007

--- Additional comment from Simone Tiraboschi on 2021-10-07 23:17:46 IDT ---

(In reply to David Vossel from comment #2)
> > @dvossel do we have better defaults that should generically fit every cluster size?
> 
> 
> The "bandwidthPerMigration" setting actually never worked, and has been
> disabled by default now starting in 0.44.0 (cnv-4.9) [1].  So unless
> "bandwidthPerMigration" is being set explicitly by HCO, it has never worked
> that I am aware of.

HCO is currently setting it with a default of 64Mi just because it was exactly the value that we documented in CNV 2.6 as the default on kubevirt-config confimap, please see:
https://docs.openshift.com/container-platform/4.7/virt/live_migration/virt-live-migration-limits.html#virt-live-migration-limits-ref_virt-live-migration-limits

--- Additional comment from David Vossel on 2021-10-08 18:25:43 IDT ---

(In reply to Simone Tiraboschi from comment #3)
> (In reply to David Vossel from comment #2)
> > > @dvossel do we have better defaults that should generically fit every cluster size?
> > 
> > 
> > The "bandwidthPerMigration" setting actually never worked, and has been
> > disabled by default now starting in 0.44.0 (cnv-4.9) [1].  So unless
> > "bandwidthPerMigration" is being set explicitly by HCO, it has never worked
> > that I am aware of.
> 
> HCO is currently setting it with a default of 64Mi just because it was
> exactly the value that we documented in CNV 2.6 as the default on
> kubevirt-config confimap


ah, interesting. If the HCO is setting 64Mi for the migration bandwidth explicitly, then we now have a change in behavior between 4.8 and 4.9 due to the bug that prevented the migration bandwidth from ever being applied correctly until now.

If we have evidence that this bandwidth setting is now causing migrations to fail when we'd expect them to succeed, then we'll need to revise the default that the HCO sets to a more reasonable one. Deciding on a new default setting would need a discussion with supporting data.

If we want to restore functionality back to the previous 4.8 behavior, then removing the bandwidthPerMigration setting on the kubevirt CR entirely would match previous behavior.

--- Additional comment from Simone Tiraboschi on 2021-10-08 19:42:52 IDT ---

(In reply to David Vossel from comment #4)
> ah, interesting. If the HCO is setting 64Mi for the migration bandwidth
> explicitly, then we now have a change in behavior between 4.8 and 4.9 due to
> the bug that prevented the migration bandwidth from ever being applied
> correctly until now.

Just to be more accurate.
HCO was already setting that value on CNV 4.8 exactly with the same default.
So we should eventually amend it also there.

--- Additional comment from Israel Pinto on 2021-10-09 21:31:04 IDT ---

(In reply to David Vossel from comment #2)
> > @dvossel do we have better defaults that should generically fit every cluster size?
> 
> 
> The "bandwidthPerMigration" setting actually never worked, and has been
> disabled by default now starting in 0.44.0 (cnv-4.9) [1].  So unless
> "bandwidthPerMigration" is being set explicitly by HCO, it has never worked
> that I am aware of.
> 
> 
> As far as better defaults go, it's possible we can do better here. I'd like
> evidence to support what settings are causing migrations to fail in a
> default case.
> 
> > Right now we have cluster-wide live migration limits and timeouts settings
> > Which causing migration to failed in same cases like when we have load on VM memory. 
> 
> 
> @ipinto do you have anymore information pertaining to why the
> migrations fail? Is this specifically during cases where the VM's memory is
> under constant stress? 

@dvossel
We have simple load memory test with the following command:
stress-ng --vm 1 --vm-bytes 15% --vm-method all --verify -t 1800s -v --hdd 1 --io 1
This test passed before the bandwidth change.
With bandwidth of: 128Mb it passed.
VMs are not idle we need the at least 128Mb , we limit the migration to be 2 per node so we not load the cluster network. 
  
> 
> If so, it's unclear whether changing any of the default migration parameters
> would help or not if the memory can never converge to the target node during
> migration. In situations like this, usage of post copy instead of pre copy
> might be the only solution that offers a guarantee that migrations will
> succeed. 
> 
> 
> 
> 
> 1. https://github.com/kubevirt/kubevirt/pull/6007

--- Additional comment from David Vossel on 2021-10-11 16:55:28 IDT ---

My recommendation is that the HCO disables setting the migration bandwidth setting entirely, which results in no throttling of migration bandwidth. Here's why.

In reality, "unlimited" has effectively been the default for every release so far up until 4.9. Once the setting of 64Mi took hold in 4.9 due to a bug fix on KubeVirt's side, it caused some significant issues.

Customer environments have been using the "unlimited" migration bandwidth for over a year now without issue. I'm concerned that any default we pick won't satisfy everyone and could cause issues during the CNV update path. I'd rather users opt in to setting their own migration bandwidth throttling (we can document some recommendations) than potentially break someone's ability to successfully migrate.

The alternative to disabling migration throttling is to set a default so high we know it won't impact anyone, which in practice is the same as disabling the bandwidth feature by default.

Comment 1 Simone Tiraboschi 2021-10-12 12:21:41 UTC
The value is not effective on 4.8.z due to a kubevirt bug and the upgrade to 4.9.0 will will fix it, fixing this on 4.9.0 is enough.