Description of problem: Right now we have cluster-wide live migration limits and timeouts settings Which causing migration to failed in same cases like when we have load on VM memory. Doc: https://docs.openshift.com/container-platform/4.7/virt/live_migration/virt-live-migration-limits.html#virt-live-migration-limits-ref_virt-live-migration-limits Values: liveMigrationConfig: bandwidthPerMigration: 64Mi completionTimeoutPerGiB: 800 parallelMigrationsPerCluster: 5 parallelOutboundMigrationsPerNode: 2 progressTimeout: 150 We should change the: bandwidthPerMigration and maybe also progressTimeout To be more higher.
@dvossel do we have better defaults that should generically fit every cluster size?
> @dvossel do we have better defaults that should generically fit every cluster size? The "bandwidthPerMigration" setting actually never worked, and has been disabled by default now starting in 0.44.0 (cnv-4.9) [1]. So unless "bandwidthPerMigration" is being set explicitly by HCO, it has never worked that I am aware of. As far as better defaults go, it's possible we can do better here. I'd like evidence to support what settings are causing migrations to fail in a default case. > Right now we have cluster-wide live migration limits and timeouts settings > Which causing migration to failed in same cases like when we have load on VM memory. @ipinto do you have anymore information pertaining to why the migrations fail? Is this specifically during cases where the VM's memory is under constant stress? If so, it's unclear whether changing any of the default migration parameters would help or not if the memory can never converge to the target node during migration. In situations like this, usage of post copy instead of pre copy might be the only solution that offers a guarantee that migrations will succeed. 1. https://github.com/kubevirt/kubevirt/pull/6007
(In reply to David Vossel from comment #2) > > @dvossel do we have better defaults that should generically fit every cluster size? > > > The "bandwidthPerMigration" setting actually never worked, and has been > disabled by default now starting in 0.44.0 (cnv-4.9) [1]. So unless > "bandwidthPerMigration" is being set explicitly by HCO, it has never worked > that I am aware of. HCO is currently setting it with a default of 64Mi just because it was exactly the value that we documented in CNV 2.6 as the default on kubevirt-config confimap, please see: https://docs.openshift.com/container-platform/4.7/virt/live_migration/virt-live-migration-limits.html#virt-live-migration-limits-ref_virt-live-migration-limits
(In reply to Simone Tiraboschi from comment #3) > (In reply to David Vossel from comment #2) > > > @dvossel do we have better defaults that should generically fit every cluster size? > > > > > > The "bandwidthPerMigration" setting actually never worked, and has been > > disabled by default now starting in 0.44.0 (cnv-4.9) [1]. So unless > > "bandwidthPerMigration" is being set explicitly by HCO, it has never worked > > that I am aware of. > > HCO is currently setting it with a default of 64Mi just because it was > exactly the value that we documented in CNV 2.6 as the default on > kubevirt-config confimap ah, interesting. If the HCO is setting 64Mi for the migration bandwidth explicitly, then we now have a change in behavior between 4.8 and 4.9 due to the bug that prevented the migration bandwidth from ever being applied correctly until now. If we have evidence that this bandwidth setting is now causing migrations to fail when we'd expect them to succeed, then we'll need to revise the default that the HCO sets to a more reasonable one. Deciding on a new default setting would need a discussion with supporting data. If we want to restore functionality back to the previous 4.8 behavior, then removing the bandwidthPerMigration setting on the kubevirt CR entirely would match previous behavior.
(In reply to David Vossel from comment #4) > ah, interesting. If the HCO is setting 64Mi for the migration bandwidth > explicitly, then we now have a change in behavior between 4.8 and 4.9 due to > the bug that prevented the migration bandwidth from ever being applied > correctly until now. Just to be more accurate. HCO was already setting that value on CNV 4.8 exactly with the same default. So we should eventually amend it also there.
(In reply to David Vossel from comment #2) > > @dvossel do we have better defaults that should generically fit every cluster size? > > > The "bandwidthPerMigration" setting actually never worked, and has been > disabled by default now starting in 0.44.0 (cnv-4.9) [1]. So unless > "bandwidthPerMigration" is being set explicitly by HCO, it has never worked > that I am aware of. > > > As far as better defaults go, it's possible we can do better here. I'd like > evidence to support what settings are causing migrations to fail in a > default case. > > > Right now we have cluster-wide live migration limits and timeouts settings > > Which causing migration to failed in same cases like when we have load on VM memory. > > > @ipinto do you have anymore information pertaining to why the > migrations fail? Is this specifically during cases where the VM's memory is > under constant stress? @dvossel We have simple load memory test with the following command: stress-ng --vm 1 --vm-bytes 15% --vm-method all --verify -t 1800s -v --hdd 1 --io 1 This test passed before the bandwidth change. With bandwidth of: 128Mb it passed. VMs are not idle we need the at least 128Mb , we limit the migration to be 2 per node so we not load the cluster network. > > If so, it's unclear whether changing any of the default migration parameters > would help or not if the memory can never converge to the target node during > migration. In situations like this, usage of post copy instead of pre copy > might be the only solution that offers a guarantee that migrations will > succeed. > > > > > 1. https://github.com/kubevirt/kubevirt/pull/6007
My recommendation is that the HCO disables setting the migration bandwidth setting entirely, which results in no throttling of migration bandwidth. Here's why. In reality, "unlimited" has effectively been the default for every release so far up until 4.9. Once the setting of 64Mi took hold in 4.9 due to a bug fix on KubeVirt's side, it caused some significant issues. Customer environments have been using the "unlimited" migration bandwidth for over a year now without issue. I'm concerned that any default we pick won't satisfy everyone and could cause issues during the CNV update path. I'd rather users opt in to setting their own migration bandwidth throttling (we can document some recommendations) than potentially break someone's ability to successfully migrate. The alternative to disabling migration throttling is to set a default so high we know it won't impact anyone, which in practice is the same as disabling the bandwidth feature by default.
Verification performed on 4.9.0-250 ----------------------------------- "bandwidthPerMigration" does not exist in the spec.liveMigrationConfig stanza: liveMigrationConfig: completionTimeoutPerGiB: 800 parallelMigrationsPerCluster: 5 parallelOutboundMigrationsPerNode: 2 progressTimeout: 150 workloadUpdateStrategy: I can still set it (and even with 64Mi), but after removing the key-value pair, it remains nonexistent. Moving to verified.
In another upgrade scenario (non-default value) from 4.8.3-19 to 4.9.0-250: 1. set bandwidthPerMigration: 128Mi in HCO CR. 2. upgrade to 4.9.0-250. 3. verify that the value was not retained, and the `bandwidthPerMigration` key was removed altogether: "liveMigrationConfig": { "completionTimeoutPerGiB": 800, "parallelMigrationsPerCluster": 5, "parallelOutboundMigrationsPerNode": 2, "progressTimeout": 150 }, The value should have been retained.
please disregard my last comment (comment 11). verification for the upgrade with non-default value in bandwidthPerMigration (256Mi) was successful: the value was retained after upgrade. upgrade path was 4.8.3-19 -> 4.9.0-250. Here is the terminal buffer (before and after upgrade): http://pastebin.test.redhat.com/1001861
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4104