Bug 1811860

Summary: [RFE] Tool or Docs guidance on required network bandwidth for successful live migration of VMs:
Product: Red Hat Enterprise Linux 8 Reporter: Germano Veit Michel <gveitmic>
Component: DocumentationAssignee: Jiri Herrmann <jherrman>
Documentation sub component: default QA Contact:
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: medium    
Priority: high CC: ahadas, amusil, germano, jherrman, klaas, lsurette, mperina, mzamazal, rhel-docs, rhoch, sgoodman, srevivo
Version: 8.6Keywords: Documentation, FutureFeature
Target Milestone: rc   
Target Release: 8.6   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-18 08:44:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Docs RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2020-03-10 00:39:14 UTC
Description of problem:

The customer requests a tool (i.e. web tool on labs), or guidance in the Docs (table?) about estimating the required network bandwidth for successful live migration of VMs.

It could be based on these factors:
* RHV Migration Policy Selected
* Dirty Ratio during migration
* VM Memory Size and Configuration

For example, it would show an estimate of bandwidth required for a VM to migrate with the following:
+ Minimal Downtime Policy
+ 512GB of Memory
+ Low/Mid/High Dirty Ratio
= estimated_bandwidth required

So that customers can have a better idea of the network requirements for migrations, if its 10G or 40G that they need.

Comment 1 RHEL Program Management 2020-03-10 00:40:57 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Germano Veit Michel 2020-03-11 01:40:59 UTC
Could also have a note on how the customer can get the dirty ratio, even though it changes constantly:

# virsh -r domjobinfo <VM Name>
Job type:         Unbounded   
Operation:        Outgoing migration
Time elapsed:     6789         ms
Data processed:   517.500 MiB
Data remaining:   27.680 MiB
Data total:       4.345 GiB
Memory processed: 517.500 MiB
Memory remaining: 27.680 MiB
Memory total:     4.345 GiB
Memory bandwidth: 101.060 MiB/s
Dirty rate:       203242       pages/s    <-----------
Page size:        4096         bytes
Iteration:        2           
Constant pages:   1009395     
Normal pages:     130008      
Normal data:      507.844 MiB
Expected downtime: 100          ms
Setup time:       56           ms
Compression cache: 64.000 MiB
Compressed data:  0.000 B
Compressed pages: 0            
Compression cache misses: 504          
Compression overflows: 0

Comment 3 Steve Goodman 2021-11-25 12:23:40 UTC
Martin, please give some input here, either on the idea for a tool (comment 0), or guidance in the Docs (table?) about estimating the required network bandwidth for successful live migration of VMs.

Comment 4 Martin Perina 2021-11-25 13:04:32 UTC
Ales/Arik, any recommendations?

Comment 5 Arik 2021-11-28 20:57:42 UTC
For post-copy I think it doesn't matter - it depends on the performance degregation the user is willing to accept. Unless the channel disconnects, the migration should complete successfully.

For pre-copy, there are two relevant phases:
1. To reach the watermark in which we pause the guest
2. To copy the remaining dirty pages while the guest is paused

I don't think we should document example values for which the migration would complete successfully but maybe we can provide more data, e.g., how much time we wait for getting to the watermark in the former phase or how much time we enable the guest to be paused in the latter phase (e.g., with minimal-downtime policy it's 500ms) to assist users in setting the bandwidth when taking into consideration also things like the latency of the network, the amount of migrations that will happen simultaneously on that network, and soon also the amount of connections.

I wouldn't go with a tool, as Germano wrote in comment 2 it's hard to retrieve those values and if the tool is expected to apply some calculation on given properties that are specified by the user then I think it will be simpler to write for formula somewhere.

Milan, what do you think?

Comment 6 Milan Zamazal 2021-11-29 09:00:16 UTC
I agree. It's hard to predict the required bandwidth but there are things that can be considered. There are several factors involved, some of them being of dynamic nature -- besides the dirty rate (which is also dependent on autoconverge!) whether a single CPU power can saturate the available network bandwidth (we work on implementing multiple migration connections that allow using more CPU power for migrations and make the predictions even more complicated at the same time). There can be theoretical formulas but I'm afraid that in the end result the customer must always test what works best in the given environment. Helping customers to understand how the things work and what they can realistically expect depending on various tweaks to make informed decisions looks like a better idea than providing a tool that's likely to be not much more than a gimmick.

BTW, I think there are some efforts on the platform to provide a dirty rate information for a VM even when it's not migrating, which could be useful in this context.

Comment 18 Klaas Demter 2023-08-01 07:37:07 UTC
Just for context because I came back here to look at the solution again and I am guessing it's hidden in a private comment:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_virtualization/migrating-virtual-machines_configuring-and-managing-virtualization#migrating-a-virtual-machine-using-the-cli_migrating-virtual-machines

# virsh domdirtyrate-calc vm-name 30
# virsh domstats vm-name --dirtyrate

Comment 19 Jiri Herrmann 2023-08-01 11:16:31 UTC
That is correct, thank you for clarification Klaas!