Bug 1686321

Summary:	The speed doesn't take effect if set postcopy speed limit during post-copy phase
Product:	Red Hat Enterprise Linux Advanced Virtualization	Reporter:	Fangge Jin <fjin>
Component:	qemu-kvm	Assignee:	Dr. David Alan Gilbert <dgilbert>
Status:	CLOSED ERRATA	QA Contact:	Li Xiaohui <xiaohli>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	8.0	CC:	ddepaula, dgilbert, fjin, jinzhao, juzhang, knoel, rbalakri, virt-maint, yuhuang
Target Milestone:	rc	Flags:	knoel: mirror+
Target Release:	8.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	qemu-kvm-4.0.0-3.module+el8.1.0+3265+26c4ed71	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-06 07:13:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Fangge Jin 2019-03-07 09:27:10 UTC

Description of problem:
The speed doesn't take effect if set postcopy speed limit during post-copy phase
If I set it during pre-copy phase or before migration starts, it can take effect after switching to post-copy mode.

Version-Release number of selected component (if applicable):
qemu-kvm-3.1.0-18.module+el8+2834+fa8bb6e2.x86_64
libvirt-5.0.0-6.module+el8+2860+4e0fe96a.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Start a vm, load stress in vm

2. Do live migration with postcopy enabled:
# virsh migrate avocado-vt-vm1 qemu+ssh://hp-ml150gen9-01.rhts.eng.bos.redhat.com/system --live --verbose --p2p  --persistent --postcopy --postcopy-bandwidth 10

3. Switch to postcopy mode:
# virsh migrate-postcopy avocado-vt-vm1

4. Get domain job info:
# virsh domjobinfo avocado-vt-vm1
Job type:         Unbounded   
Operation:        Outgoing migration
Time elapsed:     14139        ms
Data processed:   25.847 GiB
Data remaining:   467.805 MiB
Data total:       1.005 GiB
Memory processed: 25.847 GiB
Memory remaining: 467.805 MiB
Memory total:     1.005 GiB
Memory bandwidth: 10.020 MiB/s        =====> around 10 MiB/s
Dirty rate:       28584        pages/s
Page size:        4096         bytes
Iteration:        58          
Postcopy requests: 23          
Constant pages:   152210      
Normal pages:     6762014     
Normal data:      25.795 GiB
Expected downtime: 47792        ms
Setup time:       8            ms

5. Lower postcopy speed limit to 5 MiB/s:
# virsh migrate-setspeed avocado-vt-vm1 5 --postcopy

# virsh migrate-getspeed avocado-vt-vm1 --postcopy
5

6. Query domain job info:
Job type:         Unbounded   
Operation:        Outgoing migration
Time elapsed:     14233        ms
Data processed:   25.848 GiB
Data remaining:   466.766 MiB
Data total:       1.005 GiB
Memory processed: 25.848 GiB
Memory remaining: 466.766 MiB
Memory total:     1.005 GiB
Memory bandwidth: 10.020 MiB/s        =====> still around 10 MiB/s
Dirty rate:       28584        pages/s
Page size:        4096         bytes
Iteration:        58          
Postcopy requests: 25          
Constant pages:   152220      
Normal pages:     6762270     
Normal data:      25.796 GiB
Expected downtime: 47792        ms
Setup time:       8            ms

7. Wait some time, query domain job info again, it is still around 10 MiB/s

8. Try to raise the speed limit to 20 MiB/s, still doesn't work.

Actual results:
As above

Expected results:
Set postcopy speed limit in post-copy phase should work.

Additional info:

Comment 1 Dr. David Alan Gilbert 2019-03-07 09:53:40 UTC

Fangge:
  Can you please describe:
     a) The guest that you're running including the size of the VM (in GB RAM), the program it's running
     b) How you're triggering the postcopy switchover
     c) The network connection between the hosts
     d) how long the postcopy phase is?
     e) Which is the source and destination host machines you're using?

Comment 2 Fangge Jin 2019-03-07 10:13:31 UTC

(In reply to Dr. David Alan Gilbert from comment #1)
> Fangge:
>   Can you please describe:
>      a) The guest that you're running including the size of the VM (in GB
> RAM), the program it's running
The size of the VM is 1048576 KiB, I run stress in VM: # stress --cpu 8 --io 4 --vm 4 --vm-bytes 128M

>      b) How you're triggering the postcopy switchover
I use virsh command: virsh migrate-postcopy $guest, it will call QMP: migrate-start-postcopy

>      c) The network connection between the hosts
The maximum network speed is 1000Mb/s. The NICs info is as below:
Src: 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 14:58:d0:d3:31:ab brd ff:ff:ff:ff:ff:ff
    inet 10.16.184.37/22 brd 10.16.187.255 scope global dynamic noprefixroute eno1
       valid_lft 73170sec preferred_lft 73170sec

Dest:2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether a0:2b:b8:31:26:a4 brd ff:ff:ff:ff:ff:ff
    inet 10.16.65.242/21 brd 10.16.71.255 scope global dynamic noprefixroute eno1
       valid_lft 73352sec preferred_lft 73352sec


>      d) how long the postcopy phase is?
I didn't pay attention to the time. But total migration time is:
# virsh domjobinfo avocado-vt-vm1  --completed
Job type:         Completed   
Operation:        Outgoing migration
Time elapsed:     72105        ms         ==> total time
Time elapsed w/o network: 72097        ms
Data processed:   3.133 GiB
Data remaining:   0.000 B
Data total:       1.005 GiB
Memory processed: 3.133 GiB
Memory remaining: 0.000 B
Memory total:     1.005 GiB
Memory bandwidth: 44.819 MiB/s
Dirty rate:       0            pages/s
Page size:        4096         bytes
Iteration:        7           
Postcopy requests: 594         
Constant pages:   48095       
Normal pages:     819479      
Normal data:      3.126 GiB
Total downtime:   275          ms
Downtime w/o network: 267          ms
Setup time:       28           ms

The post-copy phase may be around (72105-14139=57966)ms, which is not so accurate.

>      e) Which is the source and destination host machines you're using?
Src: hp-dl120gen9-01.khw.lab.eng.bos.redhat.com
Dest: hp-ml150gen9-01.rhts.eng.bos.redhat.com

Comment 3 Dr. David Alan Gilbert 2019-03-07 20:23:39 UTC

I've checked with the code, the postcopy bandwidth setting is only read at the point of switchover from precopy to postcopy.
Yes we can fix that to allow it to be changed.

Comment 4 Dr. David Alan Gilbert 2019-03-07 20:24:17 UTC

Need to fix migrate_params_apply to not call qemu_file_set_rate_limit for max_bandwidth if pc is active, but to call it in the code for max_postcopy_bandwdith case.

Comment 5 Dr. David Alan Gilbert 2019-03-08 10:12:44 UTC

Posted upstream:
   migration/postcopy: Update the bandwidth during postcopy

Comment 6 Dr. David Alan Gilbert 2019-03-26 09:26:34 UTC

Merged upstream as c38c1c142e64901b09f5 it'll be in qemu 4.0

Comment 10 Li Xiaohui 2019-06-06 07:51:24 UTC

Hi, all
I verify this bz in environment[1], test steps are like polarion case RHEL-150076[2], the issue is gone.

environment[1]
src and dst host info: kernel-modules-4.18.0-95.el8.x86_64 & qemu-img-4.0.0-3.module+el8.1.0+3265+26c4ed71.x86_64
guest info: kernel-4.18.0-100.el8.x86_64

polarion case[2]:
https://polarion.engineering.redhat.com/polarion/#/project/RedHatEnterpriseLinux7/workitem?id=RHEL-150076


Best regards,
Li Xiaohui

Comment 12 errata-xmlrpc 2019-11-06 07:13:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3723