Bug 1749378 - postcopy migration does not honour speed limits after migrate pause and recovery, consumes entire bandwidth of NIC
Summary: postcopy migration does not honour speed limits after migrate pause and recov...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.1
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: rc
: ---
Assignee: Peter Xu
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-05 13:40 UTC by Li Xiaohui
Modified: 2020-12-20 09:02 UTC (History)
7 users (show)

Fixed In Version: qemu-kvm-4.2.0-1.module+el8.2.0+4793+b09dd2fb
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-05 09:49:41 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2017 0 None None None 2020-05-05 09:50:27 UTC

Description Li Xiaohui 2019-09-05 13:40:55 UTC
Description of problem:
migrate_pause during postcopy phase, then do migrate recovery, after that, postcopy migration consumes entire bandwidth of NIC, doesn't honour speed limit 


Version-Release number of selected component (if applicable):
src&dst host: kernel-4.18.0-138.el8.x86_64 & qemu-kvm-4.1.0-6.module+el8.1.0+4164+854d66f5.x86_64
guest info: kernel-4.18.0-141.el8.x86_64


How reproducible:
100%


Steps to Reproduce:
1.boot guest on src and dst host(guest on dst host with command "-incoming tcp:0:4444")
2.enable postcopy mode on both src and dst host, and set max-postcopy-bandwidth:
(1)src hmp:
(qemu) migrate_set_capability postcopy-ram on      
(qemu) migrate_set_parameter max-postcopy-bandwidth 5M
(qemu) info migrate_parameters 
...
max-bandwidth: 33554432 bytes/second
downtime-limit: 300 milliseconds
x-checkpoint-delay: 20000
block-incremental: off
multifd-channels: 2
xbzrle-cache-size: 67108864
max-postcopy-bandwidth: 5242880
 tls-authz: '(null)'
(2)dst hmp:
(qemu) migrate_set_capability postcopy-ram on 
3.start postcopy migration and then pause on src host:
(qemu) migrate_start_postcopy 
(qemu) info migrate
...
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off release-ram: off return-path: off pause-before-switchover: off multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off 
Migration status: postcopy-active
total time: 11569 milliseconds
expected downtime: 317927 milliseconds
setup: 18 milliseconds
transferred ram: 297187 kbytes
throughput: 42.04 mbps                --> the real-time throughput is right(nearly 5MBps)
remaining ram: 1611264 kbytes
total ram: 4211528 kbytes
duplicate: 581332 pages
skipped: 0 pages
normal: 72877 pages
normal bytes: 291508 kbytes
dirty sync count: 2
page size: 4 kbytes
multifd bytes: 0 kbytes
pages-per-second: 1440
dirty pages rate: 123652 pages
postcopy request count: 483
(qemu) migrate_pause 
4.recovery postcopy migration
(1)dst host
(qemu) migrate_recover tcp:10.66.8.208:4444
(2)src host
(qemu) migrate -r tcp:10.66.8.208:4444
5.check migration status after step4


Actual results:
after step5, found the real-time throught consumes entire bandwidth of NIC, not honour max-postcopy-bandwidth limit:
(qemu) info migrate
..
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off release-ram: off return-path: off pause-before-switchover: off multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off 
Migration status: postcopy-active
total time: 199394 milliseconds
expected downtime: 14424 milliseconds
setup: 18 milliseconds
transferred ram: 562338 kbytes
throughput: 926.57 mbps                                 --> the real-time through consumes entire bandwidth of NIC
remaining ram: 1324472 kbytes
total ram: 4211528 kbytes
duplicate: 586883 pages
skipped: 0 pages
normal: 139023 pages
normal bytes: 556092 kbytes
dirty sync count: 2
page size: 4 kbytes
multifd bytes: 0 kbytes
pages-per-second: 29210
dirty pages rate: 123652 pages
postcopy request count: 893
(qemu) info migrate
...
total time: 200242 milliseconds
expected downtime: 14063 milliseconds
setup: 18 milliseconds
transferred ram: 658229 kbytes
throughput: 950.33 mbps                                 --> the real-time through consumes entire bandwidth of NIC
remaining ram: 1216024 kbytes
total ram: 4211528 kbytes
...
dirty pages rate: 123652 pages
postcopy request count: 1066 
(qemu) info migrate_parameters 
...
max-bandwidth: 33554432 bytes/second
downtime-limit: 300 milliseconds
x-checkpoint-delay: 20000
block-incremental: off
multifd-channels: 2
xbzrle-cache-size: 67108864
max-postcopy-bandwidth: 5242880                          --> the max-postcopy-bandwidth is right
 tls-authz: '(null)'
(qemu) info migrate
...
total time: 203634 milliseconds
expected downtime: 14082 milliseconds
setup: 18 milliseconds
transferred ram: 1041486 kbytes
throughput: 949.07 mbps                                  --> the real-time through consumes entire bandwidth of NIC
remaining ram: 798336 kbytes  
total ram: 4211528 kbytes

dirty pages rate: 123652 pages
postcopy request count: 3066


Expected results:
after recovery postcopy migration, the speed should honour max-postcopy-bandwidth limits


Additional info:

Comment 1 Peter Xu 2019-09-06 13:20:29 UTC
Posted fix upstream.

https://lists.gnu.org/archive/html/qemu-devel/2019-09/msg01141.html

Comment 5 Ademar Reis 2020-02-05 23:05:04 UTC
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 6 Li Xiaohui 2020-02-14 09:18:03 UTC
Verify bz on hosts(kernel-4.18.0-177.el8.x86_64&qemu-kvm-4.2.0-9.module+el8.2.0+5699+b5331ee5.x86_64), test steps like Comment 0, the test result is good, so make this bz verified:
 

(qemu) migrate -r tcp:10.73.33.186:5555  
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
Migration status: postcopy-active
total time: 676350 milliseconds
expected downtime: 1505928 milliseconds
setup: 51 milliseconds
transferred ram: 646343 kbytes
throughput: 41.94 mbps
remaining ram: 5343656 kbytes
total ram: 8405832 kbytes
duplicate: 1407747 pages
skipped: 0 pages
normal: 158183 pages
normal bytes: 632732 kbytes
dirty sync count: 2
page size: 4 kbytes
multifd bytes: 0 kbytes
pages-per-second: 359590
dirty pages rate: 221440 pages
postcopy request count: 803
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
Migration status: postcopy-active
total time: 677277 milliseconds
expected downtime: 1498033 milliseconds
setup: 51 milliseconds
transferred ram: 650963 kbytes
throughput: 42.17 mbps
remaining ram: 4656988 kbytes
total ram: 8405832 kbytes
duplicate: 1578636 pages
skipped: 0 pages
normal: 158961 pages
normal bytes: 635844 kbytes
dirty sync count: 2
page size: 4 kbytes
multifd bytes: 0 kbytes
pages-per-second: 3210
dirty pages rate: 221440 pages
postcopy request count: 803

Comment 8 errata-xmlrpc 2020-05-05 09:49:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2017


Note You need to log in before you can comment on or make changes to this bug.