1289285 – Live Migration dynamic cpu throttling for auto-convergence (qemu-kvm-rhev)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1289285 - Live Migration dynamic cpu throttling for auto-convergence (qemu-kvm-rhev)

Summary: Live Migration dynamic cpu throttling for auto-convergence (qemu-kvm-rhev)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Juan Quintela
QA Contact:	Qianqian Zhu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	migration_improvements 1288337 1289288 1289290 1289291 1305606 1313485 1358141
TreeView+	depends on / blocked

Reported:	2015-12-07 20:15 UTC by Hai Huang
Modified:	2016-11-07 21:42 UTC (History)
CC List:	12 users (show)
Fixed In Version:	qemu-kvm-rhev-2.5.0-1.el7
Doc Type:	Enhancement
Doc Text:
Clone Of:
Clones:	1289288 1289290 1289291 (view as bug list)
Environment:
Last Closed:	2016-11-07 21:42:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:2673	0	normal	SHIPPED_LIVE	qemu-kvm-rhev bug fix and enhancement update	2016-11-08 01:06:13 UTC

Description Hai Huang 2015-12-07 20:15:06 UTC

Description of problem:

With extreme memory write intensive workloads, normal live migration will never complete because the guest is writing to memory faster than Qemu can transfer the memory changes to the destination system. In this case normal migration will continue forever, not making enough progress to stop the guest and proceed to the non-live "finishing up" phase of migration.

This feature provides a method for slowing down guest execution speed, thus hopefully, also slowing down guest memory write speed. As time advances autoconverge will continually increase the amount of guest cpu throttling until guest memory write speed slows enough to allow the guest to be stopped and migration to finish.

As of Qemu 2.5 dynamic throttling has been added to autoconverge dramatically increasing its effectiveness.

This feature will be available in RHEL7.3 qemu-kvm-rhev with the rebase 
to qemu 2.5.

The qemu feature page can be found in:
http://wiki.qemu.org/Features/AutoconvergeLiveMigration


Version-Release number of selected component (if applicable):

  qem-kvm-rhev  


How reproducible:
Always.


Steps to Reproduce:
Please refer to the qemu feature page above.


Actual results:
Live migration fails due to high page dirty rate 
(i.e. intensive memory writes).


Expected results:
Live migration successfully complete.


Additional info:

Comment 4 Qianqian Zhu 2016-08-09 07:51:15 UTC

Test with:
host:
hp-dl585g7-05.lab.eng.pek2.redhat.com
hp-dl585g7-04.lab.eng.pek2.redhat.com
NIC Speed: 1000Mb/s

Packages:
qemu-kvm-rhev-2.6.0-19.el7.x86_64
kernel-3.10.0-478.el7.x86_64

Test matrix:
migrate_set_speed 100M
migrate_set_downtime 0.5
Mem stress: stressapptest -M {30, 50, 60, 100}
migrate_set_parameter cpu-throttle-initial {20, 30}
migrate_set_parameter cpu-throttle-increment {30, 5}

Steps:
1. Launch guest on both src and guest:
/usr/libexec/qemu-kvm -name linux -cpu Opteron_G5 -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1  -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/nfsmnt/RHEL-Server-7.3-64-virtio-scsi.raw,if=none,id=scsi0,format=raw  -device virtio-scsi-pci,id=scsi0 -device scsi-disk,drive=scsi0,scsi-id=0,lun=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -spice port=5901,disable-ticketing -vga qxl -global qxl-vga.revision=3 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=3C:D9:2B:09:AB:44,bus=pci.0,addr=0x4 -monitor unix:/tmp/hmpmonitor,server,nowait

2. Set parameters for auto converge and check it on:
migrate_set_capability auto-converge on
migrate_set_parameter cpu-throttle-initial $INITIAL
migrate_set_parameter cpu-throttle-increment $INCREMENT

3.  Stress guest with:
stressapptest -M $stress_mem -s 10000

4. Set speed and downtime:
migrate_set_speed 100M
migrate_set_downtime 0.5

5. Start migration:
migrate -d tcp:$DEST_HOST_IP:$DEST_HOST_PORT

Result:
First, the cpu throttling percentage will start from $INITIAL and increased by $INCREMENT until migration could finish.

With 30M memory stress, normal migration could finish by itself, and auto converge wont start anyway.

With 50M memory stress, normal migration could finish by itself, but with auto converge on, migration time will be reduced from 51665 to 17419. And the cpu throttling percentage will finally up to to 40. 

With 60M memory stress, normal migration can't finish, with auto converge on the migration will finish when cpu throttling percentage up to 60-80.

With 100M memory stress, normal migration can't finish, with auto converge on the migration will finish when cpu throttling percentage up to 90.

Detailed data:
Stress	Autoconvege	cpu throttle initial	cpu throttle increment	cpu throttle percentage final	total time	transferred ram	speed	downtime	dirty sync counts	Guest's CPU usage Avg
30M	off	-	-	-	11822	1135558	93.8MB/s	905	18	97.58%
	on	20	10	0	10697	1030037	94MB/s	776	11	96.71%
50M	off	-	-	-	51665	5254388	99.3MB/s	505	278	95.89%
	on	20	10	40	17419	1762094	98.8MB/s	395	57	100%
60M	off	-	-	-	Unable to finish	-	-	-	-	97.73%
	on	20	10	80	34539	3509805	99.2MB/s	382	82	98.45%
	on	30	5	60	36549	3712110	99.2MB/s	398	74	100%
100M	off	-	-	-	Unable to finish	-	-	-	-	100%
	on	20	10	90	39724	4028026	99MB/s	568	35	98.91%

Comment 5 Qianqian Zhu 2016-08-09 07:53:11 UTC

Hi Hai,

Do you think we can verify this bug based on the above results?

Thanks,
Qianqian

Comment 6 Hai Huang 2016-08-09 13:48:14 UTC

yes, the test results (with 50M memory stress,, 60M, and 100M) 
look good.

Comment 7 Qianqian Zhu 2016-08-10 01:53:40 UTC

Moving to verified as per comment 4 and comment 6.

Comment 9 errata-xmlrpc 2016-11-07 21:42:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html

Note You need to log in before you can comment on or make changes to this bug.