Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
With extreme memory write intensive workloads, normal live migration will never complete because the guest is writing to memory faster than Qemu can transfer the memory changes to the destination system. In this case normal migration will continue forever, not making enough progress to stop the guest and proceed to the non-live "finishing up" phase of migration.
This feature provides a method for slowing down guest execution speed, thus hopefully, also slowing down guest memory write speed. As time advances autoconverge will continually increase the amount of guest cpu throttling until guest memory write speed slows enough to allow the guest to be stopped and migration to finish.
As of Qemu 2.5 dynamic throttling has been added to autoconverge dramatically increasing its effectiveness.
This feature will be available in RHEL7.3 qemu-kvm-rhev with the rebase
to qemu 2.5.
The qemu feature page can be found in:
http://wiki.qemu.org/Features/AutoconvergeLiveMigration
Version-Release number of selected component (if applicable):
qem-kvm-rhev
How reproducible:
Always.
Steps to Reproduce:
Please refer to the qemu feature page above.
Actual results:
Live migration fails due to high page dirty rate
(i.e. intensive memory writes).
Expected results:
Live migration successfully complete.
Additional info:
Test with:
host:
hp-dl585g7-05.lab.eng.pek2.redhat.com
hp-dl585g7-04.lab.eng.pek2.redhat.com
NIC Speed: 1000Mb/s
Packages:
qemu-kvm-rhev-2.6.0-19.el7.x86_64
kernel-3.10.0-478.el7.x86_64
Test matrix:
migrate_set_speed 100M
migrate_set_downtime 0.5
Mem stress: stressapptest -M {30, 50, 60, 100}
migrate_set_parameter cpu-throttle-initial {20, 30}
migrate_set_parameter cpu-throttle-increment {30, 5}
Steps:
1. Launch guest on both src and guest:
/usr/libexec/qemu-kvm -name linux -cpu Opteron_G5 -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/nfsmnt/RHEL-Server-7.3-64-virtio-scsi.raw,if=none,id=scsi0,format=raw -device virtio-scsi-pci,id=scsi0 -device scsi-disk,drive=scsi0,scsi-id=0,lun=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -spice port=5901,disable-ticketing -vga qxl -global qxl-vga.revision=3 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=3C:D9:2B:09:AB:44,bus=pci.0,addr=0x4 -monitor unix:/tmp/hmpmonitor,server,nowait
2. Set parameters for auto converge and check it on:
migrate_set_capability auto-converge on
migrate_set_parameter cpu-throttle-initial $INITIAL
migrate_set_parameter cpu-throttle-increment $INCREMENT
3. Stress guest with:
stressapptest -M $stress_mem -s 10000
4. Set speed and downtime:
migrate_set_speed 100M
migrate_set_downtime 0.5
5. Start migration:
migrate -d tcp:$DEST_HOST_IP:$DEST_HOST_PORT
Result:
First, the cpu throttling percentage will start from $INITIAL and increased by $INCREMENT until migration could finish.
With 30M memory stress, normal migration could finish by itself, and auto converge wont start anyway.
With 50M memory stress, normal migration could finish by itself, but with auto converge on, migration time will be reduced from 51665 to 17419. And the cpu throttling percentage will finally up to to 40.
With 60M memory stress, normal migration can't finish, with auto converge on the migration will finish when cpu throttling percentage up to 60-80.
With 100M memory stress, normal migration can't finish, with auto converge on the migration will finish when cpu throttling percentage up to 90.
Detailed data:
Stress Autoconvege cpu throttle initial cpu throttle increment cpu throttle percentage final total time transferred ram speed downtime dirty sync counts Guest's CPU usage Avg
30M off - - - 11822 1135558 93.8MB/s 905 18 97.58%
on 20 10 0 10697 1030037 94MB/s 776 11 96.71%
50M off - - - 51665 5254388 99.3MB/s 505 278 95.89%
on 20 10 40 17419 1762094 98.8MB/s 395 57 100%
60M off - - - Unable to finish - - - - 97.73%
on 20 10 80 34539 3509805 99.2MB/s 382 82 98.45%
on 30 5 60 36549 3712110 99.2MB/s 398 74 100%
100M off - - - Unable to finish - - - - 100%
on 20 10 90 39724 4028026 99MB/s 568 35 98.91%
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://rhn.redhat.com/errata/RHBA-2016-2673.html