Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
When guest is under high mem stress using STRESS, migration (either locally or to remote host) will not complete due to pages being dirtied faster than being transferred. Sending migrate_cancel cmd at this time will cause guest to hang.
Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.113.el6.x86_64
kernel-2.6.32-71.el6.x86_64
How reproducible:
100%
Setup:
Download STRESS from http://weather.ou.edu/~apw/projects/stress/
Steps to Reproduce:
1. Start src VM with 4vcpu and 8G mem
/usr/libexec/qemu-kvm -M rhel6.0.0 -enable-kvm -m 8G -smp 4,sockets=4,cores=1,threads=1 -name rhel5-64 -uuid d1a201e7-7109-507d-cb9a-b010becc6c6b -nodefconfig -nodefaults -monitor stdio -rtc base=utc -boot c -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/home/khong/rhel6-64.img,if=none,id=drive-ide0-0-0,boot=on,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:94:3f:29,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vnc 10.66.86.26:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
2. Start dest waiting for migration
... -incoming tcp:$dest_ip:5830 -name dest
3. Inside src guest, run STRESS as follows
$ stress --cpu 4 --vm 16 --vm-bytes 256M --verbose
4. Migrate src to dest (but which never ends)
(qemu) migrate -d tcp:$dest_ip:5830
5. Cancel migration
(qemu) migrate_cancel
Actual results:
Guest hanged with excessive cpu usage.
"PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
64597 root 20 0 8724m 8.1g 3104 S 400.3 7.3 53:51.78 qemu-kvm
"
Expected results:
Migration could be cancelled successfully and guest continued to work where it was.
Additional info:
By setting a breakpoint at migration.c:do_migrate_cancel, after hitting the breakpoint and continue the process, migrate_cancel will succeed.
No hang in such a case.
So is it caused by some race condition?
This seems to be related to enabling and disabling dirty tracking. I added these monitor commands for testing:
dirtytrack -> cpu_physical_memory_set_dirty_tracking(1)
nodirtytrack -> cpu_physical_memory_set_dirty_tracking(0)
Toggling this on and off seems to produce a similar hang with similar frequency to starting and stopping migration. I also tested that if we don't disable dirty tracking when ram_save_live is called with -1, I can't cause a failure. Still investigating.
Created attachment 469450[details]
kvm_trace log
In this trace, the guest starts hung. I've found that it can get un-hung by restarting the migration or, as in this case, re-issuing cpu_physical_memory_set_dirty_tracking(1). The guest then starts working again, but after a few more cycles of togging dirty tracking, gets hung again.
Reproduced on kernel-2.6.32-94.el6
Verified on kernel-2.6.32-113.el6.
Following the steps in comment #0,Guest not hang after migration_cancel
This issue has been fixed.
Based on above ,Change status to VERIFIED.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
http://rhn.redhat.com/errata/RHSA-2011-0542.html