Red Hat Bugzilla – Bug 634100
migrate_cancel under STRESS caused guest to hang
Last modified: 2013-01-09 18:08:29 EST
Description of problem: When guest is under high mem stress using STRESS, migration (either locally or to remote host) will not complete due to pages being dirtied faster than being transferred. Sending migrate_cancel cmd at this time will cause guest to hang. Version-Release number of selected component (if applicable): qemu-kvm-0.12.1.2-2.113.el6.x86_64 kernel-2.6.32-71.el6.x86_64 How reproducible: 100% Setup: Download STRESS from http://weather.ou.edu/~apw/projects/stress/ Steps to Reproduce: 1. Start src VM with 4vcpu and 8G mem /usr/libexec/qemu-kvm -M rhel6.0.0 -enable-kvm -m 8G -smp 4,sockets=4,cores=1,threads=1 -name rhel5-64 -uuid d1a201e7-7109-507d-cb9a-b010becc6c6b -nodefconfig -nodefaults -monitor stdio -rtc base=utc -boot c -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/home/khong/rhel6-64.img,if=none,id=drive-ide0-0-0,boot=on,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:94:3f:29,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vnc 10.66.86.26:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 2. Start dest waiting for migration ... -incoming tcp:$dest_ip:5830 -name dest 3. Inside src guest, run STRESS as follows $ stress --cpu 4 --vm 16 --vm-bytes 256M --verbose 4. Migrate src to dest (but which never ends) (qemu) migrate -d tcp:$dest_ip:5830 5. Cancel migration (qemu) migrate_cancel Actual results: Guest hanged with excessive cpu usage. "PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 64597 root 20 0 8724m 8.1g 3104 S 400.3 7.3 53:51.78 qemu-kvm " Expected results: Migration could be cancelled successfully and guest continued to work where it was. Additional info: By setting a breakpoint at migration.c:do_migrate_cancel, after hitting the breakpoint and continue the process, migrate_cancel will succeed. No hang in such a case. So is it caused by some race condition?
This seems to be related to enabling and disabling dirty tracking. I added these monitor commands for testing: dirtytrack -> cpu_physical_memory_set_dirty_tracking(1) nodirtytrack -> cpu_physical_memory_set_dirty_tracking(0) Toggling this on and off seems to produce a similar hang with similar frequency to starting and stopping migration. I also tested that if we don't disable dirty tracking when ram_save_live is called with -1, I can't cause a failure. Still investigating.
Created attachment 469450 [details] kvm_trace log In this trace, the guest starts hung. I've found that it can get un-hung by restarting the migration or, as in this case, re-issuing cpu_physical_memory_set_dirty_tracking(1). The guest then starts working again, but after a few more cycles of togging dirty tracking, gets hung again.
Created attachment 469451 [details] dirty tracking patch crude debug patch for allowing toggling dirty tracking independent of migration.
Adding some kernel-side experts since this seems to be happening as a result of toggling dirty page tracking via a kvm ioctl.
Marclo has posted a patch upstream that seems to resolve this in my testing. Re-assigning.
Patch(es) available on kernel-2.6.32-112.el6
Reproduced on kernel-2.6.32-94.el6 Verified on kernel-2.6.32-113.el6. Following the steps in comment #0,Guest not hang after migration_cancel This issue has been fixed. Based on above ,Change status to VERIFIED.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html