Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 634100 - migrate_cancel under STRESS caused guest to hang
migrate_cancel under STRESS caused guest to hang
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.1
All Linux
high Severity high
: rc
: 6.1
Assigned To: Marcelo Tosatti
Red Hat Kernel QE team
:
Depends On:
Blocks: Rhel6KvmTier1
  Show dependency treegraph
 
Reported: 2010-09-15 04:52 EDT by Keqin Hong
Modified: 2013-01-09 18:08 EST (History)
12 users (show)

See Also:
Fixed In Version: kernel-2.6.32-112.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-05-23 16:52:23 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
kvm_trace log (16.00 KB, text/plain)
2010-12-17 16:27 EST, Alex Williamson
no flags Details
dirty tracking patch (2.22 KB, patch)
2010-12-17 16:29 EST, Alex Williamson
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0542 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 6.1 kernel security, bug fix and enhancement update 2011-05-19 07:58:07 EDT

  None (edit)
Description Keqin Hong 2010-09-15 04:52:34 EDT
Description of problem:
When guest is under high mem stress using STRESS, migration (either locally or to remote host) will not complete due to pages being dirtied faster than being transferred. Sending migrate_cancel cmd at this time will cause guest to hang.

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.113.el6.x86_64
kernel-2.6.32-71.el6.x86_64

How reproducible:
100%

Setup:
Download STRESS from http://weather.ou.edu/~apw/projects/stress/

Steps to Reproduce:
1. Start src VM with 4vcpu and 8G mem
/usr/libexec/qemu-kvm -M rhel6.0.0 -enable-kvm -m 8G -smp 4,sockets=4,cores=1,threads=1 -name rhel5-64 -uuid d1a201e7-7109-507d-cb9a-b010becc6c6b -nodefconfig -nodefaults -monitor stdio -rtc base=utc -boot c -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/home/khong/rhel6-64.img,if=none,id=drive-ide0-0-0,boot=on,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:94:3f:29,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vnc 10.66.86.26:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

2. Start dest waiting for migration
... -incoming tcp:$dest_ip:5830 -name dest

3. Inside src guest, run STRESS as follows
$ stress --cpu 4 --vm 16 --vm-bytes 256M --verbose

4. Migrate src to dest (but which never ends)
(qemu) migrate -d tcp:$dest_ip:5830

5. Cancel migration
(qemu) migrate_cancel

Actual results:
Guest hanged with excessive cpu usage. 
"PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                          
64597 root      20   0 8724m 8.1g 3104 S 400.3  7.3  53:51.78 qemu-kvm
"
Expected results:
Migration could be cancelled successfully and guest continued to work where it was.

Additional info:
By setting a breakpoint at migration.c:do_migrate_cancel, after hitting the breakpoint and continue the process, migrate_cancel will succeed. 
No hang in such a case.

So is it caused by some race condition?
Comment 2 Alex Williamson 2010-12-16 18:12:14 EST
This seems to be related to enabling and disabling dirty tracking.  I added these monitor commands for testing:

dirtytrack -> cpu_physical_memory_set_dirty_tracking(1)
nodirtytrack -> cpu_physical_memory_set_dirty_tracking(0)

Toggling this on and off seems to produce a similar hang with similar frequency to starting and stopping migration.  I also tested that if we don't disable dirty tracking when ram_save_live is called with -1, I can't cause a failure.  Still investigating.
Comment 3 Alex Williamson 2010-12-17 16:27:09 EST
Created attachment 469450 [details]
kvm_trace log

In this trace, the guest starts hung.  I've found that it can get un-hung by restarting the migration or, as in this case, re-issuing cpu_physical_memory_set_dirty_tracking(1).  The guest then starts working again, but after a few more cycles of togging dirty tracking, gets hung again.
Comment 4 Alex Williamson 2010-12-17 16:29:45 EST
Created attachment 469451 [details]
dirty tracking patch

crude debug patch for allowing toggling dirty tracking independent of migration.
Comment 5 Alex Williamson 2010-12-17 16:34:15 EST
Adding some kernel-side experts since this seems to be happening as a result of toggling dirty page tracking via a kvm ioctl.
Comment 6 Alex Williamson 2010-12-22 13:37:31 EST
Marclo has posted a patch upstream that seems to resolve this in my testing.  Re-assigning.
Comment 8 Aristeu Rozanski 2011-02-03 12:24:57 EST
Patch(es) available on kernel-2.6.32-112.el6
Comment 11 Mike Cao 2011-02-16 00:10:22 EST
Reproduced on kernel-2.6.32-94.el6
Verified on kernel-2.6.32-113.el6.

Following the steps in comment #0,Guest not hang after migration_cancel
This issue has been fixed.

Based on above ,Change status to VERIFIED.
Comment 12 errata-xmlrpc 2011-05-23 16:52:23 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html

Note You need to log in before you can comment on or make changes to this bug.