Bug 634100

Summary: migrate_cancel under STRESS caused guest to hang
Product: Red Hat Enterprise Linux 6 Reporter: Keqin Hong <khong>
Component: kernelAssignee: Marcelo Tosatti <mtosatti>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.1CC: alex.williamson, arozansk, bcao, gcosta, gleb, knoel, lihuang, michen, mkenneth, mtosatti, tburke, virt-maint
Target Milestone: rc   
Target Release: 6.1   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-112.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-23 20:52:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580951    
Attachments:
Description Flags
kvm_trace log
none
dirty tracking patch none

Description Keqin Hong 2010-09-15 08:52:34 UTC
Description of problem:
When guest is under high mem stress using STRESS, migration (either locally or to remote host) will not complete due to pages being dirtied faster than being transferred. Sending migrate_cancel cmd at this time will cause guest to hang.

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.113.el6.x86_64
kernel-2.6.32-71.el6.x86_64

How reproducible:
100%

Setup:
Download STRESS from http://weather.ou.edu/~apw/projects/stress/

Steps to Reproduce:
1. Start src VM with 4vcpu and 8G mem
/usr/libexec/qemu-kvm -M rhel6.0.0 -enable-kvm -m 8G -smp 4,sockets=4,cores=1,threads=1 -name rhel5-64 -uuid d1a201e7-7109-507d-cb9a-b010becc6c6b -nodefconfig -nodefaults -monitor stdio -rtc base=utc -boot c -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/home/khong/rhel6-64.img,if=none,id=drive-ide0-0-0,boot=on,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:94:3f:29,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vnc 10.66.86.26:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

2. Start dest waiting for migration
... -incoming tcp:$dest_ip:5830 -name dest

3. Inside src guest, run STRESS as follows
$ stress --cpu 4 --vm 16 --vm-bytes 256M --verbose

4. Migrate src to dest (but which never ends)
(qemu) migrate -d tcp:$dest_ip:5830

5. Cancel migration
(qemu) migrate_cancel

Actual results:
Guest hanged with excessive cpu usage. 
"PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                          
64597 root      20   0 8724m 8.1g 3104 S 400.3  7.3  53:51.78 qemu-kvm
"
Expected results:
Migration could be cancelled successfully and guest continued to work where it was.

Additional info:
By setting a breakpoint at migration.c:do_migrate_cancel, after hitting the breakpoint and continue the process, migrate_cancel will succeed. 
No hang in such a case.

So is it caused by some race condition?

Comment 2 Alex Williamson 2010-12-16 23:12:14 UTC
This seems to be related to enabling and disabling dirty tracking.  I added these monitor commands for testing:

dirtytrack -> cpu_physical_memory_set_dirty_tracking(1)
nodirtytrack -> cpu_physical_memory_set_dirty_tracking(0)

Toggling this on and off seems to produce a similar hang with similar frequency to starting and stopping migration.  I also tested that if we don't disable dirty tracking when ram_save_live is called with -1, I can't cause a failure.  Still investigating.

Comment 3 Alex Williamson 2010-12-17 21:27:09 UTC
Created attachment 469450 [details]
kvm_trace log

In this trace, the guest starts hung.  I've found that it can get un-hung by restarting the migration or, as in this case, re-issuing cpu_physical_memory_set_dirty_tracking(1).  The guest then starts working again, but after a few more cycles of togging dirty tracking, gets hung again.

Comment 4 Alex Williamson 2010-12-17 21:29:45 UTC
Created attachment 469451 [details]
dirty tracking patch

crude debug patch for allowing toggling dirty tracking independent of migration.

Comment 5 Alex Williamson 2010-12-17 21:34:15 UTC
Adding some kernel-side experts since this seems to be happening as a result of toggling dirty page tracking via a kvm ioctl.

Comment 6 Alex Williamson 2010-12-22 18:37:31 UTC
Marclo has posted a patch upstream that seems to resolve this in my testing.  Re-assigning.

Comment 8 Aristeu Rozanski 2011-02-03 17:24:57 UTC
Patch(es) available on kernel-2.6.32-112.el6

Comment 11 Mike Cao 2011-02-16 05:10:22 UTC
Reproduced on kernel-2.6.32-94.el6
Verified on kernel-2.6.32-113.el6.

Following the steps in comment #0,Guest not hang after migration_cancel
This issue has been fixed.

Based on above ,Change status to VERIFIED.

Comment 12 errata-xmlrpc 2011-05-23 20:52:23 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html