Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 634100

Summary: migrate_cancel under STRESS caused guest to hang
Product: Red Hat Enterprise Linux 6 Reporter: Keqin Hong <khong>
Component: kernelAssignee: Marcelo Tosatti <mtosatti>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.1CC: alex.williamson, arozansk, bcao, gcosta, gleb, knoel, lihuang, michen, mkenneth, mtosatti, tburke, virt-maint
Target Milestone: rc   
Target Release: 6.1   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-112.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-23 20:52:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580951    
Attachments:
Description Flags
kvm_trace log
none
dirty tracking patch none

Description Keqin Hong 2010-09-15 08:52:34 UTC
Description of problem:
When guest is under high mem stress using STRESS, migration (either locally or to remote host) will not complete due to pages being dirtied faster than being transferred. Sending migrate_cancel cmd at this time will cause guest to hang.

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.113.el6.x86_64
kernel-2.6.32-71.el6.x86_64

How reproducible:
100%

Setup:
Download STRESS from http://weather.ou.edu/~apw/projects/stress/

Steps to Reproduce:
1. Start src VM with 4vcpu and 8G mem
/usr/libexec/qemu-kvm -M rhel6.0.0 -enable-kvm -m 8G -smp 4,sockets=4,cores=1,threads=1 -name rhel5-64 -uuid d1a201e7-7109-507d-cb9a-b010becc6c6b -nodefconfig -nodefaults -monitor stdio -rtc base=utc -boot c -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/home/khong/rhel6-64.img,if=none,id=drive-ide0-0-0,boot=on,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:94:3f:29,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vnc 10.66.86.26:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

2. Start dest waiting for migration
... -incoming tcp:$dest_ip:5830 -name dest

3. Inside src guest, run STRESS as follows
$ stress --cpu 4 --vm 16 --vm-bytes 256M --verbose

4. Migrate src to dest (but which never ends)
(qemu) migrate -d tcp:$dest_ip:5830

5. Cancel migration
(qemu) migrate_cancel

Actual results:
Guest hanged with excessive cpu usage. 
"PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                          
64597 root      20   0 8724m 8.1g 3104 S 400.3  7.3  53:51.78 qemu-kvm
"
Expected results:
Migration could be cancelled successfully and guest continued to work where it was.

Additional info:
By setting a breakpoint at migration.c:do_migrate_cancel, after hitting the breakpoint and continue the process, migrate_cancel will succeed. 
No hang in such a case.

So is it caused by some race condition?

Comment 2 Alex Williamson 2010-12-16 23:12:14 UTC
This seems to be related to enabling and disabling dirty tracking.  I added these monitor commands for testing:

dirtytrack -> cpu_physical_memory_set_dirty_tracking(1)
nodirtytrack -> cpu_physical_memory_set_dirty_tracking(0)

Toggling this on and off seems to produce a similar hang with similar frequency to starting and stopping migration.  I also tested that if we don't disable dirty tracking when ram_save_live is called with -1, I can't cause a failure.  Still investigating.

Comment 3 Alex Williamson 2010-12-17 21:27:09 UTC
Created attachment 469450 [details]
kvm_trace log

In this trace, the guest starts hung.  I've found that it can get un-hung by restarting the migration or, as in this case, re-issuing cpu_physical_memory_set_dirty_tracking(1).  The guest then starts working again, but after a few more cycles of togging dirty tracking, gets hung again.

Comment 4 Alex Williamson 2010-12-17 21:29:45 UTC
Created attachment 469451 [details]
dirty tracking patch

crude debug patch for allowing toggling dirty tracking independent of migration.

Comment 5 Alex Williamson 2010-12-17 21:34:15 UTC
Adding some kernel-side experts since this seems to be happening as a result of toggling dirty page tracking via a kvm ioctl.

Comment 6 Alex Williamson 2010-12-22 18:37:31 UTC
Marclo has posted a patch upstream that seems to resolve this in my testing.  Re-assigning.

Comment 8 Aristeu Rozanski 2011-02-03 17:24:57 UTC
Patch(es) available on kernel-2.6.32-112.el6

Comment 11 Mike Cao 2011-02-16 05:10:22 UTC
Reproduced on kernel-2.6.32-94.el6
Verified on kernel-2.6.32-113.el6.

Following the steps in comment #0,Guest not hang after migration_cancel
This issue has been fixed.

Based on above ,Change status to VERIFIED.

Comment 12 errata-xmlrpc 2011-05-23 20:52:23 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html