634100 – migrate_cancel under STRESS caused guest to hang

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 634100 - migrate_cancel under STRESS caused guest to hang

Summary: migrate_cancel under STRESS caused guest to hang

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.1
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	6.1
Assignee:	Marcelo Tosatti
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Rhel6KvmTier1
TreeView+	depends on / blocked

Reported:	2010-09-15 08:52 UTC by Keqin Hong
Modified:	2013-01-09 23:08 UTC (History)
CC List:	12 users (show)
Fixed In Version:	kernel-2.6.32-112.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-05-23 20:52:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
kvm_trace log (16.00 KB, text/plain) 2010-12-17 21:27 UTC, Alex Williamson	no flags	Details
dirty tracking patch (2.22 KB, patch) 2010-12-17 21:29 UTC, Alex Williamson	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0542	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 6.1 kernel security, bug fix and enhancement update	2011-05-19 11:58:07 UTC

Description Keqin Hong 2010-09-15 08:52:34 UTC

Description of problem:
When guest is under high mem stress using STRESS, migration (either locally or to remote host) will not complete due to pages being dirtied faster than being transferred. Sending migrate_cancel cmd at this time will cause guest to hang.

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.113.el6.x86_64
kernel-2.6.32-71.el6.x86_64

How reproducible:
100%

Setup:
Download STRESS from http://weather.ou.edu/~apw/projects/stress/

Steps to Reproduce:
1. Start src VM with 4vcpu and 8G mem
/usr/libexec/qemu-kvm -M rhel6.0.0 -enable-kvm -m 8G -smp 4,sockets=4,cores=1,threads=1 -name rhel5-64 -uuid d1a201e7-7109-507d-cb9a-b010becc6c6b -nodefconfig -nodefaults -monitor stdio -rtc base=utc -boot c -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/home/khong/rhel6-64.img,if=none,id=drive-ide0-0-0,boot=on,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:94:3f:29,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vnc 10.66.86.26:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

2. Start dest waiting for migration
... -incoming tcp:$dest_ip:5830 -name dest

3. Inside src guest, run STRESS as follows
$ stress --cpu 4 --vm 16 --vm-bytes 256M --verbose

4. Migrate src to dest (but which never ends)
(qemu) migrate -d tcp:$dest_ip:5830

5. Cancel migration
(qemu) migrate_cancel

Actual results:
Guest hanged with excessive cpu usage. 
"PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                          
64597 root      20   0 8724m 8.1g 3104 S 400.3  7.3  53:51.78 qemu-kvm
"
Expected results:
Migration could be cancelled successfully and guest continued to work where it was.

Additional info:
By setting a breakpoint at migration.c:do_migrate_cancel, after hitting the breakpoint and continue the process, migrate_cancel will succeed. 
No hang in such a case.

So is it caused by some race condition?

Comment 2 Alex Williamson 2010-12-16 23:12:14 UTC

This seems to be related to enabling and disabling dirty tracking.  I added these monitor commands for testing:

dirtytrack -> cpu_physical_memory_set_dirty_tracking(1)
nodirtytrack -> cpu_physical_memory_set_dirty_tracking(0)

Toggling this on and off seems to produce a similar hang with similar frequency to starting and stopping migration.  I also tested that if we don't disable dirty tracking when ram_save_live is called with -1, I can't cause a failure.  Still investigating.

Comment 3 Alex Williamson 2010-12-17 21:27:09 UTC

Created attachment 469450 [details]
kvm_trace log

In this trace, the guest starts hung.  I've found that it can get un-hung by restarting the migration or, as in this case, re-issuing cpu_physical_memory_set_dirty_tracking(1).  The guest then starts working again, but after a few more cycles of togging dirty tracking, gets hung again.

Comment 4 Alex Williamson 2010-12-17 21:29:45 UTC

Created attachment 469451 [details]
dirty tracking patch

crude debug patch for allowing toggling dirty tracking independent of migration.

Comment 5 Alex Williamson 2010-12-17 21:34:15 UTC

Adding some kernel-side experts since this seems to be happening as a result of toggling dirty page tracking via a kvm ioctl.

Comment 6 Alex Williamson 2010-12-22 18:37:31 UTC

Marclo has posted a patch upstream that seems to resolve this in my testing.  Re-assigning.

Comment 8 Aristeu Rozanski 2011-02-03 17:24:57 UTC

Patch(es) available on kernel-2.6.32-112.el6

Comment 11 Mike Cao 2011-02-16 05:10:22 UTC

Reproduced on kernel-2.6.32-94.el6
Verified on kernel-2.6.32-113.el6.

Following the steps in comment #0,Guest not hang after migration_cancel
This issue has been fixed.

Based on above ,Change status to VERIFIED.

Comment 12 errata-xmlrpc 2011-05-23 20:52:23 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html

Note You need to log in before you can comment on or make changes to this bug.