Bug 634100

Summary:

migrate_cancel under STRESS caused guest to hang

Product:

Red Hat Enterprise Linux 6

Reporter:

Keqin Hong <khong>

Component:

kernel

Assignee:

Marcelo Tosatti <mtosatti>

Status:

CLOSED ERRATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.1

CC:

alex.williamson, arozansk, bcao, gcosta, gleb, knoel, lihuang, michen, mkenneth, mtosatti, tburke, virt-maint

Target Milestone:

Target Release:

6.1

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.32-112.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-05-23 20:52:23 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

580951

Attachments:

Description	Flags
kvm_trace log	none
dirty tracking patch	none

Description Keqin Hong 2010-09-15 08:52:34 UTC

Description of problem:
When guest is under high mem stress using STRESS, migration (either locally or to remote host) will not complete due to pages being dirtied faster than being transferred. Sending migrate_cancel cmd at this time will cause guest to hang.

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.113.el6.x86_64
kernel-2.6.32-71.el6.x86_64

How reproducible:
100%

Setup:
Download STRESS from http://weather.ou.edu/~apw/projects/stress/

Steps to Reproduce:
1. Start src VM with 4vcpu and 8G mem
/usr/libexec/qemu-kvm -M rhel6.0.0 -enable-kvm -m 8G -smp 4,sockets=4,cores=1,threads=1 -name rhel5-64 -uuid d1a201e7-7109-507d-cb9a-b010becc6c6b -nodefconfig -nodefaults -monitor stdio -rtc base=utc -boot c -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/home/khong/rhel6-64.img,if=none,id=drive-ide0-0-0,boot=on,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:94:3f:29,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vnc 10.66.86.26:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

2. Start dest waiting for migration
... -incoming tcp:$dest_ip:5830 -name dest

3. Inside src guest, run STRESS as follows
$ stress --cpu 4 --vm 16 --vm-bytes 256M --verbose

4. Migrate src to dest (but which never ends)
(qemu) migrate -d tcp:$dest_ip:5830

5. Cancel migration
(qemu) migrate_cancel

Actual results:
Guest hanged with excessive cpu usage. 
"PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                          
64597 root      20   0 8724m 8.1g 3104 S 400.3  7.3  53:51.78 qemu-kvm
"
Expected results:
Migration could be cancelled successfully and guest continued to work where it was.

Additional info:
By setting a breakpoint at migration.c:do_migrate_cancel, after hitting the breakpoint and continue the process, migrate_cancel will succeed. 
No hang in such a case.

So is it caused by some race condition?

Comment 2 Alex Williamson 2010-12-16 23:12:14 UTC

This seems to be related to enabling and disabling dirty tracking.  I added these monitor commands for testing:

dirtytrack -> cpu_physical_memory_set_dirty_tracking(1)
nodirtytrack -> cpu_physical_memory_set_dirty_tracking(0)

Toggling this on and off seems to produce a similar hang with similar frequency to starting and stopping migration.  I also tested that if we don't disable dirty tracking when ram_save_live is called with -1, I can't cause a failure.  Still investigating.

Comment 3 Alex Williamson 2010-12-17 21:27:09 UTC

Created attachment 469450 [details]
kvm_trace log

In this trace, the guest starts hung.  I've found that it can get un-hung by restarting the migration or, as in this case, re-issuing cpu_physical_memory_set_dirty_tracking(1).  The guest then starts working again, but after a few more cycles of togging dirty tracking, gets hung again.

Comment 4 Alex Williamson 2010-12-17 21:29:45 UTC

Created attachment 469451 [details]
dirty tracking patch

crude debug patch for allowing toggling dirty tracking independent of migration.

Comment 5 Alex Williamson 2010-12-17 21:34:15 UTC

Adding some kernel-side experts since this seems to be happening as a result of toggling dirty page tracking via a kvm ioctl.

Comment 6 Alex Williamson 2010-12-22 18:37:31 UTC

Marclo has posted a patch upstream that seems to resolve this in my testing.  Re-assigning.

Comment 8 Aristeu Rozanski 2011-02-03 17:24:57 UTC

Patch(es) available on kernel-2.6.32-112.el6

Comment 11 Mike Cao 2011-02-16 05:10:22 UTC

Reproduced on kernel-2.6.32-94.el6
Verified on kernel-2.6.32-113.el6.

Following the steps in comment #0,Guest not hang after migration_cancel
This issue has been fixed.

Based on above ,Change status to VERIFIED.

Comment 12 errata-xmlrpc 2011-05-23 20:52:23 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html