Bug 692763

Summary: Live migration with large guest cause cpu softup
Product: Red Hat Enterprise Linux 6 Reporter: Mike Cao <bcao>
Component: qemu-kvmAssignee: Juan Quintela <quintela>
Status: CLOSED DUPLICATE QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.1CC: bcao, gcosta, juzhang, michen, mkenneth, shu, tburke, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-03-12 21:53:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
dmesg none

Description Mike Cao 2011-04-01 05:07:29 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Mike Cao 2011-04-01 05:29:30 UTC
Description of problem:
do local live migrate with -m 256G -cpu 64 ,during migration ,some times guest 
hang .after migration completed ,cpu softup in guests'demsg

Version-Release number of selected component (if applicable):
# uname -r
2.6.32-128.el6.x86_64
# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.152.el6.x86_64

How reproducible:
2/2

Steps to Reproduce:
1.start VM with -m 256G -cpu 64
eg:/usr/libexec/qemu-kvm -m 256G -smp 64 -cpu cpu64-rhel6,+x2apic -usbdevice
tablet -drive
file=/RHEL-Server-6.1-64-virtio.qcow2,format=qcow2,if=none,id=drive-virtio0,cache=none,werror=stop,rerror=stop
-device virtio-blk-pci,drive=drive-virtio0,id=virtio-blk-pci0 -netdev
tap,id=hostnet0,script=/etc/qemu-ifup -device
virtio-net-pci,netdev=hostnet0,mac=00:00:00:00:02:01,bus=pci.0,addr=0x4 -boot c
-uuid 1d4dfe1b-39ed-4d3e-881e-a2b400b63d54 -rtc base=utc
2.start listenning port
3.do live migration

Actual results:
During migration ,sometimes guest hang ,after migration ,cpu softup in guest
dmesg

Expected results:

no cpu softup

Additional info:
#dmesg
psmouse.c: Explorer Mouse at isa0060/serio1/input0 lost synchronization,
throwing 2 bytes away.
BUG: soft lockup - CPU#0 stuck for 68s! [swapper:0]
Modules linked in: ipv6 dm_mirror dm_region_hash dm_log ppdev parport_pc
parport microcode virtio_net sg i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk
sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix
dm_mod [last unloaded: speedstep_lib]
CPU 0:
Modules linked in: ipv6 dm_mirror dm_region_hash dm_log ppdev parport_pc
parport microcode virtio_net sg i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk
sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix
dm_mod [last unloaded: speedstep_lib]
Pid: 0, comm: swapper Not tainted 2.6.32-128.el6.x86_64 #1 KVM
RIP: 0010:[<ffffffff814dc527>]  [<ffffffff814dc527>]
_spin_unlock_irqrestore+0x17/0x20
RSP: 0018:ffff880129e03e58  EFLAGS: 00000282
RAX: 0000000000000000 RBX: ffff880129e03e58 RCX: 0000000000000000
RDX: ffffffff81f40820 RSI: 0000000000000282 RDI: 0000000000000282
RBP: ffffffff8100bc93 R08: 0000000000000000 R09: 0000000000000028
R10: 0000000000000200 R11: 0000000000000282 R12: ffff880129e03dd0
R13: 0000000000000000 R14: 0000000000000010 R15: ffffffff814e1feb
FS:  0000000000000000(0000) GS:ffff880129e00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007ff18b058000 CR3: 0000003f035f8000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>  [<ffffffff813bae8e>] ? i8042_interrupt+0x10e/0x3a0
 [<ffffffff810d68d0>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff810d8fc6>] ? handle_edge_irq+0xc6/0x160
 [<ffffffff8100df89>] ? handle_irq+0x49/0xa0
 [<ffffffff814e1efc>] ? do_IRQ+0x6c/0xf0
 [<ffffffff8100bad3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff810362ab>] ? native_safe_halt+0xb/0x10
 [<ffffffff810142ed>] ? default_idle+0x4d/0xb0
 [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110
 [<ffffffff814c214a>] ? rest_init+0x7a/0x80
 [<ffffffff81bbdf23>] ? start_kernel+0x418/0x424
 [<ffffffff81bbd33a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81bbd438>] ? x86_64_start_kernel+0xfa/0x109
psmouse.c: resync failed, issuing reconnect request

Comment 3 RHEL Program Management 2011-04-04 01:56:17 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 4 Dor Laor 2011-04-11 14:13:20 UTC
It's expected, depending on the length of the down time (that usually is low).
Can you ping the guest during this and report how many pings were lost?

Comment 5 Mike Cao 2011-04-20 10:53:53 UTC
(In reply to comment #4)
> It's expected, depending on the length of the down time (that usually is low).
> Can you ping the guest during this and report how many pings were lost?

3 times lost .

Before migration nearly complete (the remaining ram is very small but can not finish migration) , I enlarge the migration_set_speed (qemu)migration_set_speed 1G ,which host qemu-kvm process hang ,then migration could be finished .after that ,check dmesg in the guest .call trace occurs.

I did not change the migration down time ,it should keep the original value.Tried migration with small memory and with default migration down time ,no call trace in dmesg

Additional info
dmesg is not as comment #0 .I will attach it for your information.

Comment 6 Mike Cao 2011-04-20 10:55:42 UTC
Created attachment 493432 [details]
dmesg

Comment 9 Juan Quintela 2012-02-07 13:33:10 UTC
If the guest is busy dirtying memory, we only have two options:
- make downtime longer (softlockups)
- continue trying until there is few dirty memory (so migration can not converge ever).

What should we do here?  We put default values that we preffer migration not-converging that having soft-lockups.  We can change the default, but we can do anything with current technology.  Only real solution would be to move to 100Gigabit/1Terabit networking, but that is not available yet O:-)

Comment 10 Juan Quintela 2012-03-06 17:51:59 UTC
fix for buzgilla 752138 should help this use case.  What that fix does is improve the time accounting while we are down, and improves the bandwidth & downtime calculation.  My measurements on smaller hardware is that this problem is gone.  Complete fix is too intrusive for 6.3 (would be fixed on 7.0, and we will see if something easier cames for 6.4).

Comment 11 Shaolong Hu 2012-03-08 10:17:05 UTC
Test with:
------------
host: 80 cores / 512 GB
connection speed: 1Gbit/s
migration_down_time: default
migration_speed: default

Migration RHEL6.3 guest with "-M rhel6.3.0 -smp 64 -m 256G":
-------------------------------------------------------------
qemu-kvm-0.12.1.2-2.233.el6.x86_64

1. migration can finish correctly.
2. cpu soft lock up occurs sometime.
3. ping guest no packet loss.
4. Guest hangs during dealing with blank page (8G to 257G), cpu usage is about 200%-300%, max ping response time more than 100000ms.

qemu-kvm-0.12.1.2-2.241.el6.x86_64

1. migration can finish correctly.
2. no cpu soft lock up.
3. ping guest no packet loss.
4. guest does not hang during dealing with blank page (8G to 257G), just feeling not smooth like when there is heavy I/O, totally acceptable, cpu usage is about 100%, max ping response time is 8410ms, average 238 ms.


Conclusion:
-------------------
Patches for 752138 improve guest experience greatly when there is large amount of blank page, i think even without further improvement this is acceptable for most cases.

Comment 12 Dor Laor 2012-03-12 21:53:13 UTC
I'll mark it as a clone of 752138 although it is not exactly such.
If there would be some other issue please open a new bug.
Nice testing!

*** This bug has been marked as a duplicate of bug 752138 ***