Hide Forgot
Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Description of problem: do local live migrate with -m 256G -cpu 64 ,during migration ,some times guest hang .after migration completed ,cpu softup in guests'demsg Version-Release number of selected component (if applicable): # uname -r 2.6.32-128.el6.x86_64 # rpm -q qemu-kvm qemu-kvm-0.12.1.2-2.152.el6.x86_64 How reproducible: 2/2 Steps to Reproduce: 1.start VM with -m 256G -cpu 64 eg:/usr/libexec/qemu-kvm -m 256G -smp 64 -cpu cpu64-rhel6,+x2apic -usbdevice tablet -drive file=/RHEL-Server-6.1-64-virtio.qcow2,format=qcow2,if=none,id=drive-virtio0,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,drive=drive-virtio0,id=virtio-blk-pci0 -netdev tap,id=hostnet0,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,mac=00:00:00:00:02:01,bus=pci.0,addr=0x4 -boot c -uuid 1d4dfe1b-39ed-4d3e-881e-a2b400b63d54 -rtc base=utc 2.start listenning port 3.do live migration Actual results: During migration ,sometimes guest hang ,after migration ,cpu softup in guest dmesg Expected results: no cpu softup Additional info: #dmesg psmouse.c: Explorer Mouse at isa0060/serio1/input0 lost synchronization, throwing 2 bytes away. BUG: soft lockup - CPU#0 stuck for 68s! [swapper:0] Modules linked in: ipv6 dm_mirror dm_region_hash dm_log ppdev parport_pc parport microcode virtio_net sg i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mod [last unloaded: speedstep_lib] CPU 0: Modules linked in: ipv6 dm_mirror dm_region_hash dm_log ppdev parport_pc parport microcode virtio_net sg i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mod [last unloaded: speedstep_lib] Pid: 0, comm: swapper Not tainted 2.6.32-128.el6.x86_64 #1 KVM RIP: 0010:[<ffffffff814dc527>] [<ffffffff814dc527>] _spin_unlock_irqrestore+0x17/0x20 RSP: 0018:ffff880129e03e58 EFLAGS: 00000282 RAX: 0000000000000000 RBX: ffff880129e03e58 RCX: 0000000000000000 RDX: ffffffff81f40820 RSI: 0000000000000282 RDI: 0000000000000282 RBP: ffffffff8100bc93 R08: 0000000000000000 R09: 0000000000000028 R10: 0000000000000200 R11: 0000000000000282 R12: ffff880129e03dd0 R13: 0000000000000000 R14: 0000000000000010 R15: ffffffff814e1feb FS: 0000000000000000(0000) GS:ffff880129e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007ff18b058000 CR3: 0000003f035f8000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: <IRQ> [<ffffffff813bae8e>] ? i8042_interrupt+0x10e/0x3a0 [<ffffffff810d68d0>] ? handle_IRQ_event+0x60/0x170 [<ffffffff810d8fc6>] ? handle_edge_irq+0xc6/0x160 [<ffffffff8100df89>] ? handle_irq+0x49/0xa0 [<ffffffff814e1efc>] ? do_IRQ+0x6c/0xf0 [<ffffffff8100bad3>] ? ret_from_intr+0x0/0x11 <EOI> [<ffffffff810362ab>] ? native_safe_halt+0xb/0x10 [<ffffffff810142ed>] ? default_idle+0x4d/0xb0 [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110 [<ffffffff814c214a>] ? rest_init+0x7a/0x80 [<ffffffff81bbdf23>] ? start_kernel+0x418/0x424 [<ffffffff81bbd33a>] ? x86_64_start_reservations+0x125/0x129 [<ffffffff81bbd438>] ? x86_64_start_kernel+0xfa/0x109 psmouse.c: resync failed, issuing reconnect request
Since RHEL 6.1 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
It's expected, depending on the length of the down time (that usually is low). Can you ping the guest during this and report how many pings were lost?
(In reply to comment #4) > It's expected, depending on the length of the down time (that usually is low). > Can you ping the guest during this and report how many pings were lost? 3 times lost . Before migration nearly complete (the remaining ram is very small but can not finish migration) , I enlarge the migration_set_speed (qemu)migration_set_speed 1G ,which host qemu-kvm process hang ,then migration could be finished .after that ,check dmesg in the guest .call trace occurs. I did not change the migration down time ,it should keep the original value.Tried migration with small memory and with default migration down time ,no call trace in dmesg Additional info dmesg is not as comment #0 .I will attach it for your information.
Created attachment 493432 [details] dmesg
If the guest is busy dirtying memory, we only have two options: - make downtime longer (softlockups) - continue trying until there is few dirty memory (so migration can not converge ever). What should we do here? We put default values that we preffer migration not-converging that having soft-lockups. We can change the default, but we can do anything with current technology. Only real solution would be to move to 100Gigabit/1Terabit networking, but that is not available yet O:-)
fix for buzgilla 752138 should help this use case. What that fix does is improve the time accounting while we are down, and improves the bandwidth & downtime calculation. My measurements on smaller hardware is that this problem is gone. Complete fix is too intrusive for 6.3 (would be fixed on 7.0, and we will see if something easier cames for 6.4).
Test with: ------------ host: 80 cores / 512 GB connection speed: 1Gbit/s migration_down_time: default migration_speed: default Migration RHEL6.3 guest with "-M rhel6.3.0 -smp 64 -m 256G": ------------------------------------------------------------- qemu-kvm-0.12.1.2-2.233.el6.x86_64 1. migration can finish correctly. 2. cpu soft lock up occurs sometime. 3. ping guest no packet loss. 4. Guest hangs during dealing with blank page (8G to 257G), cpu usage is about 200%-300%, max ping response time more than 100000ms. qemu-kvm-0.12.1.2-2.241.el6.x86_64 1. migration can finish correctly. 2. no cpu soft lock up. 3. ping guest no packet loss. 4. guest does not hang during dealing with blank page (8G to 257G), just feeling not smooth like when there is heavy I/O, totally acceptable, cpu usage is about 100%, max ping response time is 8410ms, average 238 ms. Conclusion: ------------------- Patches for 752138 improve guest experience greatly when there is large amount of blank page, i think even without further improvement this is acceptable for most cases.
I'll mark it as a clone of 752138 although it is not exactly such. If there would be some other issue please open a new bug. Nice testing! *** This bug has been marked as a duplicate of bug 752138 ***