Description of problem: Live migration is successfully finished. but there is calltrace in output of dmesg CallTrace in 5u3 x86_64 server VM: BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0] CPU 0: Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api lp floppy joydev parport_pc ide_cd 8139too parport 8139cp serio_raw cdrom i2c_piix4 mii i2c_core pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 0, comm: swapper Not tainted 2.6.18-128.el5 #1 RIP: 0010:[<ffffffff80064c08>] [<ffffffff80064c08>] _spin_unlock_irqrestore+0x8/0x9 RSP: 0018:ffffffff80425cc0 EFLAGS: 00000246 RAX: 000000000000045d RBX: ffffffff804cb918 RCX: 00000000cc64a0b0 RDX: 0000000000024158 RSI: 0000000000000246 RDI: ffffffff803d3600 RBP: ffffffff80425c40 R08: 0000000000001388 R09: 0000000000000000 R10: ffff810139d9f080 R11: ffffffff800076dc R12: ffffffff8005dc8e R13: 0000000000000246 R14: ffffffff800774da R15: ffffffff80425c40 FS: 00002b042112efb0(0000) GS:ffffffff803ac000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002b294a7754d8 CR3: 000000012935b000 CR4: 00000000000006e0 Call Trace: <IRQ> [<ffffffff882325a7>] :ide_cd:cdrom_start_packet_command+0x149/0x15b [<ffffffff8000eec4>] ide_do_request+0x5ab/0x78d [<ffffffff88233103>] :ide_cd:cdrom_decode_status+0x31c/0x347 [<ffffffff882336fe>] :ide_cd:cdrom_pc_intr+0x27/0x21c [<ffffffff8000d4f5>] ide_intr+0x1af/0x1df [<ffffffff80010a46>] handle_IRQ_event+0x51/0xa6 [<ffffffff800b7ade>] __do_IRQ+0xa4/0x103 [<ffffffff8006c95d>] do_IRQ+0xe7/0xf5 [<ffffffff8005d615>] ret_from_intr+0x0/0xa [<ffffffff80064b47>] _write_lock+0x0/0xf [<ffffffff80221ff5>] neigh_periodic_timer+0xa0/0x14d [<ffffffff80221f55>] neigh_periodic_timer+0x0/0x14d [<ffffffff80094dbb>] run_timer_softirq+0x133/0x1af [<ffffffff80011fbc>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006cada>] do_softirq+0x2c/0x85 [<ffffffff8006b287>] default_idle+0x0/0x50 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c <EOI> [<ffffffff8006b2b0>] default_idle+0x29/0x50 [<ffffffff80048d19>] cpu_idle+0x95/0xb8 [<ffffffff803e7801>] start_kernel+0x220/0x225 [<ffffffff803e722f>] _sinittext+0x22f/0x236 usb 1-2: reset full speed USB device using uhci_hcd and address 2 CallTrace in rhel5u4 x86_64 server beta: CPU 0: Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api lp floppy joydev parport_pc ide_cd parport i2c_piix4 cdrom 8139too serio_raw i2c_core 8139cp mii pcspkr virtio_net virtio_pci virtio_ring dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod virtio_blk virtio ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 0, comm: swapper Not tainted 2.6.18-157.el5 #1 RIP: 0010:[<ffffffff8004fd3b>] [<ffffffff8004fd3b>] hrtimer_run_queues+0x128/0x197 RSP: 0018:ffffffff8043def0 EFLAGS: 00000286 RAX: ffffffff8005e615 RBX: ffff81013823fee8 RCX: 0000000000000000 RDX: ffff810135c3f2f0 RSI: ffff8100010064e8 RDI: ffff81013823fee8 RBP: ffffffff8043de70 R08: 00000000ffffffff R09: 0000000000000038 R10: ffff81013fcd4158 R11: 0000000000000282 R12: ffffffff8005ec8e R13: ffffffff800a25ba R14: ffffffff800786e4 R15: ffffffff8043de70 FS: 00000000427ad940(0000) GS:ffffffff803c1000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007ffff1025d48 CR3: 000000012aa55000 CR4: 00000000000006e0 Call Trace: <IRQ> [<ffffffff80097717>] run_timer_softirq+0x21/0x1af [<ffffffff800129dc>] __do_softirq+0x51/0x133 [<ffffffff80012a14>] __do_softirq+0x89/0x133 [<ffffffff8005f2fc>] call_softirq+0x1c/0x28 [<ffffffff8006db03>] do_softirq+0x2c/0x85 [<ffffffff8006c2b0>] default_idle+0x0/0x50 [<ffffffff8005ec8e>] apic_timer_interrupt+0x66/0x6c <EOI> [<ffffffff8006c2d9>] default_idle+0x29/0x50 [<ffffffff800499cc>] cpu_idle+0x95/0xb8 [<ffffffff803fd801>] start_kernel+0x220/0x225 [<ffffffff803fd22f>] _sinittext+0x22f/0x236 usb 1-2: reset full speed USB device using uhci_hcd and address 2 Version-Release number of selected component (if applicable): [root@amd-8356-32-4 ~]# rpm -q kvm kvm-83-90.el5 [root@amd-8356-32-4 ~]# rpm -q kernel kernel-2.6.18-158.el5 [root@amd-8356-32-4 ~]# cat /etc/redhat-release Red Hat Enterprise Virtualization Hypervisor release 5.4-2.0.99 (11) How reproducible: 100% Steps to Reproduce: 1. /usr/libexec/qemu-kvm -no-hpet -usbdevice tablet -rtc-td-hack -drive file=/data/images/images/RHEL-Server-5.4-64.qcow2,media=disk,if=ide,cache=off,index=0 -smp 8 -m 4096 -cpu qemu64,+sse2 -vnc :11 -monitor stdio -net nic,vlan=0,macaddr=00:21:02:e1:87:3d,model=rtl8139 -net tap,vlan=0,script=/etc/qemu-ifup -boot c -name Test -incoming tcp:0:5800 2. 3. Actual results: Expected results: Additional info: Can not reproduce the problem with rhel3u9 rhel4u8 Can not reproduce the problem with rhel5 32 bit Can not reproduce the problem with *-smp 2* Host Info : processor : 15 vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : Quad-Core AMD Opteron(tm) Processor 8356 stepping : 3 cpu MHz : 1200.000 cache size : 512 KB physical id : 12 siblings : 4 core id : 3 cpu cores : 4 apicid : 51 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc nonstop_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw bogomips : 4600.69 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate [8] [root@amd-8356-32-4 ~]# free -m total used free shared buffers cached Mem: 129073 23004 106069 0 188 18354 -/+ buffers/cache: 4461 124612 Swap: 3999 0 3999 [root@amd-8356-32-4 ~]#
Does the host has >= 8 cpus (it is a must)? In addition, I suspect it might be timing issue caused by tsc source clock in the guest. What's `cat /sys/devices/system/clocksource/clocksource0/current_clocksource` ?
(In reply to comment #1) > Does the host has >= 8 cpus (it is a must)? yes . Host has 16 cpu ( refer to the Host cpu info ). and Additional info: Can not reproduce the problem with rhel3u9 rhel4u8 Can not reproduce the problem with rhel5 32 bit Can not reproduce the problem with *-smp 2* also the calltrace is not shown before migration. > > In addition, I suspect it might be timing issue caused by tsc source clock in > the guest. What's `cat > /sys/devices/system/clocksource/clocksource0/current_clocksource` ? jiffies ( same as the host )
lihuang, Is the guest VM otherwise operating correctly on the migration destination ?
(In reply to comment #4) > lihuang, > > Is the guest VM otherwise operating correctly on the migration destination ? Yes. (but after migration, I only checked the network and output of dmesg. then I quit the vm. so maybe it is unlikely to find other issue..., anyway,network is ok, and basic operation on guest desktop is ok )
OK, so its probably just a spurious warning. This problem should be fixed by Zach's timer work.
(In reply to comment #6) > OK, so its probably just a spurious warning. This problem should be fixed > by Zach's timer work. good to know , the kvm is more and more stronger :) BTW ,it is a existed bug,right ? what's the id ? Thanks .
lihuang, I remember it was mentioned on other BZs, but i'm unable to find it.
Maybe it is not that harmless. There is the possibility that one of the vcpus is in fact, not running. lihuang, here's an easy way to confirm that: run this command in the guest prior to migration: $ for i in $(seq 0 7); do (taskset -c $i yes >/dev/null &) ; done after the migration: $ top press 1 (will show all cpus on top) If every cpu is running close to 100 % most of the time, then it is a harmless message. If one or more of them is not getting its share, then we're facing a more serious problem.
(In reply to comment #9) > Maybe it is not that harmless. > > There is the possibility that one of the vcpus is in fact, not running. > > lihuang, here's an easy way to confirm that: > > run this command in the guest prior to migration: > > $ for i in $(seq 0 7); do (taskset -c $i yes >/dev/null &) ; done > > after the migration: > > $ top > press 1 (will show all cpus on top) > > If every cpu is running close to 100 % most of the time, then it is a harmless > message. If one or more of them is not getting its share, then we're facing a > more serious problem. Yes . every cpu is close to 100% (95%~98%)
Since we all think that it's the clock issue. I close it as a dup of the original issue. *** This bug has been marked as a duplicate of bug 481013 ***