Description of problem: Start a rhel5.4-x86_64 guest on an AMD host with multi-vcpu, on the terminal of guest, first "echo 0 > cpu1" then "echo 1 > cpu1", the guest hang or sometimes quit. I tried it on Intel host, and this issue does not exist. Result : ON intel Host: i686 PASS i686-PAE PASS x86_64 PASS ON AMD HOST: i686 PASS i686-PAE PASS x86_64 *Failed* kernel of *guest*: [root@intel-5310-32-1 ~]# uname -a Linux intel-5310-32-1.englab.nay.redhat.com 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux Version-Release number of selected component (if applicable): [root@amd-4450b-4-2 ~]# uname -a Linux amd-4450b-4-2 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux [root@amd-4450b-4-2 ~]# rpm -qa | grep kvm kvm-83-105.el5_4.5 How reproducible: Always Steps to Reproduce: 1.Launch a rhel5.4-x86_64 guest: /usr/libexec/qemu-kvm -no-hpet -rtc-td-hack -smp 4 -m 4G -net nic,macaddr=1a:4a:10:20:40:5d,model=virtio,vlan=0 -net tap,vlan=0,script=/etc/qemu-ifup -drive file=/opt/RHEL-Server-5.4-64.raw,media=disk,if=ide,index=0 -monitor stdio -vnc :10 -boot c -cpu qemu64,+sse2 2. on guest's terminal: #cd /sys/devices/system/cpu/cpu1 #echo 0 > online #dmesg #echo 1 >online Actual results: Guest always hangs and sometimes quit. Expected results: Additional info: (on host) #dmesg kvm: inject_page_fault: double fault 0x80444038 (on host) #top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15292 root 15 0 2262m 604m 64m R 99.4 7.6 3:15.92 qemu-kvm 1514 root 20 -5 0 0 0 R 99.1 0.0 6019:52 kksmd 15240 root 16 0 12740 1152 820 S 0.7 0.0 0:04.69 top 1 root 15 0 10348 680 576 S 0.0 0.0 0:01.28 init (on host) #cat /proc/cpuinfo processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 107 model name : AMD Athlon(tm) Dual Core Processor 4450B stepping : 2 cpu MHz : 2300.000 cache size : 512 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy misalignsse bogomips : 4610.29 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc 100mhzsteps ***************************** Sometimes the guest quit: The infomation displays on host: (qemu) kvm_run: failed entry, reason 65535 rax ffffffff80433280 rbx ffffffff8006b2d8 rcx 0000000000000001 rdx ffff810081f4a680 rsi ffff810037c96100 rdi ffff810037c96100 rsp ffff810037cb7ef8 rbp 0000000000001eb8 r8 ffff81007c47e100 r9 0000000000000000 r10 0000000000000001 r11 0000000000000000 r12 0000000000000040 r13 ffff810037cb7f10 r14 0000000000000000 r15 0000000000000000 rip 0000000000000000 rflags 00000002 cs 0600 (00006000/0000ffff p 1 dpl 0 db 0 s 1 type a l 0 g 0 avl 0) ds 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 1 type 2 l 0 g 0 avl 0) es 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 1 type 2 l 0 g 0 avl 0) ss 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 1 type 2 l 0 g 0 avl 0) fs 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 1 type 2 l 0 g 0 avl 0) gs 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 1 type 2 l 0 g 0 avl 0) tr 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 3 l 0 g 0 avl 0) ldt 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0) gdt ffff810037c97000/ffff idt ffffffff80444000/ffff cr0 8005003b cr2 14586000 cr3 201000 cr4 6e0 cr8 0 efer d01 kvm_run returned -8 *************************** And after I launch the guest again, some dmesg info in guest: #dmesg (guest) (- start guest again ) BUG: soft lockup - CPU#0 stuck for 12s! [yum-updatesd-he:2807] CPU 0: Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api lp floppy joydev i2c_piix4 i2c_core serio_raw pcspkr virtio_pci virtio_ring 8139too virtio 8139cp mii parport_pc parport ide_cd cdrom dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 2807, comm: yum-updatesd-he Not tainted 2.6.18-164.el5 #1 RIP: 0010:[<ffffffff80012322>] [<ffffffff80012322>] __do_softirq+0x51/0x133 RSP: 0000:ffffffff8043bf40 EFLAGS: 00000206 RAX: 0000000000000002 RBX: 0000000000000002 RCX: ffff81005fa33f58 RDX: ffff81005fa33fd8 RSI: ffffffff803d8e80 RDI: 000000000000000b RBP: ffffffff8043bec0 R08: 0000000000000153 R09: ffff81005fa33f58 R10: 0000000000000000 R11: 0000003ce665e970 R12: ffffffff8005dc8e R13: 0000000000000046 R14: ffffffff80077717 R15: ffffffff8043bec0 FS: 00002b450a217f90(0000) GS:ffffffff803c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000017cbc068 CR3: 00000000621c8000 CR4: 00000000000006e0 Call Trace: <IRQ> [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006cb14>] do_softirq+0x2c/0x85 [<ffffffff8006c99c>] do_IRQ+0xec/0xf5 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI>
(In reply to comment #0) > Description of problem: > Start a rhel5.4-x86_64 guest on an AMD host with multi-vcpu, on the terminal of > guest, first "echo 0 > cpu1" then "echo 1 > cpu1", the guest hang or sometimes > quit. Sorry, should be "echo 0 > /sys/devices/system/cpu/cpu1/online" then "echo 1 > /sys/devices/system/cpu/cpu1/online"
Could you send the sosreport output from the host machine? I have tried to reproduce this on a AMD host, x86_64 guest, using exactly the same kernel and kvm versions, and the same qemu-kvm command-line, and I couldn't reproduce it.
How many CPUs does the host have? Could you send the sosreport output for the host where this was reproduced?
On host amd-4450b-4-2.englab.nay.redhat.com, this bug can be reproduce. This host have 2 CPU, the sosreport see the attachment.
Created attachment 364204 [details] The sosreport of amd-4450b-4-2
The host has 2 CPUs but you are running a 4-vcpu guest. It is not recommended to run a guest with more vcpus than the number of available CPUs on the host. That shouldn't cause the "failed entry" error on the host, but it explains the "CPU stuck" message on the guest. Is the bug reproducible if you limit the number of guest vcpus to 2? (or use a host that has enough CPUs)
Yes,the bug can be reproduced when the number of guest vcpus is 2 on an AMD host with 2 cpus.
Status: I am debugging the issue on the machine where it can be reproduced. Booting of the CPU is failing. I didn't see the problem when I have used -no-kvm-irqchip.
Created attachment 365890 [details] experimental patch to the issue Attached experimental fix to the issue. I need to test it with other guests and on a host running latest upstream KVM, before submitting it upstream.
Fix submitted and applied upstream: http://article.gmane.org/gmane.comp.emulators.kvm.devel/42168
Verified this bug in kvm-83-131.el5, the issue does not exist. AMD: x86_64 guest --Passed i386 guest --Passed i386-PAE guest --Passed Intel: i386-PAE guest -- Passed i386 guest: -- Passed x86_64 guest: -- Passed
Verified in kvm-83-140.el5,this issue does not exist. host kernel: 2.6.18-182.el5 AMD: x86_64 passed i386 passed i386-PAE passed Intel: x86_64 passed i386 passed i386-PAE passed AMD host cpuinfo: processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 107 model name : AMD Athlon(tm) Dual Core Processor 5400B stepping : 2 cpu MHz : 1000.000 cache size : 512 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy misalignsse bogomips : 2004.17 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc 100mhzsteps
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0271.html