Bug 525699 - x86_64 guest hang when set guest's cpu1 online on AMD host
Summary: x86_64 guest hang when set guest's cpu1 online on AMD host
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm
Version: 5.4
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Eduardo Habkost
QA Contact: Lawrence Lim
URL:
Whiteboard:
Depends On:
Blocks: 554506
TreeView+ depends on / blocked
 
Reported: 2009-09-25 10:06 UTC by Qunfang Zhang
Modified: 2014-03-26 01:02 UTC (History)
6 users (show)

Fixed In Version: kvm-83-132.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 554506 (view as bug list)
Environment:
Last Closed: 2010-03-30 07:55:47 UTC


Attachments (Terms of Use)
The sosreport of amd-4450b-4-2 (2.53 MB, application/x-bzip2)
2009-10-09 06:34 UTC, Mark Xie
no flags Details
experimental patch to the issue (1.63 KB, patch)
2009-10-23 20:40 UTC, Eduardo Habkost
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0271 normal SHIPPED_LIVE Important: kvm security, bug fix and enhancement update 2010-03-29 13:19:48 UTC

Description Qunfang Zhang 2009-09-25 10:06:54 UTC
Description of problem:
Start a rhel5.4-x86_64 guest on an AMD host with multi-vcpu, on the terminal of guest, first "echo 0 > cpu1" then "echo 1 > cpu1", the guest hang or sometimes quit.
I tried it on Intel host, and this issue does not exist.
 Result :
ON intel Host:
i686      PASS
i686-PAE  PASS
x86_64    PASS

ON AMD HOST:
i686      PASS
i686-PAE  PASS
x86_64    *Failed*

kernel of *guest*:
[root@intel-5310-32-1 ~]# uname -a
Linux intel-5310-32-1.englab.nay.redhat.com 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

Version-Release number of selected component (if applicable):
[root@amd-4450b-4-2 ~]# uname -a
Linux amd-4450b-4-2 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
[root@amd-4450b-4-2 ~]# rpm -qa | grep kvm
kvm-83-105.el5_4.5

How reproducible:
Always

Steps to Reproduce:
1.Launch a rhel5.4-x86_64 guest:
  /usr/libexec/qemu-kvm -no-hpet -rtc-td-hack -smp 4 -m 4G -net nic,macaddr=1a:4a:10:20:40:5d,model=virtio,vlan=0 -net tap,vlan=0,script=/etc/qemu-ifup -drive file=/opt/RHEL-Server-5.4-64.raw,media=disk,if=ide,index=0 -monitor stdio -vnc :10 -boot c -cpu qemu64,+sse2

2. on guest's terminal: 
  #cd /sys/devices/system/cpu/cpu1
  #echo 0 > online
  #dmesg
  #echo 1 >online
  

Actual results:
Guest always hangs and sometimes quit.

Expected results:


Additional info:
(on host) #dmesg
kvm: inject_page_fault: double fault 0x80444038

(on host) #top
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 
15292 root      15   0 2262m 604m  64m R 99.4  7.6   3:15.92 qemu-kvm 
1514  root      20  -5     0    0    0 R 99.1  0.0   6019:52 kksmd              
15240 root      16   0 12740 1152  820 S  0.7  0.0   0:04.69 top                
    1 root      15   0 10348  680  576 S  0.0  0.0   0:01.28 init  

(on host) #cat /proc/cpuinfo
processor	: 1
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 107
model name	: AMD Athlon(tm) Dual Core Processor 4450B
stepping	: 2
cpu MHz		: 2300.000
cache size	: 512 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy misalignsse
bogomips	: 4610.29
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc 100mhzsteps

*****************************
Sometimes the guest quit:

The infomation displays on host:

(qemu) kvm_run: failed entry, reason 65535
rax ffffffff80433280 rbx ffffffff8006b2d8 rcx 0000000000000001 rdx ffff810081f4a680
rsi ffff810037c96100 rdi ffff810037c96100 rsp ffff810037cb7ef8 rbp 0000000000001eb8
r8  ffff81007c47e100 r9  0000000000000000 r10 0000000000000001 r11 0000000000000000
r12 0000000000000040 r13 ffff810037cb7f10 r14 0000000000000000 r15 0000000000000000
rip 0000000000000000 rflags 00000002
cs 0600 (00006000/0000ffff p 1 dpl 0 db 0 s 1 type a l 0 g 0 avl 0)
ds 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 1 type 2 l 0 g 0 avl 0)
es 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 1 type 2 l 0 g 0 avl 0)
ss 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 1 type 2 l 0 g 0 avl 0)
fs 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 1 type 2 l 0 g 0 avl 0)
gs 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 1 type 2 l 0 g 0 avl 0)
tr 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 3 l 0 g 0 avl 0)
ldt 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0)
gdt ffff810037c97000/ffff
idt ffffffff80444000/ffff
cr0 8005003b cr2 14586000 cr3 201000 cr4 6e0 cr8 0 efer d01
kvm_run returned -8

***************************
And after I launch the guest again, some dmesg info in guest:

#dmesg (guest) (- start guest again )
BUG: soft lockup - CPU#0 stuck for 12s! [yum-updatesd-he:2807]
CPU 0:
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api lp floppy joydev i2c_piix4 i2c_core serio_raw pcspkr virtio_pci virtio_ring 8139too virtio 8139cp mii parport_pc parport ide_cd cdrom dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 2807, comm: yum-updatesd-he Not tainted 2.6.18-164.el5 #1
RIP: 0010:[<ffffffff80012322>]  [<ffffffff80012322>] __do_softirq+0x51/0x133
RSP: 0000:ffffffff8043bf40  EFLAGS: 00000206
RAX: 0000000000000002 RBX: 0000000000000002 RCX: ffff81005fa33f58
RDX: ffff81005fa33fd8 RSI: ffffffff803d8e80 RDI: 000000000000000b
RBP: ffffffff8043bec0 R08: 0000000000000153 R09: ffff81005fa33f58
R10: 0000000000000000 R11: 0000003ce665e970 R12: ffffffff8005dc8e
R13: 0000000000000046 R14: ffffffff80077717 R15: ffffffff8043bec0
FS:  00002b450a217f90(0000) GS:ffffffff803c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000017cbc068 CR3: 00000000621c8000 CR4: 00000000000006e0

Call Trace:
 <IRQ>  [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006cb14>] do_softirq+0x2c/0x85
 [<ffffffff8006c99c>] do_IRQ+0xec/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>

Comment 1 Qunfang Zhang 2009-09-25 10:15:44 UTC
(In reply to comment #0)
> Description of problem:
> Start a rhel5.4-x86_64 guest on an AMD host with multi-vcpu, on the terminal of
> guest, first "echo 0 > cpu1" then "echo 1 > cpu1", the guest hang or sometimes
> quit.

Sorry, should be "echo 0 > /sys/devices/system/cpu/cpu1/online"
then "echo 1 > /sys/devices/system/cpu/cpu1/online"

Comment 2 Eduardo Habkost 2009-10-01 21:02:49 UTC
Could you send the sosreport output from the host machine?

I have tried to reproduce this on a AMD host, x86_64 guest, using exactly the same kernel and kvm versions, and the same qemu-kvm command-line, and I couldn't reproduce it.

Comment 5 Eduardo Habkost 2009-10-07 16:35:58 UTC
How many CPUs does the host have?

Could you send the sosreport output for the host where this was reproduced?

Comment 6 Mark Xie 2009-10-09 06:30:34 UTC
On host amd-4450b-4-2.englab.nay.redhat.com, this bug can be reproduce.
This host have 2 CPU, the sosreport see the attachment.

Comment 7 Mark Xie 2009-10-09 06:34:52 UTC
Created attachment 364204 [details]
The sosreport of amd-4450b-4-2

Comment 8 Eduardo Habkost 2009-10-09 16:12:28 UTC
The host has 2 CPUs but you are running a 4-vcpu guest. It is not recommended to run a guest with more vcpus than the number of available CPUs on the host. That shouldn't cause the "failed entry" error on the host, but it explains the "CPU stuck" message on the guest.

Is the bug reproducible if you limit the number of guest vcpus to 2? (or use a host that has enough CPUs)

Comment 9 Qunfang Zhang 2009-10-10 01:53:16 UTC
Yes,the bug can be reproduced when the number of guest vcpus is 2 on an AMD host with 2 cpus.

Comment 12 Eduardo Habkost 2009-10-22 17:14:17 UTC
Status: I am debugging the issue on the machine where it can be reproduced. Booting of the CPU is failing. I didn't see the problem when I have used -no-kvm-irqchip.

Comment 13 Eduardo Habkost 2009-10-23 20:40:27 UTC
Created attachment 365890 [details]
experimental patch to the issue

Attached experimental fix to the issue. I need to test it with other guests and on a host running latest upstream KVM, before submitting it upstream.

Comment 14 Eduardo Habkost 2009-10-25 19:35:13 UTC
Fix submitted and applied upstream: http://article.gmane.org/gmane.comp.emulators.kvm.devel/42168

Comment 18 Qunfang Zhang 2009-10-28 05:13:25 UTC
Verified this bug in kvm-83-131.el5, the issue does not exist.
AMD:
x86_64 guest --Passed
i386 guest --Passed
i386-PAE guest --Passed

Intel:
i386-PAE guest -- Passed
i386 guest: -- Passed
x86_64 guest: -- Passed

Comment 21 Qunfang Zhang 2009-12-24 05:18:39 UTC
Verified in kvm-83-140.el5,this issue does not exist.
host kernel: 2.6.18-182.el5
AMD:
x86_64   passed
i386     passed
i386-PAE passed

Intel:
x86_64   passed
i386     passed
i386-PAE passed

AMD host cpuinfo:
processor	: 1
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 107
model name	: AMD Athlon(tm) Dual Core Processor 5400B
stepping	: 2
cpu MHz		: 1000.000
cache size	: 512 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy misalignsse
bogomips	: 2004.17
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc 100mhzsteps

Comment 24 errata-xmlrpc 2010-03-30 07:55:47 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0271.html


Note You need to log in before you can comment on or make changes to this bug.