Bug 1169577

Summary:	Redhat-6.4_64bit-guest kernel panic with cpu-passthrough and guest numa
Product:	Red Hat Enterprise Linux 6	Reporter:	Wang Xin <wangxinxin.wang>
Component:	qemu-kvm	Assignee:	Eduardo Habkost <ehabkost>
Status:	CLOSED NOTABUG	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.4	CC:	benoit.canet, bsarathy, chayang, drjones, hhuang, hrgstephen, juzhang, lwoodman, mkenneth, pbonzini, rbalakri, virt-maint, wangxinxin.wang
Target Milestone:	rc	Keywords:	Reopened
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1184125 (view as bug list)		Environment:
Last Closed:	2015-01-20 15:42:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1184125

Description Wang Xin 2014-12-02 02:47:37 UTC

Description of problem:
Running a redhat-6.4-64bit (kernel 2.6.32-358.el6.x86_64) or elder guest on
qemu-2.1, with kvm enabled and -cpu host, non default cpu-topology and guest
numa.
I'm seeing a reliable kernel panic from the guest shortly after boot. It is
happening in find_busiest_group().


Version-Release number of selected component (if applicable):
qemu-kvm-2.1.0 (We found it happend since commit
787aaf5703a702094f395db6795e74230282cd62 by git bisect.)

How reproducible:
100%

Steps to Reproduce:
1.config VM with -cpu host, cpu topo and numa-node. The full qemu cmd line:
qemu-system-x86_64 -machine pc-i440fx-2.1,accel=kvm,usb=off \
-cpu host -m 16384 \
-smp 16,sockets=2,cores=4,threads=2 \
-object memory-backend-ram,size=8192M,id=ram-node0 \
-numa node,nodeid=0,cpus=0-7,memdev=ram-node0 \
-object memory-backend-ram,size=8192M,id=ram-node1 \
-numa node,nodeid=1,cpus=8-15,memdev=ram-node1 \
-boot c -drive file=/image/dir/redhat_6.4_64 \
-vnc 0.0.0.0:0 -device cirrus-vga,id=video0,vgamem_mb=8,bus=pci.0,addr=0x1.0x4 \
-msg timestamp=on

2.
3.

Actual results:
Guest kernel divide error panic.

Expected results:
VM started with right numa-topo.

Additional info:

(1)the guest kernel messages:

divide error: 0000 [#1] SMP
last sysfs file:
CPU 0
Modules linked in:

Pid: 1, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1 QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:[<ffffffff81059a9c>]  [<ffffffff81059a9c>] find_busiest_group+0x55c/0x9f0
RSP: 0018:ffff88023c85f9e0  EFLAGS: 00010046
RAX: 0000000000100000 RBX: ffff88023c85fbdc RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000010
RBP: ffff88023c85fb50 R08: ffff88023ca16c10 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 00000000ffffff01
R13: 0000000000016700 R14: ffffffffffffffff R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001a85000 CR4: 00000000000407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff88023c85e000, task ffff88043d27c040)
Stack:
 ffff88023c85faf0 ffff88023c85fa60 ffff88023c85fbc8 0000000200000000
<d> 0000000100000000 ffff880028210b60 0000000100000001 0000000000000008
<d> 0000000000016700 0000000000016700 ffff88023ca16c00 0000000000016700
Call Trace:
 [<ffffffff8150da2a>] thread_return+0x398/0x76e
 [<ffffffff8150e555>] schedule_timeout+0x215/0x2e0
 [<ffffffff81065905>] ? enqueue_entity+0x125/0x410
 [<ffffffff8150e1d3>] wait_for_common+0x123/0x180
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffff8150e2ed>] wait_for_completion+0x1d/0x20
 [<ffffffff81096a89>] kthread_create+0x99/0x120
 [<ffffffff81090950>] ? worker_thread+0x0/0x2a0
 [<ffffffff81167769>] ? alternate_node_alloc+0xc9/0xe0
 [<ffffffff810908d9>] create_workqueue_thread+0x59/0xd0
 [<ffffffff8150ebce>] ? mutex_lock+0x1e/0x50
 [<ffffffff810911bd>] __create_workqueue_key+0x14d/0x200
 [<ffffffff81c47233>] init_workqueues+0x9f/0xb1
 [<ffffffff81c2788c>] kernel_init+0x25e/0x2fe
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff81c2762e>] ? kernel_init+0x0/0x2fe
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: 8b b5 b0 fe ff ff 48 8b bd b8 fe ff ff e8 9d 85 ff ff 0f 1f 44 00 00 48 8b 95 e0
fe ff ff 48 8b 45 a8 8b 4a 08 48 c1 e0 0a 31 d2 <48> f7 f1 48 8b 4d b0 48 89 45 a0
31 c0 48 85 c9 74 0c 48 8b 45
RIP  [<ffffffff81059a9c>] find_busiest_group+0x55c/0x9f0
 RSP <ffff88023c85f9e0>
divide error: 0000 [#2]
---[ end trace d7d20afc6dd05e71 ]---
Kernel panic - not syncing: Fatal exception
Pid: 1, comm: swapper Tainted: G      D    ---------------    2.6.32-358.el6.x86_64 #1
Call Trace:
 [<ffffffff8150cfc8>] ? panic+0xa7/0x16f
 [<ffffffff815111f4>] ? oops_end+0xe4/0x100
 [<ffffffff8100f19b>] ? die+0x5b/0x90
 [<ffffffff81510a34>] ? do_trap+0xc4/0x160
 [<ffffffff8100cf7f>] ? do_divide_error+0x8f/0xb0
 [<ffffffff81059a9c>] ? find_busiest_group+0x55c/0x9f0
 [<ffffffff8113b3a9>] ? zone_statistics+0x99/0xc0
 [<ffffffff8100bdfb>] ? divide_error+0x1b/0x20
 [<ffffffff81059a9c>] ? find_busiest_group+0x55c/0x9f0
 [<ffffffff8150da2a>] ? thread_return+0x398/0x76e
 [<ffffffff8150e555>] ? schedule_timeout+0x215/0x2e0
 [<ffffffff81065905>] ? enqueue_entity+0x125/0x410
 [<ffffffff8150e1d3>] ? wait_for_common+0x123/0x180
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffff8150e2ed>] ? wait_for_completion+0x1d/0x20
 [<ffffffff81096a89>] ? kthread_create+0x99/0x120
 [<ffffffff81090950>] ? worker_thread+0x0/0x2a0
 [<ffffffff81167769>] ? alternate_node_alloc+0xc9/0xe0
 [<ffffffff810908d9>] ? create_workqueue_thread+0x59/0xd0
 [<ffffffff8150ebce>] ? mutex_lock+0x1e/0x50
 [<ffffffff810911bd>] ? __create_workqueue_key+0x14d/0x200
 [<ffffffff81c47233>] ? init_workqueues+0x9f/0xb1
 [<ffffffff81c2788c>] ? kernel_init+0x25e/0x2fe
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff81c2762e>] ? kernel_init+0x0/0x2fe
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

-- the divide error line:
"sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power; "
in update_sg_lb_stats(), file sched.c, line 4094

(2)host info

/proc/cpuinfo on the host has 16 of these:

processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
stepping        : 7
microcode       : 1803
cpu MHz         : 3301.000
cache size      : 10240 KB
physical id     : 1
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 39
initial apicid  : 39
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx
est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts
dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips        : 6599.83
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:


host numa topo:

node 0 cpus: 0 1 2 3 8 9 10 11
node 0 size: 40936 MB
node 0 free: 39625 MB
node 1 cpus: 4 5 6 7 12 13 14 15
node 1 size: 40960 MB
node 1 free: 39876 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

(3) With "sched_debug loglevel=8" kernel parameter command line,
you can see follow error log(those "ERROR"s):

 CPU0 attaching sched-domain:
  domain 0: span 0-15 level MC
   groups: 0 (cpu_power = 1023) 1 2 3 4 5 6 7 8 9 10 (cpu_power = 1023) 11 12
 13 14 15
 ERROR: parent span is not a superset of domain->span
   domain 1: span 0-7 level CPU
 ERROR: domain->groups does not contain CPU0
    groups: 8-15 (cpu_power = 16382)
 ERROR: groups don't span domain->span
    domain 2: span 0-15 level NODE
     groups:
 ERROR: domain->cpu_power not set

Comment 1 Wang Xin 2014-12-02 02:56:10 UTC

We found after QEMU commit 787aaf57(target-i386:
forward CPUID cache leaves when -cpu host is used), guest will get cpu cache
from host when -cpu host is used. But if we configure guest numa:
   node 0 cpus 0~7
   node 1 cpus 8~15
then the numa nodes lie in the same host cpu cache (cpus 0~16).
When the guest os boot, calculate group->cpu_power, but the guest find thoes
two different nodes own the same cache, then node1's group->cpu_power
will not be valued, just is the initial value '0'. And when vcpu is scheduled,
division by 0 causes kernel panic.

Comment 3 Andrew Jones 2014-12-02 09:31:52 UTC

This should be fixed since kernel-2.6.32-395.el6 with

commit 08d7ef55afc468ed6cb29d892b53063dc382c9fa
Author: Radim Krcmar <rkrcmar>
Date:   Wed Jun 5 10:19:02 2013 -0400

    [kernel] sched: make weird topologies bootable

Please update your guest kernel.

*** This bug has been marked as a duplicate of bug 892677 ***

Comment 4 Wang Xin 2014-12-03 06:02:01 UTC

Thanks, Andrew.

Yeah, the patch can avoid guest kernel painc prolem.
While I think the other problem is that QEMU with the right args:

" -cpu host -m 16384 \
-smp 16,sockets=2,cores=4,threads=2 \
-object memory-backend-ram,size=8192M,id=ram-node0 \
-numa node,nodeid=0,cpus=0-7,memdev=ram-node0 \
-object memory-backend-ram,size=8192M,id=ram-node1 \
-numa node,nodeid=1,cpus=8-15,memdev=ram-node1 "

, but emulated a weird topology for VM.

It seems QEMU can't emulate a fine topologies when with
both '-cpu host' and 'guest numa'.

In my example, QEMU can emulate the right cpu topology and
guest numa node according user's config.
But with '-cpu host', QEMU use host CPU cache info driectly
instead of emulated it, which makes the CPU chache info and
the emulated numa topo missmatch.
Whatever, QEMU should ensure the sharing cache's vcpus at
the same guest numa node. Can QEMU build some rules to avoid
create weird topologies for guest?

Comment 5 Wang Xin 2014-12-03 07:16:20 UTC

furthermore,

Linux kernel makes assumption that cpus sharing last level cache belong to same numa node, while current qemu cannot guarantee that, so some guests panic when boot and newer guests such as RHEL 7(linux 3.10) and linux 3.17 will warn on that, like this:
[    0.139016] ------------[ cut here ]------------
[    0.139016] WARNING: at arch/x86/kernel/smpboot.c:326 topology_sane.isra.1+0x6f/0x80()
[    0.139016] sched: CPU #8's smt-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[    0.139016] Modules linked in:
[    0.139016] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 3.10.0-123.el7.x86_64 #1
[    0.139016] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    0.139016]  ffff88003e2ffe38 0f82a03969bb4fc8 ffff88003e2ffdf0 ffffffff815e19ba
[    0.139016]  ffff88003e2ffe28 ffffffff8105dee1 0000000000000001 0000000000013c80
[    0.139016]  0000000000000008 0000000000000001 0000000000000000 ffff88003e2ffe90
[    0.139016] Call Trace:
[    0.139016]  [<ffffffff815e19ba>] dump_stack+0x19/0x1b
[    0.139016]  [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
[    0.139016]  [<ffffffff8105df5c>] warn_slowpath_fmt+0x5c/0x80
[    0.139016]  [<ffffffff8102ee8d>] ? __mcheck_cpu_init_timer+0x4d/0x60
[    0.139016]  [<ffffffff815cf731>] topology_sane.isra.1+0x6f/0x80
[    0.139016]  [<ffffffff815cfa35>] set_cpu_sibling_map+0x2b9/0x500
[    0.139016]  [<ffffffff815cfe17>] start_secondary+0x19b/0x27b
[    0.139016] ---[ end trace 5508d90aed792a9b ]---

Relevant code in linux kernel 3.17 is:
1.	topology_sane() checks whether 2 cpus are in the same numa node, and if not, it will print a warning;
2.	in match_llc() if two cpus’s cpu_llc_id(last level cache ID) is the same, topology_sane is invoked;
3.	in init_intel_cacheinfo(), l2_id or l3_id(later became this cpu’s cpu_llc_id) is calculated as cpu’s apicid divides num_threads_sharing(number of cpus sharing this l2 or l3 cache). num_threads_sharing is got from cupid() leaf 04 in the guest, which is generated by qemu.

So with cpu model host-passthrough, with Benoît’s patch(http://git.qemu.org/?p=qemu.git;a=commit;h=787aaf5703a702094f395db6795e74230282cd62), cupid leaf 04 is passthroughed from host to guest. But in our host(Dual Xeon E5620), num_threads_sharing is 32, and host cpu socket 0’s apicid is from 0~31, and host cpu socket 1’s cupid is from 32~63, so in host this is right. But for example for a 8-vcpus guest, qemu and seabios will give guest cpu apicid from 0 to 7, and thus guest would think that all vcpus sharing L3 cache, but if guest numa configures vcpu0-vcpu3 in numa node 0, other vcpus in numa node 1, this will confuses guest and guest thinks this is a weird topology as cpus sharing last level cache in different numa nodes, unless vcpu topology is that all vcpus are in different sockets.

Even without Benoît’s patch, or if we use cpu model host-model, qemu will present to guest that all hyper-threads sharing same l2 cache(and no l3 cache), and if we configure vcpus from the same hyper-threads group in different numa nodes(ie, vcpu topology is 1 socket, 1 core, 2 threads, and vcpu0 in numa node 0 and vcpu1 in numa node 1), this is also a weird topology.

So I think, 
1.	cupid leaf 4 should not be directly passthroughed to guest.
2.	and qemu should check whether vcpus in same hyper-thread groups(or sharing last level cache) are in different numa group, if they are, we should stop boot.

Benoît and Paolo, how do you think about revert Benoît’s patch on cupid case 4(leaf 4)?  Or do you have any better idea? 

Thanks.

Comment 6 Andrew Jones 2014-12-03 12:06:13 UTC

Let's see what Benoît and Paolo say, but in my opinion, if you ask for -cpu host, then you should expect to see what the host sees. Now, a patch to qemu that complains and fails to start the guest when a user requests -cpu host and also some numa topology that doesn't exactly match the host, does seem reasonable.

Another thing to ask is, why is '-cpu host' necessary in your config? Maybe we should be looking at what features we're missing from the cpu models instead. Then, if we add those, it'll allow the config to stay emulated.

Comment 7 Paolo Bonzini 2014-12-04 22:39:52 UTC

I think it makes sense to remove the automatic "-cpu host" -> pass through the info, and instead add a property like "host_cache_info" that always defaults to false and that can also be used for other models than "-cpu host".

Comment 8 Benoît Canet 2014-12-04 23:24:12 UTC

Hello,

A bit about the use case which pushed me to write the patch.

Some CPU intensive applications (3DS Simulia) running in the guest uses the cpuid leaves to make best guess about the cpu topology and autotune themselves by using the cpuid results.

So there is a real business use case for this patch and Red Hat probably have some users doing similar compute intensive workload in KVM so I think the option to passthrough this leave should be kept in a way or another.

Best regards

Benoît

Comment 9 Wang Xin 2014-12-05 08:12:07 UTC

Thanks, paolo. 
It's a good idea, we just need to initialize x86_cpu_def.cache_info_passthrough to false, and turn on it according to opt args, such as "-host_cache_info=[on/off]".

But, there are still problems:
1) The cache info we passthroughed to guest was got by host CPUID at vcpus Initialization. If vcpus are not pinned to pcpus, guest will get the wrong cache info when vcpu be sched to other physical core or physical node. 
   
2) QEMU passthrough host CPUID leaf.04H to guest, while the apic_id was emulated by itself. Without host cpu APIC ID, the guest can't correctly parse the Deterministic Cache Parameters from CPUID 04H.

Unless we solve this two problems, passthrough the cache info to guest does not make sense, and is not correct infomation. Do you think there is need to modify APICID generation code in QEMU/seabios?

Benoît, do you know exactly which information from CPUID 04H guest app needs?

3) Plus, if cpu model is custom or host model, QEMU should check whether vcpu topology threads are in same guest numa node, or same problem exists. Do you think this is OK?

Comment 10 Benoît Canet 2014-12-05 08:28:52 UTC

Benoît, do you know exactly which information from CPUID 04H guest app needs?

I just asked to the user now I need to wait for a response.

Best regards

Benoît

Comment 11 Benoît Canet 2014-12-05 15:21:41 UTC

Hello,

If my memories and the one of my customers are exacts this particular application need to guess the L3 cache topology. (mainly size)

Best regards

Benoît

Comment 12 Wang Xin 2014-12-08 07:11:25 UTC

(In reply to Benoît Canet from comment #11)
> Hello,
> 
> If my memories and the one of my customers are exacts this particular
> application need to guess the L3 cache topology. (mainly size)
> 

Hi, Benoît.
Why not emulate the L3 cache info according to the host cpu?  Have you ever tried it?

> Best regards
> 
> Benoît

Comment 13 Eduardo Habkost 2015-01-20 15:42:13 UTC

Bug cloned for RHEL-7: bug 1184125.

On RHEL-6, the workaround is to not use "-cpu host" and instead use an equivalent CPU model name that matches the host.

Comment 14 Eduardo Habkost 2015-08-19 16:56:35 UTC

(In reply to Wang Xin from comment #0)
> Version-Release number of selected component (if applicable):
> qemu-kvm-2.1.0 (We found it happend since commit
> 787aaf5703a702094f395db6795e74230282cd62 by git bisect.)

This commit is not present in RHEL-6 qemu-kvm, so it is not a RHEL-6 qemu-kvm bug.