Bug 1169577
Summary: | Redhat-6.4_64bit-guest kernel panic with cpu-passthrough and guest numa | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Wang Xin <wangxinxin.wang> | |
Component: | qemu-kvm | Assignee: | Eduardo Habkost <ehabkost> | |
Status: | CLOSED NOTABUG | QA Contact: | Virtualization Bugs <virt-bugs> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 6.4 | CC: | benoit.canet, bsarathy, chayang, drjones, hhuang, hrgstephen, juzhang, lwoodman, mkenneth, pbonzini, rbalakri, virt-maint, wangxinxin.wang | |
Target Milestone: | rc | Keywords: | Reopened | |
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1184125 (view as bug list) | Environment: | ||
Last Closed: | 2015-01-20 15:42:13 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1184125 |
Description
Wang Xin
2014-12-02 02:47:37 UTC
We found after QEMU commit 787aaf57(target-i386: forward CPUID cache leaves when -cpu host is used), guest will get cpu cache from host when -cpu host is used. But if we configure guest numa: node 0 cpus 0~7 node 1 cpus 8~15 then the numa nodes lie in the same host cpu cache (cpus 0~16). When the guest os boot, calculate group->cpu_power, but the guest find thoes two different nodes own the same cache, then node1's group->cpu_power will not be valued, just is the initial value '0'. And when vcpu is scheduled, division by 0 causes kernel panic. This should be fixed since kernel-2.6.32-395.el6 with commit 08d7ef55afc468ed6cb29d892b53063dc382c9fa Author: Radim Krcmar <rkrcmar> Date: Wed Jun 5 10:19:02 2013 -0400 [kernel] sched: make weird topologies bootable Please update your guest kernel. *** This bug has been marked as a duplicate of bug 892677 *** Thanks, Andrew. Yeah, the patch can avoid guest kernel painc prolem. While I think the other problem is that QEMU with the right args: " -cpu host -m 16384 \ -smp 16,sockets=2,cores=4,threads=2 \ -object memory-backend-ram,size=8192M,id=ram-node0 \ -numa node,nodeid=0,cpus=0-7,memdev=ram-node0 \ -object memory-backend-ram,size=8192M,id=ram-node1 \ -numa node,nodeid=1,cpus=8-15,memdev=ram-node1 " , but emulated a weird topology for VM. It seems QEMU can't emulate a fine topologies when with both '-cpu host' and 'guest numa'. In my example, QEMU can emulate the right cpu topology and guest numa node according user's config. But with '-cpu host', QEMU use host CPU cache info driectly instead of emulated it, which makes the CPU chache info and the emulated numa topo missmatch. Whatever, QEMU should ensure the sharing cache's vcpus at the same guest numa node. Can QEMU build some rules to avoid create weird topologies for guest? furthermore, Linux kernel makes assumption that cpus sharing last level cache belong to same numa node, while current qemu cannot guarantee that, so some guests panic when boot and newer guests such as RHEL 7(linux 3.10) and linux 3.17 will warn on that, like this: [ 0.139016] ------------[ cut here ]------------ [ 0.139016] WARNING: at arch/x86/kernel/smpboot.c:326 topology_sane.isra.1+0x6f/0x80() [ 0.139016] sched: CPU #8's smt-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency. [ 0.139016] Modules linked in: [ 0.139016] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 3.10.0-123.el7.x86_64 #1 [ 0.139016] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 0.139016] ffff88003e2ffe38 0f82a03969bb4fc8 ffff88003e2ffdf0 ffffffff815e19ba [ 0.139016] ffff88003e2ffe28 ffffffff8105dee1 0000000000000001 0000000000013c80 [ 0.139016] 0000000000000008 0000000000000001 0000000000000000 ffff88003e2ffe90 [ 0.139016] Call Trace: [ 0.139016] [<ffffffff815e19ba>] dump_stack+0x19/0x1b [ 0.139016] [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80 [ 0.139016] [<ffffffff8105df5c>] warn_slowpath_fmt+0x5c/0x80 [ 0.139016] [<ffffffff8102ee8d>] ? __mcheck_cpu_init_timer+0x4d/0x60 [ 0.139016] [<ffffffff815cf731>] topology_sane.isra.1+0x6f/0x80 [ 0.139016] [<ffffffff815cfa35>] set_cpu_sibling_map+0x2b9/0x500 [ 0.139016] [<ffffffff815cfe17>] start_secondary+0x19b/0x27b [ 0.139016] ---[ end trace 5508d90aed792a9b ]--- Relevant code in linux kernel 3.17 is: 1. topology_sane() checks whether 2 cpus are in the same numa node, and if not, it will print a warning; 2. in match_llc() if two cpus’s cpu_llc_id(last level cache ID) is the same, topology_sane is invoked; 3. in init_intel_cacheinfo(), l2_id or l3_id(later became this cpu’s cpu_llc_id) is calculated as cpu’s apicid divides num_threads_sharing(number of cpus sharing this l2 or l3 cache). num_threads_sharing is got from cupid() leaf 04 in the guest, which is generated by qemu. So with cpu model host-passthrough, with Benoît’s patch(http://git.qemu.org/?p=qemu.git;a=commit;h=787aaf5703a702094f395db6795e74230282cd62), cupid leaf 04 is passthroughed from host to guest. But in our host(Dual Xeon E5620), num_threads_sharing is 32, and host cpu socket 0’s apicid is from 0~31, and host cpu socket 1’s cupid is from 32~63, so in host this is right. But for example for a 8-vcpus guest, qemu and seabios will give guest cpu apicid from 0 to 7, and thus guest would think that all vcpus sharing L3 cache, but if guest numa configures vcpu0-vcpu3 in numa node 0, other vcpus in numa node 1, this will confuses guest and guest thinks this is a weird topology as cpus sharing last level cache in different numa nodes, unless vcpu topology is that all vcpus are in different sockets. Even without Benoît’s patch, or if we use cpu model host-model, qemu will present to guest that all hyper-threads sharing same l2 cache(and no l3 cache), and if we configure vcpus from the same hyper-threads group in different numa nodes(ie, vcpu topology is 1 socket, 1 core, 2 threads, and vcpu0 in numa node 0 and vcpu1 in numa node 1), this is also a weird topology. So I think, 1. cupid leaf 4 should not be directly passthroughed to guest. 2. and qemu should check whether vcpus in same hyper-thread groups(or sharing last level cache) are in different numa group, if they are, we should stop boot. Benoît and Paolo, how do you think about revert Benoît’s patch on cupid case 4(leaf 4)? Or do you have any better idea? Thanks. Let's see what Benoît and Paolo say, but in my opinion, if you ask for -cpu host, then you should expect to see what the host sees. Now, a patch to qemu that complains and fails to start the guest when a user requests -cpu host and also some numa topology that doesn't exactly match the host, does seem reasonable. Another thing to ask is, why is '-cpu host' necessary in your config? Maybe we should be looking at what features we're missing from the cpu models instead. Then, if we add those, it'll allow the config to stay emulated. I think it makes sense to remove the automatic "-cpu host" -> pass through the info, and instead add a property like "host_cache_info" that always defaults to false and that can also be used for other models than "-cpu host". Hello, A bit about the use case which pushed me to write the patch. Some CPU intensive applications (3DS Simulia) running in the guest uses the cpuid leaves to make best guess about the cpu topology and autotune themselves by using the cpuid results. So there is a real business use case for this patch and Red Hat probably have some users doing similar compute intensive workload in KVM so I think the option to passthrough this leave should be kept in a way or another. Best regards Benoît Thanks, paolo. It's a good idea, we just need to initialize x86_cpu_def.cache_info_passthrough to false, and turn on it according to opt args, such as "-host_cache_info=[on/off]". But, there are still problems: 1) The cache info we passthroughed to guest was got by host CPUID at vcpus Initialization. If vcpus are not pinned to pcpus, guest will get the wrong cache info when vcpu be sched to other physical core or physical node. 2) QEMU passthrough host CPUID leaf.04H to guest, while the apic_id was emulated by itself. Without host cpu APIC ID, the guest can't correctly parse the Deterministic Cache Parameters from CPUID 04H. Unless we solve this two problems, passthrough the cache info to guest does not make sense, and is not correct infomation. Do you think there is need to modify APICID generation code in QEMU/seabios? Benoît, do you know exactly which information from CPUID 04H guest app needs? 3) Plus, if cpu model is custom or host model, QEMU should check whether vcpu topology threads are in same guest numa node, or same problem exists. Do you think this is OK? Benoît, do you know exactly which information from CPUID 04H guest app needs? I just asked to the user now I need to wait for a response. Best regards Benoît Hello, If my memories and the one of my customers are exacts this particular application need to guess the L3 cache topology. (mainly size) Best regards Benoît (In reply to Benoît Canet from comment #11) > Hello, > > If my memories and the one of my customers are exacts this particular > application need to guess the L3 cache topology. (mainly size) > Hi, Benoît. Why not emulate the L3 cache info according to the host cpu? Have you ever tried it? > Best regards > > Benoît Bug cloned for RHEL-7: bug 1184125. On RHEL-6, the workaround is to not use "-cpu host" and instead use an equivalent CPU model name that matches the host. (In reply to Wang Xin from comment #0) > Version-Release number of selected component (if applicable): > qemu-kvm-2.1.0 (We found it happend since commit > 787aaf5703a702094f395db6795e74230282cd62 by git bisect.) This commit is not present in RHEL-6 qemu-kvm, so it is not a RHEL-6 qemu-kvm bug. |