Bug 1728166
| Summary: | QEMU provides invalid local APIC ID and extended topology CPUID 0x8000001e leafs when using EPYC cpu model | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Li Xiaohui <xiaohli> | |
| Component: | qemu-kvm | Assignee: | Igor Mammedov <imammedo> | |
| qemu-kvm sub component: | CPU Models | QA Contact: | liunana <nanliu> | |
| Status: | CLOSED WONTFIX | Docs Contact: | ||
| Severity: | unspecified | |||
| Priority: | low | CC: | ailan, babu.moger, chayang, ctatman, dgilbert, ehabkost, imammedo, jinzhao, juzhang, nanliu, virt-maint, yuhuang | |
| Version: | 9.0 | Keywords: | Reopened, Triaged | |
| Target Milestone: | rc | Flags: | knoel:
mirror+
|
|
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1873417 (view as bug list) | Environment: | ||
| Last Closed: | 2021-09-15 08:26:13 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Attachments: | ||||
Created attachment 1588638 [details]
auto-test-log
found guest hit call trace before migration, so free to change assignee, thanks confirm how to reproduce: when boot guest with following commands, will hit guest call trace, or won't hit this issue: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648 \ -object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648,host-nodes=0,policy=bind \ -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \ -numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \ Add host numa info: [root@hp-dl385g10-13 latest]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7251 8-Core Processor Stepping: 2 CPU MHz: 2327.725 CPU max MHz: 2100.0000 CPU min MHz: 1200.0000 BogoMIPS: 4192.17 Virtualization: AMD-V L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 4096K NUMA node0 CPU(s): 0,1,16,17 NUMA node1 CPU(s): 2,3,18,19 NUMA node2 CPU(s): 4,5,20,21 NUMA node3 CPU(s): 6,7,22,23 NUMA node4 CPU(s): 8,9,24,25 NUMA node5 CPU(s): 10,11,26,27 NUMA node6 CPU(s): 12,13,28,29 NUMA node7 CPU(s): 14,15,30,31 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca [root@hp-dl385g10-13 qemu-kvm]# numactl -H available: 8 nodes (0-7) node 0 cpus: 0 1 16 17 node 0 size: 15764 MB node 0 free: 15327 MB node 1 cpus: 2 3 18 19 node 1 size: 0 MB node 1 free: 0 MB node 2 cpus: 4 5 20 21 node 2 size: 0 MB node 2 free: 0 MB node 3 cpus: 6 7 22 23 node 3 size: 0 MB node 3 free: 0 MB node 4 cpus: 8 9 24 25 node 4 size: 16079 MB node 4 free: 14815 MB node 5 cpus: 10 11 26 27 node 5 size: 0 MB node 5 free: 0 MB node 6 cpus: 12 13 28 29 node 6 size: 0 MB node 6 free: 0 MB node 7 cpus: 14 15 30 31 node 7 size: 0 MB node 7 free: 0 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 16 32 32 32 32 1: 16 10 16 16 32 32 32 32 2: 16 16 10 16 32 32 32 32 3: 16 16 16 10 32 32 32 32 4: 32 32 32 32 10 16 16 16 5: 32 32 32 32 16 10 16 16 6: 32 32 32 32 16 16 10 16 7: 32 32 32 32 16 16 16 10 Notes: before migration, guest hit call trace, so this issue has no relation with migration I can reproduce this issue on rhel8.0-av: host(kernel-4.18.0-80.el8.x86_64 & qemu-img-3.1.0-20.module+el8.0.0+3273+6bc1ee54.1.x86_64) and guest(kernel-4.18.0-80.el8.x86_64) Li Xiaohui, pls provide access to the host where it reproduces. (In reply to Igor Mammedov from comment #9) > Li Xiaohui, > > pls provide access to the host where it reproduces. Hi Igor, Could reproduce this bz on EPYC host, and boot guest with "-cpu EPYC". On intel host or boot guest with "-cpu host/Opteron_G3" on EPYC host, wouldn't hit this issue. So from above, any EPYC host could reproduce it. If you have no EPYC host to use, could contact me again. From what I see it's CPUID problem, so it's not related to any memory related options.
Simpler steps to reproduce:
1. EPYC host
2. QEMU CLI:
-enable-kvm -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
-numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3
-kernel kernel_that_supports_epyc
Comparing to an EPYC host, AMD implementation in QEMU produces bogus values in 0x8000001e.ebx|ecx
and uses invalid physical APIC ID (not as described in EPYC spec).
QEMU's EPYC topology, is made backwards, where each vcpu tries to make up topology information
for 0x8000001e based on -smp and ignores/doesn't validate APIC value set by machine,
where as I think it should provide routines for calculating correct APIC ID to board
(based on -sm and -numa options) and in CPU code calculate 0x8000001e based on
APIC ID + correctness checks at cpu.realize() time.
Given it's never worked properly it's too late for 8.1.,
so moving it to 8.2 with condNAK PartnerUpstream.
Adding the email discussion to bugzilla. ========================================== On Thu, 18 Jul 2019 14:19:07 +0000 "Moger, Babu" <Babu.Moger> wrote: > Hi Igor, > > On 7/18/19 3:35 AM, Igor Mammedov wrote: >> On Wed, 17 Jul 2019 21:14:29 +0000 >> "Moger, Babu" <Babu.Moger> wrote: >> >>> Hi, >>> I am looking at this. I have recreated the problem. Noticed that APIC ID >>> enumeration(0x8000001e eax) in qemu does not match 0x8000001e EBX/ECX >>> values. Also we make some assumptions on number of nodes based on >> You can avoid assumption and use 'node-id' property that is assigned >> to the vCPU instance (which originates from -numa options) > > ok. Will look into it closely. Also, we need to know total number of numa > nodes user wants to create. Not sure if that information is already > available somewhere. Will investigate. it's available, but lets keep technical discussion to Bugzill or qemu-devel mail-list for history reasons. >> >> (Also looking at guest side, kernel seems to assume that node ids >> in CPUID match system wide node ids and that matches what I've observed >> on EPYC hardware) >> >>> nr_cores. That is also not matching here. Will investigate further and let >>> you know. Let me know if I missed anything here. >> >> you also need to make physical APIC ID match topology >> (as specified in spec and as it is on real hardware) >> in QEMU it is generated by machine see pc_possible_cpu_arch_ids(). >> Maybe it would be better to move some EPYC topology routines to >> ./include/hw/i386/topology.h from cpu.c and use it in >> pc_possible_cpu_arch_ids() for calculating valid APIC ID. >> >>> >>> Thanks >>> Babu >>> >>> >>> On 7/16/19 5:44 PM, Grimm, Jon wrote: >>>> Hi Eduardo, >>>> >>>> +Babu for looking into >>>> >>>> Best Regards, Jon >>>> >>>>> -----Original Message----- >>>>> From: virt-amd-list-bounces <virt-amd-list-bounces> >>>>> On Behalf Of Eduardo Habkost >>>>> Sent: Tuesday, July 16, 2019 4:42 PM >>>>> To: virt-amd-list >>>>> Cc: Igor Mammedov <imammedo> >>>>> Subject: [virt-amd-list] [Bug 1728166] QEMU provides invalid local APIC ID and >>>>> extended topology CPUID 0x8000001e leafs when using EPYC cpu model >>>>> >>>>> Hi,APIC ID >>>>> >>>>> Can anybody from AMD take a look at this bug? It is probably reproducible >>>>> upstream too. >>>>> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1728166 >>>>> Bug 1728166 - QEMU provides invalid local APIC ID and extended topology >>>>> CPUID 0x8000001e leafs when using EPYC cpu model >>>>> >>>>> ----------- >>>>> Comment 11 Igor Mammedov 2019-07-16 12:40:46 UTC >>>>> >>>>> >From what I see it's CPUID problem, so it's not related to any memory related >>>>> options. >>>>> Simpler steps to reproduce: >>>>> >>>>> 1. EPYC host >>>>> 2. QEMU CLI: >>>>> -enable-kvm -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \ >>>>> -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 >>>>> -kernel kernel_that_supports_epyc >>>>> >>>>> Comparing to an EPYC host, AMD implementation in QEMU produces bogus >>>>> values in 0x8000001e.ebx|ecx and uses invalid physical APIC ID (not as >>>>> described in EPYC spec). >>>>> >>>>> QEMU's EPYC topology, is made backwards, where each vcpu tries to make up >>>>> topology information for 0x8000001e based on -smp and ignores/doesn't >>>>> validate APIC value set by machine, where as I think it should provide routines >>>>> for calculating correct APIC ID to board (based on -sm and -numa options) and >>>>> in CPU code calculate 0x8000001e based on APIC ID + correctness checks at >>>>> cpu.realize() time. >>>>> >>>>> Given it's never worked properly it's too late for 8.1., so moving it to 8.2 with >>>>> condNAK PartnerUpstream. >>>>> >>>>> -- >>>>> Eduardo Hi Eduardo and Igor, Changing the arch_id(or apic_id) involves quite a bit of changes(in file hw/i386/pc.c). Right now apic id is decoded based on sockets, cores, threads. This appears to work for most standard configurations for AMD and other vendors. But this decoding does not follow AMD's APIC ID enumeration. Changing will effect other vendors. I am trying to see if there is a way to handle this without affection other vendors. Any thoughts? Thanks Babu I'm still trying to understand why fixing this bug would require changing the APIC ID assignment logic we have today. Do we have a quick explanation why? We could introduce a new AMD-specific APIC ID assignment API upstream (as opposed to existing i386/topology.h code) and choose the APIC ID assignment implementation depending on the CPU vendor, but I'm not sure this is really necessary. (In reply to Eduardo Habkost from comment #16) > I'm still trying to understand why fixing this bug would require changing > the APIC ID assignment logic we have today. Do we have a quick explanation > why? > > We could introduce a new AMD-specific APIC ID assignment API upstream (as > opposed to existing i386/topology.h code) and choose the APIC ID assignment > implementation depending on the CPU vendor, but I'm not sure this is really > necessary. Here is the kernel message when booting the kernel. [ 0.020817] ------------[ cut here ]------------ [ 0.020817] sched: CPU #2's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency. [ 0.020817] WARNING: CPU: 2 PID: 0 at arch/x86/kernel/smpboot.c:377 topology_sane.isra.5+0x63/0x70 Kernel is trying to validate the topology here. The cpu 0 and cpu 2 are on two different numa nodes(-numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3). Based on the EPYC topology, kernel shifts the apic id to get the llc_id(per_cpu(cpu_llc_id, cpu) = c->apicid >> 3; in arch/x86/kernel/cpu/cacheinfo.c). It finds that both these cpus are same llc_id but they are on different numa nodes. There is inconsistency. (In reply to Eduardo Habkost from comment #16) > I'm still trying to understand why fixing this bug would require changing > the APIC ID assignment logic we have today. Do we have a quick explanation > why? > > We could introduce a new AMD-specific APIC ID assignment API upstream (as > opposed to existing i386/topology.h code) and choose the APIC ID assignment > implementation depending on the CPU vendor, but I'm not sure this is really > necessary. I think we should do exactly this, i.e have a default APIC ID generation routine (current algorithm) but allow to machine to use a different algorithm keying off used cpu type. (EPYC spec has quite specific guidance on how to encode APIC ID which we can do taking as input data provided by -smp and -numa). EPYC impl. attempt is trying to infer CPUID extended APIC leaf only on -smp from inside of CPU code only and leaving physical APIC ID out of sync and end up with IDs that guest OS does not expect (as IDs do not match what is happening on real machines nor match spec). Looking from other side, a properly encoded physical APIC ID (arch_id in QEMU) has all necessary data to encode CPUID 0x8000001e leafs. That's why I'd suggest to rework pc_possible_cpu_arch_ids() ... ms->possible_cpus->cpus[i].arch_id = x86_cpu_apic_id_from_index(pcms, i); x86_topo_ids_from_apicid(ms->possible_cpus->cpus[i].arch_id, pcms->smp_dies, ms->smp.cores, ms->smp.threads, &topo); ... logic that encodes APIC ID and then splits it back into topology socket/core/thread parts and override it by CPU type. (I think that could be easiest way and would free CPU code from making up data out of nowhere) (In reply to Igor Mammedov from comment #18) > I think we should do exactly this, i.e have a default APIC ID generation > routine > (current algorithm) but allow to machine to use a different algorithm keying > off > used cpu type. (EPYC spec has quite specific guidance on how to encode APIC > ID > which we can do taking as input data provided by -smp and -numa). > EPYC impl. attempt is trying to infer CPUID extended APIC leaf only on -smp > from inside of CPU code only and leaving physical APIC ID out of sync > and end up with IDs that guest OS does not expect (as IDs do not match what > is happening on real machines nor match spec). Do the EPYC APIC ID requirements conflict with the requirements for other processors (the ones currently encoded in i386/topology.h), or are they a superset of the existing requirements? I will look for the docs, but I would appreciate if anybody has pointers to the specific documentation sections on how APIC IDs should be encoded for EPYC. (In reply to Babu Moger from comment #17) > Here is the kernel message when booting the kernel. > [ 0.020817] ------------[ cut here ]------------ > [ 0.020817] sched: CPU #2's llc-sibling CPU #0 is not on the same node! > [node: 1 != 0]. Ignoring dependency. > [ 0.020817] WARNING: CPU: 2 PID: 0 at arch/x86/kernel/smpboot.c:377 > topology_sane.isra.5+0x63/0x70 > > Kernel is trying to validate the topology here. The cpu 0 and cpu 2 are on > two different numa nodes(-numa node,nodeid=0,cpus=0-1 -numa > node,nodeid=1,cpus=2-3). > Based on the EPYC topology, kernel shifts the apic id to get the > llc_id(per_cpu(cpu_llc_id, cpu) = c->apicid >> 3; in > arch/x86/kernel/cpu/cacheinfo.c). > It finds that both these cpus are same llc_id but they are on different numa > nodes. There is inconsistency. What should be the correct llc_id for each of the 4 VCPUs in the example command line above? (-smp 4,maxcpus=4,cores=2,threads=1,sockets=2) (In reply to Eduardo Habkost from comment #19) > (In reply to Igor Mammedov from comment #18) > > I think we should do exactly this, i.e have a default APIC ID generation > > routine > > (current algorithm) but allow to machine to use a different algorithm keying > > off > > used cpu type. (EPYC spec has quite specific guidance on how to encode APIC > > ID > > which we can do taking as input data provided by -smp and -numa). > > EPYC impl. attempt is trying to infer CPUID extended APIC leaf only on -smp > > from inside of CPU code only and leaving physical APIC ID out of sync > > and end up with IDs that guest OS does not expect (as IDs do not match what > > is happening on real machines nor match spec). > > Do the EPYC APIC ID requirements conflict with the requirements for other > processors (the ones currently encoded in i386/topology.h), or are they a > superset of the existing requirements? > > I will look for the docs, but I would appreciate if anybody has pointers to > the specific documentation sections on how APIC IDs should be encoded for > EPYC. Answering my own question: Processor Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1 Processors: """ 2.1.10.2.1.3 ApicId Enumeration Requirements Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdCoreIdSize], the number of least significant bits in the Initial APIC ID that indicate core ID within a processor, in constructing per-core CPUID masks. Core::X86::Cpuid::SizeId[ApicIdCoreIdSize] determines the maximum number of cores (MNC) that the processor could theoretically support, not the actual number of cores that are actually implemented or enabled on the processor, as indicated by Core::X86::Cpuid::SizeId[NC]. Each Core::X86::Apic::ApicId[ApicId] register is preset as follows: • ApicId[6] = Socket ID. • ApicId[5:4] = Node ID. • ApicId[3] = Logical CCX L3 complex ID • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}. """ This imposes the following restrictions: 1) only threads=1 and threads=2 will work 2) apicid_core_offset() must be 0 (if threads=1) or 1 (if threads=2) 4) apicid_pkg_offset() must be >= 3 (preferably it should be 6) Enforcing #1 is doable. #2 is already true. #3 conflicts with the existing code. cores=2,threads=2 makes apicid_pkg_offset() return 2. We can try to make i386/topology.h support socket/node/core-complex IDs, or make a separate APIC ID encoding implementation for EPYC. Instead of duplicating code, I'd prefer to just add a mechanism that let the EPYC CPU code force a minimum apicid_pkg_offset() value. (In reply to Eduardo Habkost from comment #21) > (In reply to Eduardo Habkost from comment #19) > > (In reply to Igor Mammedov from comment #18) > > > I think we should do exactly this, i.e have a default APIC ID generation > > > routine > > > (current algorithm) but allow to machine to use a different algorithm keying > > > off > > > used cpu type. (EPYC spec has quite specific guidance on how to encode APIC > > > ID > > > which we can do taking as input data provided by -smp and -numa). > > > EPYC impl. attempt is trying to infer CPUID extended APIC leaf only on -smp > > > from inside of CPU code only and leaving physical APIC ID out of sync > > > and end up with IDs that guest OS does not expect (as IDs do not match what > > > is happening on real machines nor match spec). > > > > Do the EPYC APIC ID requirements conflict with the requirements for other > > processors (the ones currently encoded in i386/topology.h), or are they a > > superset of the existing requirements? > > > > I will look for the docs, but I would appreciate if anybody has pointers to > > the specific documentation sections on how APIC IDs should be encoded for > > EPYC. > > Answering my own question: > > Processor Programming Reference (PPR) for AMD Family 17h Model 01h, Revision > B1 Processors: > > """ > 2.1.10.2.1.3 > ApicId Enumeration Requirements > Operating systems are expected to use > Core::X86::Cpuid::SizeId[ApicIdCoreIdSize], the number of least > significant bits in the Initial APIC ID that indicate core ID within a > processor, in constructing per-core CPUID > masks. Core::X86::Cpuid::SizeId[ApicIdCoreIdSize] determines the maximum > number of cores (MNC) that the > processor could theoretically support, not the actual number of cores that > are actually implemented or enabled on > the processor, as indicated by Core::X86::Cpuid::SizeId[NC]. > Each Core::X86::Apic::ApicId[ApicId] register is preset as follows: > • ApicId[6] = Socket ID. > • ApicId[5:4] = Node ID. > • ApicId[3] = Logical CCX L3 complex ID > • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : > {1'b0,LogicalCoreID[1:0]}. > """ > > This imposes the following restrictions: > 1) only threads=1 and threads=2 will work > 2) apicid_core_offset() must be 0 (if threads=1) or 1 (if threads=2) > 4) apicid_pkg_offset() must be >= 3 (preferably it should be 6) > > Enforcing #1 is doable. #2 is already true. > > #3 conflicts with the existing code. cores=2,threads=2 makes > apicid_pkg_offset() return 2. > > We can try to make i386/topology.h support socket/node/core-complex IDs, or > make a separate APIC ID encoding implementation for EPYC. Instead of > duplicating code, I'd prefer to just add a mechanism that let the EPYC CPU > code force a minimum apicid_pkg_offset() value. Here is the logic I am thinking of.. if (cs->nr_threads - 1) { *eax = (topo.socket_id << 6) | (topo.node_id << 4) | (topo.ccx_id << 3) | (topo.core_id << 1) | topo.smt_id; } else { *eax = (topo.socket_id << 6) | (topo.node_id << 4) | (topo.ccx_id << 3) | (topo.core_id); } We need to know number of numa nodes, number of sockets, nr_cores and nr_threads. APIC id could change for future models. So I think it is better we have a separate apic id handler for EPYC models. If it changes for new models, then we can easily add a new handler. (In reply to Babu Moger from comment #22) > Here is the logic I am thinking of.. > > if (cs->nr_threads - 1) { > *eax = (topo.socket_id << 6) | (topo.node_id << 4) | (topo.ccx_id << > 3) > | (topo.core_id << 1) | topo.smt_id; > } else { > *eax = (topo.socket_id << 6) | (topo.node_id << 4) | (topo.ccx_id << > 3) > | (topo.core_id); > } You can implement it this way, but you need to make sure cpu->apic_id is consistent with the value you are calculating. You will probably need to change x86_cpu_apic_id_from_index() and x86_topo_ids_from_apicid() (or their callers) as suggested by Igor in comment #18. Created attachment 1594432 [details]
Patches to Implement apic id for AMD EPYC models
First draft patch 1
Created attachment 1594433 [details]
Patches to Implement apic id for AMD EPYC models 2/4
First draft 2/4
Created attachment 1594434 [details]
Patches to Implement apic id for AMD EPYC models 3/4
Created attachment 1594435 [details]
Patches to Implement apic id for AMD EPYC models 4/4
Hi Eduardo/Igor, I have attached the first draft patches for review. Please take a look. (In reply to Babu Moger from comment #28) > Hi Eduardo/Igor, I have attached the first draft patches for review. Please > take a look. Thank you. Can you submit them to qemu-devel for review? (In reply to Eduardo Habkost from comment #29) > (In reply to Babu Moger from comment #28) > > Hi Eduardo/Igor, I have attached the first draft patches for review. Please > > take a look. > > Thank you. Can you submit them to qemu-devel for review? Yes. Working on it. Hi, I hit same call trace info with Intel cpu model, wondering it might not only relate to EPYC ? QEMU cli: # /usr/libexec/qemu-kvm -cpu Haswell-noTSX \ -smp 12,sockets=3,cores=2,threads=2 \ -numa node,nodeid=0 \ -numa node,nodeid=1 \ -numa node,nodeid=2 \ -numa node,nodeid=3 \ -numa node,nodeid=4 \ -numa cpu,node-id=0,socket-id=0 \ -numa cpu,node-id=1,socket-id=1,core-id=0 \ -numa cpu,node-id=2,socket-id=1,core-id=1 \ -numa cpu,node-id=3,socket-id=2,core-id=0,thread-id=0 \ -numa cpu,node-id=3,socket-id=2,core-id=0,thread-id=1 \ -numa cpu,node-id=4,socket-id=2,core-id=1,thread-id=0 \ -numa cpu,node-id=4,socket-id=2,core-id=1,thread-id=1 \ -monitor stdio -vnc :0 -qmp tcp:0:4444,server,nowait Inside Guest: # numactl -H available: 5 nodes (0-4) node 0 cpus: 0 1 2 3 node 0 size: 1604 MB node 0 free: 840 MB node 1 cpus: 4 5 node 1 size: 1389 MB node 1 free: 1105 MB node 2 cpus: 6 7 node 2 size: 1613 MB node 2 free: 1327 MB node 3 cpus: 8 9 node 3 size: 1587 MB node 3 free: 1335 MB node 4 cpus: 10 11 node 4 size: 1614 MB node 4 free: 1368 MB node distances: node 0 1 2 3 4 0: 10 20 20 20 20 1: 20 10 20 20 20 2: 20 20 10 20 20 3: 20 20 20 10 20 4: 20 20 20 20 10 Call trace in guest dmesg: [ 0.001000] ------------[ cut here ]------------ [ 0.001000] sched: CPU #6's llc-sibling CPU #4 is not on the same node! [node: 2 != 1]. Ignoring dependency. [ 0.001000] WARNING: CPU: 6 PID: 0 at arch/x86/kernel/smpboot.c:381 topology_sane.isra.6+0x62/0x70 [ 0.001000] Modules linked in: [ 0.001000] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.18.0-141.el8.x86_64 #1 [ 0.001000] Hardware name: Red Hat KVM, BIOS 1.12.0-5.module+el8.1.0+4022+29a53beb 04/01/2014 [ 0.001000] RIP: 0010:topology_sane.isra.6+0x62/0x70 [ 0.001000] Code: 0f 94 c0 c3 80 3d de 3e 37 01 00 75 ef 89 f1 41 89 d9 89 fe 41 89 e8 48 c7 c7 68 7c e7 ac c6 05 c4 3e 37 01 01 e8 d8 3c 06 00 <0f> 0b eb ce 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 0f b6 05 [ 0.001000] RSP: 0000:ffffa76340ce7ed0 EFLAGS: 00010086 [ 0.001000] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffffad05a2c8 [ 0.001000] RDX: 0000000000000001 RSI: 0000000000000086 RDI: 0000000000000046 [ 0.001000] RBP: 0000000000000002 R08: 00000000000000f4 R09: 0000000000000004 [ 0.001000] R10: 0000000000000000 R11: 0000000000000001 R12: 000000000000f040 [ 0.001000] R13: 0000000000000004 R14: 0000000000000006 R15: 0000000000000006 [ 0.001000] FS: 0000000000000000(0000) GS:ffff98f7b1000000(0000) knlGS:0000000000000000 [ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.001000] CR2: 0000000000000000 CR3: 00000001b880a001 CR4: 00000000001606e0 [ 0.001000] Call Trace: [ 0.001000] set_cpu_sibling_map+0x149/0x4f0 [ 0.001000] start_secondary+0xaf/0x200 [ 0.001000] secondary_startup_64+0xb7/0xc0 [ 0.001000] ---[ end trace 0afccdf8a7afbd98 ]--- Adding more info for comment 31. qemu-kvm-4.1.0-7.module+el8.1.0+4177+896cb282 kernel-4.18.0-141.el8.x86_64 (both guest and host) Host: # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Stepping: 2 CPU MHz: 1839.985 CPU max MHz: 3200.0000 CPU min MHz: 1200.0000 BogoMIPS: 4799.75 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d Posted second RFC patches. https://lore.kernel.org/qemu-devel/156779689013.21957.1631551572950676212.stgit@localhost.localdomain/ (In reply to Yumei Huang from comment #31) > Hi, > > I hit same call trace info with Intel cpu model, wondering it might not only > relate to EPYC ? > > QEMU cli: > # /usr/libexec/qemu-kvm -cpu Haswell-noTSX \ > -smp 12,sockets=3,cores=2,threads=2 \ > -numa node,nodeid=0 \ > -numa node,nodeid=1 \ > -numa node,nodeid=2 \ > -numa node,nodeid=3 \ > -numa node,nodeid=4 \ > -numa cpu,node-id=0,socket-id=0 \ > -numa cpu,node-id=1,socket-id=1,core-id=0 \ > -numa cpu,node-id=2,socket-id=1,core-id=1 \ > -numa cpu,node-id=3,socket-id=2,core-id=0,thread-id=0 \ > -numa cpu,node-id=3,socket-id=2,core-id=0,thread-id=1 \ > -numa cpu,node-id=4,socket-id=2,core-id=1,thread-id=0 \ > -numa cpu,node-id=4,socket-id=2,core-id=1,thread-id=1 \ > -monitor stdio -vnc :0 -qmp tcp:0:4444,server,nowait > > Inside Guest: > # numactl -H > available: 5 nodes (0-4) > node 0 cpus: 0 1 2 3 > node 0 size: 1604 MB > node 0 free: 840 MB > node 1 cpus: 4 5 > node 1 size: 1389 MB > node 1 free: 1105 MB > node 2 cpus: 6 7 > node 2 size: 1613 MB > node 2 free: 1327 MB > node 3 cpus: 8 9 > node 3 size: 1587 MB > node 3 free: 1335 MB > node 4 cpus: 10 11 > node 4 size: 1614 MB > node 4 free: 1368 MB > node distances: > node 0 1 2 3 4 > 0: 10 20 20 20 20 > 1: 20 10 20 20 20 > 2: 20 20 10 20 20 > 3: 20 20 20 10 20 > 4: 20 20 20 20 10 > > > Call trace in guest dmesg: > [ 0.001000] ------------[ cut here ]------------ > [ 0.001000] sched: CPU #6's llc-sibling CPU #4 is not on the same node! > [node: 2 != 1]. Ignoring dependency. The guest doesn't expect cores from the same die (which share the L3 cache) to appear in separate numa nodes, so the warning in this case is expected. It is a bug only if the warning is being triggered for valid NUMA configurations like the one in comment #0 (where all cores in each socket are in the same node). (In reply to Babu Moger from comment #33) > Posted second RFC patches. > https://lore.kernel.org/qemu-devel/156779689013.21957.1631551572950676212. > stgit/ Eduardo, Waiting for your feedback on this series in general. Also feedback on setting the new epyc feature bit. The apic id calculation is done much before x86_cpu_load_model is called. Thanks Posted v3 version: Please review. https://lore.kernel.org/qemu-devel/157541968844.46157.17994918142533791313.stgit@naples-babu.amd.com/ QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks All the changes required to fix this bug are committed to the master. https://git.qemu.org/?p=qemu.git;a=commit;h=7b225762c8c05fd31d4c2be116aedfbc00383f8b Thanks needinfo from comment 54 replied by Igor on comment 55. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. QEMU-5.2 returns expected (per spec) apic-id, but only for valid -smp for EPYC. So one won't see a warning if -smp resembles real EPYC cpu. For all other combinations current kernel may get confused as for EPYC it uses package-id+die-id from apic-id as llc_id. So to test you needs command line that matches real EPYC cpu, for example: EPYC 7232P | 2 × CCD | 4 cores × 2 threads where CCD is configured with help of 'dies' in QEMU in EPYC case # of 'dies' should be equal to # of numa nodes so CLI for above should look like: -cpu EPYC \ -smp 16,sockets=1,dies=2,cores=4,threads=2 \ -numa cpu,node-id=0,socket-id=0,die-id=0 \ -numa cpu,node-id=1,socket-id=0,die-id=1 I used https://en.wikipedia.org/wiki/Epyc as a base to play with. Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release. Reset the ITR='---' as part of the change. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. Can you retest it with qemu-kvm-6.2? Taking in account comment 70? (In reply to Igor Mammedov from comment #76) > Can you retest it with qemu-kvm-6.2? > Taking in account comment 70? Hi, Sorry for the later reply. I can't reproduce this bug while booting guest with special smp configuration. Guest kernel: 4.18.0-372.2.1.el8.x86_64 Host info: 5.14.0-70.2.1.el9_0.x86_64 qemu-kvm-6.2.0-11.el9_0.2.x86_64 Model name: AMD EPYC 7302 16-Core Processor dell-per7525-02.lab.eng.pek2.redhat.com # numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 0 size: 31556 MB node 0 free: 23831 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 1 size: 32239 MB node 1 free: 5504 MB node distances: node 0 1 0: 10 32 1: 32 10 I didn't find the EPYC 7302 in wiki Doc, And only EPYC 7302P. EPYC 7302P | 4 × CCD | 8 cores × 2 threads Test results: Guest works well without call trace while booting with follow qemu cli: -smp 64,sockets=1,cores=8,threads=2,dies=4 \ -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648 \ -object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648,host-nodes=0,policy=bind \ -numa node,nodeid=0,cpus=0-31,memdev=ram-node0 \ -numa node,nodeid=1,cpus=32-63,memdev=ram-node1 \ Besides, I also test the smp combinations. ---------------------------------------------------------------------------------------------------------- -smp 32,sockets=2,cores=8,threads=2 pass -smp 64,sockets=4,cores=8,threads=2 pass -smp 64,sockets=2,cores=8,threads=2,dies=2 pass -smp 16,sockets=1,cores=8,threads=2,dies=1 pass --------------------------------------------------------------------------------------------------------- But if I change the combinations of "cores=8,threads=2", will hit this issue. I'm a little confuesd about the smp configuration for specific EPYC machine with numa, is a fixed configuration of cores*threads for different EPYC? and will we improve this in the future for all other combinations? Please help to check this, thanks. Best regards Liu Nana (In reply to liunana from comment #77) > (In reply to Igor Mammedov from comment #76) > > Can you retest it with qemu-kvm-6.2? > > Taking in account comment 70? > > Hi, > > Sorry for the later reply. > > I can't reproduce this bug while booting guest with special smp > configuration. > > > Guest kernel: 4.18.0-372.2.1.el8.x86_64 > > Host info: > 5.14.0-70.2.1.el9_0.x86_64 > qemu-kvm-6.2.0-11.el9_0.2.x86_64 > Model name: AMD EPYC 7302 16-Core Processor > dell-per7525-02.lab.eng.pek2.redhat.com > # numactl -H > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 > 40 41 42 43 44 45 46 47 > node 0 size: 31556 MB > node 0 free: 23831 MB > node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 > 53 54 55 56 57 58 59 60 61 62 63 > node 1 size: 32239 MB > node 1 free: 5504 MB > node distances: > node 0 1 > 0: 10 32 > 1: 32 10 > > > > I didn't find the EPYC 7302 in wiki Doc, And only EPYC 7302P. > EPYC 7302P | 4 × CCD | 8 cores × 2 threads Following CLI would correspond to 7302P: -numa node,nodeid=0 \ -numa node,nodeid=1 \ -numa node,nodeid=2 \ -numa node,nodeid=3 \ -smp 32,sockets=1,cores=4,threads=2,dies=4 \ -numa cpu,socket-id=0,die-id=0,node-id=0 \ -numa cpu,socket-id=0,die-id=1,node-id=1 \ -numa cpu,socket-id=0,die-id=2,node-id=2 \ -numa cpu,socket-id=0,die-id=3,node-id=3 where CCD (chiplets) goes to 'dies' and cores are equal to cores per CCD per wiki it would be column Cores/CCD column > Test results: > Guest works well without call trace while booting with follow qemu cli: > -smp 64,sockets=1,cores=8,threads=2,dies=4 \ > -object > memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/kvm_hugepage, > size=2147483648 \ > -object > memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/mnt/kvm_hugepage, > size=2147483648,host-nodes=0,policy=bind \ > -numa node,nodeid=0,cpus=0-31,memdev=ram-node0 \ > -numa node,nodeid=1,cpus=32-63,memdev=ram-node1 \ this is a little bit wrong, for EPYC you have to have #of numa nodes == to #dies also use -numa cpu,socket-id=x,die-id=y,node-id=z to assign CPUs to nodes unless you know exactly where each cpu-index belongs. > > > Besides, I also test the smp combinations. > ----------------------------------------------------------------------------- > ----------------------------- > -smp 32,sockets=2,cores=8,threads=2 pass > -smp 64,sockets=4,cores=8,threads=2 pass > -smp 64,sockets=2,cores=8,threads=2,dies=2 pass > -smp 16,sockets=1,cores=8,threads=2,dies=1 pass > ----------------------------------------------------------------------------- > ---------------------------- > > > But if I change the combinations of "cores=8,threads=2", will hit this issue. above example: -smp 32,sockets=1,cores=4,threads=2,dies=4 works for me if you assign numa nodes correctly > I'm a little confuesd about the smp configuration for specific EPYC machine > with numa, is a fixed configuration of cores*threads for different EPYC? > and will we improve this in the future for all other combinations? As far as I recall we have tried to introduce checks upstream but eventually consensus was 'not to bother' with them (and we reverted relevant commits) in QEMU and let user configure whatever. If topology matches real EPYC CPUs, there won't be a warning and if configured topology is fictional guest may complain and print warning. No other changes to improve it are planned. > Please help to check this, thanks. > > > > Best regards > Liu Nana |
Description of problem: after do Postcopy migration with Numa pinned and Hugepage pinned guest--file backend, migration succeed, check dmesg info, found guest on dst host hit call trace Version-Release number of selected component (if applicable): src and dst host info: kernel-4.18.0-109.el8.x86_64 & qemu-img-4.0.0-4.module+el8.1.0+3523+b348b848.x86_64 after execute following steps, src&dst host memory info: [root@hp-dl385g10-13 ipa]# free -g total used free shared buff/cache available Mem: 31 4 25 0 0 25 Swap: 15 0 15 guest info: kernel-4.18.0-109.el8.x86_64 How reproducible: 10/10 in auto test 1/1 in manual test Steps to Reproduce: 1.run migration auto case: [root@hp-dl385g10-13 ipa]# python3 Start2Run.py --test_requirement=rhel7_49060_x86_q35_blockdev --test_cases=rhel7_93722 --src_host_ip=10.73.130.67 --dst_host_ip=10.73.130.69 --share_images_dir=/mnt/nfs --sys_image_name=rhel810-64-virtio-scsi.qcow2 --repeat_times=10 Notes:for auto test, can check each rhel7_93722-2019-07-09-03*_logs/long_debug.log for auto test steps and result. logs are in attachment or do manual test steps: 1.config hugepages on src&dst host: [root@host ~]# cat /proc/meminfo | grep -i hugepagesize | awk '{print $2}' 2048 [root@host ~]# echo 2048 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages [root@host ~]# cat /proc/meminfo | grep -i hugepages_total | awk '{print $2}' 2048 [root@host ~]# mkdir /mnt/kvm_hugepage [root@host ~]# mount -t hugetlbfs none /mnt/kvm_hugepage/ 2.boot guest with hugepage memdev and numa on src host: /usr/libexec/qemu-kvm \ -name "mouse-vm" \ -sandbox off \ -machine q35 \ -cpu EPYC \ -nodefaults \ -vga std \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server,nowait \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pcie-root-port,id=root0,slot=0,addr=0x3 \ -device pcie-root-port,id=root1,slot=1,addr=0x4 \ -device pcie-root-port,id=root2,slot=2,addr=0x5 \ -device nec-usb-xhci,id=usb1,bus=root0 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root1 \ -device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 \ -device virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,vectors=4,netdev=tap0,bus=root2 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=/mnt/nfs/rhel810-64-virtio-scsi.qcow2,node-name=drive_sys1 \ -blockdev driver=qcow2,node-name=drive_image1,file=drive_sys1 \ -netdev tap,id=tap0,vhost=on \ -m 4096 \ -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \ -vnc :10 \ -rtc base=utc,clock=host \ -boot menu=off,strict=off,order=cdn,once=c \ -enable-kvm \ -qmp tcp:0:3333,server,nowait \ -serial tcp:0:4444,server,nowait \ -monitor stdio \ -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648 \ -object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648,host-nodes=0,policy=bind \ -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \ -numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \ 3.boot guest with "-incoming ..." on dst host /usr/libexec/qemu-kvm \ -name "mouse-vm" \ -sandbox off \ -machine q35 \ -cpu EPYC \ -nodefaults \ -vga std \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server,nowait \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pcie-root-port,id=root0,slot=0,addr=0x3 \ -device pcie-root-port,id=root1,slot=1,addr=0x4 \ -device pcie-root-port,id=root2,slot=2,addr=0x5 \ -device nec-usb-xhci,id=usb1,bus=root0 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root1 \ -device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 \ -device virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,vectors=4,netdev=tap0,bus=root2 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=/mnt/nfs/rhel810-64-virtio-scsi.qcow2,node-name=drive_sys1 \ -blockdev driver=qcow2,node-name=drive_image1,file=drive_sys1 \ -netdev tap,id=tap0,vhost=on \ -m 4096 \ -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \ -vnc :10 \ -rtc base=utc,clock=host \ -boot menu=off,strict=off,order=cdn,once=c \ -enable-kvm \ -qmp tcp:0:3333,server,nowait \ -serial tcp:0:4444,server,nowait \ -monitor stdio \ -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648 \ -object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648,host-nodes=0,policy=bind \ -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \ -numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \ -incoming tcp:0:4000 \ 4.On src host and dst host, enable postcopy and do migration (qemu) migrate_set_capability postcopy-ram on (qemu) migrate -d tcp:10.73.130.69:4000 (qemu) migrate_start_postcopy 5.after migration, check guest whether work well on dst host Actual results: found guest hit call trace after migration: # dmesg 2019-07-09-03:21:04: [ 0.001000] Call Trace: 2019-07-09-03:21:04: [ 0.001000] set_cpu_sibling_map+0x149/0x4f0 2019-07-09-03:21:04: [ 0.001000] start_secondary+0xaf/0x200 2019-07-09-03:21:04: [ 0.001000] secondary_startup_64+0xb7/0xc0 2019-07-09-03:21:04: [ 0.001000] ---[ end trace 717102d13b5f6712 ]--- Expected results: no call trace in guest dmesg info Additional info: