Bug 1728166

Summary:

QEMU provides invalid local APIC ID and extended topology CPUID 0x8000001e leafs when using EPYC cpu model

Product:

Red Hat Enterprise Linux 9

Reporter:

Li Xiaohui <xiaohli>

Component:

qemu-kvm

Assignee:

Igor Mammedov <imammedo>

qemu-kvm sub component:

CPU Models

QA Contact:

liunana <nanliu>

Status:

CLOSED WONTFIX

Docs Contact:

Severity:

unspecified

Priority:

low

CC:

ailan, babu.moger, chayang, ctatman, dgilbert, ehabkost, imammedo, jinzhao, juzhang, nanliu, virt-maint, yuhuang

Version:

9.0

Keywords:

Reopened, Triaged

Target Milestone:

Flags:

knoel: mirror+

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1873417 (view as bug list)

Environment:

Last Closed:

2021-09-15 08:26:13 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
auto-test-log	none
Patches to Implement apic id for AMD EPYC models	none
Patches to Implement apic id for AMD EPYC models 2/4	none
Patches to Implement apic id for AMD EPYC models 3/4	none
Patches to Implement apic id for AMD EPYC models 4/4	none

Description Li Xiaohui 2019-07-09 08:32:38 UTC

Description of problem:
after do Postcopy migration with Numa pinned and Hugepage pinned guest--file backend, migration succeed, check dmesg info, found guest on dst host hit call trace


Version-Release number of selected component (if applicable):
src and dst host info: kernel-4.18.0-109.el8.x86_64 & qemu-img-4.0.0-4.module+el8.1.0+3523+b348b848.x86_64

after execute following steps, src&dst host memory info:
[root@hp-dl385g10-13 ipa]# free -g
              total        used        free      shared  buff/cache   available
Mem:             31           4          25           0           0          25
Swap:            15           0          15

guest info:
kernel-4.18.0-109.el8.x86_64


How reproducible:
10/10 in auto test
1/1 in manual test

Steps to Reproduce:
1.run migration auto case:
[root@hp-dl385g10-13 ipa]# python3 Start2Run.py --test_requirement=rhel7_49060_x86_q35_blockdev --test_cases=rhel7_93722 --src_host_ip=10.73.130.67 --dst_host_ip=10.73.130.69 --share_images_dir=/mnt/nfs --sys_image_name=rhel810-64-virtio-scsi.qcow2 --repeat_times=10

Notes:for auto test, can check each rhel7_93722-2019-07-09-03*_logs/long_debug.log for auto test steps and result. logs are in attachment


or do manual test steps:
1.config hugepages on src&dst host:
[root@host ~]# cat /proc/meminfo  | grep -i hugepagesize | awk '{print $2}'
2048
[root@host ~]# echo 2048 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
[root@host ~]# cat /proc/meminfo  | grep -i hugepages_total | awk '{print $2}'
2048
[root@host ~]# mkdir /mnt/kvm_hugepage
[root@host ~]# mount -t hugetlbfs none /mnt/kvm_hugepage/
2.boot guest with hugepage memdev and numa on src host:
/usr/libexec/qemu-kvm  \
-name "mouse-vm" \
-sandbox off \
-machine q35 \
-cpu EPYC \
-nodefaults  \
-vga std \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server,nowait \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server,nowait \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-mon chardev=qmp_id_catch_monitor,mode=control \
-device pcie-root-port,id=root0,slot=0,addr=0x3 \
-device pcie-root-port,id=root1,slot=1,addr=0x4 \
-device pcie-root-port,id=root2,slot=2,addr=0x5 \
-device nec-usb-xhci,id=usb1,bus=root0 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root1 \
-device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 \
-device virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,vectors=4,netdev=tap0,bus=root2 \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=/mnt/nfs/rhel810-64-virtio-scsi.qcow2,node-name=drive_sys1 \
-blockdev driver=qcow2,node-name=drive_image1,file=drive_sys1 \
-netdev tap,id=tap0,vhost=on \
-m 4096 \
-smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
-vnc :10 \
-rtc base=utc,clock=host \
-boot menu=off,strict=off,order=cdn,once=c \
-enable-kvm  \
-qmp tcp:0:3333,server,nowait \
-serial tcp:0:4444,server,nowait \
-monitor stdio \
-object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648 \
-object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648,host-nodes=0,policy=bind \
-numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
-numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \
3.boot guest with "-incoming ..." on dst host
/usr/libexec/qemu-kvm  \
-name "mouse-vm" \
-sandbox off \
-machine q35 \
-cpu EPYC \
-nodefaults  \
-vga std \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server,nowait \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server,nowait \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-mon chardev=qmp_id_catch_monitor,mode=control \
-device pcie-root-port,id=root0,slot=0,addr=0x3 \
-device pcie-root-port,id=root1,slot=1,addr=0x4 \
-device pcie-root-port,id=root2,slot=2,addr=0x5 \
-device nec-usb-xhci,id=usb1,bus=root0 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root1 \
-device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 \
-device virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,vectors=4,netdev=tap0,bus=root2 \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=/mnt/nfs/rhel810-64-virtio-scsi.qcow2,node-name=drive_sys1 \
-blockdev driver=qcow2,node-name=drive_image1,file=drive_sys1 \
-netdev tap,id=tap0,vhost=on \
-m 4096 \
-smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
-vnc :10 \
-rtc base=utc,clock=host \
-boot menu=off,strict=off,order=cdn,once=c \
-enable-kvm  \
-qmp tcp:0:3333,server,nowait \
-serial tcp:0:4444,server,nowait \
-monitor stdio \
-object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648 \
-object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648,host-nodes=0,policy=bind \
-numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
-numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \
-incoming tcp:0:4000 \
4.On src host and dst host, enable postcopy and do migration 
(qemu) migrate_set_capability postcopy-ram on
(qemu) migrate -d tcp:10.73.130.69:4000
(qemu) migrate_start_postcopy
5.after migration, check guest whether work well on dst host


Actual results:
found guest hit call trace after migration:
# dmesg
2019-07-09-03:21:04: [    0.001000] Call Trace:
2019-07-09-03:21:04: [    0.001000]  set_cpu_sibling_map+0x149/0x4f0
2019-07-09-03:21:04: [    0.001000]  start_secondary+0xaf/0x200
2019-07-09-03:21:04: [    0.001000]  secondary_startup_64+0xb7/0xc0
2019-07-09-03:21:04: [    0.001000] ---[ end trace 717102d13b5f6712 ]---


Expected results:
no call trace in guest dmesg info 


Additional info:

Comment 1 Li Xiaohui 2019-07-09 08:35:28 UTC

Created attachment 1588638 [details]
auto-test-log

Comment 2 Li Xiaohui 2019-07-09 08:47:43 UTC

found guest hit call trace before migration, so free to change assignee, thanks

Comment 3 Li Xiaohui 2019-07-09 10:58:03 UTC

confirm how to reproduce:
when boot guest with following commands, will hit guest call trace, or won't hit this issue:
-object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648 \
-object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648,host-nodes=0,policy=bind \
-numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
-numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \

Comment 4 Li Xiaohui 2019-07-10 08:29:10 UTC

Add host numa info:
[root@hp-dl385g10-13 latest]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        8
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               1
Model name:          AMD EPYC 7251 8-Core Processor
Stepping:            2
CPU MHz:             2327.725
CPU max MHz:         2100.0000
CPU min MHz:         1200.0000
BogoMIPS:            4192.17
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            4096K
NUMA node0 CPU(s):   0,1,16,17
NUMA node1 CPU(s):   2,3,18,19
NUMA node2 CPU(s):   4,5,20,21
NUMA node3 CPU(s):   6,7,22,23
NUMA node4 CPU(s):   8,9,24,25
NUMA node5 CPU(s):   10,11,26,27
NUMA node6 CPU(s):   12,13,28,29
NUMA node7 CPU(s):   14,15,30,31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

Comment 5 Li Xiaohui 2019-07-10 09:25:36 UTC

[root@hp-dl385g10-13 qemu-kvm]# numactl -H 
available: 8 nodes (0-7)
node 0 cpus: 0 1 16 17
node 0 size: 15764 MB
node 0 free: 15327 MB
node 1 cpus: 2 3 18 19
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 4 5 20 21
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 6 7 22 23
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 8 9 24 25
node 4 size: 16079 MB
node 4 free: 14815 MB
node 5 cpus: 10 11 26 27
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus: 12 13 28 29
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 14 15 30 31
node 7 size: 0 MB
node 7 free: 0 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  16  16  16  32  32  32  32 
  1:  16  10  16  16  32  32  32  32 
  2:  16  16  10  16  32  32  32  32 
  3:  16  16  16  10  32  32  32  32 
  4:  32  32  32  32  10  16  16  16 
  5:  32  32  32  32  16  10  16  16 
  6:  32  32  32  32  16  16  10  16 
  7:  32  32  32  32  16  16  16  10 

Notes： before migration, guest hit call trace, so this issue has no relation with migration

Comment 8 Li Xiaohui 2019-07-12 05:54:23 UTC

I can reproduce this issue on rhel8.0-av:
host(kernel-4.18.0-80.el8.x86_64 & qemu-img-3.1.0-20.module+el8.0.0+3273+6bc1ee54.1.x86_64) and guest(kernel-4.18.0-80.el8.x86_64)

Comment 9 Igor Mammedov 2019-07-12 11:40:34 UTC

 Li Xiaohui,

pls provide access to the host where it reproduces.

Comment 10 Li Xiaohui 2019-07-15 02:43:13 UTC

(In reply to Igor Mammedov from comment #9)
>  Li Xiaohui,
> 
> pls provide access to the host where it reproduces.

Hi Igor,
Could reproduce this bz on EPYC host, and boot guest with "-cpu EPYC".
On intel host or boot guest with "-cpu host/Opteron_G3" on EPYC host, wouldn't hit this issue.

So from above, any EPYC host could reproduce it. If you have no EPYC host to use, could contact me again.

Comment 11 Igor Mammedov 2019-07-16 12:40:46 UTC

From what I see it's CPUID problem, so it's not related to any memory related options.
Simpler steps to reproduce:

 1. EPYC host
 2. QEMU CLI:
     -enable-kvm -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
     -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3
     -kernel kernel_that_supports_epyc

Comparing to an EPYC host, AMD implementation in QEMU produces bogus values in 0x8000001e.ebx|ecx
and uses invalid physical APIC ID (not as described in EPYC spec).

QEMU's EPYC topology, is made backwards, where each vcpu tries to make up topology information
for 0x8000001e based on -smp and ignores/doesn't validate APIC value set by machine,
where as I think it should provide routines for calculating correct APIC ID to board
(based on -sm and -numa options) and in CPU code calculate 0x8000001e based on
APIC ID + correctness checks at cpu.realize() time.

Given it's never worked properly it's too late for 8.1.,
so moving it to 8.2 with condNAK PartnerUpstream.

Comment 14 Babu Moger 2019-07-18 20:03:38 UTC

Adding the email discussion to bugzilla.
==========================================

On Thu, 18 Jul 2019 14:19:07 +0000
"Moger, Babu" <Babu.Moger> wrote:

> Hi Igor,
>
> On 7/18/19 3:35 AM, Igor Mammedov wrote:
>> On Wed, 17 Jul 2019 21:14:29 +0000
>> "Moger, Babu" <Babu.Moger> wrote:
>>   
>>> Hi,
>>>  I am looking at this. I have recreated the problem. Noticed that APIC ID
>>> enumeration(0x8000001e eax) in qemu does not match 0x8000001e EBX/ECX
>>> values. Also we make some assumptions on number of nodes based on  
>> You can avoid assumption and use 'node-id' property that is assigned
>> to the vCPU instance (which originates from -numa options)  
>
> ok. Will look into it closely. Also, we need to know total number of numa
> nodes user wants to create. Not sure if that information is already
> available somewhere. Will investigate.
it's available, but lets keep technical discussion to Bugzill or qemu-devel
mail-list for history reasons.

>>
>> (Also looking at guest side, kernel seems to assume that node ids
>> in CPUID match system wide node ids and that matches what I've observed
>> on EPYC hardware)
>>   
>>> nr_cores. That is also not matching here. Will investigate further and let
>>> you know. Let me know if I missed anything here.  
>>
>> you also need to make physical APIC ID match topology
>> (as specified in spec and as it is on real hardware)
>> in QEMU it is generated by machine see pc_possible_cpu_arch_ids().
>> Maybe it would be better to move some EPYC topology routines to
>> ./include/hw/i386/topology.h from cpu.c and use it in
>> pc_possible_cpu_arch_ids() for calculating valid APIC ID.
>>   
>>>
>>> Thanks
>>> Babu
>>>
>>>
>>> On 7/16/19 5:44 PM, Grimm, Jon wrote:  
>>>> Hi Eduardo,
>>>>
>>>> +Babu for looking into 
>>>>
>>>> Best Regards, Jon
>>>>     
>>>>> -----Original Message-----
>>>>> From: virt-amd-list-bounces <virt-amd-list-bounces>
>>>>> On Behalf Of Eduardo Habkost
>>>>> Sent: Tuesday, July 16, 2019 4:42 PM
>>>>> To: virt-amd-list
>>>>> Cc: Igor Mammedov <imammedo>
>>>>> Subject: [virt-amd-list] [Bug 1728166] QEMU provides invalid local APIC ID and
>>>>> extended topology CPUID 0x8000001e leafs when using EPYC cpu model
>>>>>
>>>>> Hi,APIC ID
>>>>>
>>>>> Can anybody from AMD take a look at this bug?  It is probably reproducible
>>>>> upstream too.
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1728166
>>>>> Bug 1728166 - QEMU provides invalid local APIC ID and extended topology
>>>>> CPUID 0x8000001e leafs when using EPYC cpu model
>>>>>
>>>>> -----------
>>>>> Comment 11 Igor Mammedov 2019-07-16 12:40:46 UTC
>>>>>    
>>>>> >From what I see it's CPUID problem, so it's not related to any memory related    
>>>>> options.
>>>>> Simpler steps to reproduce:
>>>>>
>>>>>  1. EPYC host
>>>>>  2. QEMU CLI:
>>>>>      -enable-kvm -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
>>>>>      -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3
>>>>>      -kernel kernel_that_supports_epyc
>>>>>
>>>>> Comparing to an EPYC host, AMD implementation in QEMU produces bogus
>>>>> values in 0x8000001e.ebx|ecx and uses invalid physical APIC ID (not as
>>>>> described in EPYC spec).
>>>>>
>>>>> QEMU's EPYC topology, is made backwards, where each vcpu tries to make up
>>>>> topology information for 0x8000001e based on -smp and ignores/doesn't
>>>>> validate APIC value set by machine, where as I think it should provide routines
>>>>> for calculating correct APIC ID to board (based on -sm and -numa options) and
>>>>> in CPU code calculate 0x8000001e based on APIC ID + correctness checks at
>>>>> cpu.realize() time.
>>>>>
>>>>> Given it's never worked properly it's too late for 8.1., so moving it to 8.2 with
>>>>> condNAK PartnerUpstream.
>>>>>
>>>>> --
>>>>> Eduardo

Comment 15 Babu Moger 2019-07-24 14:34:59 UTC

Hi Eduardo and Igor,
  Changing the arch_id(or apic_id) involves quite a bit of changes(in file hw/i386/pc.c). Right now apic id is decoded based on sockets, cores, threads. This appears to work for most standard configurations for AMD and other vendors. But this decoding does not follow AMD's APIC ID enumeration. Changing will effect other vendors. I am trying to see if there is a way to handle this without affection other vendors.  Any thoughts?
Thanks
Babu

Comment 16 Eduardo Habkost 2019-07-24 15:07:13 UTC

I'm still trying to understand why fixing this bug would require changing the APIC ID assignment logic we have today.  Do we have a quick explanation why?

We could introduce a new AMD-specific APIC ID assignment API upstream (as opposed to existing i386/topology.h code) and choose the APIC ID assignment implementation depending on the CPU vendor, but I'm not sure this is really necessary.

Comment 17 Babu Moger 2019-07-24 15:27:10 UTC

(In reply to Eduardo Habkost from comment #16)
> I'm still trying to understand why fixing this bug would require changing
> the APIC ID assignment logic we have today.  Do we have a quick explanation
> why?
> 
> We could introduce a new AMD-specific APIC ID assignment API upstream (as
> opposed to existing i386/topology.h code) and choose the APIC ID assignment
> implementation depending on the CPU vendor, but I'm not sure this is really
> necessary.

Here is the kernel message when booting the kernel. 
[    0.020817] ------------[ cut here ]------------
[    0.020817] sched: CPU #2's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[    0.020817] WARNING: CPU: 2 PID: 0 at arch/x86/kernel/smpboot.c:377 topology_sane.isra.5+0x63/0x70

Kernel is trying to validate the topology here. The cpu 0 and cpu 2 are on two different numa nodes(-numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3).
Based on the EPYC topology, kernel shifts the apic id to get the llc_id(per_cpu(cpu_llc_id, cpu) = c->apicid >> 3; in arch/x86/kernel/cpu/cacheinfo.c).
It finds that both these cpus are same llc_id but they are on different numa nodes. There is inconsistency.

Comment 18 Igor Mammedov 2019-07-24 17:12:47 UTC

(In reply to Eduardo Habkost from comment #16)
> I'm still trying to understand why fixing this bug would require changing
> the APIC ID assignment logic we have today.  Do we have a quick explanation
> why?
> 
> We could introduce a new AMD-specific APIC ID assignment API upstream (as
> opposed to existing i386/topology.h code) and choose the APIC ID assignment
> implementation depending on the CPU vendor, but I'm not sure this is really
> necessary.

I think we should do exactly this, i.e have a default APIC ID generation routine
(current algorithm) but allow to machine to use a different algorithm keying off
used cpu type. (EPYC spec has quite specific guidance on how to encode APIC ID
which we can do taking as input data provided by -smp and -numa).
EPYC impl. attempt is trying to infer CPUID extended APIC leaf only on -smp
from inside of CPU code only and leaving physical APIC ID out of sync
and end up with IDs that guest OS does not expect (as IDs do not match what
is happening on real machines nor match spec).

Looking from other side, a properly encoded physical APIC ID (arch_id in QEMU)
has all necessary data to encode CPUID 0x8000001e leafs.
That's why I'd suggest to rework 
 pc_possible_cpu_arch_ids()
    ...
        ms->possible_cpus->cpus[i].arch_id = x86_cpu_apic_id_from_index(pcms, i);
        x86_topo_ids_from_apicid(ms->possible_cpus->cpus[i].arch_id,
                                 pcms->smp_dies, ms->smp.cores,
                                 ms->smp.threads, &topo);
    ...
logic that encodes APIC ID and then splits it back into topology socket/core/thread
parts and override it by CPU type.
(I think that could be easiest way and would free CPU code from making up data out of nowhere)

Comment 19 Eduardo Habkost 2019-07-24 17:24:49 UTC

(In reply to Igor Mammedov from comment #18)
> I think we should do exactly this, i.e have a default APIC ID generation
> routine
> (current algorithm) but allow to machine to use a different algorithm keying
> off
> used cpu type. (EPYC spec has quite specific guidance on how to encode APIC
> ID
> which we can do taking as input data provided by -smp and -numa).
> EPYC impl. attempt is trying to infer CPUID extended APIC leaf only on -smp
> from inside of CPU code only and leaving physical APIC ID out of sync
> and end up with IDs that guest OS does not expect (as IDs do not match what
> is happening on real machines nor match spec).

Do the EPYC APIC ID requirements conflict with the requirements for other processors (the ones currently encoded in i386/topology.h), or are they a superset of the existing requirements?

I will look for the docs, but I would appreciate if anybody has pointers to the specific documentation sections on how APIC IDs should be encoded for EPYC.

Comment 20 Eduardo Habkost 2019-07-24 17:32:37 UTC

(In reply to Babu Moger from comment #17)
> Here is the kernel message when booting the kernel. 
> [    0.020817] ------------[ cut here ]------------
> [    0.020817] sched: CPU #2's llc-sibling CPU #0 is not on the same node!
> [node: 1 != 0]. Ignoring dependency.
> [    0.020817] WARNING: CPU: 2 PID: 0 at arch/x86/kernel/smpboot.c:377
> topology_sane.isra.5+0x63/0x70
> 
> Kernel is trying to validate the topology here. The cpu 0 and cpu 2 are on
> two different numa nodes(-numa node,nodeid=0,cpus=0-1 -numa
> node,nodeid=1,cpus=2-3).
> Based on the EPYC topology, kernel shifts the apic id to get the
> llc_id(per_cpu(cpu_llc_id, cpu) = c->apicid >> 3; in
> arch/x86/kernel/cpu/cacheinfo.c).
> It finds that both these cpus are same llc_id but they are on different numa
> nodes. There is inconsistency.

What should be the correct llc_id for each of the 4 VCPUs in the example command line above? (-smp 4,maxcpus=4,cores=2,threads=1,sockets=2)

Comment 21 Eduardo Habkost 2019-07-24 18:00:56 UTC

(In reply to Eduardo Habkost from comment #19)
> (In reply to Igor Mammedov from comment #18)
> > I think we should do exactly this, i.e have a default APIC ID generation
> > routine
> > (current algorithm) but allow to machine to use a different algorithm keying
> > off
> > used cpu type. (EPYC spec has quite specific guidance on how to encode APIC
> > ID
> > which we can do taking as input data provided by -smp and -numa).
> > EPYC impl. attempt is trying to infer CPUID extended APIC leaf only on -smp
> > from inside of CPU code only and leaving physical APIC ID out of sync
> > and end up with IDs that guest OS does not expect (as IDs do not match what
> > is happening on real machines nor match spec).
> 
> Do the EPYC APIC ID requirements conflict with the requirements for other
> processors (the ones currently encoded in i386/topology.h), or are they a
> superset of the existing requirements?
> 
> I will look for the docs, but I would appreciate if anybody has pointers to
> the specific documentation sections on how APIC IDs should be encoded for
> EPYC.

Answering my own question:

Processor Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1 Processors:

"""
2.1.10.2.1.3
ApicId Enumeration Requirements
Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdCoreIdSize], the number of least
significant bits in the Initial APIC ID that indicate core ID within a processor, in constructing per-core CPUID
masks. Core::X86::Cpuid::SizeId[ApicIdCoreIdSize] determines the maximum number of cores (MNC) that the
processor could theoretically support, not the actual number of cores that are actually implemented or enabled on
the processor, as indicated by Core::X86::Cpuid::SizeId[NC].
Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
• ApicId[6] = Socket ID.
• ApicId[5:4] = Node ID.
• ApicId[3] = Logical CCX L3 complex ID
• ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}.
"""

This imposes the following restrictions:
1) only threads=1 and threads=2 will work
2) apicid_core_offset() must be 0 (if threads=1) or 1 (if threads=2)
4) apicid_pkg_offset() must be >= 3 (preferably it should be 6)

Enforcing #1 is doable.  #2 is already true.

#3 conflicts with the existing code.  cores=2,threads=2 makes apicid_pkg_offset() return 2.

We can try to make i386/topology.h support socket/node/core-complex IDs, or make a separate APIC ID encoding implementation for EPYC.  Instead of duplicating code, I'd prefer to just add a mechanism that let the EPYC CPU code force a minimum apicid_pkg_offset() value.

Comment 22 Babu Moger 2019-07-24 18:40:17 UTC

(In reply to Eduardo Habkost from comment #21)
> (In reply to Eduardo Habkost from comment #19)
> > (In reply to Igor Mammedov from comment #18)
> > > I think we should do exactly this, i.e have a default APIC ID generation
> > > routine
> > > (current algorithm) but allow to machine to use a different algorithm keying
> > > off
> > > used cpu type. (EPYC spec has quite specific guidance on how to encode APIC
> > > ID
> > > which we can do taking as input data provided by -smp and -numa).
> > > EPYC impl. attempt is trying to infer CPUID extended APIC leaf only on -smp
> > > from inside of CPU code only and leaving physical APIC ID out of sync
> > > and end up with IDs that guest OS does not expect (as IDs do not match what
> > > is happening on real machines nor match spec).
> > 
> > Do the EPYC APIC ID requirements conflict with the requirements for other
> > processors (the ones currently encoded in i386/topology.h), or are they a
> > superset of the existing requirements?
> > 
> > I will look for the docs, but I would appreciate if anybody has pointers to
> > the specific documentation sections on how APIC IDs should be encoded for
> > EPYC.
> 
> Answering my own question:
> 
> Processor Programming Reference (PPR) for AMD Family 17h Model 01h, Revision
> B1 Processors:
> 
> """
> 2.1.10.2.1.3
> ApicId Enumeration Requirements
> Operating systems are expected to use
> Core::X86::Cpuid::SizeId[ApicIdCoreIdSize], the number of least
> significant bits in the Initial APIC ID that indicate core ID within a
> processor, in constructing per-core CPUID
> masks. Core::X86::Cpuid::SizeId[ApicIdCoreIdSize] determines the maximum
> number of cores (MNC) that the
> processor could theoretically support, not the actual number of cores that
> are actually implemented or enabled on
> the processor, as indicated by Core::X86::Cpuid::SizeId[NC].
> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
> • ApicId[6] = Socket ID.
> • ApicId[5:4] = Node ID.
> • ApicId[3] = Logical CCX L3 complex ID
> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} :
> {1'b0,LogicalCoreID[1:0]}.
> """
> 
> This imposes the following restrictions:
> 1) only threads=1 and threads=2 will work
> 2) apicid_core_offset() must be 0 (if threads=1) or 1 (if threads=2)
> 4) apicid_pkg_offset() must be >= 3 (preferably it should be 6)
> 
> Enforcing #1 is doable.  #2 is already true.
> 
> #3 conflicts with the existing code.  cores=2,threads=2 makes
> apicid_pkg_offset() return 2.
> 
> We can try to make i386/topology.h support socket/node/core-complex IDs, or
> make a separate APIC ID encoding implementation for EPYC.  Instead of
> duplicating code, I'd prefer to just add a mechanism that let the EPYC CPU
> code force a minimum apicid_pkg_offset() value.

Here is the logic I am thinking of..

if (cs->nr_threads - 1) {
       *eax = (topo.socket_id << 6) | (topo.node_id << 4) | (topo.ccx_id << 3)
               | (topo.core_id << 1) | topo.smt_id;
    } else {
        *eax = (topo.socket_id << 6) | (topo.node_id << 4) | (topo.ccx_id << 3)
               | (topo.core_id);
}

We need to know number of numa nodes, number of sockets, nr_cores and nr_threads.

APIC id could change for future models. So I think it is better we have a separate apic id handler for EPYC models. If it changes for new models, then we can easily add a new handler.

Comment 23 Eduardo Habkost 2019-07-24 18:56:31 UTC

(In reply to Babu Moger from comment #22)
> Here is the logic I am thinking of..
> 
> if (cs->nr_threads - 1) {
>        *eax = (topo.socket_id << 6) | (topo.node_id << 4) | (topo.ccx_id <<
> 3)
>                | (topo.core_id << 1) | topo.smt_id;
>     } else {
>         *eax = (topo.socket_id << 6) | (topo.node_id << 4) | (topo.ccx_id <<
> 3)
>                | (topo.core_id);
> }

You can implement it this way, but you need to make sure cpu->apic_id is consistent with the value you are calculating.

You will probably need to change x86_cpu_apic_id_from_index() and x86_topo_ids_from_apicid() (or their callers) as suggested by Igor in comment #18.

Comment 24 Babu Moger 2019-07-29 23:20:55 UTC

Created attachment 1594432 [details]
Patches to Implement apic id for AMD EPYC models

First draft patch 1

Comment 25 Babu Moger 2019-07-29 23:22:58 UTC

Created attachment 1594433 [details]
Patches to Implement apic id for AMD EPYC models 2/4

First draft 2/4

Comment 26 Babu Moger 2019-07-29 23:23:45 UTC

Created attachment 1594434 [details]
Patches to Implement apic id for AMD EPYC models 3/4

Comment 27 Babu Moger 2019-07-29 23:24:20 UTC

Created attachment 1594435 [details]
Patches to Implement apic id for AMD EPYC models 4/4

Comment 28 Babu Moger 2019-07-29 23:25:58 UTC

Hi Eduardo/Igor, I have attached the first draft patches for review. Please take a look.

Comment 29 Eduardo Habkost 2019-07-30 02:13:51 UTC

(In reply to Babu Moger from comment #28)
> Hi Eduardo/Igor, I have attached the first draft patches for review. Please
> take a look.

Thank you.  Can you submit them to qemu-devel for review?

Comment 30 Babu Moger 2019-07-30 14:23:23 UTC

(In reply to Eduardo Habkost from comment #29)
> (In reply to Babu Moger from comment #28)
> > Hi Eduardo/Igor, I have attached the first draft patches for review. Please
> > take a look.
> 
> Thank you.  Can you submit them to qemu-devel for review?

Yes.  Working on it.

Comment 31 Yumei Huang 2019-09-09 07:01:18 UTC

Hi,

I hit same call trace info with Intel cpu model, wondering it might not only relate to EPYC ?

QEMU cli:
# /usr/libexec/qemu-kvm  -cpu Haswell-noTSX  \
 -smp 12,sockets=3,cores=2,threads=2 \
 -numa node,nodeid=0  \
 -numa node,nodeid=1 \
 -numa node,nodeid=2 \
 -numa node,nodeid=3 \
 -numa node,nodeid=4  \
 -numa cpu,node-id=0,socket-id=0  \
 -numa cpu,node-id=1,socket-id=1,core-id=0  \
 -numa cpu,node-id=2,socket-id=1,core-id=1  \
 -numa cpu,node-id=3,socket-id=2,core-id=0,thread-id=0  \
 -numa cpu,node-id=3,socket-id=2,core-id=0,thread-id=1  \
 -numa cpu,node-id=4,socket-id=2,core-id=1,thread-id=0  \
 -numa cpu,node-id=4,socket-id=2,core-id=1,thread-id=1  \
 -monitor stdio -vnc :0 -qmp tcp:0:4444,server,nowait 

Inside Guest:
# numactl -H
available: 5 nodes (0-4)
node 0 cpus: 0 1 2 3
node 0 size: 1604 MB
node 0 free: 840 MB
node 1 cpus: 4 5
node 1 size: 1389 MB
node 1 free: 1105 MB
node 2 cpus: 6 7
node 2 size: 1613 MB
node 2 free: 1327 MB
node 3 cpus: 8 9
node 3 size: 1587 MB
node 3 free: 1335 MB
node 4 cpus: 10 11
node 4 size: 1614 MB
node 4 free: 1368 MB
node distances:
node   0   1   2   3   4 
  0:  10  20  20  20  20 
  1:  20  10  20  20  20 
  2:  20  20  10  20  20 
  3:  20  20  20  10  20 
  4:  20  20  20  20  10 


Call trace in guest dmesg:
[    0.001000] ------------[ cut here ]------------
[    0.001000] sched: CPU #6's llc-sibling CPU #4 is not on the same node! [node: 2 != 1]. Ignoring dependency.
[    0.001000] WARNING: CPU: 6 PID: 0 at arch/x86/kernel/smpboot.c:381 topology_sane.isra.6+0x62/0x70
[    0.001000] Modules linked in:
[    0.001000] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.18.0-141.el8.x86_64 #1
[    0.001000] Hardware name: Red Hat KVM, BIOS 1.12.0-5.module+el8.1.0+4022+29a53beb 04/01/2014
[    0.001000] RIP: 0010:topology_sane.isra.6+0x62/0x70
[    0.001000] Code: 0f 94 c0 c3 80 3d de 3e 37 01 00 75 ef 89 f1 41 89 d9 89 fe 41 89 e8 48 c7 c7 68 7c e7 ac c6 05 c4 3e 37 01 01 e8 d8 3c 06 00 <0f> 0b eb ce 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 0f b6 05
[    0.001000] RSP: 0000:ffffa76340ce7ed0 EFLAGS: 00010086
[    0.001000] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffffad05a2c8
[    0.001000] RDX: 0000000000000001 RSI: 0000000000000086 RDI: 0000000000000046
[    0.001000] RBP: 0000000000000002 R08: 00000000000000f4 R09: 0000000000000004
[    0.001000] R10: 0000000000000000 R11: 0000000000000001 R12: 000000000000f040
[    0.001000] R13: 0000000000000004 R14: 0000000000000006 R15: 0000000000000006
[    0.001000] FS:  0000000000000000(0000) GS:ffff98f7b1000000(0000) knlGS:0000000000000000
[    0.001000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.001000] CR2: 0000000000000000 CR3: 00000001b880a001 CR4: 00000000001606e0
[    0.001000] Call Trace:
[    0.001000]  set_cpu_sibling_map+0x149/0x4f0
[    0.001000]  start_secondary+0xaf/0x200
[    0.001000]  secondary_startup_64+0xb7/0xc0
[    0.001000] ---[ end trace 0afccdf8a7afbd98 ]---

Comment 32 Yumei Huang 2019-09-09 07:03:38 UTC

Adding more info for comment 31.

qemu-kvm-4.1.0-7.module+el8.1.0+4177+896cb282
kernel-4.18.0-141.el8.x86_64 (both guest and host)

Host:
# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:            2
CPU MHz:             1839.985
CPU max MHz:         3200.0000
CPU min MHz:         1200.0000
BogoMIPS:            4799.75
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20480K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d

Comment 33 Babu Moger 2019-09-09 19:47:09 UTC

Posted second RFC patches.
https://lore.kernel.org/qemu-devel/156779689013.21957.1631551572950676212.stgit@localhost.localdomain/

Comment 34 Eduardo Habkost 2019-09-16 15:26:20 UTC

(In reply to Yumei Huang from comment #31)
> Hi,
> 
> I hit same call trace info with Intel cpu model, wondering it might not only
> relate to EPYC ?
> 
> QEMU cli:
> # /usr/libexec/qemu-kvm  -cpu Haswell-noTSX  \
>  -smp 12,sockets=3,cores=2,threads=2 \
>  -numa node,nodeid=0  \
>  -numa node,nodeid=1 \
>  -numa node,nodeid=2 \
>  -numa node,nodeid=3 \
>  -numa node,nodeid=4  \
>  -numa cpu,node-id=0,socket-id=0  \
>  -numa cpu,node-id=1,socket-id=1,core-id=0  \
>  -numa cpu,node-id=2,socket-id=1,core-id=1  \
>  -numa cpu,node-id=3,socket-id=2,core-id=0,thread-id=0  \
>  -numa cpu,node-id=3,socket-id=2,core-id=0,thread-id=1  \
>  -numa cpu,node-id=4,socket-id=2,core-id=1,thread-id=0  \
>  -numa cpu,node-id=4,socket-id=2,core-id=1,thread-id=1  \
>  -monitor stdio -vnc :0 -qmp tcp:0:4444,server,nowait 
> 
> Inside Guest:
> # numactl -H
> available: 5 nodes (0-4)
> node 0 cpus: 0 1 2 3
> node 0 size: 1604 MB
> node 0 free: 840 MB
> node 1 cpus: 4 5
> node 1 size: 1389 MB
> node 1 free: 1105 MB
> node 2 cpus: 6 7
> node 2 size: 1613 MB
> node 2 free: 1327 MB
> node 3 cpus: 8 9
> node 3 size: 1587 MB
> node 3 free: 1335 MB
> node 4 cpus: 10 11
> node 4 size: 1614 MB
> node 4 free: 1368 MB
> node distances:
> node   0   1   2   3   4 
>   0:  10  20  20  20  20 
>   1:  20  10  20  20  20 
>   2:  20  20  10  20  20 
>   3:  20  20  20  10  20 
>   4:  20  20  20  20  10 
> 
> 
> Call trace in guest dmesg:
> [    0.001000] ------------[ cut here ]------------
> [    0.001000] sched: CPU #6's llc-sibling CPU #4 is not on the same node!
> [node: 2 != 1]. Ignoring dependency.


The guest doesn't expect cores from the same die (which share the L3 cache) to appear in separate numa nodes, so the warning in this case is expected.  It is a bug only if the warning is being triggered for valid NUMA configurations like the one in comment #0 (where all cores in each socket are in the same node).

Comment 35 Babu Moger 2019-09-16 20:06:50 UTC

(In reply to Babu Moger from comment #33)
> Posted second RFC patches.
> https://lore.kernel.org/qemu-devel/156779689013.21957.1631551572950676212.
> stgit/

Eduardo, Waiting for your feedback on this series in general. Also feedback on setting the new epyc feature bit. The apic id calculation is done much before x86_cpu_load_model is called.
Thanks

Comment 36 Babu Moger 2019-12-06 19:30:14 UTC

Posted v3 version: Please review.
https://lore.kernel.org/qemu-devel/157541968844.46157.17994918142533791313.stgit@naples-babu.amd.com/

Comment 39 Ademar Reis 2020-02-05 23:00:27 UTC

QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 44 Babu Moger 2020-04-15 14:08:54 UTC

All the changes required to fix this bug are committed to the master.
https://git.qemu.org/?p=qemu.git;a=commit;h=7b225762c8c05fd31d4c2be116aedfbc00383f8b
Thanks

Comment 56 Eduardo Habkost 2020-09-09 18:42:18 UTC

needinfo from comment 54 replied by Igor on comment 55.

Comment 62 RHEL Program Management 2021-03-15 07:37:23 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 70 Igor Mammedov 2021-03-29 22:50:02 UTC

QEMU-5.2 returns expected (per spec) apic-id, but only for valid -smp for EPYC.
So one won't see a warning if -smp resembles real EPYC cpu.
For all other combinations current kernel may get confused as for EPYC it uses package-id+die-id
from apic-id as llc_id.

So to test you needs command line that matches real EPYC cpu, for example:

EPYC 7232P | 2 × CCD | 4 cores × 2 threads 

where CCD is configured with help of 'dies' in QEMU
in EPYC case # of 'dies' should be equal to # of numa nodes

so CLI for above should look like:
 -cpu EPYC \
 -smp 16,sockets=1,dies=2,cores=4,threads=2 \
 -numa cpu,node-id=0,socket-id=0,die-id=0 \
 -numa cpu,node-id=1,socket-id=0,die-id=1

I used https://en.wikipedia.org/wiki/Epyc as a base to play with.

Comment 74 John Ferlan 2021-09-08 21:23:38 UTC

Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release. Reset the ITR='---' as part of the change.

Comment 75 RHEL Program Management 2021-09-15 08:26:13 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 76 Igor Mammedov 2022-03-23 07:53:19 UTC

Can you retest it with qemu-kvm-6.2?
Taking in account comment 70?

Comment 77 liunana 2022-03-30 14:19:23 UTC

(In reply to Igor Mammedov from comment #76)
> Can you retest it with qemu-kvm-6.2?
> Taking in account comment 70?

Hi,

Sorry for the later reply.

I can't reproduce this bug while booting guest with special smp configuration.


Guest kernel: 4.18.0-372.2.1.el8.x86_64

Host info:
    5.14.0-70.2.1.el9_0.x86_64
    qemu-kvm-6.2.0-11.el9_0.2.x86_64
    Model name:            AMD EPYC 7302 16-Core Processor
    dell-per7525-02.lab.eng.pek2.redhat.com
# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 31556 MB
node 0 free: 23831 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 32239 MB
node 1 free: 5504 MB
node distances:
node   0   1 
  0:  10  32 
  1:  32  10 


 
I didn't find the EPYC 7302 in wiki Doc, And only EPYC 7302P.

EPYC 7302P | 4 × CCD | 8 cores × 2 threads 



Test results:
Guest works well without call trace while booting with follow qemu cli:
    -smp 64,sockets=1,cores=8,threads=2,dies=4 \
    -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648 \
    -object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648,host-nodes=0,policy=bind \
    -numa node,nodeid=0,cpus=0-31,memdev=ram-node0 \
    -numa node,nodeid=1,cpus=32-63,memdev=ram-node1 \


Besides, I also test the smp combinations.
----------------------------------------------------------------------------------------------------------
-smp 32,sockets=2,cores=8,threads=2                     pass
-smp 64,sockets=4,cores=8,threads=2                     pass
-smp 64,sockets=2,cores=8,threads=2,dies=2         pass
-smp 16,sockets=1,cores=8,threads=2,dies=1         pass
---------------------------------------------------------------------------------------------------------


But if I change the combinations of "cores=8,threads=2", will hit this issue.


I'm a little confuesd about the smp configuration for specific EPYC machine with numa,  is a fixed configuration of cores*threads for different EPYC?
and will we improve this in the future for all other combinations?

Please help to check this, thanks.



Best regards
Liu Nana

Comment 78 Igor Mammedov 2022-03-31 15:23:47 UTC

(In reply to liunana from comment #77)
> (In reply to Igor Mammedov from comment #76)
> > Can you retest it with qemu-kvm-6.2?
> > Taking in account comment 70?
> 
> Hi,
> 
> Sorry for the later reply.
> 
> I can't reproduce this bug while booting guest with special smp
> configuration.
> 
> 
> Guest kernel: 4.18.0-372.2.1.el8.x86_64
> 
> Host info:
>     5.14.0-70.2.1.el9_0.x86_64
>     qemu-kvm-6.2.0-11.el9_0.2.x86_64
>     Model name:            AMD EPYC 7302 16-Core Processor
>     dell-per7525-02.lab.eng.pek2.redhat.com
> # numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39
> 40 41 42 43 44 45 46 47
> node 0 size: 31556 MB
> node 0 free: 23831 MB
> node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52
> 53 54 55 56 57 58 59 60 61 62 63
> node 1 size: 32239 MB
> node 1 free: 5504 MB
> node distances:
> node   0   1 
>   0:  10  32 
>   1:  32  10 
> 
> 
>  
> I didn't find the EPYC 7302 in wiki Doc, And only EPYC 7302P.
> EPYC 7302P | 4 × CCD | 8 cores × 2 threads 
                         
Following CLI would correspond to 7302P:

 -numa node,nodeid=0 \
 -numa node,nodeid=1 \
 -numa node,nodeid=2 \
 -numa node,nodeid=3 \
 -smp 32,sockets=1,cores=4,threads=2,dies=4 \
 -numa cpu,socket-id=0,die-id=0,node-id=0 \
 -numa cpu,socket-id=0,die-id=1,node-id=1 \
 -numa cpu,socket-id=0,die-id=2,node-id=2 \
 -numa cpu,socket-id=0,die-id=3,node-id=3

where CCD (chiplets) goes to 'dies' and cores are equal to cores per CCD per wiki it would be column Cores/CCD column


> Test results:
> Guest works well without call trace while booting with follow qemu cli:
>     -smp 64,sockets=1,cores=8,threads=2,dies=4 \
>     -object
> memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/kvm_hugepage,
> size=2147483648 \
>     -object
> memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/mnt/kvm_hugepage,
> size=2147483648,host-nodes=0,policy=bind \
>     -numa node,nodeid=0,cpus=0-31,memdev=ram-node0 \
>     -numa node,nodeid=1,cpus=32-63,memdev=ram-node1 \

this is a little bit wrong, for EPYC you have to have #of numa nodes == to #dies
also use -numa cpu,socket-id=x,die-id=y,node-id=z to assign CPUs to nodes
unless you know exactly where each cpu-index belongs.

> 
> 
> Besides, I also test the smp combinations.
> -----------------------------------------------------------------------------
> -----------------------------
> -smp 32,sockets=2,cores=8,threads=2                     pass
> -smp 64,sockets=4,cores=8,threads=2                     pass
> -smp 64,sockets=2,cores=8,threads=2,dies=2         pass
> -smp 16,sockets=1,cores=8,threads=2,dies=1         pass
> -----------------------------------------------------------------------------
> ----------------------------
> 
> 
> But if I change the combinations of "cores=8,threads=2", will hit this issue.

above example:
  -smp 32,sockets=1,cores=4,threads=2,dies=4 
works for me if you assign numa nodes correctly


> I'm a little confuesd about the smp configuration for specific EPYC machine
> with numa,  is a fixed configuration of cores*threads for different EPYC?
> and will we improve this in the future for all other combinations?

As far as I recall we have tried to introduce checks upstream but
eventually consensus was 'not to bother' with them (and we reverted
relevant commits) in QEMU and let user configure whatever.
If topology matches real EPYC CPUs, there won't be a warning and if
configured topology is fictional guest may complain and print warning.

No other changes to improve it are planned.

> Please help to check this, thanks.
> 
> 
> 
> Best regards
> Liu Nana