Bug 1728166 - QEMU provides invalid local APIC ID and extended topology CPUID 0x8000001e leafs when using EPYC cpu model
Summary: QEMU provides invalid local APIC ID and extended topology CPUID 0x8000001e l...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.1
Hardware: Unspecified
OS: Unspecified
low
unspecified
Target Milestone: rc
: ---
Assignee: Igor Mammedov
QA Contact: Yumei Huang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-09 08:32 UTC by Li Xiaohui
Modified: 2020-04-21 06:02 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)
auto-test-log (169 bytes, application/gzip)
2019-07-09 08:35 UTC, Li Xiaohui
no flags Details
Patches to Implement apic id for AMD EPYC models (8.01 KB, application/mbox)
2019-07-29 23:20 UTC, Babu Moger
no flags Details
Patches to Implement apic id for AMD EPYC models 2/4 (7.49 KB, application/mbox)
2019-07-29 23:22 UTC, Babu Moger
no flags Details
Patches to Implement apic id for AMD EPYC models 3/4 (7.08 KB, application/mbox)
2019-07-29 23:23 UTC, Babu Moger
no flags Details
Patches to Implement apic id for AMD EPYC models 4/4 (2.34 KB, application/mbox)
2019-07-29 23:24 UTC, Babu Moger
no flags Details

Description Li Xiaohui 2019-07-09 08:32:38 UTC
Description of problem:
after do Postcopy migration with Numa pinned and Hugepage pinned guest--file backend, migration succeed, check dmesg info, found guest on dst host hit call trace


Version-Release number of selected component (if applicable):
src and dst host info: kernel-4.18.0-109.el8.x86_64 & qemu-img-4.0.0-4.module+el8.1.0+3523+b348b848.x86_64

after execute following steps, src&dst host memory info:
[root@hp-dl385g10-13 ipa]# free -g
              total        used        free      shared  buff/cache   available
Mem:             31           4          25           0           0          25
Swap:            15           0          15

guest info:
kernel-4.18.0-109.el8.x86_64


How reproducible:
10/10 in auto test
1/1 in manual test

Steps to Reproduce:
1.run migration auto case:
[root@hp-dl385g10-13 ipa]# python3 Start2Run.py --test_requirement=rhel7_49060_x86_q35_blockdev --test_cases=rhel7_93722 --src_host_ip=10.73.130.67 --dst_host_ip=10.73.130.69 --share_images_dir=/mnt/nfs --sys_image_name=rhel810-64-virtio-scsi.qcow2 --repeat_times=10

Notes:for auto test, can check each rhel7_93722-2019-07-09-03*_logs/long_debug.log for auto test steps and result. logs are in attachment


or do manual test steps:
1.config hugepages on src&dst host:
[root@host ~]# cat /proc/meminfo  | grep -i hugepagesize | awk '{print $2}'
2048
[root@host ~]# echo 2048 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
[root@host ~]# cat /proc/meminfo  | grep -i hugepages_total | awk '{print $2}'
2048
[root@host ~]# mkdir /mnt/kvm_hugepage
[root@host ~]# mount -t hugetlbfs none /mnt/kvm_hugepage/
2.boot guest with hugepage memdev and numa on src host:
/usr/libexec/qemu-kvm  \
-name "mouse-vm" \
-sandbox off \
-machine q35 \
-cpu EPYC \
-nodefaults  \
-vga std \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server,nowait \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server,nowait \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-mon chardev=qmp_id_catch_monitor,mode=control \
-device pcie-root-port,id=root0,slot=0,addr=0x3 \
-device pcie-root-port,id=root1,slot=1,addr=0x4 \
-device pcie-root-port,id=root2,slot=2,addr=0x5 \
-device nec-usb-xhci,id=usb1,bus=root0 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root1 \
-device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 \
-device virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,vectors=4,netdev=tap0,bus=root2 \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=/mnt/nfs/rhel810-64-virtio-scsi.qcow2,node-name=drive_sys1 \
-blockdev driver=qcow2,node-name=drive_image1,file=drive_sys1 \
-netdev tap,id=tap0,vhost=on \
-m 4096 \
-smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
-vnc :10 \
-rtc base=utc,clock=host \
-boot menu=off,strict=off,order=cdn,once=c \
-enable-kvm  \
-qmp tcp:0:3333,server,nowait \
-serial tcp:0:4444,server,nowait \
-monitor stdio \
-object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648 \
-object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648,host-nodes=0,policy=bind \
-numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
-numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \
3.boot guest with "-incoming ..." on dst host
/usr/libexec/qemu-kvm  \
-name "mouse-vm" \
-sandbox off \
-machine q35 \
-cpu EPYC \
-nodefaults  \
-vga std \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server,nowait \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server,nowait \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-mon chardev=qmp_id_catch_monitor,mode=control \
-device pcie-root-port,id=root0,slot=0,addr=0x3 \
-device pcie-root-port,id=root1,slot=1,addr=0x4 \
-device pcie-root-port,id=root2,slot=2,addr=0x5 \
-device nec-usb-xhci,id=usb1,bus=root0 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root1 \
-device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 \
-device virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,vectors=4,netdev=tap0,bus=root2 \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=/mnt/nfs/rhel810-64-virtio-scsi.qcow2,node-name=drive_sys1 \
-blockdev driver=qcow2,node-name=drive_image1,file=drive_sys1 \
-netdev tap,id=tap0,vhost=on \
-m 4096 \
-smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
-vnc :10 \
-rtc base=utc,clock=host \
-boot menu=off,strict=off,order=cdn,once=c \
-enable-kvm  \
-qmp tcp:0:3333,server,nowait \
-serial tcp:0:4444,server,nowait \
-monitor stdio \
-object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648 \
-object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648,host-nodes=0,policy=bind \
-numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
-numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \
-incoming tcp:0:4000 \
4.On src host and dst host, enable postcopy and do migration 
(qemu) migrate_set_capability postcopy-ram on
(qemu) migrate -d tcp:10.73.130.69:4000
(qemu) migrate_start_postcopy
5.after migration, check guest whether work well on dst host


Actual results:
found guest hit call trace after migration:
# dmesg
2019-07-09-03:21:04: [    0.001000] Call Trace:
2019-07-09-03:21:04: [    0.001000]  set_cpu_sibling_map+0x149/0x4f0
2019-07-09-03:21:04: [    0.001000]  start_secondary+0xaf/0x200
2019-07-09-03:21:04: [    0.001000]  secondary_startup_64+0xb7/0xc0
2019-07-09-03:21:04: [    0.001000] ---[ end trace 717102d13b5f6712 ]---


Expected results:
no call trace in guest dmesg info 


Additional info:

Comment 1 Li Xiaohui 2019-07-09 08:35:28 UTC
Created attachment 1588638 [details]
auto-test-log

Comment 2 Li Xiaohui 2019-07-09 08:47:43 UTC
found guest hit call trace before migration, so free to change assignee, thanks

Comment 3 Li Xiaohui 2019-07-09 10:58:03 UTC
confirm how to reproduce:
when boot guest with following commands, will hit guest call trace, or won't hit this issue:
-object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648 \
-object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/mnt/kvm_hugepage,size=2147483648,host-nodes=0,policy=bind \
-numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
-numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \

Comment 4 Li Xiaohui 2019-07-10 08:29:10 UTC
Add host numa info:
[root@hp-dl385g10-13 latest]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        8
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               1
Model name:          AMD EPYC 7251 8-Core Processor
Stepping:            2
CPU MHz:             2327.725
CPU max MHz:         2100.0000
CPU min MHz:         1200.0000
BogoMIPS:            4192.17
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            4096K
NUMA node0 CPU(s):   0,1,16,17
NUMA node1 CPU(s):   2,3,18,19
NUMA node2 CPU(s):   4,5,20,21
NUMA node3 CPU(s):   6,7,22,23
NUMA node4 CPU(s):   8,9,24,25
NUMA node5 CPU(s):   10,11,26,27
NUMA node6 CPU(s):   12,13,28,29
NUMA node7 CPU(s):   14,15,30,31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

Comment 5 Li Xiaohui 2019-07-10 09:25:36 UTC
[root@hp-dl385g10-13 qemu-kvm]# numactl -H 
available: 8 nodes (0-7)
node 0 cpus: 0 1 16 17
node 0 size: 15764 MB
node 0 free: 15327 MB
node 1 cpus: 2 3 18 19
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 4 5 20 21
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 6 7 22 23
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 8 9 24 25
node 4 size: 16079 MB
node 4 free: 14815 MB
node 5 cpus: 10 11 26 27
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus: 12 13 28 29
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 14 15 30 31
node 7 size: 0 MB
node 7 free: 0 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  16  16  16  32  32  32  32 
  1:  16  10  16  16  32  32  32  32 
  2:  16  16  10  16  32  32  32  32 
  3:  16  16  16  10  32  32  32  32 
  4:  32  32  32  32  10  16  16  16 
  5:  32  32  32  32  16  10  16  16 
  6:  32  32  32  32  16  16  10  16 
  7:  32  32  32  32  16  16  16  10 

Notes: before migration, guest hit call trace, so this issue has no relation with migration

Comment 8 Li Xiaohui 2019-07-12 05:54:23 UTC
I can reproduce this issue on rhel8.0-av:
host(kernel-4.18.0-80.el8.x86_64 & qemu-img-3.1.0-20.module+el8.0.0+3273+6bc1ee54.1.x86_64) and guest(kernel-4.18.0-80.el8.x86_64)

Comment 9 Igor Mammedov 2019-07-12 11:40:34 UTC
 Li Xiaohui,

pls provide access to the host where it reproduces.

Comment 10 Li Xiaohui 2019-07-15 02:43:13 UTC
(In reply to Igor Mammedov from comment #9)
>  Li Xiaohui,
> 
> pls provide access to the host where it reproduces.

Hi Igor,
Could reproduce this bz on EPYC host, and boot guest with "-cpu EPYC".
On intel host or boot guest with "-cpu host/Opteron_G3" on EPYC host, wouldn't hit this issue.

So from above, any EPYC host could reproduce it. If you have no EPYC host to use, could contact me again.

Comment 11 Igor Mammedov 2019-07-16 12:40:46 UTC
From what I see it's CPUID problem, so it's not related to any memory related options.
Simpler steps to reproduce:

 1. EPYC host
 2. QEMU CLI:
     -enable-kvm -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
     -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3
     -kernel kernel_that_supports_epyc

Comparing to an EPYC host, AMD implementation in QEMU produces bogus values in 0x8000001e.ebx|ecx
and uses invalid physical APIC ID (not as described in EPYC spec).

QEMU's EPYC topology, is made backwards, where each vcpu tries to make up topology information
for 0x8000001e based on -smp and ignores/doesn't validate APIC value set by machine,
where as I think it should provide routines for calculating correct APIC ID to board
(based on -sm and -numa options) and in CPU code calculate 0x8000001e based on
APIC ID + correctness checks at cpu.realize() time.

Given it's never worked properly it's too late for 8.1.,
so moving it to 8.2 with condNAK PartnerUpstream.

Comment 14 Babu Moger 2019-07-18 20:03:38 UTC
Adding the email discussion to bugzilla.
==========================================

On Thu, 18 Jul 2019 14:19:07 +0000
"Moger, Babu" <Babu.Moger@amd.com> wrote:

> Hi Igor,
>
> On 7/18/19 3:35 AM, Igor Mammedov wrote:
>> On Wed, 17 Jul 2019 21:14:29 +0000
>> "Moger, Babu" <Babu.Moger@amd.com> wrote:
>>   
>>> Hi,
>>>  I am looking at this. I have recreated the problem. Noticed that APIC ID
>>> enumeration(0x8000001e eax) in qemu does not match 0x8000001e EBX/ECX
>>> values. Also we make some assumptions on number of nodes based on  
>> You can avoid assumption and use 'node-id' property that is assigned
>> to the vCPU instance (which originates from -numa options)  
>
> ok. Will look into it closely. Also, we need to know total number of numa
> nodes user wants to create. Not sure if that information is already
> available somewhere. Will investigate.
it's available, but lets keep technical discussion to Bugzill or qemu-devel
mail-list for history reasons.

>>
>> (Also looking at guest side, kernel seems to assume that node ids
>> in CPUID match system wide node ids and that matches what I've observed
>> on EPYC hardware)
>>   
>>> nr_cores. That is also not matching here. Will investigate further and let
>>> you know. Let me know if I missed anything here.  
>>
>> you also need to make physical APIC ID match topology
>> (as specified in spec and as it is on real hardware)
>> in QEMU it is generated by machine see pc_possible_cpu_arch_ids().
>> Maybe it would be better to move some EPYC topology routines to
>> ./include/hw/i386/topology.h from cpu.c and use it in
>> pc_possible_cpu_arch_ids() for calculating valid APIC ID.
>>   
>>>
>>> Thanks
>>> Babu
>>>
>>>
>>> On 7/16/19 5:44 PM, Grimm, Jon wrote:  
>>>> Hi Eduardo,
>>>>
>>>> +Babu for looking into 
>>>>
>>>> Best Regards, Jon
>>>>     
>>>>> -----Original Message-----
>>>>> From: virt-amd-list-bounces@redhat.com <virt-amd-list-bounces@redhat.com>
>>>>> On Behalf Of Eduardo Habkost
>>>>> Sent: Tuesday, July 16, 2019 4:42 PM
>>>>> To: virt-amd-list@redhat.com
>>>>> Cc: Igor Mammedov <imammedo@redhat.com>
>>>>> Subject: [virt-amd-list] [Bug 1728166] QEMU provides invalid local APIC ID and
>>>>> extended topology CPUID 0x8000001e leafs when using EPYC cpu model
>>>>>
>>>>> Hi,APIC ID
>>>>>
>>>>> Can anybody from AMD take a look at this bug?  It is probably reproducible
>>>>> upstream too.
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1728166
>>>>> Bug 1728166 - QEMU provides invalid local APIC ID and extended topology
>>>>> CPUID 0x8000001e leafs when using EPYC cpu model
>>>>>
>>>>> -----------
>>>>> Comment 11 Igor Mammedov 2019-07-16 12:40:46 UTC
>>>>>    
>>>>> >From what I see it's CPUID problem, so it's not related to any memory related    
>>>>> options.
>>>>> Simpler steps to reproduce:
>>>>>
>>>>>  1. EPYC host
>>>>>  2. QEMU CLI:
>>>>>      -enable-kvm -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
>>>>>      -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3
>>>>>      -kernel kernel_that_supports_epyc
>>>>>
>>>>> Comparing to an EPYC host, AMD implementation in QEMU produces bogus
>>>>> values in 0x8000001e.ebx|ecx and uses invalid physical APIC ID (not as
>>>>> described in EPYC spec).
>>>>>
>>>>> QEMU's EPYC topology, is made backwards, where each vcpu tries to make up
>>>>> topology information for 0x8000001e based on -smp and ignores/doesn't
>>>>> validate APIC value set by machine, where as I think it should provide routines
>>>>> for calculating correct APIC ID to board (based on -sm and -numa options) and
>>>>> in CPU code calculate 0x8000001e based on APIC ID + correctness checks at
>>>>> cpu.realize() time.
>>>>>
>>>>> Given it's never worked properly it's too late for 8.1., so moving it to 8.2 with
>>>>> condNAK PartnerUpstream.
>>>>>
>>>>> --
>>>>> Eduardo

Comment 15 Babu Moger 2019-07-24 14:34:59 UTC
Hi Eduardo and Igor,
  Changing the arch_id(or apic_id) involves quite a bit of changes(in file hw/i386/pc.c). Right now apic id is decoded based on sockets, cores, threads. This appears to work for most standard configurations for AMD and other vendors. But this decoding does not follow AMD's APIC ID enumeration. Changing will effect other vendors. I am trying to see if there is a way to handle this without affection other vendors.  Any thoughts?
Thanks
Babu

Comment 16 Eduardo Habkost 2019-07-24 15:07:13 UTC
I'm still trying to understand why fixing this bug would require changing the APIC ID assignment logic we have today.  Do we have a quick explanation why?

We could introduce a new AMD-specific APIC ID assignment API upstream (as opposed to existing i386/topology.h code) and choose the APIC ID assignment implementation depending on the CPU vendor, but I'm not sure this is really necessary.

Comment 17 Babu Moger 2019-07-24 15:27:10 UTC
(In reply to Eduardo Habkost from comment #16)
> I'm still trying to understand why fixing this bug would require changing
> the APIC ID assignment logic we have today.  Do we have a quick explanation
> why?
> 
> We could introduce a new AMD-specific APIC ID assignment API upstream (as
> opposed to existing i386/topology.h code) and choose the APIC ID assignment
> implementation depending on the CPU vendor, but I'm not sure this is really
> necessary.

Here is the kernel message when booting the kernel. 
[    0.020817] ------------[ cut here ]------------
[    0.020817] sched: CPU #2's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[    0.020817] WARNING: CPU: 2 PID: 0 at arch/x86/kernel/smpboot.c:377 topology_sane.isra.5+0x63/0x70

Kernel is trying to validate the topology here. The cpu 0 and cpu 2 are on two different numa nodes(-numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3).
Based on the EPYC topology, kernel shifts the apic id to get the llc_id(per_cpu(cpu_llc_id, cpu) = c->apicid >> 3; in arch/x86/kernel/cpu/cacheinfo.c).
It finds that both these cpus are same llc_id but they are on different numa nodes. There is inconsistency.

Comment 18 Igor Mammedov 2019-07-24 17:12:47 UTC
(In reply to Eduardo Habkost from comment #16)
> I'm still trying to understand why fixing this bug would require changing
> the APIC ID assignment logic we have today.  Do we have a quick explanation
> why?
> 
> We could introduce a new AMD-specific APIC ID assignment API upstream (as
> opposed to existing i386/topology.h code) and choose the APIC ID assignment
> implementation depending on the CPU vendor, but I'm not sure this is really
> necessary.

I think we should do exactly this, i.e have a default APIC ID generation routine
(current algorithm) but allow to machine to use a different algorithm keying off
used cpu type. (EPYC spec has quite specific guidance on how to encode APIC ID
which we can do taking as input data provided by -smp and -numa).
EPYC impl. attempt is trying to infer CPUID extended APIC leaf only on -smp
from inside of CPU code only and leaving physical APIC ID out of sync
and end up with IDs that guest OS does not expect (as IDs do not match what
is happening on real machines nor match spec).

Looking from other side, a properly encoded physical APIC ID (arch_id in QEMU)
has all necessary data to encode CPUID 0x8000001e leafs.
That's why I'd suggest to rework 
 pc_possible_cpu_arch_ids()
    ...
        ms->possible_cpus->cpus[i].arch_id = x86_cpu_apic_id_from_index(pcms, i);
        x86_topo_ids_from_apicid(ms->possible_cpus->cpus[i].arch_id,
                                 pcms->smp_dies, ms->smp.cores,
                                 ms->smp.threads, &topo);
    ...
logic that encodes APIC ID and then splits it back into topology socket/core/thread
parts and override it by CPU type.
(I think that could be easiest way and would free CPU code from making up data out of nowhere)

Comment 19 Eduardo Habkost 2019-07-24 17:24:49 UTC
(In reply to Igor Mammedov from comment #18)
> I think we should do exactly this, i.e have a default APIC ID generation
> routine
> (current algorithm) but allow to machine to use a different algorithm keying
> off
> used cpu type. (EPYC spec has quite specific guidance on how to encode APIC
> ID
> which we can do taking as input data provided by -smp and -numa).
> EPYC impl. attempt is trying to infer CPUID extended APIC leaf only on -smp
> from inside of CPU code only and leaving physical APIC ID out of sync
> and end up with IDs that guest OS does not expect (as IDs do not match what
> is happening on real machines nor match spec).

Do the EPYC APIC ID requirements conflict with the requirements for other processors (the ones currently encoded in i386/topology.h), or are they a superset of the existing requirements?

I will look for the docs, but I would appreciate if anybody has pointers to the specific documentation sections on how APIC IDs should be encoded for EPYC.

Comment 20 Eduardo Habkost 2019-07-24 17:32:37 UTC
(In reply to Babu Moger from comment #17)
> Here is the kernel message when booting the kernel. 
> [    0.020817] ------------[ cut here ]------------
> [    0.020817] sched: CPU #2's llc-sibling CPU #0 is not on the same node!
> [node: 1 != 0]. Ignoring dependency.
> [    0.020817] WARNING: CPU: 2 PID: 0 at arch/x86/kernel/smpboot.c:377
> topology_sane.isra.5+0x63/0x70
> 
> Kernel is trying to validate the topology here. The cpu 0 and cpu 2 are on
> two different numa nodes(-numa node,nodeid=0,cpus=0-1 -numa
> node,nodeid=1,cpus=2-3).
> Based on the EPYC topology, kernel shifts the apic id to get the
> llc_id(per_cpu(cpu_llc_id, cpu) = c->apicid >> 3; in
> arch/x86/kernel/cpu/cacheinfo.c).
> It finds that both these cpus are same llc_id but they are on different numa
> nodes. There is inconsistency.

What should be the correct llc_id for each of the 4 VCPUs in the example command line above? (-smp 4,maxcpus=4,cores=2,threads=1,sockets=2)

Comment 21 Eduardo Habkost 2019-07-24 18:00:56 UTC
(In reply to Eduardo Habkost from comment #19)
> (In reply to Igor Mammedov from comment #18)
> > I think we should do exactly this, i.e have a default APIC ID generation
> > routine
> > (current algorithm) but allow to machine to use a different algorithm keying
> > off
> > used cpu type. (EPYC spec has quite specific guidance on how to encode APIC
> > ID
> > which we can do taking as input data provided by -smp and -numa).
> > EPYC impl. attempt is trying to infer CPUID extended APIC leaf only on -smp
> > from inside of CPU code only and leaving physical APIC ID out of sync
> > and end up with IDs that guest OS does not expect (as IDs do not match what
> > is happening on real machines nor match spec).
> 
> Do the EPYC APIC ID requirements conflict with the requirements for other
> processors (the ones currently encoded in i386/topology.h), or are they a
> superset of the existing requirements?
> 
> I will look for the docs, but I would appreciate if anybody has pointers to
> the specific documentation sections on how APIC IDs should be encoded for
> EPYC.

Answering my own question:

Processor Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1 Processors:

"""
2.1.10.2.1.3
ApicId Enumeration Requirements
Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdCoreIdSize], the number of least
significant bits in the Initial APIC ID that indicate core ID within a processor, in constructing per-core CPUID
masks. Core::X86::Cpuid::SizeId[ApicIdCoreIdSize] determines the maximum number of cores (MNC) that the
processor could theoretically support, not the actual number of cores that are actually implemented or enabled on
the processor, as indicated by Core::X86::Cpuid::SizeId[NC].
Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
• ApicId[6] = Socket ID.
• ApicId[5:4] = Node ID.
• ApicId[3] = Logical CCX L3 complex ID
• ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}.
"""

This imposes the following restrictions:
1) only threads=1 and threads=2 will work
2) apicid_core_offset() must be 0 (if threads=1) or 1 (if threads=2)
4) apicid_pkg_offset() must be >= 3 (preferably it should be 6)

Enforcing #1 is doable.  #2 is already true.

#3 conflicts with the existing code.  cores=2,threads=2 makes apicid_pkg_offset() return 2.

We can try to make i386/topology.h support socket/node/core-complex IDs, or make a separate APIC ID encoding implementation for EPYC.  Instead of duplicating code, I'd prefer to just add a mechanism that let the EPYC CPU code force a minimum apicid_pkg_offset() value.

Comment 22 Babu Moger 2019-07-24 18:40:17 UTC
(In reply to Eduardo Habkost from comment #21)
> (In reply to Eduardo Habkost from comment #19)
> > (In reply to Igor Mammedov from comment #18)
> > > I think we should do exactly this, i.e have a default APIC ID generation
> > > routine
> > > (current algorithm) but allow to machine to use a different algorithm keying
> > > off
> > > used cpu type. (EPYC spec has quite specific guidance on how to encode APIC
> > > ID
> > > which we can do taking as input data provided by -smp and -numa).
> > > EPYC impl. attempt is trying to infer CPUID extended APIC leaf only on -smp
> > > from inside of CPU code only and leaving physical APIC ID out of sync
> > > and end up with IDs that guest OS does not expect (as IDs do not match what
> > > is happening on real machines nor match spec).
> > 
> > Do the EPYC APIC ID requirements conflict with the requirements for other
> > processors (the ones currently encoded in i386/topology.h), or are they a
> > superset of the existing requirements?
> > 
> > I will look for the docs, but I would appreciate if anybody has pointers to
> > the specific documentation sections on how APIC IDs should be encoded for
> > EPYC.
> 
> Answering my own question:
> 
> Processor Programming Reference (PPR) for AMD Family 17h Model 01h, Revision
> B1 Processors:
> 
> """
> 2.1.10.2.1.3
> ApicId Enumeration Requirements
> Operating systems are expected to use
> Core::X86::Cpuid::SizeId[ApicIdCoreIdSize], the number of least
> significant bits in the Initial APIC ID that indicate core ID within a
> processor, in constructing per-core CPUID
> masks. Core::X86::Cpuid::SizeId[ApicIdCoreIdSize] determines the maximum
> number of cores (MNC) that the
> processor could theoretically support, not the actual number of cores that
> are actually implemented or enabled on
> the processor, as indicated by Core::X86::Cpuid::SizeId[NC].
> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
> • ApicId[6] = Socket ID.
> • ApicId[5:4] = Node ID.
> • ApicId[3] = Logical CCX L3 complex ID
> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} :
> {1'b0,LogicalCoreID[1:0]}.
> """
> 
> This imposes the following restrictions:
> 1) only threads=1 and threads=2 will work
> 2) apicid_core_offset() must be 0 (if threads=1) or 1 (if threads=2)
> 4) apicid_pkg_offset() must be >= 3 (preferably it should be 6)
> 
> Enforcing #1 is doable.  #2 is already true.
> 
> #3 conflicts with the existing code.  cores=2,threads=2 makes
> apicid_pkg_offset() return 2.
> 
> We can try to make i386/topology.h support socket/node/core-complex IDs, or
> make a separate APIC ID encoding implementation for EPYC.  Instead of
> duplicating code, I'd prefer to just add a mechanism that let the EPYC CPU
> code force a minimum apicid_pkg_offset() value.

Here is the logic I am thinking of..

if (cs->nr_threads - 1) {
       *eax = (topo.socket_id << 6) | (topo.node_id << 4) | (topo.ccx_id << 3)
               | (topo.core_id << 1) | topo.smt_id;
    } else {
        *eax = (topo.socket_id << 6) | (topo.node_id << 4) | (topo.ccx_id << 3)
               | (topo.core_id);
}

We need to know number of numa nodes, number of sockets, nr_cores and nr_threads.

APIC id could change for future models. So I think it is better we have a separate apic id handler for EPYC models. If it changes for new models, then we can easily add a new handler.

Comment 23 Eduardo Habkost 2019-07-24 18:56:31 UTC
(In reply to Babu Moger from comment #22)
> Here is the logic I am thinking of..
> 
> if (cs->nr_threads - 1) {
>        *eax = (topo.socket_id << 6) | (topo.node_id << 4) | (topo.ccx_id <<
> 3)
>                | (topo.core_id << 1) | topo.smt_id;
>     } else {
>         *eax = (topo.socket_id << 6) | (topo.node_id << 4) | (topo.ccx_id <<
> 3)
>                | (topo.core_id);
> }

You can implement it this way, but you need to make sure cpu->apic_id is consistent with the value you are calculating.

You will probably need to change x86_cpu_apic_id_from_index() and x86_topo_ids_from_apicid() (or their callers) as suggested by Igor in comment #18.

Comment 24 Babu Moger 2019-07-29 23:20:55 UTC
Created attachment 1594432 [details]
Patches to Implement apic id for AMD EPYC models

First draft patch 1

Comment 25 Babu Moger 2019-07-29 23:22:58 UTC
Created attachment 1594433 [details]
Patches to Implement apic id for AMD EPYC models 2/4

First draft 2/4

Comment 26 Babu Moger 2019-07-29 23:23:45 UTC
Created attachment 1594434 [details]
Patches to Implement apic id for AMD EPYC models 3/4

Comment 27 Babu Moger 2019-07-29 23:24:20 UTC
Created attachment 1594435 [details]
Patches to Implement apic id for AMD EPYC models 4/4

Comment 28 Babu Moger 2019-07-29 23:25:58 UTC
Hi Eduardo/Igor, I have attached the first draft patches for review. Please take a look.

Comment 29 Eduardo Habkost 2019-07-30 02:13:51 UTC
(In reply to Babu Moger from comment #28)
> Hi Eduardo/Igor, I have attached the first draft patches for review. Please
> take a look.

Thank you.  Can you submit them to qemu-devel for review?

Comment 30 Babu Moger 2019-07-30 14:23:23 UTC
(In reply to Eduardo Habkost from comment #29)
> (In reply to Babu Moger from comment #28)
> > Hi Eduardo/Igor, I have attached the first draft patches for review. Please
> > take a look.
> 
> Thank you.  Can you submit them to qemu-devel for review?

Yes.  Working on it.

Comment 31 Yumei Huang 2019-09-09 07:01:18 UTC
Hi,

I hit same call trace info with Intel cpu model, wondering it might not only relate to EPYC ?

QEMU cli:
# /usr/libexec/qemu-kvm  -cpu Haswell-noTSX  \
 -smp 12,sockets=3,cores=2,threads=2 \
 -numa node,nodeid=0  \
 -numa node,nodeid=1 \
 -numa node,nodeid=2 \
 -numa node,nodeid=3 \
 -numa node,nodeid=4  \
 -numa cpu,node-id=0,socket-id=0  \
 -numa cpu,node-id=1,socket-id=1,core-id=0  \
 -numa cpu,node-id=2,socket-id=1,core-id=1  \
 -numa cpu,node-id=3,socket-id=2,core-id=0,thread-id=0  \
 -numa cpu,node-id=3,socket-id=2,core-id=0,thread-id=1  \
 -numa cpu,node-id=4,socket-id=2,core-id=1,thread-id=0  \
 -numa cpu,node-id=4,socket-id=2,core-id=1,thread-id=1  \
 -monitor stdio -vnc :0 -qmp tcp:0:4444,server,nowait 

Inside Guest:
# numactl -H
available: 5 nodes (0-4)
node 0 cpus: 0 1 2 3
node 0 size: 1604 MB
node 0 free: 840 MB
node 1 cpus: 4 5
node 1 size: 1389 MB
node 1 free: 1105 MB
node 2 cpus: 6 7
node 2 size: 1613 MB
node 2 free: 1327 MB
node 3 cpus: 8 9
node 3 size: 1587 MB
node 3 free: 1335 MB
node 4 cpus: 10 11
node 4 size: 1614 MB
node 4 free: 1368 MB
node distances:
node   0   1   2   3   4 
  0:  10  20  20  20  20 
  1:  20  10  20  20  20 
  2:  20  20  10  20  20 
  3:  20  20  20  10  20 
  4:  20  20  20  20  10 


Call trace in guest dmesg:
[    0.001000] ------------[ cut here ]------------
[    0.001000] sched: CPU #6's llc-sibling CPU #4 is not on the same node! [node: 2 != 1]. Ignoring dependency.
[    0.001000] WARNING: CPU: 6 PID: 0 at arch/x86/kernel/smpboot.c:381 topology_sane.isra.6+0x62/0x70
[    0.001000] Modules linked in:
[    0.001000] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.18.0-141.el8.x86_64 #1
[    0.001000] Hardware name: Red Hat KVM, BIOS 1.12.0-5.module+el8.1.0+4022+29a53beb 04/01/2014
[    0.001000] RIP: 0010:topology_sane.isra.6+0x62/0x70
[    0.001000] Code: 0f 94 c0 c3 80 3d de 3e 37 01 00 75 ef 89 f1 41 89 d9 89 fe 41 89 e8 48 c7 c7 68 7c e7 ac c6 05 c4 3e 37 01 01 e8 d8 3c 06 00 <0f> 0b eb ce 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 0f b6 05
[    0.001000] RSP: 0000:ffffa76340ce7ed0 EFLAGS: 00010086
[    0.001000] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffffad05a2c8
[    0.001000] RDX: 0000000000000001 RSI: 0000000000000086 RDI: 0000000000000046
[    0.001000] RBP: 0000000000000002 R08: 00000000000000f4 R09: 0000000000000004
[    0.001000] R10: 0000000000000000 R11: 0000000000000001 R12: 000000000000f040
[    0.001000] R13: 0000000000000004 R14: 0000000000000006 R15: 0000000000000006
[    0.001000] FS:  0000000000000000(0000) GS:ffff98f7b1000000(0000) knlGS:0000000000000000
[    0.001000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.001000] CR2: 0000000000000000 CR3: 00000001b880a001 CR4: 00000000001606e0
[    0.001000] Call Trace:
[    0.001000]  set_cpu_sibling_map+0x149/0x4f0
[    0.001000]  start_secondary+0xaf/0x200
[    0.001000]  secondary_startup_64+0xb7/0xc0
[    0.001000] ---[ end trace 0afccdf8a7afbd98 ]---

Comment 32 Yumei Huang 2019-09-09 07:03:38 UTC
Adding more info for comment 31.

qemu-kvm-4.1.0-7.module+el8.1.0+4177+896cb282
kernel-4.18.0-141.el8.x86_64 (both guest and host)

Host:
# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:            2
CPU MHz:             1839.985
CPU max MHz:         3200.0000
CPU min MHz:         1200.0000
BogoMIPS:            4799.75
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20480K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d

Comment 34 Eduardo Habkost 2019-09-16 15:26:20 UTC
(In reply to Yumei Huang from comment #31)
> Hi,
> 
> I hit same call trace info with Intel cpu model, wondering it might not only
> relate to EPYC ?
> 
> QEMU cli:
> # /usr/libexec/qemu-kvm  -cpu Haswell-noTSX  \
>  -smp 12,sockets=3,cores=2,threads=2 \
>  -numa node,nodeid=0  \
>  -numa node,nodeid=1 \
>  -numa node,nodeid=2 \
>  -numa node,nodeid=3 \
>  -numa node,nodeid=4  \
>  -numa cpu,node-id=0,socket-id=0  \
>  -numa cpu,node-id=1,socket-id=1,core-id=0  \
>  -numa cpu,node-id=2,socket-id=1,core-id=1  \
>  -numa cpu,node-id=3,socket-id=2,core-id=0,thread-id=0  \
>  -numa cpu,node-id=3,socket-id=2,core-id=0,thread-id=1  \
>  -numa cpu,node-id=4,socket-id=2,core-id=1,thread-id=0  \
>  -numa cpu,node-id=4,socket-id=2,core-id=1,thread-id=1  \
>  -monitor stdio -vnc :0 -qmp tcp:0:4444,server,nowait 
> 
> Inside Guest:
> # numactl -H
> available: 5 nodes (0-4)
> node 0 cpus: 0 1 2 3
> node 0 size: 1604 MB
> node 0 free: 840 MB
> node 1 cpus: 4 5
> node 1 size: 1389 MB
> node 1 free: 1105 MB
> node 2 cpus: 6 7
> node 2 size: 1613 MB
> node 2 free: 1327 MB
> node 3 cpus: 8 9
> node 3 size: 1587 MB
> node 3 free: 1335 MB
> node 4 cpus: 10 11
> node 4 size: 1614 MB
> node 4 free: 1368 MB
> node distances:
> node   0   1   2   3   4 
>   0:  10  20  20  20  20 
>   1:  20  10  20  20  20 
>   2:  20  20  10  20  20 
>   3:  20  20  20  10  20 
>   4:  20  20  20  20  10 
> 
> 
> Call trace in guest dmesg:
> [    0.001000] ------------[ cut here ]------------
> [    0.001000] sched: CPU #6's llc-sibling CPU #4 is not on the same node!
> [node: 2 != 1]. Ignoring dependency.


The guest doesn't expect cores from the same die (which share the L3 cache) to appear in separate numa nodes, so the warning in this case is expected.  It is a bug only if the warning is being triggered for valid NUMA configurations like the one in comment #0 (where all cores in each socket are in the same node).

Comment 35 Babu Moger 2019-09-16 20:06:50 UTC
(In reply to Babu Moger from comment #33)
> Posted second RFC patches.
> https://lore.kernel.org/qemu-devel/156779689013.21957.1631551572950676212.
> stgit@localhost.localdomain/

Eduardo, Waiting for your feedback on this series in general. Also feedback on setting the new epyc feature bit. The apic id calculation is done much before x86_cpu_load_model is called.
Thanks

Comment 39 Ademar Reis 2020-02-05 23:00:27 UTC
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 44 Babu Moger 2020-04-15 14:08:54 UTC
All the changes required to fix this bug are committed to the master.
https://git.qemu.org/?p=qemu.git;a=commit;h=7b225762c8c05fd31d4c2be116aedfbc00383f8b
Thanks


Note You need to log in before you can comment on or make changes to this bug.