Bug 1834200 - cpu_x86_cpuid: Assertion `cpu->core_id <= 255' failed
Summary: cpu_x86_cpuid: Assertion `cpu->core_id <= 255' failed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.2
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Eduardo Habkost
QA Contact: Yumei Huang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-11 09:55 UTC by Yumei Huang
Modified: 2021-05-25 06:42 UTC (History)
10 users (show)

Fixed In Version: qemu-kvm-5.2.0-1.module+el8.4.0+9091+650b220a
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-25 06:42:08 UTC
Type: Bug
Target Upstream Version: commit in BZ
Embargoed:


Attachments (Terms of Use)

Description Yumei Huang 2020-05-11 09:55:09 UTC
Description of problem:
Boot guest with EPYC cpu model, then hotplug a vcpu with core-id>255, qemu core dumped.

Version-Release number of selected component (if applicable):
qemu-kvm-4.2.0-20.module+el8.2.1+6467+49dc3278
kernl-4.18.0-193.1.2.el8_2.x86_64

How reproducible:
always

Steps to Reproduce:
1. Boot guest with EPYC cpu model
# /usr/libexec/qemu-kvm \
  -smp 1,maxcpus=384,cores=384,threads=1,sockets=1 \
  -cpu EPYC \
  -monitor stdio \
  -M q35,kernel-irqchip=split \
  -device intel-iommu,intremap=on,eim=on

2. Hotplug vcpu with core-id=256
  (qemu) device_add EPYC-x86_64-cpu,id=cpu0,core-id=256,socket-id=0,thread-id=0


Actual results:
(qemu) device_add EPYC-x86_64-cpu,id=cpu0,core-id=256,socket-id=0,thread-id=0
qemu-kvm: /builddir/build/BUILD/qemu-4.2.0/target/i386/cpu.c:5717: cpu_x86_cpuid: Assertion `cpu->core_id <= 255' failed.
Aborted (core dumped)

Expected results:
No core dump.

Additional info:
1. It works fine with other amd cpu models, e.g. Opteron_G5.

2. Can't reproduce with qemu5.0, will hit bug1828750.

3. (gdb)  bt full
#0  0x00007fdabdee670f in raise () at /lib64/libc.so.6
#1  0x00007fdabded0b25 in abort () at /lib64/libc.so.6
#2  0x00007fdabded09f9 in _nl_load_domain.cold.0 () at /lib64/libc.so.6
#3  0x00007fdabdedecc6 in .annobin_assert.c_end () at /lib64/libc.so.6
#4  0x0000556502cee1e0 in cpu_x86_cpuid (env=env@entry=0x5565040ee550, index=<optimized out>, 
    index@entry=2147483678, count=count@entry=0, eax=eax@entry=0x7fdaae9fdc14, ebx=ebx@entry=0x7fdaae9fdc18, ecx=ecx@entry=0x7fdaae9fdc1c, edx=0x7fdaae9fdc20)
    at /usr/src/debug/qemu-kvm-4.2.0-20.module+el8.2.1+6467+49dc3278.x86_64/target/i386/cpu.c:5717
        cpu = 0x5565040e5d00
        cs = 0x5565040e5d00
        die_offset = <optimized out>
        limit = <optimized out>
        __PRETTY_FUNCTION__ = "cpu_x86_cpuid"
#5  0x0000556502d3d74b in kvm_arch_init_vcpu (cs=0x5565040e5d00)
    at /usr/src/debug/qemu-kvm-4.2.0-20.module+el8.2.1+6467+49dc3278.x86_64/target/i386/kvm.c:1704
        cpuid_data = 
                {cpuid = {nent = 0, padding = 0, entries = 0x7fdaae9fd708}, entries = {{function = 1073741824, index = 0, flags = 0, eax = 1073741825, ebx = 1263359563, ecx = 1447775574, edx = 77, padding = {0, 0, 0}}, {function = 1073741825, index = 0, flags = 0, eax = 16777467, ebx = 0, ecx = 0, edx = 0, padding = {0, 0, 0}}, {function = 0, index = 0, flags = 0, eax = 13, ebx = 1752462657, ecx = 1145913699, edx = 1769238117, padding = {0, 0, 0}}, {function = 1, index = 0, flags = 0, eax = 8392466, ebx = 25167872, ecx = 4143460867, edx = 395049983, padding = {0, 0, 0}}, {function = 2, index = 0, flags = 6, eax = 1, ebx = 0, ecx = 75, edx = 2948992, padding = {0, 0, 0}}, {function = 4, index = 0, flags = 1, eax = 4227858721, ebx = 29360191, ecx = 63, edx = 1, padding = {0, 0, 0}}, {function = 4, index = 1, flags = 1, eax = 4227858722, ebx = 12582975, ecx = 255, edx = 1, padding = {0, 0, 0}}, {function = 4, index = 2, flags = 1, eax--Type <RET> for more, q to quit, c to continue without paging--
 = 4227858499, ebx = 29360191, ecx = 1023, edx = 0, padding = {0, 0, 0}}, {function = 4, index = 3, flags = 1, eax = 4236231011, ebx = 62914623, ecx = 8191, edx = 6, padding = {0, 0, 0}}, {function = 4, index = 4, flags = 1, eax = 0, ebx = 0, ecx = 0, edx = 0, padding = {0, 0, 0}}, {function = 5, index = 0, flags = 0, eax = 0, ebx = 0, ecx = 3, edx = 0, padding = {0, 0, 0}}, {function = 6, index = 0, flags = 0, eax = 4, ebx = 0, ecx = 0, edx = 0, padding = {0, 0, 0}}, {function = 7, index = 0, flags = 1, eax = 0, ebx = 547094953, ecx = 0, edx = 0, padding = {0, 0, 0}}, {function = 11, index = 0, flags = 1, eax = 0, ebx = 1, ecx = 256, edx = 256, padding = {0, 0, 0}}, {function = 11, index = 1, flags = 1, eax = 9, ebx = 384, ecx = 513, edx = 256, padding = {0, 0, 0}}, {function = 11, index = 2, flags = 1, eax = 0, ebx = 0, ecx = 2, edx = 256, padding = {0, 0, 0}}, {function = 13, index = 0, flags = 1, eax = 7, ebx = 832, ecx = 832, edx = 0, padding = {0, 0, 0}}, {function = 13, index = 1, flags = 1, eax = 7, ebx = 0, ecx = 0, edx = 0, padding = {0, 0, 0}}, {function = 13, index = 2, flags = 1, eax = 256, ebx = 576, ecx = 0, edx = 0, padding = {0, 0, 0}}, {function = 13, index = 63, flags = 1, eax = 0, ebx = 0, ecx = 0, edx = 0, padding = {0, 0, 0}}, {function = 2147483648, index = 0, flags = 0, eax = 2147483678, ebx = 1752462657, ecx = 1145913699, edx = 1769238117, padding = {0, 0, 0}}, {function = 2147483649, index = 0, flags = 0, eax = 8392466, ebx = 0, ecx = 4195315, edx = 802421759, padding = {0, 0, 0}}, {function = 2147483650, index = 0, flags = 0, eax = 541347137, ebx = 1129926725, ecx = 1869762592, edx = 1936942435, padding = {0, 0, 0}}, {function = 2147483651, index = 0, flags = 0, eax = 29295, ebx = 0, ecx = 0, edx = 0, padding = {0, 0, 0}}, {function = 2147483653, index = 0, flags = 0, eax = 33489407, ebx = 33489407, ecx = 537395520, edx = 1074004288, padding = {0, 0, 0}}, {function = 2147483654, index = 0, flags = 0, eax = 0, ebx = 1107313152, ecx = 33579328, edx = 4227392, padding = {0, 0, 0}}, {function = 2147483656, index = 0, flags = 0, eax = 12336, ebx = 0, ecx = 383, edx = 0, padding = {0, 0, 0}}, {function = 2147483677, index = 0, flags = 1, eax = 289, ebx = 29360191, ecx = 63, edx = 1, padding = {0, 0, 0}}, {function = 2147483677, index = 1, flags = 1, eax = 290, ebx = 12582975, ecx = 255, edx = 1, padding = {0, 0, 0}}, {function = 2147483677, index = 2, flags = 1, eax = 67, ebx = 29360191, ecx = 1023, edx = 0, padding = {0, 0, 0}}, {function = 2147483677, index = 3, flags = 1, eax = 49507, ebx = 62914623, ecx = 8191, edx = 6, padding = {0, 0, 0}}, {function = --Type <RET> for more, q to quit, c to continue without paging--
2147483677, index = 4, flags = 1, eax = 0, ebx = 0, ecx = 0, edx = 0, padding = {0, 0, 0}}, {function = 2147483678, index = 0, flags = 0, eax = 0, ebx = 0, ecx = 0, edx = 0, padding = {0, 0, 0}}, {function = 0, index = 0, flags = 0, eax = 0, ebx = 0, ecx = 0, edx = 0, padding = {0, 0, 0}} <repeats 67 times>}}
        cpu = 0x5565040e5d00
        __func__ = "kvm_arch_init_vcpu"
        env = 0x5565040ee550
        limit = 2147483678
        i = 2147483678
        j = <optimized out>
        cpuid_i = 33
        unused = 1145913699
        c = 0x7fdaae9fdc08
        kvm_base = 1073741824
        max_nested_state_len = <optimized out>
        r = <optimized out>
        local_err = 0x0
        __PRETTY_FUNCTION__ = "kvm_arch_init_vcpu"
#6  0x0000556502c4d2a7 in qemu_kvm_cpu_thread_fn (arg=0x5565040e5d00)
    at /usr/src/debug/qemu-kvm-4.2.0-20.module+el8.2.1+6467+49dc3278.x86_64/cpus.c:1303
        cpu = 0x5565040e5d00
        r = <optimized out>
#7  0x0000556502f77114 in qemu_thread_start (args=0x55650495bd30) at util/qemu-thread-posix.c:519
        __clframe = 
          {__cancel_routine = <optimized out>, __cancel_arg = 0x0, __do_it = 1, __cancel_type = <optimized out>}
--Type <RET> for more, q to quit, c to continue without paging--
        qemu_thread_args = 0x55650495bd30
        start_routine = 0x556502c4d250 <qemu_kvm_cpu_thread_fn>
        arg = 0x5565040e5d00
        r = <optimized out>
#8  0x00007fdabe2792de in start_thread () at /lib64/libpthread.so.0
#9  0x00007fdabdfaae83 in clone () at /lib64/libc.so.6

Comment 1 John Ferlan 2020-05-15 16:14:26 UTC
Eduardo - setting the ITR == 8.2.1 - feel free to reset to '---'... Although seems bug 1828750 may be related.

Seems commit ed78467a2 is where the assert was first added (and it's been there a while too since 3.0)

Comment 3 Eduardo Habkost 2020-06-02 16:45:33 UTC
Dave, do you think we can get an AMD engineer to fix this upstream?

Comment 4 Dr. David Alan Gilbert 2020-06-02 16:51:27 UTC
Yeh, lets add a needinfo on Wei.
I wonder what the right fix is her e- just limit max_cpus to 255?

Comment 5 Eduardo Habkost 2020-06-02 17:39:30 UTC
(In reply to Dr. David Alan Gilbert from comment #4)
> Yeh, lets add a needinfo on Wei.
> I wonder what the right fix is her e- just limit max_cpus to 255?

nr_cores, more specifically.  I see two possibilities:
* declaring nr_cores > 256 as never supported (or deprecated); or
* omitting the CPUID[8000001E] node if nr_cores is too large.

I'm sure there are other CPUID leaves that break unpredictably if nr_cores or nr_threads is too large, but we never noticed because they don't have any asserts.  It would be nice to fix all of them.

Comment 6 Eduardo Habkost 2020-06-02 17:41:41 UTC
Simple way to reproduce the bug upstream without using the monitor or the EPYC CPU model:

$ qemu-system-x86_64 -machine q35,accel=kvm,kernel-irqchip=split -device intel-iommu,intremap=on,eim=on -smp 1,maxcpus=258,cores=258,threads=1,sockets=1 -cpu qemu64,xlevel=0x8000001e -device qemu64-x86_64-cpu,apic-id=257
qemu-system-x86_64: warning: Number of hotpluggable cpus requested (258) exceeds the recommended cpus supported by KVM (240)
qemu-system-x86_64: /home/ehabkost/rh/proj/virt/qemu/target/i386/cpu.c:5888: cpu_x86_cpuid: Assertion `cpu->core_id <= 255' failed.
Aborted (core dumped)

Comment 8 Yumei Huang 2020-06-18 02:17:28 UTC
Hit same issue on rhel8.3 slow train.

qemu-kvm-4.2.0-25.module+el8.3.0+6986+29a4dcd7
kernel-4.18.0-215.el8.x86_64

Comment 9 WEI HUANG 2020-06-18 05:20:37 UTC
I can reproduce it with RHEL 8.3. Let me ping Babu if he had worked on it. Otherwise I will take a look myself.

Comment 10 Babu Moger 2020-06-18 14:45:46 UTC
Looking at the code again, I think best way to handle this is to omitting the CPUID[8000001E] if nr_cores is too large. We can't build the CPUID "8000001E" with core_id greater than 255. To support more than 255 cores we need x2apic support. In that case topology is coming from CPUID 0xB which appears to work fine.

Eduardo, Can I add a check in cpu_x86_cpuid under case 0x8000001E: to return all zeros if core_id > 255.  Or let me know where to add this check(or checks).

Comment 11 Eduardo Habkost 2020-06-18 23:30:06 UTC
(In reply to Babu Moger from comment #10)
> Looking at the code again, I think best way to handle this is to omitting
> the CPUID[8000001E] if nr_cores is too large. We can't build the CPUID
> "8000001E" with core_id greater than 255. To support more than 255 cores we
> need x2apic support. In that case topology is coming from CPUID 0xB which
> appears to work fine.
> 
> Eduardo, Can I add a check in cpu_x86_cpuid under case 0x8000001E: to return
> all zeros if core_id > 255.  Or let me know where to add this check(or
> checks).

This sounds like the simplest solution.  Especially if we want to make a quick and safe bug fix to be backported to downstream releases.  Supporting larger core_id sizes and refactoring the CPUID[0x8000001E] code can be implemented later.

Comment 12 Babu Moger 2020-06-19 13:58:13 UTC
Posted the patch https://lore.kernel.org/qemu-devel/159257395689.52908.4409314503988289481.stgit@naples-babu.amd.com/
Please review. thanks

Comment 14 WEI HUANG 2020-10-22 02:37:31 UTC
Hi Eduardo,

What is the plan for this BZ? I did a backport test with Babu's patch from upstream and it did fix the problem (both virt-rhel and virt-av). I can submit the backport patch if needed, possibly for both rhel-av-8.3.1 and rhel-8.4.0? I think it might be too late for rhel-8.3.0?

Thanks,
-Wei

Comment 15 Eduardo Habkost 2020-11-10 19:10:20 UTC
(In reply to WEI HUANG from comment #14)
> Hi Eduardo,
> 
> What is the plan for this BZ? I did a backport test with Babu's patch from
> upstream and it did fix the problem (both virt-rhel and virt-av). I can
> submit the backport patch if needed, possibly for both rhel-av-8.3.1 and
> rhel-8.4.0? I think it might be too late for rhel-8.3.0?
> 
> Thanks,
> -Wei

Sorry for taking so long to reply.  It was already too late for 8.3.0, but we can target this for 8.4.0 because the fix will be included via rebase.

Comment 16 Eduardo Habkost 2020-11-10 19:12:04 UTC
Fixed upstream by:

commit 35ac5dfbcaa4b31470b4e201d26143b8b9a0a1e7
Author: Babu Moger <babu.moger>
Date:   Mon Sep 21 17:47:28 2020 -0500

    target/i386: Remove core_id assert check in CPUID 0x8000001E
    
    With x2apic enabled, configurations can have more that 255 cores.
    Noticed the device add test is hitting an assert when during cpu
    hotplug with core_id > 255. This is due to assert check in the
    CPUID 0x8000001E.
    
    Remove the assert check and fix the problem.
    
    Fixes the bug:
    Link: https://bugzilla.redhat.com/show_bug.cgi?id=1834200
    
    Signed-off-by: Babu Moger <babu.moger>
    Message-Id: <160072824160.9666.8890355282135970684.stgit.com>
    Signed-off-by: Eduardo Habkost <ehabkost>

Comment 19 Yumei Huang 2021-01-04 11:29:53 UTC
Verify:
qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f
host kernel: 4.18.0-268.el8.x86_64
guest kernel: 4.18.0-269.el8.x86_64

The issue is gone, guest works well.

Comment 21 errata-xmlrpc 2021-05-25 06:42:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2098


Note You need to log in before you can comment on or make changes to this bug.