1420726 – [downstream clone - 4.0.7] Out of range CPU APIC ID

Bug 1420726 - [downstream clone - 4.0.7] Out of range CPU APIC ID

Summary: [downstream clone - 4.0.7] Out of range CPU APIC ID

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.0.4
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.0.7
Target Release:	---
Assignee:	Milan Zamazal
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:	1406243
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-09 12:09 UTC by rhev-integ
Modified:	2020-05-14 15:41 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	With this update, the issue where an incorrect calculation meant that virtual machines with an unsupported number of vCPUs attempted to start and failed has been fixed. The maximum number of allowed vCPUs per virtual machine formula was adjusted to take into account the limitation of APIC ID. For more information see https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration
Clone Of:	1406243
Environment:
Last Closed:	2017-03-16 15:33:22 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2820831	None	None	None	2017-02-09 12:11:28 UTC
Red Hat Product Errata	RHBA-2017:0542	normal	SHIPPED_LIVE	Red Hat Virtualization Manager 4.0.7	2017-03-16 19:25:04 UTC
oVirt gerrit	71282	None	None	None	2017-02-09 12:11:28 UTC

Description rhev-integ 2017-02-09 12:09:25 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1406243 +++
======================================================================

Description of problem:

The following VM CPU config currently fails to start:

      Total Virtual CPUs: 24
         Virtual Sockets:  1
Cores per Virtual Socket: 12
        Threads per Core:  2

The engine sends this out:

org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand 
...
smpThreadsPerCore=2
maxVCpus=240
smp=24
smpCoresPerSocket=12
emulatedMachine=pc-i440fx-rhel7.2.0
...

Which translates to this on qemu:

-smp 24,maxcpus=240,sockets=10,cores=12,threads=2

And the VM fails to start with this:

2016-12-20T04:13:02.733683Z qemu-kvm: max_cpus is too large. APIC ID of last CPU is 311

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.6.0-27.el7.x86_64
ovirt-engine-4.0.4.4-0.1.el7ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. See VM Cpu configuration above

Actual results:
VM fails to start, user does not have any clue about what is wrong (we copy last qemu line to event log but that is not the error in this case)

Expected results:
Automatically lower maxVCpus for this VM and run it.

Additional information:
https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration

(Originally by Germano Veit Michel)

Comment 1 rhev-integ 2017-02-09 12:09:35 UTC

Apparently this needs to be revised again:

    a21d1388bf5cab3d8db95a6ceda9b5a4a8b5bdc3 (master, between RHV 3.5 and 3.6)

    core: maxVCpus revised
    
    we need to make sure must be result of cores * socket otherwise the VM
    will fail to load by libvirt    
  
    ....

maxVCpus = vm.getCpuPerSocket() * (Math.min(maxSockets, maxVCpus / vm.getCpuPerSocket()));

For the VM of the BZ:

Engine-Config:
 MaxNumOfVmCpus: 240 version: 4.0
 MaxNumOfVmSockets: 16 version: 4.0

VM:
 smpThreadsPerCore=2
 maxVCpus=240
 smpCoresPerSocket=12

maxVCpu = 24 * (min(16,240/24)) = 240

Then engine sends maxVCpus=240 on VDSCreate and the VM fails to start.

Am I mistaken or this formula looks wrong (missing smpThreadsPerCore)?

(Originally by Germano Veit Michel)

Comment 3 rhev-integ 2017-02-09 12:09:40 UTC

Ok, the formula in 4.0 is actually this one:

maxVCpus = cpuPerSocket * threadsPerCore * Math.min(maxSockets, maxVCpus / (cpuPerSocket * threadsPerCore));

12*2*min(16,240/(12*2)) = 24*10 = 240

Something is still wrong.

(Originally by Germano Veit Michel)

Comment 4 rhev-integ 2017-02-09 12:09:46 UTC

Values look ok to me. What's the reason to use threads in the guest? It was (re)introduced for ppc64 and for testing purposes, not generally recommnded on x86

(Originally by michal.skrivanek)

Comment 5 rhev-integ 2017-02-09 12:09:53 UTC

(In reply to Michal Skrivanek from comment #3)
> Values look ok to me. What's the reason to use threads in the guest?

Hey Michal, thanks for taking a look at this.

Which value looks ok? An ID of 311 is clearly wrong.

They want it as close as possible to baremetal for performance reasons. That is  the config of their Physical HW and they will pin the VM. Makes sense to me.

> It was (re)introduced for ppc64 and for testing purposes, not generally
> recommnded on x86

Well, if it was for testing purposes it should have been hidden or enabled via engine-config. Now it needs to be fixed.

What the problem of setting this up on x86? Isn't it still a bunch of vCPU tasks just with different form of presentation (package/cores/threads). Specially in the case that it's all pinned to the correct physical CPUs (SMT), what's the problem? If there is a real problem with this config on x86, shouldn't it blocked/warn the user?

Thanks!

(Originally by Germano Veit Michel)

Comment 6 rhev-integ 2017-02-09 12:09:59 UTC

(In reply to Germano Veit Michel from comment #4)
> (In reply to Michal Skrivanek from comment #3)
> > Values look ok to me. What's the reason to use threads in the guest?
> 
> Hey Michal, thanks for taking a look at this.
> 
> Which value looks ok? An ID of 311 is clearly wrong.

Yes. I meant the formula only:)

> 
> They want it as close as possible to baremetal for performance reasons. That
> is  the config of their Physical HW and they will pin the VM. Makes sense to
> me.

It won't give best performance. If they care about that CPU performance then CPU  pinning can help, and disabling HT on host is worth a try (then it may make sense to use threads in guest). Either way wihout pinning the host topology is irrelevant


> > It was (re)introduced for ppc64 and for testing purposes, not generally
> > recommnded on x86
> 
> Well, if it was for testing purposes it should have been hidden or enabled
> via engine-config. Now it needs to be fixed.

For testing Threads topologyy in guest, that is. It's ok to configure it, just not too useful

> What the problem of setting this up on x86? Isn't it still a bunch of vCPU
> tasks just with different form of presentation (package/cores/threads).

Not the guest thread - that's not a host qemu thread

> Specially in the case that it's all pinned to the correct physical CPUs
> (SMT), what's the problem? If there is a real problem with this config on
> x86, shouldn't it blocked/warn the user?
> 
> Thanks!

There certainly is a problem with the id. I would nee to defer to QEMU team to answer. Karen, can you please check the topology limitations?

(Originally by michal.skrivanek)

Comment 7 rhev-integ 2017-02-09 12:10:06 UTC

(In reply to Michal Skrivanek from comment #5)
> Yes. I meant the formula only:)

Ohh, right! Yes, it also looks ok to me, but as we can see it's wrong :(

> It won't give best performance. If they care about that CPU performance then
> CPU  pinning can help, and disabling HT on host is worth a try (then it may
> make sense to use threads in guest). Either way wihout pinning the host
> topology is irrelevant

It's pinned. But I am not understading the relation of "disable HT" with "threads=2" you are suggesting. My understanding is that threads=2 should be used with SMT enabled. Do you have any Docs to refer?

> For testing Threads topologyy in guest, that is. It's ok to configure it,
> just not too useful.

OK. Our Documentation "Virtual Machine Management Guide" does recommend 1 even for x86 SMT.
Perhaps we should add a more detailed note.

> There certainly is a problem with the id. I would nee to defer to QEMU team
> to answer. Karen, can you please check the topology limitations?

APIC is 8-bit. With that MaxVCPUs the last CPU would have ID 311, which won't fit. I think x2APIC would be different but for now RHV should lower maxVCPUs on such configurations. This looks like a RHV problem to me, which is failing to workaround the 8-bit limitation. But better have platform input here indeed.

Thanks!

(Originally by Germano Veit Michel)

Comment 8 rhev-integ 2017-02-09 12:10:12 UTC

-smp [cpus=]n[,cores=cores][,threads=threads][,sockets=sockets][,maxcpus=maxcpus]

Simulate an SMP system with n CPUs. On the PC target, up to 255 CPUs are supported. On Sparc32 target, Linux limits the number of usable CPUs to 4. For the PC target, the number of cores per socket, the number of threads per cores and the total number of sockets can be specified. Missing values will be computed. If any on the three values is given, the total number of CPUs n can be omitted. maxcpus specifies the maximum number of hotpluggable CPUs.

So indeed, 2 threads per core. And with HT enabled, a core supposedly "handles" 2 threads.

So why is HT enabled with threads=2 bad, what I am missing? Is it because of the nature of the SMT implementation of Intel HT, like shared resources?

(Originally by Germano Veit Michel)

Comment 9 rhev-integ 2017-02-09 12:10:18 UTC

(In reply to Germano Veit Michel from comment #6)

> APIC is 8-bit. With that MaxVCPUs the last CPU would have ID 311, which

So the formula results in 240, which seems ok to me. Not sure where 311 is coming from

(Originally by michal.skrivanek)

Comment 11 rhev-integ 2017-02-09 12:10:31 UTC

(In reply to Michal Skrivanek from comment #8)
> (In reply to Germano Veit Michel from comment #6)
> 
> > APIC is 8-bit. With that MaxVCPUs the last CPU would have ID 311, which
> 
> So the formula results in 240, which seems ok to me. Not sure where 311 is
> coming from

No, it's NOT ok to result in 240, it's wrong. 311 comes from that wrong 240, which causes 10 sockets in this config, resulting in APIC IDs as high as 311. 

The formula should have adjusted MaxVCPUs to a lower number, resulting in less sockets and therefore not going past the 8-bit limit. There are basically 3 sub-fields on that 8-bit ID, and the config RHV is generating simply won't fit. MaxVCPUs needs to be lowered, at least until we switch to x2APIC.

If you check the Intel link I added comment #0, it explains how the 8-bit value is constructed and you will see why this is a bug in RHV.

Hopefully platform (Eduardo) could help us out to figure out what is the correct formula to determine MaxVCPUs in RHV.

(Originally by Germano Veit Michel)

Comment 13 rhev-integ 2017-02-09 12:10:43 UTC

Eduardo, we need to confirm (or deny) that it's RHV's job to lower maxVCPUs here:

-smp 24,maxcpus=240,sockets=10,cores=12,threads=2

Because this config won't fit on 8-bit APIC CPU ID (last CPU is 311).

Currently RHV uses the formula below to recalculate maxcpus (and lower it if required). From my point of view this formula is the problem of this BZ:

maxVCpus = cpuPerSocket * threadsPerCore * Math.min(maxSockets, maxVCpus / (cpuPerSocket * threadsPerCore));

Considering that the default config in RHV is
MaxNumOfVmCpus: 240
MaxNumOfVmSockets: 16

Moving on, from the above qemu command line:

12*2*min(16,240/(12*2)) = 24*10 = 240

I believe the formula is wrong, it should have resulted in a lower maxVCpus.

Could you please share your thoughts?

(Originally by Germano Veit Michel)

Comment 14 rhev-integ 2017-02-09 12:10:49 UTC

The rules for calculating the required APIC ID size are based on the specification at:

https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/

The relevant section in that document is: "Sub ID Extraction Parameters for Initial APIC ID".

The main problem with the your formula is that it can't calculate the actual field widths using simple multiplication and division alone. Unfortunately, it needs to round the core/thread counts to the nearest power of 2 at some point.

The limit you are hitting is this: the APIC ID of the last VCPU should be < 255, otherwise the APIC ID of the CPU won't fit in the 8-bit APIC ID.

The formulas on the Intel document are:
CorePlus_Mask_Width = CoreOnly_Mask_Width + SMT_Mask_Width
CoreOnly_Mask_Width = Log2(1 + (CPUID.(EAX=4, ECX=0):EAX[31:26] ))
SMT_Mask_Width = Log2 1 ( RoundToNearestPof2(CPUID.1:EBX[23:16]) /
((CPUID.(EAX=4, ECX=0):EAX[31:26] ) + 1))

Note that the Intel formula tells you how to get the field widths based on CPUID data, not how QEMU actually chose the field widths on CPUID. We choose the field widths using the nearest power of 2.

The formulas we use in QEMU can be seen at include/hw/i386/topology.h, but the summary is:

last_apic_id = apic_id_for_cpu(last_pkg_id(), last_core_id(), last_smt_id());

last_pkg_id = last_core_index() / cores_per_socket;
last_core_id = last_core_index() % cores_per_socket;
last_smt_id = last_cpu_index() % threads_per_core;

last_core_index() = last_cpu_index() / threads_per_core;
last_cpu_index() = (max_cpus - 1);

apic_id_for_cpu(pkg_id, core_id, smt_id) = (pkg_id  << pkg_offset()) |
                                           (core_id << core_offset()) |
                                           smt_id;

pkg_offset() = core_offset() + core_width();

core_offset() = smt_width();

core_width() = bitwidth_for_count(cores_per_socket);

smt_width() = bitwidth_for_count(thread_per_core);

bitwidth_for_count(c) = Log2(RoundToNearestPowerOf2(c));


(This is a manual translation from the original C code, so apologies in advance for any typos.)

(Originally by Eduardo Habkost)

Comment 15 rhev-integ 2017-02-09 12:10:56 UTC

Thanks for explanation! And manual translation:), we should be able to update/correct our formula then.

Germano, the workaround would be to set the supported maximums to fit the contraints via engine-config. The relevant parameters are
MaxNumOfVmCpus (240)
MaxNumOfCpuPerSocket (16)
MaxNumOfVmSockets (16)
MaxNumOfThreadsPerCpu (8)

(Originally by michal.skrivanek)

Comment 16 rhev-integ 2017-02-09 12:11:02 UTC

(In reply to Eduardo Habkost from comment #13)

Thank you for a very nice summary! Just one question, to be sure I don't miss some detail:

> The limit you are hitting is this: the APIC ID of the last VCPU should be <
> 255, otherwise the APIC ID of the CPU won't fit in the 8-bit APIC ID.

Did you really mean "the last VCPU should be < 255", or actually "the last VCPU should be < 256"?

(Originally by Milan Zamazal)

Comment 17 rhev-integ 2017-02-09 12:11:08 UTC

Assuming maximum APIC ID is 255 and I performed correct formula transformation, the corrected formula should be

  MaxNumOfVmCpus = pow(2, 8 - (bitwidth(NumOfCores) + bitwidth(NumOfThreads))) * NumOfCores * NumOfThreads

It results in 192 for the reported numbers (12 cores, 8 threads) and in 256 for the maximum numbers (16 cores, 8 threads).

(Originally by Milan Zamazal)

Comment 18 rhev-integ 2017-02-09 12:11:13 UTC

(In reply to Milan Zamazal from comment #15)
> (In reply to Eduardo Habkost from comment #13)
> 
> Thank you for a very nice summary! Just one question, to be sure I don't
> miss some detail:
> 
> > The limit you are hitting is this: the APIC ID of the last VCPU should be <
> > 255, otherwise the APIC ID of the CPU won't fit in the 8-bit APIC ID.
> 
> Did you really mean "the last VCPU should be < 255", or actually "the last
> VCPU should be < 256"?

"It's complicated". :)

QEMU only enforces apic_id < 256, but you are still likely to have trouble if you have a CPU with APIC ID = 255 and your guest OS doesn't use x2apic. The problem here is that apic_id = 255 _can_ work, but it is likely to cause problems in some scenarios. I wouldn't allow it unless it was carefully tested.

So I guess the answer depends on the role of your component:

* If you are not implementing policy but just a mechanism to prevent setups would never work (in the current QEMU version) due to hard limits (as opposed to "may or may not work" or "not supported by Red Hat"), you can just check if apic_id < 256.
* If you want to implement policy to prevent the user from running unsupported setups, I recommend ensuring apic_id < 255.

Note that there is ongoing work to support more than 256 VCPUs, so whatever is the limit you choose to enforce, it is likely to change in the future.

Sorry for the leaky abstraction. Maybe we could work with the libvirt folks to try to expose those limits through an API somehow.

(Originally by Eduardo Habkost)

Comment 19 rhev-integ 2017-02-09 12:11:19 UTC

I see, thank you for clarification. I think we should prevent running unsupported setups, it's not a big problem since we supported maximum 240 VCPUs until recently and as explained above the overall limit is going to change in future.

So the formula above must be amended a bit. The maximum apic_id 255 (i.e. 256 VCPUs) is reachable only when the 8-bit capacity is fully utilized, which state can be reached only when all the particular values (threads, cores) are power of 2 and the resulting maximum VCPU count is 256. Thus if the result value is 256, we must reduce it to 255. All other cases should be safe.

(Originally by Milan Zamazal)

Comment 22 Israel Pinto 2017-02-20 14:23:32 UTC

Verify with:
Red Hat Virtualization Manager Version: 4.0.7.1-0.1.el7ev

Step:
Run VM with 1 socket, 6 cores and 2 threads per core 

Results:
Vm is up.

Comment 24 errata-xmlrpc 2017-03-16 15:33:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0542.html

Note You need to log in before you can comment on or make changes to this bug.