Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1406243

Summary:	Out of range CPU APIC ID
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Germano Veit Michel <gveitmic>
Component:	ovirt-engine	Assignee:	Milan Zamazal <mzamazal>
Status:	CLOSED ERRATA	QA Contact:	Israel Pinto <ipinto>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.0.4	CC:	ailan, bgraveno, ehabkost, gklein, gveitmic, knoel, lsurette, mgoldboi, michal.skrivanek, mzamazal, rbalakri, Rhev-m-bugs, srevivo, tjelinek, ykaul
Target Milestone:	ovirt-4.1.1	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	With this update, the issue where an incorrect calculation meant that virtual machines with an unsupported number of vCPUs attempted to start and failed has been fixed. The maximum number of allowed vCPUs per virtual machine formula was adjusted to take into account the limitation of APIC ID. For more information see https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration	Story Points:	---
Clone Of:
Clones:	1420726 (view as bug list)		Environment:
Last Closed:	2017-04-25 00:49:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1420726

Description Germano Veit Michel 2016-12-20 04:38:42 UTC

Description of problem:

The following VM CPU config currently fails to start:

      Total Virtual CPUs: 24
         Virtual Sockets:  1
Cores per Virtual Socket: 12
        Threads per Core:  2

The engine sends this out:

org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand 
...
smpThreadsPerCore=2
maxVCpus=240
smp=24
smpCoresPerSocket=12
emulatedMachine=pc-i440fx-rhel7.2.0
...

Which translates to this on qemu:

-smp 24,maxcpus=240,sockets=10,cores=12,threads=2

And the VM fails to start with this:

2016-12-20T04:13:02.733683Z qemu-kvm: max_cpus is too large. APIC ID of last CPU is 311

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.6.0-27.el7.x86_64
ovirt-engine-4.0.4.4-0.1.el7ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. See VM Cpu configuration above

Actual results:
VM fails to start, user does not have any clue about what is wrong (we copy last qemu line to event log but that is not the error in this case)

Expected results:
Automatically lower maxVCpus for this VM and run it.

Additional information:
https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration

Comment 1 Germano Veit Michel 2016-12-20 05:16:22 UTC

Apparently this needs to be revised again:

    a21d1388bf5cab3d8db95a6ceda9b5a4a8b5bdc3 (master, between RHV 3.5 and 3.6)

    core: maxVCpus revised
    
    we need to make sure must be result of cores * socket otherwise the VM
    will fail to load by libvirt    
  
    ....

maxVCpus = vm.getCpuPerSocket() * (Math.min(maxSockets, maxVCpus / vm.getCpuPerSocket()));

For the VM of the BZ:

Engine-Config:
 MaxNumOfVmCpus: 240 version: 4.0
 MaxNumOfVmSockets: 16 version: 4.0

VM:
 smpThreadsPerCore=2
 maxVCpus=240
 smpCoresPerSocket=12

maxVCpu = 24 * (min(16,240/24)) = 240

Then engine sends maxVCpus=240 on VDSCreate and the VM fails to start.

Am I mistaken or this formula looks wrong (missing smpThreadsPerCore)?

Comment 2 Germano Veit Michel 2016-12-20 05:51:12 UTC

Ok, the formula in 4.0 is actually this one:

maxVCpus = cpuPerSocket * threadsPerCore * Math.min(maxSockets, maxVCpus / (cpuPerSocket * threadsPerCore));

12*2*min(16,240/(12*2)) = 24*10 = 240

Something is still wrong.

Comment 3 Michal Skrivanek 2016-12-20 05:57:12 UTC

Values look ok to me. What's the reason to use threads in the guest? It was (re)introduced for ppc64 and for testing purposes, not generally recommnded on x86

Comment 4 Germano Veit Michel 2016-12-20 06:09:23 UTC

(In reply to Michal Skrivanek from comment #3)
> Values look ok to me. What's the reason to use threads in the guest?

Hey Michal, thanks for taking a look at this.

Which value looks ok? An ID of 311 is clearly wrong.

They want it as close as possible to baremetal for performance reasons. That is  the config of their Physical HW and they will pin the VM. Makes sense to me.

> It was (re)introduced for ppc64 and for testing purposes, not generally
> recommnded on x86

Well, if it was for testing purposes it should have been hidden or enabled via engine-config. Now it needs to be fixed.

What the problem of setting this up on x86? Isn't it still a bunch of vCPU tasks just with different form of presentation (package/cores/threads). Specially in the case that it's all pinned to the correct physical CPUs (SMT), what's the problem? If there is a real problem with this config on x86, shouldn't it blocked/warn the user?

Thanks!

Comment 5 Michal Skrivanek 2016-12-20 06:17:42 UTC

(In reply to Germano Veit Michel from comment #4)
> (In reply to Michal Skrivanek from comment #3)
> > Values look ok to me. What's the reason to use threads in the guest?
> 
> Hey Michal, thanks for taking a look at this.
> 
> Which value looks ok? An ID of 311 is clearly wrong.

Yes. I meant the formula only:)

> 
> They want it as close as possible to baremetal for performance reasons. That
> is  the config of their Physical HW and they will pin the VM. Makes sense to
> me.

It won't give best performance. If they care about that CPU performance then CPU  pinning can help, and disabling HT on host is worth a try (then it may make sense to use threads in guest). Either way wihout pinning the host topology is irrelevant


> > It was (re)introduced for ppc64 and for testing purposes, not generally
> > recommnded on x86
> 
> Well, if it was for testing purposes it should have been hidden or enabled
> via engine-config. Now it needs to be fixed.

For testing Threads topologyy in guest, that is. It's ok to configure it, just not too useful

> What the problem of setting this up on x86? Isn't it still a bunch of vCPU
> tasks just with different form of presentation (package/cores/threads).

Not the guest thread - that's not a host qemu thread

> Specially in the case that it's all pinned to the correct physical CPUs
> (SMT), what's the problem? If there is a real problem with this config on
> x86, shouldn't it blocked/warn the user?
> 
> Thanks!

There certainly is a problem with the id. I would nee to defer to QEMU team to answer. Karen, can you please check the topology limitations?

Comment 6 Germano Veit Michel 2016-12-20 06:41:33 UTC

(In reply to Michal Skrivanek from comment #5)
> Yes. I meant the formula only:)

Ohh, right! Yes, it also looks ok to me, but as we can see it's wrong :(

> It won't give best performance. If they care about that CPU performance then
> CPU  pinning can help, and disabling HT on host is worth a try (then it may
> make sense to use threads in guest). Either way wihout pinning the host
> topology is irrelevant

It's pinned. But I am not understading the relation of "disable HT" with "threads=2" you are suggesting. My understanding is that threads=2 should be used with SMT enabled. Do you have any Docs to refer?

> For testing Threads topologyy in guest, that is. It's ok to configure it,
> just not too useful.

OK. Our Documentation "Virtual Machine Management Guide" does recommend 1 even for x86 SMT.
Perhaps we should add a more detailed note.

> There certainly is a problem with the id. I would nee to defer to QEMU team
> to answer. Karen, can you please check the topology limitations?

APIC is 8-bit. With that MaxVCPUs the last CPU would have ID 311, which won't fit. I think x2APIC would be different but for now RHV should lower maxVCPUs on such configurations. This looks like a RHV problem to me, which is failing to workaround the 8-bit limitation. But better have platform input here indeed.

Thanks!

Comment 7 Germano Veit Michel 2016-12-20 06:48:35 UTC

-smp [cpus=]n[,cores=cores][,threads=threads][,sockets=sockets][,maxcpus=maxcpus]

Simulate an SMP system with n CPUs. On the PC target, up to 255 CPUs are supported. On Sparc32 target, Linux limits the number of usable CPUs to 4. For the PC target, the number of cores per socket, the number of threads per cores and the total number of sockets can be specified. Missing values will be computed. If any on the three values is given, the total number of CPUs n can be omitted. maxcpus specifies the maximum number of hotpluggable CPUs.

So indeed, 2 threads per core. And with HT enabled, a core supposedly "handles" 2 threads.

So why is HT enabled with threads=2 bad, what I am missing? Is it because of the nature of the SMT implementation of Intel HT, like shared resources?

Comment 8 Michal Skrivanek 2016-12-20 15:13:14 UTC

(In reply to Germano Veit Michel from comment #6)

> APIC is 8-bit. With that MaxVCPUs the last CPU would have ID 311, which

So the formula results in 240, which seems ok to me. Not sure where 311 is coming from

Comment 10 Germano Veit Michel 2016-12-20 23:13:24 UTC

(In reply to Michal Skrivanek from comment #8)
> (In reply to Germano Veit Michel from comment #6)
> 
> > APIC is 8-bit. With that MaxVCPUs the last CPU would have ID 311, which
> 
> So the formula results in 240, which seems ok to me. Not sure where 311 is
> coming from

No, it's NOT ok to result in 240, it's wrong. 311 comes from that wrong 240, which causes 10 sockets in this config, resulting in APIC IDs as high as 311. 

The formula should have adjusted MaxVCPUs to a lower number, resulting in less sockets and therefore not going past the 8-bit limit. There are basically 3 sub-fields on that 8-bit ID, and the config RHV is generating simply won't fit. MaxVCPUs needs to be lowered, at least until we switch to x2APIC.

If you check the Intel link I added comment #0, it explains how the 8-bit value is constructed and you will see why this is a bug in RHV.

Hopefully platform (Eduardo) could help us out to figure out what is the correct formula to determine MaxVCPUs in RHV.

Comment 12 Germano Veit Michel 2016-12-21 06:56:24 UTC

Eduardo, we need to confirm (or deny) that it's RHV's job to lower maxVCPUs here:

-smp 24,maxcpus=240,sockets=10,cores=12,threads=2

Because this config won't fit on 8-bit APIC CPU ID (last CPU is 311).

Currently RHV uses the formula below to recalculate maxcpus (and lower it if required). From my point of view this formula is the problem of this BZ:

maxVCpus = cpuPerSocket * threadsPerCore * Math.min(maxSockets, maxVCpus / (cpuPerSocket * threadsPerCore));

Considering that the default config in RHV is
MaxNumOfVmCpus: 240
MaxNumOfVmSockets: 16

Moving on, from the above qemu command line:

12*2*min(16,240/(12*2)) = 24*10 = 240

I believe the formula is wrong, it should have resulted in a lower maxVCpus.

Could you please share your thoughts?

Comment 13 Eduardo Habkost 2016-12-21 13:56:07 UTC

The rules for calculating the required APIC ID size are based on the specification at:

https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/

The relevant section in that document is: "Sub ID Extraction Parameters for Initial APIC ID".

The main problem with the your formula is that it can't calculate the actual field widths using simple multiplication and division alone. Unfortunately, it needs to round the core/thread counts to the nearest power of 2 at some point.

The limit you are hitting is this: the APIC ID of the last VCPU should be < 255, otherwise the APIC ID of the CPU won't fit in the 8-bit APIC ID.

The formulas on the Intel document are:
CorePlus_Mask_Width = CoreOnly_Mask_Width + SMT_Mask_Width
CoreOnly_Mask_Width = Log2(1 + (CPUID.(EAX=4, ECX=0):EAX[31:26] ))
SMT_Mask_Width = Log2 1 ( RoundToNearestPof2(CPUID.1:EBX[23:16]) /
((CPUID.(EAX=4, ECX=0):EAX[31:26] ) + 1))

Note that the Intel formula tells you how to get the field widths based on CPUID data, not how QEMU actually chose the field widths on CPUID. We choose the field widths using the nearest power of 2.

The formulas we use in QEMU can be seen at include/hw/i386/topology.h, but the summary is:

last_apic_id = apic_id_for_cpu(last_pkg_id(), last_core_id(), last_smt_id());

last_pkg_id = last_core_index() / cores_per_socket;
last_core_id = last_core_index() % cores_per_socket;
last_smt_id = last_cpu_index() % threads_per_core;

last_core_index() = last_cpu_index() / threads_per_core;
last_cpu_index() = (max_cpus - 1);

apic_id_for_cpu(pkg_id, core_id, smt_id) = (pkg_id  << pkg_offset()) |
                                           (core_id << core_offset()) |
                                           smt_id;

pkg_offset() = core_offset() + core_width();

core_offset() = smt_width();

core_width() = bitwidth_for_count(cores_per_socket);

smt_width() = bitwidth_for_count(thread_per_core);

bitwidth_for_count(c) = Log2(RoundToNearestPowerOf2(c));


(This is a manual translation from the original C code, so apologies in advance for any typos.)

Comment 14 Michal Skrivanek 2017-01-04 13:50:56 UTC

Thanks for explanation! And manual translation:), we should be able to update/correct our formula then.

Germano, the workaround would be to set the supported maximums to fit the contraints via engine-config. The relevant parameters are
MaxNumOfVmCpus (240)
MaxNumOfCpuPerSocket (16)
MaxNumOfVmSockets (16)
MaxNumOfThreadsPerCpu (8)

Comment 15 Milan Zamazal 2017-01-19 11:42:07 UTC

(In reply to Eduardo Habkost from comment #13)

Thank you for a very nice summary! Just one question, to be sure I don't miss some detail:

> The limit you are hitting is this: the APIC ID of the last VCPU should be <
> 255, otherwise the APIC ID of the CPU won't fit in the 8-bit APIC ID.

Did you really mean "the last VCPU should be < 255", or actually "the last VCPU should be < 256"?

Comment 16 Milan Zamazal 2017-01-19 12:08:25 UTC

Assuming maximum APIC ID is 255 and I performed correct formula transformation, the corrected formula should be

  MaxNumOfVmCpus = pow(2, 8 - (bitwidth(NumOfCores) + bitwidth(NumOfThreads))) * NumOfCores * NumOfThreads

It results in 192 for the reported numbers (12 cores, 8 threads) and in 256 for the maximum numbers (16 cores, 8 threads).

Comment 17 Eduardo Habkost 2017-01-19 14:57:13 UTC

(In reply to Milan Zamazal from comment #15)
> (In reply to Eduardo Habkost from comment #13)
> 
> Thank you for a very nice summary! Just one question, to be sure I don't
> miss some detail:
> 
> > The limit you are hitting is this: the APIC ID of the last VCPU should be <
> > 255, otherwise the APIC ID of the CPU won't fit in the 8-bit APIC ID.
> 
> Did you really mean "the last VCPU should be < 255", or actually "the last
> VCPU should be < 256"?

"It's complicated". :)

QEMU only enforces apic_id < 256, but you are still likely to have trouble if you have a CPU with APIC ID = 255 and your guest OS doesn't use x2apic. The problem here is that apic_id = 255 _can_ work, but it is likely to cause problems in some scenarios. I wouldn't allow it unless it was carefully tested.

So I guess the answer depends on the role of your component:

* If you are not implementing policy but just a mechanism to prevent setups would never work (in the current QEMU version) due to hard limits (as opposed to "may or may not work" or "not supported by Red Hat"), you can just check if apic_id < 256.
* If you want to implement policy to prevent the user from running unsupported setups, I recommend ensuring apic_id < 255.

Note that there is ongoing work to support more than 256 VCPUs, so whatever is the limit you choose to enforce, it is likely to change in the future.

Sorry for the leaky abstraction. Maybe we could work with the libvirt folks to try to expose those limits through an API somehow.

Comment 18 Milan Zamazal 2017-01-23 08:38:12 UTC

I see, thank you for clarification. I think we should prevent running unsupported setups, it's not a big problem since we supported maximum 240 VCPUs until recently and as explained above the overall limit is going to change in future.

So the formula above must be amended a bit. The maximum apic_id 255 (i.e. 256 VCPUs) is reachable only when the 8-bit capacity is fully utilized, which state can be reached only when all the particular values (threads, cores) are power of 2 and the resulting maximum VCPU count is 256. Thus if the result value is 256, we must reduce it to 255. All other cases should be safe.

Comment 20 Israel Pinto 2017-02-20 14:30:18 UTC

Verify with:
Red Hat Virtualization Manager Version: 4.1.1.2-0.1.el7

Step:
1. Run VM with 1 socket, 6 cores and 2 threads per core 
2. Run VM with 240 CPUs, should failed and give explanation why it failed  
Results:
1. Vm is up.
2. Vm failed to run, error massage:

Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:
The host host_mixed_2 did not satisfy internal filter CPU because it does not have enough cores to run the VM.
The host host_mixed_1 did not satisfy internal filter CPU because it does not have enough cores to run the VM.