Description of problem: The following VM CPU config currently fails to start: Total Virtual CPUs: 24 Virtual Sockets: 1 Cores per Virtual Socket: 12 Threads per Core: 2 The engine sends this out: org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand ... smpThreadsPerCore=2 maxVCpus=240 smp=24 smpCoresPerSocket=12 emulatedMachine=pc-i440fx-rhel7.2.0 ... Which translates to this on qemu: -smp 24,maxcpus=240,sockets=10,cores=12,threads=2 And the VM fails to start with this: 2016-12-20T04:13:02.733683Z qemu-kvm: max_cpus is too large. APIC ID of last CPU is 311 Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.6.0-27.el7.x86_64 ovirt-engine-4.0.4.4-0.1.el7ev.noarch How reproducible: 100% Steps to Reproduce: 1. See VM Cpu configuration above Actual results: VM fails to start, user does not have any clue about what is wrong (we copy last qemu line to event log but that is not the error in this case) Expected results: Automatically lower maxVCpus for this VM and run it. Additional information: https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration
Apparently this needs to be revised again: a21d1388bf5cab3d8db95a6ceda9b5a4a8b5bdc3 (master, between RHV 3.5 and 3.6) core: maxVCpus revised we need to make sure must be result of cores * socket otherwise the VM will fail to load by libvirt .... maxVCpus = vm.getCpuPerSocket() * (Math.min(maxSockets, maxVCpus / vm.getCpuPerSocket())); For the VM of the BZ: Engine-Config: MaxNumOfVmCpus: 240 version: 4.0 MaxNumOfVmSockets: 16 version: 4.0 VM: smpThreadsPerCore=2 maxVCpus=240 smpCoresPerSocket=12 maxVCpu = 24 * (min(16,240/24)) = 240 Then engine sends maxVCpus=240 on VDSCreate and the VM fails to start. Am I mistaken or this formula looks wrong (missing smpThreadsPerCore)?
Ok, the formula in 4.0 is actually this one: maxVCpus = cpuPerSocket * threadsPerCore * Math.min(maxSockets, maxVCpus / (cpuPerSocket * threadsPerCore)); 12*2*min(16,240/(12*2)) = 24*10 = 240 Something is still wrong.
Values look ok to me. What's the reason to use threads in the guest? It was (re)introduced for ppc64 and for testing purposes, not generally recommnded on x86
(In reply to Michal Skrivanek from comment #3) > Values look ok to me. What's the reason to use threads in the guest? Hey Michal, thanks for taking a look at this. Which value looks ok? An ID of 311 is clearly wrong. They want it as close as possible to baremetal for performance reasons. That is the config of their Physical HW and they will pin the VM. Makes sense to me. > It was (re)introduced for ppc64 and for testing purposes, not generally > recommnded on x86 Well, if it was for testing purposes it should have been hidden or enabled via engine-config. Now it needs to be fixed. What the problem of setting this up on x86? Isn't it still a bunch of vCPU tasks just with different form of presentation (package/cores/threads). Specially in the case that it's all pinned to the correct physical CPUs (SMT), what's the problem? If there is a real problem with this config on x86, shouldn't it blocked/warn the user? Thanks!
(In reply to Germano Veit Michel from comment #4) > (In reply to Michal Skrivanek from comment #3) > > Values look ok to me. What's the reason to use threads in the guest? > > Hey Michal, thanks for taking a look at this. > > Which value looks ok? An ID of 311 is clearly wrong. Yes. I meant the formula only:) > > They want it as close as possible to baremetal for performance reasons. That > is the config of their Physical HW and they will pin the VM. Makes sense to > me. It won't give best performance. If they care about that CPU performance then CPU pinning can help, and disabling HT on host is worth a try (then it may make sense to use threads in guest). Either way wihout pinning the host topology is irrelevant > > It was (re)introduced for ppc64 and for testing purposes, not generally > > recommnded on x86 > > Well, if it was for testing purposes it should have been hidden or enabled > via engine-config. Now it needs to be fixed. For testing Threads topologyy in guest, that is. It's ok to configure it, just not too useful > What the problem of setting this up on x86? Isn't it still a bunch of vCPU > tasks just with different form of presentation (package/cores/threads). Not the guest thread - that's not a host qemu thread > Specially in the case that it's all pinned to the correct physical CPUs > (SMT), what's the problem? If there is a real problem with this config on > x86, shouldn't it blocked/warn the user? > > Thanks! There certainly is a problem with the id. I would nee to defer to QEMU team to answer. Karen, can you please check the topology limitations?
(In reply to Michal Skrivanek from comment #5) > Yes. I meant the formula only:) Ohh, right! Yes, it also looks ok to me, but as we can see it's wrong :( > It won't give best performance. If they care about that CPU performance then > CPU pinning can help, and disabling HT on host is worth a try (then it may > make sense to use threads in guest). Either way wihout pinning the host > topology is irrelevant It's pinned. But I am not understading the relation of "disable HT" with "threads=2" you are suggesting. My understanding is that threads=2 should be used with SMT enabled. Do you have any Docs to refer? > For testing Threads topologyy in guest, that is. It's ok to configure it, > just not too useful. OK. Our Documentation "Virtual Machine Management Guide" does recommend 1 even for x86 SMT. Perhaps we should add a more detailed note. > There certainly is a problem with the id. I would nee to defer to QEMU team > to answer. Karen, can you please check the topology limitations? APIC is 8-bit. With that MaxVCPUs the last CPU would have ID 311, which won't fit. I think x2APIC would be different but for now RHV should lower maxVCPUs on such configurations. This looks like a RHV problem to me, which is failing to workaround the 8-bit limitation. But better have platform input here indeed. Thanks!
-smp [cpus=]n[,cores=cores][,threads=threads][,sockets=sockets][,maxcpus=maxcpus] Simulate an SMP system with n CPUs. On the PC target, up to 255 CPUs are supported. On Sparc32 target, Linux limits the number of usable CPUs to 4. For the PC target, the number of cores per socket, the number of threads per cores and the total number of sockets can be specified. Missing values will be computed. If any on the three values is given, the total number of CPUs n can be omitted. maxcpus specifies the maximum number of hotpluggable CPUs. So indeed, 2 threads per core. And with HT enabled, a core supposedly "handles" 2 threads. So why is HT enabled with threads=2 bad, what I am missing? Is it because of the nature of the SMT implementation of Intel HT, like shared resources?
(In reply to Germano Veit Michel from comment #6) > APIC is 8-bit. With that MaxVCPUs the last CPU would have ID 311, which So the formula results in 240, which seems ok to me. Not sure where 311 is coming from
(In reply to Michal Skrivanek from comment #8) > (In reply to Germano Veit Michel from comment #6) > > > APIC is 8-bit. With that MaxVCPUs the last CPU would have ID 311, which > > So the formula results in 240, which seems ok to me. Not sure where 311 is > coming from No, it's NOT ok to result in 240, it's wrong. 311 comes from that wrong 240, which causes 10 sockets in this config, resulting in APIC IDs as high as 311. The formula should have adjusted MaxVCPUs to a lower number, resulting in less sockets and therefore not going past the 8-bit limit. There are basically 3 sub-fields on that 8-bit ID, and the config RHV is generating simply won't fit. MaxVCPUs needs to be lowered, at least until we switch to x2APIC. If you check the Intel link I added comment #0, it explains how the 8-bit value is constructed and you will see why this is a bug in RHV. Hopefully platform (Eduardo) could help us out to figure out what is the correct formula to determine MaxVCPUs in RHV.
Eduardo, we need to confirm (or deny) that it's RHV's job to lower maxVCPUs here: -smp 24,maxcpus=240,sockets=10,cores=12,threads=2 Because this config won't fit on 8-bit APIC CPU ID (last CPU is 311). Currently RHV uses the formula below to recalculate maxcpus (and lower it if required). From my point of view this formula is the problem of this BZ: maxVCpus = cpuPerSocket * threadsPerCore * Math.min(maxSockets, maxVCpus / (cpuPerSocket * threadsPerCore)); Considering that the default config in RHV is MaxNumOfVmCpus: 240 MaxNumOfVmSockets: 16 Moving on, from the above qemu command line: 12*2*min(16,240/(12*2)) = 24*10 = 240 I believe the formula is wrong, it should have resulted in a lower maxVCpus. Could you please share your thoughts?
The rules for calculating the required APIC ID size are based on the specification at: https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/ The relevant section in that document is: "Sub ID Extraction Parameters for Initial APIC ID". The main problem with the your formula is that it can't calculate the actual field widths using simple multiplication and division alone. Unfortunately, it needs to round the core/thread counts to the nearest power of 2 at some point. The limit you are hitting is this: the APIC ID of the last VCPU should be < 255, otherwise the APIC ID of the CPU won't fit in the 8-bit APIC ID. The formulas on the Intel document are: CorePlus_Mask_Width = CoreOnly_Mask_Width + SMT_Mask_Width CoreOnly_Mask_Width = Log2(1 + (CPUID.(EAX=4, ECX=0):EAX[31:26] )) SMT_Mask_Width = Log2 1 ( RoundToNearestPof2(CPUID.1:EBX[23:16]) / ((CPUID.(EAX=4, ECX=0):EAX[31:26] ) + 1)) Note that the Intel formula tells you how to get the field widths based on CPUID data, not how QEMU actually chose the field widths on CPUID. We choose the field widths using the nearest power of 2. The formulas we use in QEMU can be seen at include/hw/i386/topology.h, but the summary is: last_apic_id = apic_id_for_cpu(last_pkg_id(), last_core_id(), last_smt_id()); last_pkg_id = last_core_index() / cores_per_socket; last_core_id = last_core_index() % cores_per_socket; last_smt_id = last_cpu_index() % threads_per_core; last_core_index() = last_cpu_index() / threads_per_core; last_cpu_index() = (max_cpus - 1); apic_id_for_cpu(pkg_id, core_id, smt_id) = (pkg_id << pkg_offset()) | (core_id << core_offset()) | smt_id; pkg_offset() = core_offset() + core_width(); core_offset() = smt_width(); core_width() = bitwidth_for_count(cores_per_socket); smt_width() = bitwidth_for_count(thread_per_core); bitwidth_for_count(c) = Log2(RoundToNearestPowerOf2(c)); (This is a manual translation from the original C code, so apologies in advance for any typos.)
Thanks for explanation! And manual translation:), we should be able to update/correct our formula then. Germano, the workaround would be to set the supported maximums to fit the contraints via engine-config. The relevant parameters are MaxNumOfVmCpus (240) MaxNumOfCpuPerSocket (16) MaxNumOfVmSockets (16) MaxNumOfThreadsPerCpu (8)
(In reply to Eduardo Habkost from comment #13) Thank you for a very nice summary! Just one question, to be sure I don't miss some detail: > The limit you are hitting is this: the APIC ID of the last VCPU should be < > 255, otherwise the APIC ID of the CPU won't fit in the 8-bit APIC ID. Did you really mean "the last VCPU should be < 255", or actually "the last VCPU should be < 256"?
Assuming maximum APIC ID is 255 and I performed correct formula transformation, the corrected formula should be MaxNumOfVmCpus = pow(2, 8 - (bitwidth(NumOfCores) + bitwidth(NumOfThreads))) * NumOfCores * NumOfThreads It results in 192 for the reported numbers (12 cores, 8 threads) and in 256 for the maximum numbers (16 cores, 8 threads).
(In reply to Milan Zamazal from comment #15) > (In reply to Eduardo Habkost from comment #13) > > Thank you for a very nice summary! Just one question, to be sure I don't > miss some detail: > > > The limit you are hitting is this: the APIC ID of the last VCPU should be < > > 255, otherwise the APIC ID of the CPU won't fit in the 8-bit APIC ID. > > Did you really mean "the last VCPU should be < 255", or actually "the last > VCPU should be < 256"? "It's complicated". :) QEMU only enforces apic_id < 256, but you are still likely to have trouble if you have a CPU with APIC ID = 255 and your guest OS doesn't use x2apic. The problem here is that apic_id = 255 _can_ work, but it is likely to cause problems in some scenarios. I wouldn't allow it unless it was carefully tested. So I guess the answer depends on the role of your component: * If you are not implementing policy but just a mechanism to prevent setups would never work (in the current QEMU version) due to hard limits (as opposed to "may or may not work" or "not supported by Red Hat"), you can just check if apic_id < 256. * If you want to implement policy to prevent the user from running unsupported setups, I recommend ensuring apic_id < 255. Note that there is ongoing work to support more than 256 VCPUs, so whatever is the limit you choose to enforce, it is likely to change in the future. Sorry for the leaky abstraction. Maybe we could work with the libvirt folks to try to expose those limits through an API somehow.
I see, thank you for clarification. I think we should prevent running unsupported setups, it's not a big problem since we supported maximum 240 VCPUs until recently and as explained above the overall limit is going to change in future. So the formula above must be amended a bit. The maximum apic_id 255 (i.e. 256 VCPUs) is reachable only when the 8-bit capacity is fully utilized, which state can be reached only when all the particular values (threads, cores) are power of 2 and the resulting maximum VCPU count is 256. Thus if the result value is 256, we must reduce it to 255. All other cases should be safe.
Verify with: Red Hat Virtualization Manager Version: 4.1.1.2-0.1.el7 Step: 1. Run VM with 1 socket, 6 cores and 2 threads per core 2. Run VM with 240 CPUs, should failed and give explanation why it failed Results: 1. Vm is up. 2. Vm failed to run, error massage: Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details: The host host_mixed_2 did not satisfy internal filter CPU because it does not have enough cores to run the VM. The host host_mixed_1 did not satisfy internal filter CPU because it does not have enough cores to run the VM.