Description of problem: - When running RHV VM with Q35/UEFI and more than 48 CPUs (RHV host is with 64 CPUs), there is no VM console (black screen and no TianoCore screen when booting) and there's no IP address for that VM. - When running the same VM, this time reducing CPUs to 48, the VM is running properly with console. network address etc. - running VM with Q35/BIOS with 64 CPUs is also running properly with console and network address. - this issue reproduced also on a different setup with different host, but there the issue encountered with more than 28 CPUs (RHV host with 48 CPUs, ). QEMU bug related to this issue: https://bugzilla.redhat.com/show_bug.cgi?id=2074149 Version-Release number of selected component (if applicable): ovirt-engine-4.5.0-0.237.el8ev vdsm-4.50.0.10-1.el8ev.x86_64 qemu-kvm-6.2.0-11.module+el8.6.0+14707+5aa4b42d.x86_64 libvirt-daemon-8.0.0-5.module+el8.6.0+14480+c0a3aa0f.x86_64 How reproducible: 100% Steps to Reproduce: 1. using host with 64 CPUs (2 virtual sockets, 16 cores per socket and 2 threads per core), try to run VM with 64 CPUs and Q35 chipset with UEFI. 2. 3. Actual results: VM is running, but no boot screen (TianoCore), console is black and no IP addresses. Expected results: VM should boot with console, network etc. Additional info: vdsm.log attached (VM with 48 CPUs run at 2022-04-11 10:51:42,151-0400, VM with 64 CPUs run at 2022-04-11 10:58:01,881-0400) engine.log, libvirt/qemu.log and VMs domain XML attached. VM names: rhel8_VM_48_CPUs and rhel8_VM_64_CPUs
Platform discussion happens in BZ 2074149. There is also a nice explanatory summary in https://bugzilla.redhat.com/show_bug.cgi?id=1469338#c30. What we need to do in oVirt is: - Not setting the maximum number of vCPUs unnecessarily high. Let's make it at most some configurable multiple of the initial vCPUs number (somewhat similar to how we limit maximum memory). - Adding <features> <smm state='on'> <tseg unit='MiB'>48</tseg> </smm> </features> to the domain XML when the maximum number of vCPUs exceeds certain limit (e.g. 255?) and UEFI is used (see also `smm' doc in https://libvirt.org/formatdomain.html#hypervisor-features). We may hit a similar problem with large RAM. This is sort of guesswork but hopefully it should cover most of the typical cases.
*** Bug 2074149 has been marked as a duplicate of this bug. ***
it doesn't sound like too much of an overhead, why not just setting it high enough to cover our maximums?
You meant TSEG? Let's not be nasty to small VMs. If I understand the libvirt documentation correctly then this value is taken from the RAM available to the guest and there are other things that eat some memory from the guest memory here and there. There is no need to waste it unnecessarily here in case of small VMs.
Verified: ovirt-engine-4.5.0.5-0.7.el8ev vdsm-4.50.0.13-1.el8ev.x86_64 qemu-kvm-6.2.0-11.module+el8.6.0+14707+5aa4b42d.x86_64 libvirt-daemon-8.0.0-5.module+el8.6.0+14480+c0a3aa0f.x86_64 Verification scenario: 1. Run the next VMs with 64 virtual CPUs (2 virtual sockets, 16 cores per virtual socket and 2 threads per core): - RHEL8 VM Q35/SecureBoot - RHEL8 VM Q35/UEFI - RHEL8 VM Q35/BIOS - RHEL9 VM Q35/SecureBoot - RHEL9 VM Q35/UEFI - RHEL9 VM Q35/BIOS - RHEL8 VM I400FX/BIOS - Windows VM Q35/SecureBoot - Windows VM Q35/UEFI - Windows VM Q35/BIOS - Windows VM I440FX/BIOS 2. For each VM running, verify console is showing boot screen and after boot it shows VM OS, mouse and keyboard are functioning and inside the VM verify that the correct CPUs (64, 2/16/2) are set.