Description of problem: Icelake intel cpus are detected as Broadwell-IBRS Version-Release number of selected component (if applicable): 8.2 EUS How reproducible: always Steps to Reproduce: 1. Add new computes nodes with 10nm icelake cpus 2. 3. Actual results: live migration fails with: 2022-10-27 12:28:10.620 7 ERROR oslo_messaging.rpc.server nova.exception.InvalidCPUInfo: Unacceptable CPU info: CPU doesn't have compatibility. virsh capabilities: <model>Broadwell-IBRS</model> <vendor>Intel</vendor> Expected results: should be detected as icelake Additional info:
What is the exact version of libvirt? OSP release would help as well. Could you please attach the output of "virsh domcapabilities" command? And ideally also the output of the script from https://gitlab.com/libvirt/libvirt/-/blob/master/tests/cputestdata/cpu-data.py run with no parameters (running it as normal user is fine as long as it can use /dev/kvm)? In addition to python it only needs "cpuid" package to be installed.
The provided domcapabilities output contain <mode name='host-model' supported='yes'> <model fallback='forbid'>Icelake-Server</model> <vendor>Intel</vendor> <feature policy='require' name='ss'/> <feature policy='require' name='vmx'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='tsc_adjust'/> <feature policy='require' name='avx512ifma'/> <feature policy='require' name='sha-ni'/> <feature policy='require' name='md-clear'/> <feature policy='require' name='stibp'/> <feature policy='require' name='arch-capabilities'/> <feature policy='require' name='xsaves'/> <feature policy='require' name='invtsc'/> <feature policy='require' name='ibpb'/> <feature policy='require' name='amd-ssbd'/> <feature policy='require' name='rdctl-no'/> <feature policy='require' name='ibrs-all'/> <feature policy='require' name='skip-l1dfl-vmentry'/> <feature policy='require' name='mds-no'/> <feature policy='require' name='pschange-mc-no'/> <feature policy='require' name='tsx-ctrl'/> <feature policy='require' name='taa-no'/> <feature policy='disable' name='mpx'/> <feature policy='disable' name='intel-pt'/> </mode> and <mode name='host-model' supported='yes'> <model fallback='forbid'>Cascadelake-Server</model> <vendor>Intel</vendor> <feature policy='require' name='ss'/> <feature policy='require' name='vmx'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='tsc_adjust'/> <feature policy='require' name='umip'/> <feature policy='require' name='pku'/> <feature policy='require' name='md-clear'/> <feature policy='require' name='stibp'/> <feature policy='require' name='arch-capabilities'/> <feature policy='require' name='xsaves'/> <feature policy='require' name='invtsc'/> <feature policy='require' name='ibpb'/> <feature policy='require' name='amd-ssbd'/> <feature policy='require' name='rdctl-no'/> <feature policy='require' name='ibrs-all'/> <feature policy='require' name='skip-l1dfl-vmentry'/> <feature policy='require' name='mds-no'/> <feature policy='require' name='pschange-mc-no'/> <feature policy='require' name='tsx-ctrl'/> </mode> In other words, the CPU is not recognized as Broadwell-IBRS. The CPU model advertised in capabalities XML is not used for anything libvirt does. The model from domain capabilities is what matters. The //host/cpu element in capabilities XML cannot express missing features due to backward compatibility so an older CPU model without the missing features will be shown there instead. But as I said, this is not an issue. Would you be so kind and explain what actual issue you're seeing? The error message from Nova is not enough as it doesn't contain any actions or data leading to this error. We'd need logs (debug logs, ideally) from both Nova and libvirt. Anyway, I was informed that OSP still uses wrong libvirt API for comparing CPUs, which is most likely the cause of this issue. Moving to OSP for further investigation. Please, feel free to move the bug back in case a libvirt bug is identified.
Customer can't live migrate between icelake / cascade lake because of this and this: ~~~ [root@overcloud-compute-1 /]# virsh capabilities <capabilities> <host> <uuid>4c4c4544-0038-4810-8036-c4c04f475233</uuid> <cpu> <arch>x86_64</arch> <model>Broadwell-noTSX-IBRS</model> <====== <vendor>Intel</vendor> <microcode version='218104675'/> <counter name='tsc' frequency='1995312000' scaling='yes'/> <topology sockets='1' dies='1' cores='32' threads='2'/> <feature name='vme'/> <feature name='ds'/> <feature name='acpi'/> <feature name='ss'/> <feature name='ht'/> ~~~
As I explained above, the CPU model in virsh capabilities is *irrelevant* for libvirt during migration. The error comes from CPU compatibility check in Nova which is wrong as it uses wrong libvirt API to compare a guest CPU to a host. Not to mention that Nova has no reason for doing the comparison itself, libvirt will properly check CPU compatibility during migration and fail appropriately. So this still looks like OpenStack issue.
its likely an operator configuration issue not an OpenStack one. we would need to know how they configured the CPU model in the nova config file. our default is hostmodel which we effectively deleget to libvirt to implement. we have see libvirt incorrectly enable amd feature flags on intel chips recently when we use host model. i say libvirt enabled it because the feature element were not in the domain we provided to libvirt they were added by it. so we need a few things all of which would be included in an sos report. 1st we need the nova debug log so we can see the domain we provided to libvirt when the VM was created 2nd we need the nova config file to know if the operator chose a specific CPU model or used host-model/host-passhtough. setting a custom model is more or less required to allow live migration bidirectioanly if you have a mix of icelake/cacadelake in the deployment. due to changes intel made in microcode there is also the complciationf of tsx. if the VM was create with tsx enabled even if its unused it wont be able to live migrate to a host with it disabled. so before we can determine if ther is a nova issue here or not we need more information about the vms (the domain XML for the VM in question would help) and the hosts and how its currently configured.
(In reply to David Sedgmen from comment #13) > So all it is doing on start up is checking the nova config for the cpu model > that is going to be selected for creating instances. > Then comparing it to the capabilities. > > From what I can see on the bug is that `virsh capabilities` is returning the > wrong model. As I tried to explain, virsh capabilities does not and cannot provide the right model for several reasons. Most importantly the CPU which can be created on a given host depends on the QEMU binary, which capabilities XML cannot express. Because of this, we added host-model CPU definition to domain capabilities (virsh domcapabilities) to show what CPU model and features a specific QEMU binary can provide on the host. And we also added new CPU related APIs to use this CPU model for comparisons rather then the one from host capabilities. So instead of using compareCPU() you should be using compareHypervisorCPU().
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Low: Red Hat OpenStack Platform 16.2 (openstack-nova) security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:1948
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days