Bug 2138381 - [OSP 16.2] Unacceptable CPU info: CPU doesn't have compatibility
Summary: [OSP 16.2] Unacceptable CPU info: CPU doesn't have compatibility
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 16.2 (Train)
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: z5
: 16.2 (Train on RHEL 8.4)
Assignee: Kashyap Chamarthy
QA Contact: James Parker
URL:
Whiteboard:
Depends On:
Blocks: 2180872
TreeView+ depends on / blocked
 
Reported: 2022-10-28 13:35 UTC by David Hill
Modified: 2024-02-22 04:25 UTC (History)
25 users (show)

Fixed In Version: openstack-nova-20.6.2-2.20230308185148.fc01371.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2180872 (view as bug list)
Environment:
Last Closed: 2023-04-26 12:18:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1978064 0 None None None 2022-10-28 13:36:32 UTC
OpenStack gerrit 869536 0 None ABANDONED Nova: Add workaround to mask mpx on compareCPU() 2023-05-30 12:31:43 UTC
OpenStack gerrit 869587 0 None NEW libvirt: Remove compareCPU() check in _check_cpu_compatibility() 2023-05-30 12:31:43 UTC
OpenStack gerrit 869950 0 None MERGED libvirt: Replace usage of compareCPU() with compareHypervisorCPU() 2023-05-30 12:31:42 UTC
OpenStack gerrit 870794 0 None MERGED libvirt: At start-up rework compareCPU() usage with a workaround 2023-05-30 12:31:41 UTC
Red Hat Issue Tracker OSP-20016 0 None None None 2022-11-09 10:08:33 UTC
Red Hat Knowledge Base (Solution) 6982430 0 None None None 2022-10-28 13:48:27 UTC
Red Hat Product Errata RHSA-2023:1948 0 None None None 2023-04-26 12:18:57 UTC

Internal Links: 2180872

Description David Hill 2022-10-28 13:35:53 UTC
Description of problem:
Icelake intel cpus are detected as Broadwell-IBRS

Version-Release number of selected component (if applicable):
8.2 EUS

How reproducible:
always

Steps to Reproduce:
1. Add new computes nodes with 10nm icelake cpus
2.
3.

Actual results:

live migration fails with:
2022-10-27 12:28:10.620 7 ERROR oslo_messaging.rpc.server nova.exception.InvalidCPUInfo: Unacceptable CPU info: CPU doesn't have compatibility.

virsh capabilities:
      <model>Broadwell-IBRS</model>
      <vendor>Intel</vendor>

Expected results:
should be detected as icelake

Additional info:

Comment 1 Jiri Denemark 2022-10-31 10:19:46 UTC
What is the exact version of libvirt? OSP release would help as well.

Could you please attach the output of "virsh domcapabilities" command?

And ideally also the output of the script from
https://gitlab.com/libvirt/libvirt/-/blob/master/tests/cputestdata/cpu-data.py
run with no parameters (running it as normal user is fine as long as it can
use /dev/kvm)? In addition to python it only needs "cpuid" package to be
installed.

Comment 5 Jiri Denemark 2022-11-09 10:00:22 UTC
The provided domcapabilities output contain

    <mode name='host-model' supported='yes'>
      <model fallback='forbid'>Icelake-Server</model>
      <vendor>Intel</vendor>
      <feature policy='require' name='ss'/>
      <feature policy='require' name='vmx'/>
      <feature policy='require' name='hypervisor'/>
      <feature policy='require' name='tsc_adjust'/>
      <feature policy='require' name='avx512ifma'/>
      <feature policy='require' name='sha-ni'/>
      <feature policy='require' name='md-clear'/>
      <feature policy='require' name='stibp'/>
      <feature policy='require' name='arch-capabilities'/>
      <feature policy='require' name='xsaves'/>
      <feature policy='require' name='invtsc'/>
      <feature policy='require' name='ibpb'/>
      <feature policy='require' name='amd-ssbd'/>
      <feature policy='require' name='rdctl-no'/>
      <feature policy='require' name='ibrs-all'/>
      <feature policy='require' name='skip-l1dfl-vmentry'/>
      <feature policy='require' name='mds-no'/>
      <feature policy='require' name='pschange-mc-no'/>
      <feature policy='require' name='tsx-ctrl'/>
      <feature policy='require' name='taa-no'/>
      <feature policy='disable' name='mpx'/>
      <feature policy='disable' name='intel-pt'/>
    </mode>

and

    <mode name='host-model' supported='yes'>
      <model fallback='forbid'>Cascadelake-Server</model>
      <vendor>Intel</vendor>
      <feature policy='require' name='ss'/>
      <feature policy='require' name='vmx'/>
      <feature policy='require' name='hypervisor'/>
      <feature policy='require' name='tsc_adjust'/>
      <feature policy='require' name='umip'/>
      <feature policy='require' name='pku'/>
      <feature policy='require' name='md-clear'/>
      <feature policy='require' name='stibp'/>
      <feature policy='require' name='arch-capabilities'/>
      <feature policy='require' name='xsaves'/>
      <feature policy='require' name='invtsc'/>
      <feature policy='require' name='ibpb'/>
      <feature policy='require' name='amd-ssbd'/>
      <feature policy='require' name='rdctl-no'/>
      <feature policy='require' name='ibrs-all'/>
      <feature policy='require' name='skip-l1dfl-vmentry'/>
      <feature policy='require' name='mds-no'/>
      <feature policy='require' name='pschange-mc-no'/>
      <feature policy='require' name='tsx-ctrl'/>
    </mode>

In other words, the CPU is not recognized as Broadwell-IBRS. The CPU model
advertised in capabalities XML is not used for anything libvirt does. The
model from domain capabilities is what matters. The //host/cpu element in
capabilities XML cannot express missing features due to backward compatibility
so an older CPU model without the missing features will be shown there
instead. But as I said, this is not an issue.

Would you be so kind and explain what actual issue you're seeing? The error
message from Nova is not enough as it doesn't contain any actions or data
leading to this error. We'd need logs (debug logs, ideally) from both Nova and
libvirt.

Anyway, I was informed that OSP still uses wrong libvirt API for comparing
CPUs, which is most likely the cause of this issue. Moving to OSP for further
investigation. Please, feel free to move the bug back in case a libvirt bug is
identified.

Comment 6 David Hill 2022-11-09 13:16:01 UTC
Customer can't live migrate between icelake / cascade lake because of this and this:
~~~
[root@overcloud-compute-1 /]# virsh capabilities
<capabilities>

  <host>
    <uuid>4c4c4544-0038-4810-8036-c4c04f475233</uuid>
    <cpu>
      <arch>x86_64</arch>
      <model>Broadwell-noTSX-IBRS</model>                           <======
      <vendor>Intel</vendor>
      <microcode version='218104675'/>
      <counter name='tsc' frequency='1995312000' scaling='yes'/>
      <topology sockets='1' dies='1' cores='32' threads='2'/>
      <feature name='vme'/>
      <feature name='ds'/>
      <feature name='acpi'/>
      <feature name='ss'/>
      <feature name='ht'/>
~~~

Comment 7 Jiri Denemark 2022-11-09 13:38:52 UTC
As I explained above, the CPU model in virsh capabilities is *irrelevant* for
libvirt during migration.

The error comes from CPU compatibility check in Nova which is wrong as it uses
wrong libvirt API to compare a guest CPU to a host. Not to mention that Nova
has no reason for doing the comparison itself, libvirt will properly check CPU
compatibility during migration and fail appropriately.

So this still looks like OpenStack issue.

Comment 8 smooney 2022-11-09 14:04:35 UTC
its likely an operator configuration issue not an OpenStack one.

we would need to know how they configured the CPU model in the nova config file.

our default is hostmodel which we effectively deleget to libvirt to implement.

we have see libvirt incorrectly enable amd feature flags on intel chips recently when we use host model.
i say libvirt enabled it because the feature element were not in the domain we provided to libvirt they were added by it.

so we need a few things all of which would be included in an sos report.
1st we need the nova debug log so we can see the domain we provided to libvirt when the VM was created
2nd we need the nova config file to know if the operator chose a specific CPU model or used host-model/host-passhtough.
setting a custom model is more or less required to allow live migration bidirectioanly if you have a mix of icelake/cacadelake in the deployment.

due to changes intel made in microcode there is also the complciationf of tsx.

if the VM was create with tsx enabled even if its unused it wont be able to live migrate to  a host with it disabled.

so before we can determine if ther is a nova issue here or not we need more information about the vms (the domain XML for the VM in question would help)
and the hosts and how its currently configured.

Comment 22 Jiri Denemark 2022-11-28 12:07:41 UTC
(In reply to David Sedgmen from comment #13)
> So all it is doing on start up is checking the nova config for the cpu model
> that is going to be selected for creating instances. 
> Then comparing it to the capabilities. 
> 
> From what I can see on the bug is that `virsh capabilities` is returning the
> wrong model. 

As I tried to explain, virsh capabilities does not and cannot provide the
right model for several reasons. Most importantly the CPU which can be created
on a given host depends on the QEMU binary, which capabilities XML cannot
express. Because of this, we added host-model CPU definition to domain
capabilities (virsh domcapabilities) to show what CPU model and features a
specific QEMU binary can provide on the host. And we also added new CPU
related APIs to use this CPU model for comparisons rather then the one from
host capabilities. So instead of using compareCPU() you should be using
compareHypervisorCPU().

Comment 69 errata-xmlrpc 2023-04-26 12:18:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: Red Hat OpenStack Platform 16.2 (openstack-nova) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:1948

Comment 75 Red Hat Bugzilla 2024-02-22 04:25:03 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.