Bug 1867128 - [OSP-16] [Downstream-Only] Don't provide 'arch-facilities' CPU f.eature to migration XML, to avoid live migration breakage from EL7 to EL8
Summary: [OSP-16] [Downstream-Only] Don't provide 'arch-facilities' CPU f.eature to mi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z1
: 16.1 (Train on RHEL 8.2)
Assignee: Lee Yarwood
QA Contact: Archit Modi
URL:
Whiteboard:
: 1867127 (view as bug list)
Depends On:
Blocks: 1900425
TreeView+ depends on / blocked
 
Reported: 2020-08-07 12:13 UTC by Kashyap Chamarthy
Modified: 2020-11-22 22:21 UTC (History)
15 users (show)

Fixed In Version: openstack-nova-20.3.1-0.20200626213435.38ee1f3.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1900425 (view as bug list)
Environment:
Last Closed: 2020-08-27 15:21:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
libvirt error in the 'nova-compute.log' upon live migration (6.58 KB, text/plain)
2020-08-07 12:13 UTC, Kashyap Chamarthy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:3572 0 None None None 2020-08-27 15:21:43 UTC

Description Kashyap Chamarthy 2020-08-07 12:13:40 UTC
Created attachment 1710796 [details]
libvirt error in the 'nova-compute.log' upon live migration

(NOTE: This issue affects OSP-only; hence the "Downstream-Only" tag in the bug
title.)

Description of problem
----------------------


(Thanks: Lukas Bezdicka for first noticing it in his OSP-13 to OSP-16
upgrades environment.)

Migrating a Nova instance from a RHEL-7 host that reports the CPU
feature 'arch-facilities' in its host capabilities (as seen in the
output of `virsh capabilities`) to a RHEL-8 host fails with;

    libvirt.libvirtError: internal error: Unknown CPU feature arch-facilities

One of the reasons here is that the 'arch-facilities' CPU feature was
a RHEL7-only thing; which was later replaced by the differently-named:
'arch-capabilities'.  (For reasons, see "Why is 'arch-facilities' CPU
feature RHEL7-only?")

(Fuller failure in attachment.)


Version
-------

RHEL-7
  - qemu-kvm-rhev-2.12.0-44.el7_8.1.x86_64;
  - libvirt-daemon-kvm-4.5.0-33.el7.x86_64

RHEL-8
  - qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64 
    libvirt-daemon-kvm-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64


How reproducible: Consistently

Steps to Reproduce
------------------

(Writing instructions broadly enough that QE can set it up; this requies
an OSP-13 and an OSP-16 environment.)

1. Have an OSP-13 Compute node (whether it be a VM or a baremetal)
   reports: 'arch_capabilities' in /proc/cpuinfo.  As see in the output
   of Ensure that the `virsh capabilities | grep arch-facilities`  (Yes,
   libvirt reports it as "arch-facilities")

2. Start an instance running on the above OSP-13 Compute node

3. Migrate the above instance to OSP-16 environment


Actual results
--------------

Live migration fails with:

[...]
 _compare_cpu /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8505
2020-08-07 08:46:59.946 8 ERROR nova.virt.libvirt.driver [req-97ec675a-2190-43c7-9fb7-dcb9321ebce5 b7f72df5d01c44bba503ecd629831e86 2e3990b4d77b4d95a275d32b6d63c743 - default default] CPU doesn't have compatibility.

internal error: Unknown CPU feature arch-facilities

Refer to http://libvirt.org/html/libvirt-libvirt-host.html#virCPUCompareResult: libvirt.libvirtError: internal error: Unknown CPU feature arch-facilities
2020-08-07 08:46:59.996 8 ERROR oslo_messaging.rpc.server [req-97ec675a-2190-43c7-9fb7-dcb9321ebce5 b7f72df5d01c44bba503ecd629831e86 2e3990b4d77b4d95a275d32b6d63c743 - default default] Exception during message handling: nova.exception.MigrationPreCheckError: Migration pre-check error: CPU doesn't have compatibility.

internal error: Unknown CPU feature arch-facilities
[...]


Expected results
----------------

Live migration from OSP-13 (on a Compute node with 'arch-facilities') to 
OSP-16 succeeds.

Comment 1 Kashyap Chamarthy 2020-08-07 12:15:52 UTC
Fix suggested by the libvirt folks
-----------------------------------

Just don't send the 'arch-facilities' CPU feature to the XML you pass to
libvirt's migration API; and let libvirt handle it internally.

                - - -

Why was 'arch-facilities' CPU feature RHEL7-only?
------------------------------------------------

The 'arch-facilities' feature was first pushed downstream in RHEL-7
libvirt as part of Spectre/Meltdown fixes.  And later when upstream
patches were made, this feature was not included, as it was not
necessary.

However, about a year (or more) later upstream finally introduced this
feature, but called it 'arch-capabilities'.

And since 'arch-facilities' was a RHEL7-only name and it wasn't
migratable anyway, libvirt didn't bother adding compatibility hacks to
RHEL8.

(The above is based on a chat with Jiri Denemark — thank you! — from
libvirt.)


References
----------

    https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=511df17aec — 
    cpu_map: Add support for arch-capabilities feature

    https://bugzilla.redhat.com/show_bug.cgi?id=1658406 — 
    mode="host-model" VMs include broken "arch-facilities" flag name
    [libvirt]

Comment 2 Kashyap Chamarthy 2020-08-07 12:35:36 UTC
(In reply to Kashyap Chamarthy from comment #0)

> 
> Description of problem
> ----------------------
> 
> 
> (Thanks: Lukas Bezdicka for first noticing it in his OSP-13 to OSP-16
> upgrades environment.)
> 
> Migrating a Nova instance from a RHEL-7 host that reports the CPU
> feature 'arch-facilities' in its host capabilities (as seen in the
> output of `virsh capabilities`) to a RHEL-8 host fails with;
> 
>     libvirt.libvirtError: internal error: Unknown CPU feature arch-facilities
> 
> One of the reasons here is that the 'arch-facilities' CPU feature was
> a RHEL7-only thing; which was later replaced by the differently-named:
> 'arch-capabilities'.  (For reasons, see "Why is 'arch-facilities' CPU
> feature RHEL7-only?")
> 
> (Fuller failure in attachment.)
> 
> 
> Version
> -------
> 
> RHEL-7
>   - qemu-kvm-rhev-2.12.0-44.el7_8.1.x86_64;
>   - libvirt-daemon-kvm-4.5.0-33.el7.x86_64
> 
> RHEL-8
>   - qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64 
>     libvirt-daemon-kvm-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64
> 
> 
> How reproducible: Consistently
> 
> Steps to Reproduce
> ------------------
> 
> (Writing instructions broadly enough that QE can set it up; this requies
> an OSP-13 and an OSP-16 environment.)
> 
> 1. Have an OSP-13 Compute node (whether it be a VM or a baremetal)
>    reports: 'arch_capabilities' in /proc/cpuinfo.  As see in the output
>    of Ensure that the `virsh capabilities | grep arch-facilities`  (Yes,
>    libvirt reports it as "arch-facilities")
> 
> 2. Start an instance running on the above OSP-13 Compute node
> 
> 3. Migrate the above instance to OSP-16 environment


IMPORTANT CORRECTION on the OSP versions
-----------------------------------------

The above environment is that source AND destination, both are running OSP-16.1.

The problem is caused by RHEL version difference:

Source: RHEL-7 (OSP-16)
Dest  : RHEL-8 (OSP-16)


[...]

Comment 6 Jesse Pretorius 2020-08-07 12:49:10 UTC
*** Bug 1867127 has been marked as a duplicate of this bug. ***

Comment 9 Jiri Denemark 2020-08-07 13:48:49 UTC
The arch-facilities feature is not enabled even for host-model. It would only
be enabled if the user explicitly asked for it. Which is not the case here.
But even so, the domain would fail to migrate anyway because enabling the
feature breaks migration (the support for migrating VMs with arch-capabilities
enabled was added to QEMU not so long ago).

That said, libvirt RHEL-8 does not need any backward compatibility hacks for
arch-facilities as the source QEMU would refuse to migrate a VM with
arch-facilities anyway. And the migration would fail even if the target host
was RHEL-7.

Anyway, the issue here is not that a domain with arch-facilities enabled
cannot be migrated from RHEL-7 to RHEL-8. The problem we're facing here is
that Nova does not even try to migrate a domain with host-model CPU (and
arch-facilities disabled) even though libvirt would successfully migrate such
domain. The only place where arch-facilities is visible is in virsh
capabilities.

The migration is not even attempted because CPU comparison check done by Nova
before starting a migration fails. However, libvirt already checks CPU
compatibility during migration so the Openstack code is quite redundant.
If Openstack really needs to mimic this compatibility check, it should do so
correctly. I can imagine such code being useful when selecting a suitable
migration target, but running it when migrating to a specific host is quite
pointless.

AFAIK, currently Nova takes the CPU definition from host capabilities and
passes it to virConnectCompareCPU on the destination host. This is not
actually checking whether a given domain can be migrated to the destination
host. It just checks compatibility of the two host CPUs, which is quite
different, because it may compare irrelevant features that are not enabled by
QEMU anyway (this is the case of arch-facilities) or it is not checking all
features because QEMU can enable some features even though the host does not
support them.

So the correct behaviour would be either one of the following two options:
1) get the CPU definition from the XML of the domain being migrated and pass
   it to virConnectHypervisorCPU (notice the different API),
2) do nothing and just let libvirt check the compatibility.

The correct fix (which should also go upstream) is to change the way Nova
checks CPU compatibility before migration. The suggested downstream change is
just a quick workaround for the currently broken behavior of Nova until it is
fixed properly.

Imagine we would add a compatibility hack to libvirt in RHEL-8 and renamed
arch-facilities to arch-capabilities before processing the CPU definition.
Migration would still fail in case the destination host would not support this
feature (which I would say is true for the majority of the CPUs currently in
use) even though the feature is not enabled in the VM.

Comment 10 Lee Yarwood 2020-08-07 23:14:46 UTC
(In reply to Jiri Denemark from comment #9)
> The arch-facilities feature is not enabled even for host-model. It would only
> be enabled if the user explicitly asked for it. Which is not the case here.
> But even so, the domain would fail to migrate anyway because enabling the
> feature breaks migration (the support for migrating VMs with
> arch-capabilities
> enabled was added to QEMU not so long ago).
> 
> That said, libvirt RHEL-8 does not need any backward compatibility hacks for
> arch-facilities as the source QEMU would refuse to migrate a VM with
> arch-facilities anyway. And the migration would fail even if the target host
> was RHEL-7.
> 
> Anyway, the issue here is not that a domain with arch-facilities enabled
> cannot be migrated from RHEL-7 to RHEL-8. The problem we're facing here is
> that Nova does not even try to migrate a domain with host-model CPU (and
> arch-facilities disabled) even though libvirt would successfully migrate such
> domain. The only place where arch-facilities is visible is in virsh
> capabilities.
> 
> The migration is not even attempted because CPU comparison check done by Nova
> before starting a migration fails. However, libvirt already checks CPU
> compatibility during migration so the Openstack code is quite redundant.
> If Openstack really needs to mimic this compatibility check, it should do so
> correctly. I can imagine such code being useful when selecting a suitable
> migration target, but running it when migrating to a specific host is quite
> pointless.
> 
> AFAIK, currently Nova takes the CPU definition from host capabilities and
> passes it to virConnectCompareCPU on the destination host. This is not
> actually checking whether a given domain can be migrated to the destination
> host. It just checks compatibility of the two host CPUs, which is quite
> different, because it may compare irrelevant features that are not enabled by
> QEMU anyway (this is the case of arch-facilities) or it is not checking all
> features because QEMU can enable some features even though the host does not
> support them.
> 
> So the correct behaviour would be either one of the following two options:
> 1) get the CPU definition from the XML of the domain being migrated and pass
>    it to virConnectHypervisorCPU (notice the different API),
> 2) do nothing and just let libvirt check the compatibility.

I've posted the following WIP for option #2 below:

WIP libvirt: Remove host CPU checks during check_can_live_migrate_destination
https://review.opendev.org/#/c/745431/

> The correct fix (which should also go upstream) is to change the way Nova
> checks CPU compatibility before migration. The suggested downstream change is
> just a quick workaround for the currently broken behavior of Nova until it is
> fixed properly.

Right and to be clear we only went with the workaround given some pressing downstream deadlines.

When the eventual full fix lands the workaround will end up being reverted and replaced.

> Imagine we would add a compatibility hack to libvirt in RHEL-8 and renamed
> arch-facilities to arch-capabilities before processing the CPU definition.
> Migration would still fail in case the destination host would not support
> this
> feature (which I would say is true for the majority of the CPUs currently in
> use) even though the feature is not enabled in the VM.

Understood thanks for clearing that up.

Comment 18 errata-xmlrpc 2020-08-27 15:21:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (openstack-nova bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3572


Note You need to log in before you can comment on or make changes to this bug.