Created attachment 1710796 [details] libvirt error in the 'nova-compute.log' upon live migration (NOTE: This issue affects OSP-only; hence the "Downstream-Only" tag in the bug title.) Description of problem ---------------------- (Thanks: Lukas Bezdicka for first noticing it in his OSP-13 to OSP-16 upgrades environment.) Migrating a Nova instance from a RHEL-7 host that reports the CPU feature 'arch-facilities' in its host capabilities (as seen in the output of `virsh capabilities`) to a RHEL-8 host fails with; libvirt.libvirtError: internal error: Unknown CPU feature arch-facilities One of the reasons here is that the 'arch-facilities' CPU feature was a RHEL7-only thing; which was later replaced by the differently-named: 'arch-capabilities'. (For reasons, see "Why is 'arch-facilities' CPU feature RHEL7-only?") (Fuller failure in attachment.) Version ------- RHEL-7 - qemu-kvm-rhev-2.12.0-44.el7_8.1.x86_64; - libvirt-daemon-kvm-4.5.0-33.el7.x86_64 RHEL-8 - qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64 libvirt-daemon-kvm-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64 How reproducible: Consistently Steps to Reproduce ------------------ (Writing instructions broadly enough that QE can set it up; this requies an OSP-13 and an OSP-16 environment.) 1. Have an OSP-13 Compute node (whether it be a VM or a baremetal) reports: 'arch_capabilities' in /proc/cpuinfo. As see in the output of Ensure that the `virsh capabilities | grep arch-facilities` (Yes, libvirt reports it as "arch-facilities") 2. Start an instance running on the above OSP-13 Compute node 3. Migrate the above instance to OSP-16 environment Actual results -------------- Live migration fails with: [...] _compare_cpu /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8505 2020-08-07 08:46:59.946 8 ERROR nova.virt.libvirt.driver [req-97ec675a-2190-43c7-9fb7-dcb9321ebce5 b7f72df5d01c44bba503ecd629831e86 2e3990b4d77b4d95a275d32b6d63c743 - default default] CPU doesn't have compatibility. internal error: Unknown CPU feature arch-facilities Refer to http://libvirt.org/html/libvirt-libvirt-host.html#virCPUCompareResult: libvirt.libvirtError: internal error: Unknown CPU feature arch-facilities 2020-08-07 08:46:59.996 8 ERROR oslo_messaging.rpc.server [req-97ec675a-2190-43c7-9fb7-dcb9321ebce5 b7f72df5d01c44bba503ecd629831e86 2e3990b4d77b4d95a275d32b6d63c743 - default default] Exception during message handling: nova.exception.MigrationPreCheckError: Migration pre-check error: CPU doesn't have compatibility. internal error: Unknown CPU feature arch-facilities [...] Expected results ---------------- Live migration from OSP-13 (on a Compute node with 'arch-facilities') to OSP-16 succeeds.
Fix suggested by the libvirt folks ----------------------------------- Just don't send the 'arch-facilities' CPU feature to the XML you pass to libvirt's migration API; and let libvirt handle it internally. - - - Why was 'arch-facilities' CPU feature RHEL7-only? ------------------------------------------------ The 'arch-facilities' feature was first pushed downstream in RHEL-7 libvirt as part of Spectre/Meltdown fixes. And later when upstream patches were made, this feature was not included, as it was not necessary. However, about a year (or more) later upstream finally introduced this feature, but called it 'arch-capabilities'. And since 'arch-facilities' was a RHEL7-only name and it wasn't migratable anyway, libvirt didn't bother adding compatibility hacks to RHEL8. (The above is based on a chat with Jiri Denemark — thank you! — from libvirt.) References ---------- https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=511df17aec — cpu_map: Add support for arch-capabilities feature https://bugzilla.redhat.com/show_bug.cgi?id=1658406 — mode="host-model" VMs include broken "arch-facilities" flag name [libvirt]
(In reply to Kashyap Chamarthy from comment #0) > > Description of problem > ---------------------- > > > (Thanks: Lukas Bezdicka for first noticing it in his OSP-13 to OSP-16 > upgrades environment.) > > Migrating a Nova instance from a RHEL-7 host that reports the CPU > feature 'arch-facilities' in its host capabilities (as seen in the > output of `virsh capabilities`) to a RHEL-8 host fails with; > > libvirt.libvirtError: internal error: Unknown CPU feature arch-facilities > > One of the reasons here is that the 'arch-facilities' CPU feature was > a RHEL7-only thing; which was later replaced by the differently-named: > 'arch-capabilities'. (For reasons, see "Why is 'arch-facilities' CPU > feature RHEL7-only?") > > (Fuller failure in attachment.) > > > Version > ------- > > RHEL-7 > - qemu-kvm-rhev-2.12.0-44.el7_8.1.x86_64; > - libvirt-daemon-kvm-4.5.0-33.el7.x86_64 > > RHEL-8 > - qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64 > libvirt-daemon-kvm-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64 > > > How reproducible: Consistently > > Steps to Reproduce > ------------------ > > (Writing instructions broadly enough that QE can set it up; this requies > an OSP-13 and an OSP-16 environment.) > > 1. Have an OSP-13 Compute node (whether it be a VM or a baremetal) > reports: 'arch_capabilities' in /proc/cpuinfo. As see in the output > of Ensure that the `virsh capabilities | grep arch-facilities` (Yes, > libvirt reports it as "arch-facilities") > > 2. Start an instance running on the above OSP-13 Compute node > > 3. Migrate the above instance to OSP-16 environment IMPORTANT CORRECTION on the OSP versions ----------------------------------------- The above environment is that source AND destination, both are running OSP-16.1. The problem is caused by RHEL version difference: Source: RHEL-7 (OSP-16) Dest : RHEL-8 (OSP-16) [...]
*** Bug 1867127 has been marked as a duplicate of this bug. ***
The arch-facilities feature is not enabled even for host-model. It would only be enabled if the user explicitly asked for it. Which is not the case here. But even so, the domain would fail to migrate anyway because enabling the feature breaks migration (the support for migrating VMs with arch-capabilities enabled was added to QEMU not so long ago). That said, libvirt RHEL-8 does not need any backward compatibility hacks for arch-facilities as the source QEMU would refuse to migrate a VM with arch-facilities anyway. And the migration would fail even if the target host was RHEL-7. Anyway, the issue here is not that a domain with arch-facilities enabled cannot be migrated from RHEL-7 to RHEL-8. The problem we're facing here is that Nova does not even try to migrate a domain with host-model CPU (and arch-facilities disabled) even though libvirt would successfully migrate such domain. The only place where arch-facilities is visible is in virsh capabilities. The migration is not even attempted because CPU comparison check done by Nova before starting a migration fails. However, libvirt already checks CPU compatibility during migration so the Openstack code is quite redundant. If Openstack really needs to mimic this compatibility check, it should do so correctly. I can imagine such code being useful when selecting a suitable migration target, but running it when migrating to a specific host is quite pointless. AFAIK, currently Nova takes the CPU definition from host capabilities and passes it to virConnectCompareCPU on the destination host. This is not actually checking whether a given domain can be migrated to the destination host. It just checks compatibility of the two host CPUs, which is quite different, because it may compare irrelevant features that are not enabled by QEMU anyway (this is the case of arch-facilities) or it is not checking all features because QEMU can enable some features even though the host does not support them. So the correct behaviour would be either one of the following two options: 1) get the CPU definition from the XML of the domain being migrated and pass it to virConnectHypervisorCPU (notice the different API), 2) do nothing and just let libvirt check the compatibility. The correct fix (which should also go upstream) is to change the way Nova checks CPU compatibility before migration. The suggested downstream change is just a quick workaround for the currently broken behavior of Nova until it is fixed properly. Imagine we would add a compatibility hack to libvirt in RHEL-8 and renamed arch-facilities to arch-capabilities before processing the CPU definition. Migration would still fail in case the destination host would not support this feature (which I would say is true for the majority of the CPUs currently in use) even though the feature is not enabled in the VM.
(In reply to Jiri Denemark from comment #9) > The arch-facilities feature is not enabled even for host-model. It would only > be enabled if the user explicitly asked for it. Which is not the case here. > But even so, the domain would fail to migrate anyway because enabling the > feature breaks migration (the support for migrating VMs with > arch-capabilities > enabled was added to QEMU not so long ago). > > That said, libvirt RHEL-8 does not need any backward compatibility hacks for > arch-facilities as the source QEMU would refuse to migrate a VM with > arch-facilities anyway. And the migration would fail even if the target host > was RHEL-7. > > Anyway, the issue here is not that a domain with arch-facilities enabled > cannot be migrated from RHEL-7 to RHEL-8. The problem we're facing here is > that Nova does not even try to migrate a domain with host-model CPU (and > arch-facilities disabled) even though libvirt would successfully migrate such > domain. The only place where arch-facilities is visible is in virsh > capabilities. > > The migration is not even attempted because CPU comparison check done by Nova > before starting a migration fails. However, libvirt already checks CPU > compatibility during migration so the Openstack code is quite redundant. > If Openstack really needs to mimic this compatibility check, it should do so > correctly. I can imagine such code being useful when selecting a suitable > migration target, but running it when migrating to a specific host is quite > pointless. > > AFAIK, currently Nova takes the CPU definition from host capabilities and > passes it to virConnectCompareCPU on the destination host. This is not > actually checking whether a given domain can be migrated to the destination > host. It just checks compatibility of the two host CPUs, which is quite > different, because it may compare irrelevant features that are not enabled by > QEMU anyway (this is the case of arch-facilities) or it is not checking all > features because QEMU can enable some features even though the host does not > support them. > > So the correct behaviour would be either one of the following two options: > 1) get the CPU definition from the XML of the domain being migrated and pass > it to virConnectHypervisorCPU (notice the different API), > 2) do nothing and just let libvirt check the compatibility. I've posted the following WIP for option #2 below: WIP libvirt: Remove host CPU checks during check_can_live_migrate_destination https://review.opendev.org/#/c/745431/ > The correct fix (which should also go upstream) is to change the way Nova > checks CPU compatibility before migration. The suggested downstream change is > just a quick workaround for the currently broken behavior of Nova until it is > fixed properly. Right and to be clear we only went with the workaround given some pressing downstream deadlines. When the eventual full fix lands the workaround will end up being reverted and replaced. > Imagine we would add a compatibility hack to libvirt in RHEL-8 and renamed > arch-facilities to arch-capabilities before processing the CPU definition. > Migration would still fail in case the destination host would not support > this > feature (which I would say is true for the majority of the CPUs currently in > use) even though the feature is not enabled in the VM. Understood thanks for clearing that up.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (openstack-nova bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3572