Description of problem: We installed a oVirt hosted engine evaluation setup which three CentOS 7.5 hypervisor hosts and the hosted engine VM deployed to a replicated gluster volume with a brick on each hypervisor host. We updated the ovirt hosted engine VM with the ovirt 4.2.4.x packages. After updating the first hypervisor host "test-ovirt-1" with the ovirt 4.2.4.x packages as well, the host status switched to "Non Operational". Version-Release number of selected component (if applicable): On the ovirt engine VM ovirt-engine-4.2.4.5-1.el7.noarch On hypervisor host test-ovirt-1: # rpm -qa | grep -i -e ovirt -e vdsm | sort cockpit-machines-ovirt-169-1.el7.noarch cockpit-ovirt-dashboard-0.11.28-1.el7.noarch ovirt-engine-appliance-4.2-20180626.1.el7.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7.noarch ovirt-host-4.2.3-1.el7.x86_64 ovirt-host-dependencies-4.2.3-1.el7.x86_64 ovirt-host-deploy-1.7.4-1.el7.noarch ovirt-hosted-engine-ha-2.2.14-1.el7.noarch ovirt-hosted-engine-setup-2.2.22.1-1.el7.noarch ovirt-imageio-common-1.3.1.2-0.el7.centos.noarch ovirt-imageio-daemon-1.3.1.2-0.el7.centos.noarch ovirt-provider-ovn-driver-1.2.11-1.el7.noarch ovirt-release42-4.2.4-1.el7.noarch ovirt-setup-lib-1.1.4-1.el7.centos.noarch ovirt-vmconsole-1.0.5-4.el7.centos.noarch ovirt-vmconsole-host-1.0.5-4.el7.centos.noarch python-ovirt-engine-sdk4-4.2.7-2.el7.x86_64 vdsm-4.20.32-1.el7.x86_64 vdsm-api-4.20.32-1.el7.noarch vdsm-client-4.20.32-1.el7.noarch vdsm-common-4.20.32-1.el7.noarch vdsm-gluster-4.20.32-1.el7.x86_64 vdsm-hook-ethtool-options-4.20.32-1.el7.noarch vdsm-hook-fcoe-4.20.32-1.el7.noarch vdsm-hook-openstacknet-4.20.32-1.el7.noarch vdsm-hook-vfio-mdev-4.20.32-1.el7.noarch vdsm-hook-vhostmd-4.20.32-1.el7.noarch vdsm-hook-vmfex-dev-4.20.32-1.el7.noarch vdsm-http-4.20.32-1.el7.noarch vdsm-jsonrpc-4.20.32-1.el7.noarch vdsm-network-4.20.32-1.el7.x86_64 vdsm-python-4.20.32-1.el7.noarch vdsm-yajsonrpc-4.20.32-1.el7.noarch On the hypervisor host test-ovirt-2 before yum update: # rpm -qa | grep -i -e ovirt -e vdsm | sort cockpit-ovirt-dashboard-0.11.24-1.el7.centos.noarch ovirt-engine-appliance-4.2-20180504.1.el7.centos.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7.noarch ovirt-host-4.2.2-2.el7.centos.x86_64 ovirt-host-dependencies-4.2.2-2.el7.centos.x86_64 ovirt-host-deploy-1.7.3-1.el7.centos.noarch ovirt-hosted-engine-ha-2.2.11-1.el7.centos.noarch ovirt-hosted-engine-setup-2.2.20-1.el7.centos.noarch ovirt-imageio-common-1.3.1.2-0.el7.centos.noarch ovirt-imageio-daemon-1.3.1.2-0.el7.centos.noarch ovirt-provider-ovn-driver-1.2.10-1.el7.centos.noarch ovirt-release42-4.2.3.1-1.el7.noarch ovirt-setup-lib-1.1.4-1.el7.centos.noarch ovirt-vmconsole-1.0.5-4.el7.centos.noarch ovirt-vmconsole-host-1.0.5-4.el7.centos.noarch python-ovirt-engine-sdk4-4.2.6-2.el7.centos.x86_64 vdsm-4.20.27.1-1.el7.centos.x86_64 vdsm-api-4.20.27.1-1.el7.centos.noarch vdsm-client-4.20.27.1-1.el7.centos.noarch vdsm-common-4.20.27.1-1.el7.centos.noarch vdsm-gluster-4.20.27.1-1.el7.centos.x86_64 vdsm-hook-ethtool-options-4.20.27.1-1.el7.centos.noarch vdsm-hook-fcoe-4.20.27.1-1.el7.centos.noarch vdsm-hook-openstacknet-4.20.27.1-1.el7.centos.noarch vdsm-hook-vfio-mdev-4.20.27.1-1.el7.centos.noarch vdsm-hook-vhostmd-4.20.27.1-1.el7.centos.noarch vdsm-hook-vmfex-dev-4.20.27.1-1.el7.centos.noarch vdsm-http-4.20.27.1-1.el7.centos.noarch vdsm-jsonrpc-4.20.27.1-1.el7.centos.noarch vdsm-network-4.20.27.1-1.el7.centos.x86_64 vdsm-python-4.20.27.1-1.el7.centos.noarch vdsm-yajsonrpc-4.20.27.1-1.el7.centos.noarch How reproducible: Once (did not repeat the setup and update procedure). Steps to Reproduce: 1. Deploy ovirt hosted engine setup with three hypervisor hosts using CPU "Intel(R) Xeon(R) CPU E3-1275 v6" based on packages: - ovirt-hosted-engine-setup-2.2.20-1.el7.centos.noarch - ovirt-engine-appliance-4.2-20180504.1.el7.centos.noarch - vdsm-gluster-4.20.27.1-1.el7.centos.x86_64 The default cluster object created by the ovirt engine will have CPU family set to "Intel Skylake Client IBRS Family". 2. Upgrade the engine and first hypervisor host according to: https://www.ovirt.org/documentation/how-to/hosted-engine/#upgrade-hosted-engine https://www.ovirt.org/documentation/upgrade-guide/chap-Updates_between_Minor_Releases/ 2.a. Switching the first host to maintenance mode did not work while the cluster was in "global maintenance mode". We did need to disable "global maintenance" in order to switch host 1 to maintenance mode. 3. After applying the updates to host 1, the host will switch to status "Non Operational" in the oVirt engine Admin UI, because it is missing CPU feature "SPEC_CTRL". Actual results: After updating the first hypervisor host "test-ovirt-1" with the ovirt 4.2.4.x packages as well, the host status switched to "Non Operational". In the detailed view of host test-ovirt-1, in the events section, when "double clicking" the matching error event, it says: Host test-ovirt-1 moved to Non-Operational state as host does not meet the cluster's minimum CPU level. Missing CPU features : spec_ctrl Expected results: Hypervisor host should meet all the CPU type requirements it met before updating ovirt packages and therefore not become "Non Operational" after applying ovirt updates. Additional info: On host1 after applying all updates and rebooting the host (since restarting vdsmd and ha services left the host non operational: # uname -a Linux test-ovirt-1 3.10.0-862.3.3.el7.x86_64 #1 SMP Fri Jun 15 04:15:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux # dmesg -T | grep -i spec [Tue Jun 26 19:44:15 2018] Speculative Store Bypass: Vulnerable [Tue Jun 26 19:44:15 2018] FEATURE SPEC_CTRL Not Present [Tue Jun 26 19:44:15 2018] Spectre V2 : Vulnerable: Retpoline without IBPB [Tue Jun 26 19:44:17 2018] FEATURE SPEC_CTRL Present [Tue Jun 26 19:44:17 2018] Spectre V2 : Mitigation: IBRS (kernel) On host2 before applying any updates: # uname -a Linux test-ovirt-2 3.10.0-862.3.3.el7.x86_64 #1 SMP Fri Jun 15 04:15:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux [root@test-ovirt-2 ~]# dmesg -T | grep -i spec [Tue Jun 26 00:47:12 2018] Speculative Store Bypass: Vulnerable [Tue Jun 26 00:47:12 2018] FEATURE SPEC_CTRL Not Present [Tue Jun 26 00:47:12 2018] Spectre V2 : Vulnerable: Retpoline without IBPB [Tue Jun 26 00:47:14 2018] FEATURE SPEC_CTRL Present [Tue Jun 26 00:47:14 2018] Spectre V2 : Mitigation: IBRS (kernel) lscpu on host1 (first) after the updates and on host2 (second) before the updates: # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Xeon(R) CPU E3-1275 v6 @ 3.80GHz Stepping: 9 CPU MHz: 4099.658 CPU max MHz: 4200.0000 CPU min MHz: 800.0000 BogoMIPS: 7584.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 ibpb ibrs dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Xeon(R) CPU E3-1275 v6 @ 3.80GHz Stepping: 9 CPU MHz: 3972.326 CPU max MHz: 4200.0000 CPU min MHz: 800.0000 BogoMIPS: 7584.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 ibpb ibrs dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
Changing the cluster CPU type from "Intel Skylake Client IBRS Family" to "Intel Skylake Client Family" allowed me to activate host 1 again
Changing the cluster CPU type also prevented oVirt from switching the next host (test-ovirt-2) into maintenance mode, since it was unable to migrate the VMs running on host test-ovirt-2 to any other host. The reason was that all running VMs were still using CPU type "Intel Skylake Client IBRS Family" and there was no longer any detination host with that CPU type available in the cluster. Had to shudown and run again all VMs started before switching the cluster CPU type.
this could happen with an early 4.2.4(or late 4.2.3 IIRC) engines. Since you've used appliance - is it possible it hasn't been updated to current 4.2.4.x before adding the first host? If that's the case - indeed just use the workaround you mentioned in comment #1
Hi Michal, we did not update before adding the first host. As I tried to describe in the section "Steps to Reproduce:", we set up an oVirt hosted engine setup with three hypervisor hosts and glusterfs based on the package versions mentioned in that section. The engine was fully provisioned and all three host deployed/integrated, managed glusterfs volumes created for data, iso and export purposes and virtual machines provisioned and running before applying the 4.2.4.x and yum updates to the hosted engine VM. After updating the hosted engine VM we applied yum updates to the first hypervisor host in local maintenance mode and that host was switched to "Non Operational" status by the engine as described in the ticket.
Yeah, that explains the behavior then. You wouldn’t have hit that if you had updated the engine before adding the first host. We do not release updates to appliance that often, relying on yum updates of a single baseline. It is not ideal, but we do not have capacity to rebuild completely all the time I would cpnsider it fixed now in latest 4.2.4.x, if that’s fine with you
We did deploy the hosted engine VM before the 4.2.4.x updates where available from the ovirt 4.2 repo. From my point of view, this change in host CPU detection/classification causes a service degradation from a seamless update procedure based on VM live migrations to a pretty disruptive update procedure requiring a change of cluster CPU type and a stop and restart of all VMs running within the cluster. This issue in the update procedure will hit every oVirt setup deployed before the 4.2.4.x updates were released. Additionally, according to the provided dmesg output, the Skylake CPUs actually provide the feature SPEC_CTRL (after microcode updates during CentOS boot procedure), so the CPU is not missing this feature as claimed by the oVirt engine error message. Additionally, it seems that our CPUs did not provide IBRS ever, neither before the host update, not afterwards. So the classification as a "IBRS" CPU type seems questionable as well.
(In reply to Linus from comment #6) > We did deploy the hosted engine VM before the 4.2.4.x updates where > available from the ovirt 4.2 repo. > > From my point of view, this change in host CPU detection/classification > causes a service degradation from a seamless update procedure based on VM > live migrations to a pretty disruptive update procedure requiring a change > of cluster CPU type and a stop and restart of all VMs running within the > cluster. > > This issue in the update procedure will hit every oVirt setup deployed > before the 4.2.4.x updates were released. Only those using IBRS CPUs, and it is fixed even for those by bug 1582483. The host should stay Operational if you use the up to date ovirt-emgine version. Perhaps a mistake in update procedure? > > Additionally, according to the provided dmesg output, the Skylake CPUs > actually provide the feature SPEC_CTRL (after microcode updates during > CentOS boot procedure), so the CPU is not missing this feature as claimed by > the oVirt engine error message. It’s not missing it, it’s just a consequence of a change in how flags are reported on oVirt side > Additionally, it seems that our CPUs did not provide IBRS ever, neither > before the host update, not afterwards. So the classification as a "IBRS" > CPU type seems questionable as well. They do, that’s the meaning of spec_ctrl flag. If you care about security please use the ssbd ones now, if not you can as well change that to the base type and avoid all the issues above
(In reply to Michal Skrivanek from comment #7) > (In reply to Linus from comment #6) > > We did deploy the hosted engine VM before the 4.2.4.x updates where > > available from the ovirt 4.2 repo. > > > > From my point of view, this change in host CPU detection/classification > > causes a service degradation from a seamless update procedure based on VM > > live migrations to a pretty disruptive update procedure requiring a change > > of cluster CPU type and a stop and restart of all VMs running within the > > cluster. > > > > This issue in the update procedure will hit every oVirt setup deployed > > before the 4.2.4.x updates were released. > > Only those using IBRS CPUs, and it is fixed even for those by bug 1582483. > The host should stay Operational if you use the up to date ovirt-emgine > version. Perhaps a mistake in update procedure? As far as I know, all current Intel CPUs are affected by all Spectre related bugs, so with "IBRS CPUs" you mean those Intel CPUs that Intel provides a firmware update for to add features that allow controlling the impact of Spectre bugs? That would probably be all Intel servers bought within the last 3 to four years? :) Code 1582483 describes a workaround to allow scheduling of VMs requiring an IBRS CPU type on hosts that do not provide a IBRS type CPU according to ovirt/libvirt reporting mechanisms. Should we have been able to find the errata describing this workaround in the release notes of oVirt 4.2.4? As I already mentioned, we installed the hypervisor hosts and hosted engine VM before ovirt 4.2.4.x updates were available from the oVirt 4.2 repo. The IBRS CPU type was set automatically. We updated the ovirt engine VM before updating the first oVirt host. Is there a way to update an existing setup to ovirt 4.2.4 without having the hypervisor host CPU type changed from IBRS to the base type? > > > > > Additionally, according to the provided dmesg output, the Skylake CPUs > > actually provide the feature SPEC_CTRL (after microcode updates during > > CentOS boot procedure), so the CPU is not missing this feature as claimed by > > the oVirt engine error message. > > It’s not missing it, it’s just a consequence of a change in how flags are > reported on oVirt side > > > Additionally, it seems that our CPUs did not provide IBRS ever, neither > > before the host update, not afterwards. So the classification as a "IBRS" > > CPU type seems questionable as well. > > They do, that’s the meaning of spec_ctrl flag. If you care about security > please use the ssbd ones now, if not you can as well change that to the base > type and avoid all the issues above Ok, I did some searches and found that "SPEC_CTRL" is the Linux kernel label for using the MSR toggles of SPECTRE related CPU features. IBRS is a CPU mitigation feature to protect privileged code from any speculation influence resulting from user space code, right?