Bug 1595378 - hypervisor host non operational after yum update due to missing CPU feature SPEC_CTRL
Summary: hypervisor host non operational after yum update due to missing CPU feature S...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 4.2.4
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: bugs@ovirt.org
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-26 19:05 UTC by Linus
Modified: 2018-11-28 22:13 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-28 22:13:18 UTC
oVirt Team: Virt
Embargoed:


Attachments (Terms of Use)

Description Linus 2018-06-26 19:05:46 UTC
Description of problem:

We installed a oVirt hosted engine evaluation setup which three CentOS 7.5 hypervisor hosts and the hosted engine VM deployed to a replicated gluster volume with a brick on each hypervisor host. We updated the ovirt hosted engine VM with the ovirt 4.2.4.x packages. After updating the first hypervisor host "test-ovirt-1" with the ovirt 4.2.4.x packages as well, the host status switched to "Non Operational".

Version-Release number of selected component (if applicable):

On the ovirt engine VM
ovirt-engine-4.2.4.5-1.el7.noarch

On hypervisor host test-ovirt-1:
# rpm -qa | grep -i -e ovirt -e vdsm | sort
cockpit-machines-ovirt-169-1.el7.noarch
cockpit-ovirt-dashboard-0.11.28-1.el7.noarch
ovirt-engine-appliance-4.2-20180626.1.el7.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7.noarch
ovirt-host-4.2.3-1.el7.x86_64
ovirt-host-dependencies-4.2.3-1.el7.x86_64
ovirt-host-deploy-1.7.4-1.el7.noarch
ovirt-hosted-engine-ha-2.2.14-1.el7.noarch
ovirt-hosted-engine-setup-2.2.22.1-1.el7.noarch
ovirt-imageio-common-1.3.1.2-0.el7.centos.noarch
ovirt-imageio-daemon-1.3.1.2-0.el7.centos.noarch
ovirt-provider-ovn-driver-1.2.11-1.el7.noarch
ovirt-release42-4.2.4-1.el7.noarch
ovirt-setup-lib-1.1.4-1.el7.centos.noarch
ovirt-vmconsole-1.0.5-4.el7.centos.noarch
ovirt-vmconsole-host-1.0.5-4.el7.centos.noarch
python-ovirt-engine-sdk4-4.2.7-2.el7.x86_64
vdsm-4.20.32-1.el7.x86_64
vdsm-api-4.20.32-1.el7.noarch
vdsm-client-4.20.32-1.el7.noarch
vdsm-common-4.20.32-1.el7.noarch
vdsm-gluster-4.20.32-1.el7.x86_64
vdsm-hook-ethtool-options-4.20.32-1.el7.noarch
vdsm-hook-fcoe-4.20.32-1.el7.noarch
vdsm-hook-openstacknet-4.20.32-1.el7.noarch
vdsm-hook-vfio-mdev-4.20.32-1.el7.noarch
vdsm-hook-vhostmd-4.20.32-1.el7.noarch
vdsm-hook-vmfex-dev-4.20.32-1.el7.noarch
vdsm-http-4.20.32-1.el7.noarch
vdsm-jsonrpc-4.20.32-1.el7.noarch
vdsm-network-4.20.32-1.el7.x86_64
vdsm-python-4.20.32-1.el7.noarch
vdsm-yajsonrpc-4.20.32-1.el7.noarch

On the hypervisor host test-ovirt-2 before yum update:
# rpm -qa | grep -i -e ovirt -e vdsm | sort
cockpit-ovirt-dashboard-0.11.24-1.el7.centos.noarch
ovirt-engine-appliance-4.2-20180504.1.el7.centos.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7.noarch
ovirt-host-4.2.2-2.el7.centos.x86_64
ovirt-host-dependencies-4.2.2-2.el7.centos.x86_64
ovirt-host-deploy-1.7.3-1.el7.centos.noarch
ovirt-hosted-engine-ha-2.2.11-1.el7.centos.noarch
ovirt-hosted-engine-setup-2.2.20-1.el7.centos.noarch
ovirt-imageio-common-1.3.1.2-0.el7.centos.noarch
ovirt-imageio-daemon-1.3.1.2-0.el7.centos.noarch
ovirt-provider-ovn-driver-1.2.10-1.el7.centos.noarch
ovirt-release42-4.2.3.1-1.el7.noarch
ovirt-setup-lib-1.1.4-1.el7.centos.noarch
ovirt-vmconsole-1.0.5-4.el7.centos.noarch
ovirt-vmconsole-host-1.0.5-4.el7.centos.noarch
python-ovirt-engine-sdk4-4.2.6-2.el7.centos.x86_64
vdsm-4.20.27.1-1.el7.centos.x86_64
vdsm-api-4.20.27.1-1.el7.centos.noarch
vdsm-client-4.20.27.1-1.el7.centos.noarch
vdsm-common-4.20.27.1-1.el7.centos.noarch
vdsm-gluster-4.20.27.1-1.el7.centos.x86_64
vdsm-hook-ethtool-options-4.20.27.1-1.el7.centos.noarch
vdsm-hook-fcoe-4.20.27.1-1.el7.centos.noarch
vdsm-hook-openstacknet-4.20.27.1-1.el7.centos.noarch
vdsm-hook-vfio-mdev-4.20.27.1-1.el7.centos.noarch
vdsm-hook-vhostmd-4.20.27.1-1.el7.centos.noarch
vdsm-hook-vmfex-dev-4.20.27.1-1.el7.centos.noarch
vdsm-http-4.20.27.1-1.el7.centos.noarch
vdsm-jsonrpc-4.20.27.1-1.el7.centos.noarch
vdsm-network-4.20.27.1-1.el7.centos.x86_64
vdsm-python-4.20.27.1-1.el7.centos.noarch
vdsm-yajsonrpc-4.20.27.1-1.el7.centos.noarch


How reproducible:
Once (did not repeat the setup and update procedure).


Steps to Reproduce:

1. Deploy ovirt hosted engine setup with three hypervisor hosts using CPU "Intel(R) Xeon(R) CPU E3-1275 v6" based on packages:
- ovirt-hosted-engine-setup-2.2.20-1.el7.centos.noarch
- ovirt-engine-appliance-4.2-20180504.1.el7.centos.noarch
- vdsm-gluster-4.20.27.1-1.el7.centos.x86_64
The default cluster object created by the ovirt engine will have CPU family set to "Intel Skylake Client IBRS Family".
2. Upgrade the engine and first hypervisor host according to:
https://www.ovirt.org/documentation/how-to/hosted-engine/#upgrade-hosted-engine
https://www.ovirt.org/documentation/upgrade-guide/chap-Updates_between_Minor_Releases/
2.a. Switching the first host to maintenance mode did not work while the cluster was in "global maintenance mode". We did need to disable "global maintenance" in order to switch host 1 to maintenance mode.
3. After applying the updates to host 1, the host will switch to status "Non Operational" in the oVirt engine Admin UI, because it is missing CPU feature "SPEC_CTRL".

Actual results:

After updating the first hypervisor host "test-ovirt-1" with the ovirt 4.2.4.x packages as well, the host status switched to "Non Operational".
In the detailed view of host test-ovirt-1, in the events section, when "double clicking" the matching error event, it says:

Host test-ovirt-1 moved to Non-Operational state as host does not meet the cluster's minimum CPU level. Missing CPU features : spec_ctrl


Expected results:

Hypervisor host should meet all the CPU type requirements it met before updating ovirt packages and therefore not become "Non Operational" after applying ovirt updates.


Additional info:

On host1 after applying all updates and rebooting the host (since restarting vdsmd and ha services left the host non operational:
# uname -a
Linux test-ovirt-1 3.10.0-862.3.3.el7.x86_64 #1 SMP Fri Jun 15 04:15:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
# dmesg -T | grep -i spec
[Tue Jun 26 19:44:15 2018] Speculative Store Bypass: Vulnerable
[Tue Jun 26 19:44:15 2018] FEATURE SPEC_CTRL Not Present
[Tue Jun 26 19:44:15 2018] Spectre V2 : Vulnerable: Retpoline without IBPB
[Tue Jun 26 19:44:17 2018] FEATURE SPEC_CTRL Present
[Tue Jun 26 19:44:17 2018] Spectre V2 : Mitigation: IBRS (kernel)

On host2 before applying any updates:
# uname -a
Linux test-ovirt-2 3.10.0-862.3.3.el7.x86_64 #1 SMP Fri Jun 15 04:15:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@test-ovirt-2 ~]# dmesg -T | grep -i spec
[Tue Jun 26 00:47:12 2018] Speculative Store Bypass: Vulnerable
[Tue Jun 26 00:47:12 2018] FEATURE SPEC_CTRL Not Present
[Tue Jun 26 00:47:12 2018] Spectre V2 : Vulnerable: Retpoline without IBPB
[Tue Jun 26 00:47:14 2018] FEATURE SPEC_CTRL Present
[Tue Jun 26 00:47:14 2018] Spectre V2 : Mitigation: IBRS (kernel)

lscpu on host1 (first) after the updates and on host2 (second) before the updates:

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 158
Model name:            Intel(R) Xeon(R) CPU E3-1275 v6 @ 3.80GHz
Stepping:              9
CPU MHz:               4099.658
CPU max MHz:           4200.0000
CPU min MHz:           800.0000
BogoMIPS:              7584.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 ibpb ibrs dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp



# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 158
Model name:            Intel(R) Xeon(R) CPU E3-1275 v6 @ 3.80GHz
Stepping:              9
CPU MHz:               3972.326
CPU max MHz:           4200.0000
CPU min MHz:           800.0000
BogoMIPS:              7584.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2
smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 ibpb ibrs dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp

Comment 1 Linus 2018-06-26 19:08:46 UTC
Changing the cluster CPU type from "Intel Skylake Client IBRS Family" to "Intel Skylake Client Family" allowed me to activate host 1 again

Comment 2 Linus 2018-06-26 19:33:32 UTC
Changing the cluster CPU type also prevented oVirt from switching the next host (test-ovirt-2) into maintenance mode, since it was unable to migrate the VMs running on host test-ovirt-2 to any other host. The reason was that all running VMs were still using CPU type "Intel Skylake Client IBRS Family" and there was no longer any detination host with that CPU type available in the cluster.
Had to shudown and run again all VMs started before switching the cluster CPU type.

Comment 3 Michal Skrivanek 2018-06-27 06:58:47 UTC
this could happen with an early 4.2.4(or late 4.2.3 IIRC) engines. Since you've used appliance - is it possible it hasn't been updated to current 4.2.4.x before adding the first host?
If that's the case - indeed just use the workaround you mentioned in comment #1

Comment 4 Linus 2018-06-27 16:23:38 UTC
Hi Michal,

we did not update before adding the first host.
As I tried to describe in the section "Steps to Reproduce:", we set up an oVirt hosted engine setup with three hypervisor hosts and glusterfs based on the package versions mentioned in that section.
The engine was fully provisioned and all three host deployed/integrated, managed glusterfs volumes created for data, iso and export purposes and virtual machines provisioned and running before applying the 4.2.4.x and yum updates to the hosted engine VM.
After updating the hosted engine VM we applied yum updates to the first hypervisor host in local maintenance mode and that host was switched to "Non Operational" status by the engine as described in the ticket.

Comment 5 Michal Skrivanek 2018-06-27 16:30:48 UTC
Yeah, that explains the behavior then. You wouldn’t have hit that if you had updated the engine before adding the first host. We do not release updates to appliance that often, relying on yum updates of a single baseline. It is not ideal, but we do not have capacity to rebuild completely all the time

I would cpnsider it fixed now in latest 4.2.4.x, if that’s fine with you

Comment 6 Linus 2018-06-27 16:40:33 UTC
We did deploy the hosted engine VM before the 4.2.4.x updates where available from the ovirt 4.2 repo.

From my point of view, this change in host CPU detection/classification causes a service degradation from a seamless update procedure based on VM live migrations to a pretty disruptive update procedure requiring a change of cluster CPU type and a stop and restart of all VMs running within the cluster.

This issue in the update procedure will hit every oVirt setup deployed before the 4.2.4.x updates were released.

Additionally, according to the provided dmesg output, the Skylake CPUs actually provide the feature SPEC_CTRL (after microcode updates during CentOS boot procedure), so the CPU is not missing this feature as claimed by the oVirt engine error message.

Additionally, it seems that our CPUs did not provide IBRS ever, neither before the host update, not afterwards. So the classification as a "IBRS" CPU type seems questionable as well.

Comment 7 Michal Skrivanek 2018-06-28 05:05:22 UTC
(In reply to Linus from comment #6)
> We did deploy the hosted engine VM before the 4.2.4.x updates where
> available from the ovirt 4.2 repo.
> 
> From my point of view, this change in host CPU detection/classification
> causes a service degradation from a seamless update procedure based on VM
> live migrations to a pretty disruptive update procedure requiring a change
> of cluster CPU type and a stop and restart of all VMs running within the
> cluster.
> 
> This issue in the update procedure will hit every oVirt setup deployed
> before the 4.2.4.x updates were released.

Only those using IBRS CPUs, and it is fixed even for those by bug 1582483. The host should stay Operational if you use the up to date ovirt-emgine version. Perhaps a mistake in update procedure?

> 
> Additionally, according to the provided dmesg output, the Skylake CPUs
> actually provide the feature SPEC_CTRL (after microcode updates during
> CentOS boot procedure), so the CPU is not missing this feature as claimed by
> the oVirt engine error message.

It’s not missing it, it’s just a consequence of a change in how flags are reported on oVirt side

> Additionally, it seems that our CPUs did not provide IBRS ever, neither
> before the host update, not afterwards. So the classification as a "IBRS"
> CPU type seems questionable as well.

They do, that’s the meaning of spec_ctrl flag. If you care about security please use the ssbd ones now, if not you can as well change that to the base type and avoid all the issues above

Comment 8 Linus 2018-07-04 14:03:38 UTC
(In reply to Michal Skrivanek from comment #7)
> (In reply to Linus from comment #6)
> > We did deploy the hosted engine VM before the 4.2.4.x updates where
> > available from the ovirt 4.2 repo.
> > 
> > From my point of view, this change in host CPU detection/classification
> > causes a service degradation from a seamless update procedure based on VM
> > live migrations to a pretty disruptive update procedure requiring a change
> > of cluster CPU type and a stop and restart of all VMs running within the
> > cluster.
> > 
> > This issue in the update procedure will hit every oVirt setup deployed
> > before the 4.2.4.x updates were released.
> 
> Only those using IBRS CPUs, and it is fixed even for those by bug 1582483.
> The host should stay Operational if you use the up to date ovirt-emgine
> version. Perhaps a mistake in update procedure?

As far as I know, all current Intel CPUs are affected by all Spectre related bugs, so with "IBRS CPUs" you mean those Intel CPUs that Intel provides a firmware update for to add features that allow controlling the impact of Spectre bugs? That would probably be all Intel servers bought within the last 3 to four years? :)

Code 1582483 describes a workaround to allow scheduling of VMs requiring an IBRS CPU type on hosts that do not provide a IBRS type CPU according to ovirt/libvirt reporting mechanisms. Should we have been able to find the errata describing this workaround in the release notes of oVirt 4.2.4?

As I already mentioned, we installed the hypervisor hosts and hosted engine VM before ovirt 4.2.4.x updates were available from the oVirt 4.2 repo. The IBRS CPU type was set automatically.
We updated the ovirt engine VM before updating the first oVirt host. Is there a way to update an existing setup to ovirt 4.2.4 without having the hypervisor host CPU type changed from IBRS to the base type?


> 
> > 
> > Additionally, according to the provided dmesg output, the Skylake CPUs
> > actually provide the feature SPEC_CTRL (after microcode updates during
> > CentOS boot procedure), so the CPU is not missing this feature as claimed by
> > the oVirt engine error message.
> 
> It’s not missing it, it’s just a consequence of a change in how flags are
> reported on oVirt side
> 
> > Additionally, it seems that our CPUs did not provide IBRS ever, neither
> > before the host update, not afterwards. So the classification as a "IBRS"
> > CPU type seems questionable as well.
> 
> They do, that’s the meaning of spec_ctrl flag. If you care about security
> please use the ssbd ones now, if not you can as well change that to the base
> type and avoid all the issues above

Ok, I did some searches and found that "SPEC_CTRL" is the Linux kernel label for using the MSR toggles of SPECTRE related CPU features. IBRS is a CPU mitigation feature to protect privileged code from any speculation influence resulting from user space code, right?


Note You need to log in before you can comment on or make changes to this bug.