Bug 1772614

Summary: After RHV-H update to 4.3.6-20191108, host is marked Non-Operational as host does not meet the cluster's minimum CPU level. Missing CPU features : md_clear
Product: Red Hat Enterprise Virtualization Manager Reporter: amashah
Component: redhat-virtualization-hostAssignee: Yuval Turgeman <yturgema>
Status: CLOSED NOTABUG QA Contact: peyu
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.3.6CC: cshao, esyr, lsvaty, mavital, michal.skrivanek, mkalinin, nlevy, peyu, qiyuan, rbarry, sbonazzo, shlei, weiwang, yaniwang, yturgema
Target Milestone: ---Flags: lsvaty: testing_plan_complete-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-19 08:52:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Node RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description amashah 2019-11-14 18:45:33 UTC
Description of problem:

After updating RHV-H: redhat-virtualization-host-image-update-4.3.6-20191108.0.el7_7.noarch

The host is in Non-Operational state due to not meet the cluster's minimum CPU level. Missing CPU features : md_clear

This update included mitigation's for CVE-2019-12207 and CVE-2019-11135.

Cluster CPU Type: Intel SandyBridge IBRS SSBD MDS Family





engine.log
============

~~~
2019-11-14 13:16:46,148Z INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-23) [2029b11c] EVENT_ID: VDS_SET_NONOPERATIONAL(517), Host <REMOVED> moved to Non-Operational state.
2019-11-14 13:16:46,157Z WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-23) [2029b11c] EVENT_ID: VDS_CPU_LOWER_THAN_CLUSTER(515), Host <REMOVED> moved to Non-Operational state as host does not meet the cluster's minimum CPU level. Missing CPU features : md_clear
2019-11-14 13:16:46,159Z INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-23) [2029b11c] EVENT_ID: VDS_DETECTED(13), Status of host <REMOVED> was set to NonOperational.
~~~


$ cat uname
Linux 3.10.0-1062.4.2.el7.x86_64 #1 SMP Tue Nov 5 11:59:54 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux


Nov 14 13:10:08 Installed: redhat-virtualization-host-image-update-4.3.6-20191108.0.el7_7.noarch
Nov 14 13:10:08 Erased: redhat-virtualization-host-image-update-placeholder-4.3.5-3.el7ev.noarch


$ cat sos_commands/processor/lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 45
Model name:            Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Stepping:              7
CPU MHz:               2700.000
CPU max MHz:           2700.0000
CPU min MHz:           1200.0000
BogoMIPS:              5399.78
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts spec_ctrl intel_stibp flush_l1d




Version-Release number of selected component (if applicable):
4.3.6-20191108

How reproducible:
Only 1 host was attempted

Steps to Reproduce:
1. Updating RHV-H to latest version
2. RHV-M Cluster CPU Type set to 'Intel SandyBridge IBRS SSBD MDS Family'
3. When host reboots it is marked as Non-Operational

Actual results:
Host is Non-Operational due to missing md_clear CPU flag. The host's CPU type no longer shows as MDS.

Expected results:
The host should be operational.

Additional info:
Logs will be attached soon from RHV-M and the host.

Comment 2 Ryan Barry 2019-11-14 21:06:15 UTC
Microcode and other packages look up to date (from rpmdb and cpuinfo). Is it possible to get the same output (or at least cat /proc/cpuinfo) from the previous version of RHVH?

Comment 7 Eugene Syromiatnikov 2019-11-15 13:12:17 UTC
Yep, the microcode 0x718 for Sandy Bridge-E/EN/EP (FF-MM-SS 06-2d-07, CPUID 0x206d7), that contains MDS mitigations, is disabled by default, since it causes system hangs on some systems under undetermined conditions (see [1][2]).  Information about it should have been reported to syslog/kmsg.  One can enable it explicitly by creating file "/etc/microcode_ctl/ucode_with_caveats/force-intel-06-2d-07" and triggering initramfs regeneration ("dracut -f --regenerate-all").

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1758382
[2] https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/issues/15

Comment 10 Marina Kalinin 2019-11-15 16:47:57 UTC
Let me see if I understand this correctly:
- md_clear flag is disabled now by default with latest microcode_ctl, meaning that all customers will not have this flag after upgrade.
-- Not all customers, since obviously QE/Eng didn't hit this issue. So is there a list of the CPUs that will be impacted?

- If this flag is not required anymore, why to require it in the cluster? This is where I am the most confused. What if the customer does not want to enable it (why would they, actually?), what should they do? Which RHV cluster type should they use?

- If they do enable it right now, what will happen with next upgrade? Would it get disabled again?

Comment 11 Marina Kalinin 2019-11-15 16:51:46 UTC
Also, the mentioned bz#1758382 is only for RHEL6, I didn't find anything for RHEL7 attached to it. 
What does it mean for RHV RHEL7 based hypervisors?

Comment 12 Eugene Syromiatnikov 2019-11-15 17:03:21 UTC
(In reply to Marina Kalinin from comment #10)
> Let me see if I understand this correctly:

> - md_clear flag is disabled now by default with latest microcode_ctl,
> meaning that all customers will not have this flag after upgrade.
> -- Not all customers, since obviously QE/Eng didn't hit this issue. So is
> there a list of the CPUs that will be impacted?

It's limited to SNB-E/EN/EP, as it is the only CPU model family so far that manifested such behaviour with the MDS-mitigation-enabling microcode update.

Approximately (since the information regarding Family/Mode/Stepping for each CPU model is not easily available from Intel, and there's also 06-2d-06, that has not reports of the issue so far), this is the list of the impacted CPU models:
https://ark.intel.com/content/www/us/en/ark/products/codename/64276/sandy-bridge-ep.html
https://ark.intel.com/content/www/us/en/ark/products/codename/64275/sandy-bridge-en.html
https://ark.intel.com/content/www/us/en/ark/products/codename/63378/sandy-bridge-e.html

> - If this flag is not required anymore, why to require it in the cluster?
> This is where I am the most confused. What if the customer does not want to
> enable it (why would they, actually?), what should they do? Which RHV
> cluster type should they use?

It is required for proper MDS mitigations.

> - If they do enable it right now, what will happen with next upgrade? Would
> it get disabled again?

Since the override configuration file is persistent in nature, it is honored during following microcode_ctl package upgrades.

Comment 13 Eugene Syromiatnikov 2019-11-15 17:05:17 UTC
(In reply to Marina Kalinin from comment #11)
> Also, the mentioned bz#1758382 is only for RHEL6, I didn't find anything for
> RHEL7 attached to it. 

The report [1] also mentions other kernels, and so far it is believed that the issue is microcode-specific and is not tied to specific kernel version.

[1] https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/issues/15

Comment 14 Marina Kalinin 2019-11-15 17:10:22 UTC
(In reply to Eugene Syromiatnikov from comment #12)
> (In reply to Marina Kalinin from comment #10)
> > Let me see if I understand this correctly:
> 
> > - md_clear flag is disabled now by default with latest microcode_ctl,
> > meaning that all customers will not have this flag after upgrade.
> > -- Not all customers, since obviously QE/Eng didn't hit this issue. So is
> > there a list of the CPUs that will be impacted?
> 
> It's limited to SNB-E/EN/EP, as it is the only CPU model family so far that
> manifested such behaviour with the MDS-mitigation-enabling microcode update.
> 
> Approximately (since the information regarding Family/Mode/Stepping for each
> CPU model is not easily available from Intel, and there's also 06-2d-06,
> that has not reports of the issue so far), this is the list of the impacted
> CPU models:
> https://ark.intel.com/content/www/us/en/ark/products/codename/64276/sandy-
> bridge-ep.html
> https://ark.intel.com/content/www/us/en/ark/products/codename/64275/sandy-
> bridge-en.html
> https://ark.intel.com/content/www/us/en/ark/products/codename/63378/sandy-
> bridge-e.html
> 
> > - If this flag is not required anymore, why to require it in the cluster?
> > This is where I am the most confused. What if the customer does not want to
> > enable it (why would they, actually?), what should they do? Which RHV
> > cluster type should they use?
> 
> It is required for proper MDS mitigations.
> 
> > - If they do enable it right now, what will happen with next upgrade? Would
> > it get disabled again?
> 
> Since the override configuration file is persistent in nature, it is honored
> during following microcode_ctl package upgrades.

ok, thanks. I think I started getting it:
There is only a list of impacted CPUs that would not have this flag. Everyone else will have it.
And those that are on the list may disable it, but it can result in kernel panic, so at their own risk?

Comment 15 Eugene Syromiatnikov 2019-11-15 19:04:30 UTC
(In reply to Marina Kalinin from comment #14)
> There is only a list of impacted CPUs that would not have this flag.
> Everyone else will have it.

Well, there are also older CPU models (Westmere and older) that have not received any MDS-mitigated microcode updates, but, yes, of those that have gotten this md_clear "feature", only SNB-EP has rolled back by default.

> And those that are on the list may disable it, but it can result in kernel
> panic, so at their own risk?

It's not even kernel panic, it's system hang (similar to BDW-EP late microcode update).

Comment 16 Ryan Barry 2019-11-15 19:37:29 UTC
We've also seen similar microcode issues on other systems (some require firmware updates before it's safe to apply). Since not all CPUs are affected, a KCS documenting the override which links to our page on minimum microcode/firmware guidance is the recommended way forward. It would not be safe to force enable it in RHV, but showing customers that their hosts are still vulnerable/non-operational is the right way

Comment 21 Marina Kalinin 2019-11-18 16:21:06 UTC
I modified the KCS: https://access.redhat.com/solutions/4588001 accordingly.

Comment 22 Sandro Bonazzola 2019-11-19 08:52:51 UTC
Closing as not a bug. KCS has been provided for handling the microcode update needed to get host back to operational state: https://access.redhat.com/solutions/4588001