Bug 1772614
| Summary: | After RHV-H update to 4.3.6-20191108, host is marked Non-Operational as host does not meet the cluster's minimum CPU level. Missing CPU features : md_clear | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | amashah |
| Component: | redhat-virtualization-host | Assignee: | Yuval Turgeman <yturgema> |
| Status: | CLOSED NOTABUG | QA Contact: | peyu |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.3.6 | CC: | cshao, esyr, lsvaty, mavital, michal.skrivanek, mkalinin, nlevy, peyu, qiyuan, rbarry, sbonazzo, shlei, weiwang, yaniwang, yturgema |
| Target Milestone: | --- | Flags: | lsvaty:
testing_plan_complete-
|
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-11-19 08:52:51 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Node | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
amashah
2019-11-14 18:45:33 UTC
Microcode and other packages look up to date (from rpmdb and cpuinfo). Is it possible to get the same output (or at least cat /proc/cpuinfo) from the previous version of RHVH? Yep, the microcode 0x718 for Sandy Bridge-E/EN/EP (FF-MM-SS 06-2d-07, CPUID 0x206d7), that contains MDS mitigations, is disabled by default, since it causes system hangs on some systems under undetermined conditions (see [1][2]). Information about it should have been reported to syslog/kmsg. One can enable it explicitly by creating file "/etc/microcode_ctl/ucode_with_caveats/force-intel-06-2d-07" and triggering initramfs regeneration ("dracut -f --regenerate-all").
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1758382
[2] https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/issues/15
Let me see if I understand this correctly: - md_clear flag is disabled now by default with latest microcode_ctl, meaning that all customers will not have this flag after upgrade. -- Not all customers, since obviously QE/Eng didn't hit this issue. So is there a list of the CPUs that will be impacted? - If this flag is not required anymore, why to require it in the cluster? This is where I am the most confused. What if the customer does not want to enable it (why would they, actually?), what should they do? Which RHV cluster type should they use? - If they do enable it right now, what will happen with next upgrade? Would it get disabled again? Also, the mentioned bz#1758382 is only for RHEL6, I didn't find anything for RHEL7 attached to it. What does it mean for RHV RHEL7 based hypervisors? (In reply to Marina Kalinin from comment #10) > Let me see if I understand this correctly: > - md_clear flag is disabled now by default with latest microcode_ctl, > meaning that all customers will not have this flag after upgrade. > -- Not all customers, since obviously QE/Eng didn't hit this issue. So is > there a list of the CPUs that will be impacted? It's limited to SNB-E/EN/EP, as it is the only CPU model family so far that manifested such behaviour with the MDS-mitigation-enabling microcode update. Approximately (since the information regarding Family/Mode/Stepping for each CPU model is not easily available from Intel, and there's also 06-2d-06, that has not reports of the issue so far), this is the list of the impacted CPU models: https://ark.intel.com/content/www/us/en/ark/products/codename/64276/sandy-bridge-ep.html https://ark.intel.com/content/www/us/en/ark/products/codename/64275/sandy-bridge-en.html https://ark.intel.com/content/www/us/en/ark/products/codename/63378/sandy-bridge-e.html > - If this flag is not required anymore, why to require it in the cluster? > This is where I am the most confused. What if the customer does not want to > enable it (why would they, actually?), what should they do? Which RHV > cluster type should they use? It is required for proper MDS mitigations. > - If they do enable it right now, what will happen with next upgrade? Would > it get disabled again? Since the override configuration file is persistent in nature, it is honored during following microcode_ctl package upgrades. (In reply to Marina Kalinin from comment #11) > Also, the mentioned bz#1758382 is only for RHEL6, I didn't find anything for > RHEL7 attached to it. The report [1] also mentions other kernels, and so far it is believed that the issue is microcode-specific and is not tied to specific kernel version. [1] https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/issues/15 (In reply to Eugene Syromiatnikov from comment #12) > (In reply to Marina Kalinin from comment #10) > > Let me see if I understand this correctly: > > > - md_clear flag is disabled now by default with latest microcode_ctl, > > meaning that all customers will not have this flag after upgrade. > > -- Not all customers, since obviously QE/Eng didn't hit this issue. So is > > there a list of the CPUs that will be impacted? > > It's limited to SNB-E/EN/EP, as it is the only CPU model family so far that > manifested such behaviour with the MDS-mitigation-enabling microcode update. > > Approximately (since the information regarding Family/Mode/Stepping for each > CPU model is not easily available from Intel, and there's also 06-2d-06, > that has not reports of the issue so far), this is the list of the impacted > CPU models: > https://ark.intel.com/content/www/us/en/ark/products/codename/64276/sandy- > bridge-ep.html > https://ark.intel.com/content/www/us/en/ark/products/codename/64275/sandy- > bridge-en.html > https://ark.intel.com/content/www/us/en/ark/products/codename/63378/sandy- > bridge-e.html > > > - If this flag is not required anymore, why to require it in the cluster? > > This is where I am the most confused. What if the customer does not want to > > enable it (why would they, actually?), what should they do? Which RHV > > cluster type should they use? > > It is required for proper MDS mitigations. > > > - If they do enable it right now, what will happen with next upgrade? Would > > it get disabled again? > > Since the override configuration file is persistent in nature, it is honored > during following microcode_ctl package upgrades. ok, thanks. I think I started getting it: There is only a list of impacted CPUs that would not have this flag. Everyone else will have it. And those that are on the list may disable it, but it can result in kernel panic, so at their own risk? (In reply to Marina Kalinin from comment #14) > There is only a list of impacted CPUs that would not have this flag. > Everyone else will have it. Well, there are also older CPU models (Westmere and older) that have not received any MDS-mitigated microcode updates, but, yes, of those that have gotten this md_clear "feature", only SNB-EP has rolled back by default. > And those that are on the list may disable it, but it can result in kernel > panic, so at their own risk? It's not even kernel panic, it's system hang (similar to BDW-EP late microcode update). We've also seen similar microcode issues on other systems (some require firmware updates before it's safe to apply). Since not all CPUs are affected, a KCS documenting the override which links to our page on minimum microcode/firmware guidance is the recommended way forward. It would not be safe to force enable it in RHV, but showing customers that their hosts are still vulnerable/non-operational is the right way I modified the KCS: https://access.redhat.com/solutions/4588001 accordingly. Closing as not a bug. KCS has been provided for handling the microcode update needed to get host back to operational state: https://access.redhat.com/solutions/4588001 |