Bug 1758382

Summary: Dell PowerEdge M820 servers panicking after microcode / firmware upgrades
Product: Red Hat Enterprise Linux 6 Reporter: Joshua Baker <jobaker>
Component: microcode_ctlAssignee: Eugene Syromiatnikov <esyr>
Status: CLOSED ERRATA QA Contact: Jeff Bastian <jbastian>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.10CC: ionut.jula, ionutjula, jmario, kwalker, sjohnsto, skozina, toneata
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: microcode_ctl-1.17-33.17.el6_10 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 08:54:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joshua Baker 2019-10-04 00:56:24 UTC
Description of problem:
* 7 systems are getting frozen completely. Screen black, no console or ssh. 
   - Hardware errors ( MCE chekcs ) are frequently seen in logs for systems experiencing the issue.
   - Customer has contacted hardware vendor who have run thorough hardware diagnostics with no errors observed.
   - Hardware vendor Dell has opened case 02486054 for this customer's servers and another Model PowerEdge M820 from diffrent CU experiencing same issue.
* NMI, serial console has not worked *at the time of issue*.
   - Customer mentioned that NMI works fine if system is running normally but it fails only at the time of issue.
* Issue started post kernel update from 2.6.32-754.12.1.el6.x86_64 to 2.6.32-754.17.1.el6.x86_64.
   - microcode_ctl-1.17-33.9.el6_10.x86_64 having firmware revision=0x42d additionally updated to  microcode_ctl-1.17-33.14.el6_10.x86_64 having firmware revision=0x718
   - Customer is facing issue with the latest kernel-2.6.32-754.18.2.el6.x86_64 as well.
* Issue is sporadic (there is no pattern for timing of issue manifesting.
* Customer believes issue is related to the microcode introduced with the kernel, although evidence appears to point more torwards the firmware being the issue.

Version-Release number of selected component (if applicable):
- kernel-2.6.32-754.15.1+
- microcode_ctl-1.17-33.14.el6_10.x86_64 --> revision=0x718 firmware

How reproducible:
Issue happens randomly on Dell PowerEdge M820 servers with above listed kernel's and firmware. No way to initiate event.

Steps to reproduce:
1. Install the following on Dell PowerEdge M820:
  - kernel-2.6.32-754.15.1+
  - microcode_ctl-1.17-33.14.el6_10.x86_64 --> revision=0x718 firmware
2. Wait for issue to appear

Additional info:
Observed in 02458913 opened by customer
Observed in 02458913 opened by Dell

Comment 2 Eugene Syromiatnikov 2019-10-04 07:57:40 UTC
> microcode_ctl-1.17-33.9.el6_10.x86_64 having firmware revision=0x42d

Judging by the server model (Dell PowerEdge M820) and microcode_ctl-1.17-33.14.el6_10.x86_64-provided microcode revision being 0x718, I would presume that the CPU model is Intel Xeon E5-46xx (CPUID 0x206d7, FF-MM-SS 06-2d-07, codename Sandy Bridge-EP) and the respective microcode version for it in microcode_ctl-1.17-33.9.el6_10.x86_64 is 0x714.

Otherwise this microcode revision looks like the one for Ivy Bridge-EP (CPUID 0x306e4, FF-MM-SS 06-3e-04), possibly Intel Xeon E5-46xx v2.

So, considering the above, there are the following questions:
 * Are the issues observed on kernel-2.6.32-754.15.1+ with microcode_ctl-1.17-33.11+ (that's the microcode_ctl RPM release that brings MDS-enabled IVB-EP microcode revision 0x42e)?
 * Considering both the kernel and microcode updates are MDS-related, are the issues observed on SNB-EP machines with kernel-2.6.32-754.15.1+, microcode_ctl-1.17-33.14, and mds=off kernel parameter?

Comment 3 Joe Mario 2019-10-04 11:30:27 UTC
After reading through the lenghtly customer case, I agree with Eugene's
triage steps posted in the case, (appended below).

This feels like the verw instruction's flushing behavior is the trigger.  

Comments from  Eugene Syromiatnikov  on next steps:

  microcode_ctl-1.17-33.14.el6_10.x86_64 that brings 0x718 microcode
  release.

  So, since both the microcode_ctl and kernel updates are MDS-related, I
  would suggest to check the following cases:
   * Downgraded microcode_ctl package (that only makes sense only if
     OS-driven upgrades are actually used and system firmware doesn't have
     0x718 microcode revision already) and check kernel-2.6.32-754.15.1+
   * Updated microcode_ctl and pre-2.6.32-754.15.1 kernel.
   * Updated microcode_ctl and 2.6.32-754.15.1+ kernel with mds=off

  If all these cases do not lead to hangs, I would suspect issues in VMWERV
  instruction implementation on SNB-EP.

Comment 4 Eugene Syromiatnikov 2019-10-04 14:58:13 UTC
See also [1].

[1] https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/issues/15

Comment 10 errata-xmlrpc 2019-10-16 08:54:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3090

Comment 12 Eugene Syromiatnikov 2019-11-19 16:58:38 UTC
*** Bug 1774134 has been marked as a duplicate of this bug. ***

Comment 14 Red Hat Bugzilla 2023-09-18 00:17:38 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days