RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1758382 - Dell PowerEdge M820 servers panicking after microcode / firmware upgrades
Summary: Dell PowerEdge M820 servers panicking after microcode / firmware upgrades
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: microcode_ctl
Version: 6.10
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Eugene Syromiatnikov
QA Contact: Jeff Bastian
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-04 00:56 UTC by Joshua Baker
Modified: 2023-09-18 00:17 UTC (History)
7 users (show)

Fixed In Version: microcode_ctl-1.17-33.17.el6_10
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 08:54:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4593951 0 None None None 2019-11-18 20:12:09 UTC
Red Hat Product Errata RHEA-2019:3090 0 None None None 2019-10-16 08:54:19 UTC

Description Joshua Baker 2019-10-04 00:56:24 UTC
Description of problem:
* 7 systems are getting frozen completely. Screen black, no console or ssh. 
   - Hardware errors ( MCE chekcs ) are frequently seen in logs for systems experiencing the issue.
   - Customer has contacted hardware vendor who have run thorough hardware diagnostics with no errors observed.
   - Hardware vendor Dell has opened case 02486054 for this customer's servers and another Model PowerEdge M820 from diffrent CU experiencing same issue.
* NMI, serial console has not worked *at the time of issue*.
   - Customer mentioned that NMI works fine if system is running normally but it fails only at the time of issue.
* Issue started post kernel update from 2.6.32-754.12.1.el6.x86_64 to 2.6.32-754.17.1.el6.x86_64.
   - microcode_ctl-1.17-33.9.el6_10.x86_64 having firmware revision=0x42d additionally updated to  microcode_ctl-1.17-33.14.el6_10.x86_64 having firmware revision=0x718
   - Customer is facing issue with the latest kernel-2.6.32-754.18.2.el6.x86_64 as well.
* Issue is sporadic (there is no pattern for timing of issue manifesting.
* Customer believes issue is related to the microcode introduced with the kernel, although evidence appears to point more torwards the firmware being the issue.

Version-Release number of selected component (if applicable):
- kernel-2.6.32-754.15.1+
- microcode_ctl-1.17-33.14.el6_10.x86_64 --> revision=0x718 firmware

How reproducible:
Issue happens randomly on Dell PowerEdge M820 servers with above listed kernel's and firmware. No way to initiate event.

Steps to reproduce:
1. Install the following on Dell PowerEdge M820:
  - kernel-2.6.32-754.15.1+
  - microcode_ctl-1.17-33.14.el6_10.x86_64 --> revision=0x718 firmware
2. Wait for issue to appear

Additional info:
Observed in 02458913 opened by customer
Observed in 02458913 opened by Dell

Comment 2 Eugene Syromiatnikov 2019-10-04 07:57:40 UTC
> microcode_ctl-1.17-33.9.el6_10.x86_64 having firmware revision=0x42d

Judging by the server model (Dell PowerEdge M820) and microcode_ctl-1.17-33.14.el6_10.x86_64-provided microcode revision being 0x718, I would presume that the CPU model is Intel Xeon E5-46xx (CPUID 0x206d7, FF-MM-SS 06-2d-07, codename Sandy Bridge-EP) and the respective microcode version for it in microcode_ctl-1.17-33.9.el6_10.x86_64 is 0x714.

Otherwise this microcode revision looks like the one for Ivy Bridge-EP (CPUID 0x306e4, FF-MM-SS 06-3e-04), possibly Intel Xeon E5-46xx v2.

So, considering the above, there are the following questions:
 * Are the issues observed on kernel-2.6.32-754.15.1+ with microcode_ctl-1.17-33.11+ (that's the microcode_ctl RPM release that brings MDS-enabled IVB-EP microcode revision 0x42e)?
 * Considering both the kernel and microcode updates are MDS-related, are the issues observed on SNB-EP machines with kernel-2.6.32-754.15.1+, microcode_ctl-1.17-33.14, and mds=off kernel parameter?

Comment 3 Joe Mario 2019-10-04 11:30:27 UTC
After reading through the lenghtly customer case, I agree with Eugene's
triage steps posted in the case, (appended below).

This feels like the verw instruction's flushing behavior is the trigger.  

Comments from  Eugene Syromiatnikov  on next steps:

  microcode_ctl-1.17-33.14.el6_10.x86_64 that brings 0x718 microcode
  release.

  So, since both the microcode_ctl and kernel updates are MDS-related, I
  would suggest to check the following cases:
   * Downgraded microcode_ctl package (that only makes sense only if
     OS-driven upgrades are actually used and system firmware doesn't have
     0x718 microcode revision already) and check kernel-2.6.32-754.15.1+
   * Updated microcode_ctl and pre-2.6.32-754.15.1 kernel.
   * Updated microcode_ctl and 2.6.32-754.15.1+ kernel with mds=off

  If all these cases do not lead to hangs, I would suspect issues in VMWERV
  instruction implementation on SNB-EP.

Comment 4 Eugene Syromiatnikov 2019-10-04 14:58:13 UTC
See also [1].

[1] https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/issues/15

Comment 10 errata-xmlrpc 2019-10-16 08:54:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3090

Comment 12 Eugene Syromiatnikov 2019-11-19 16:58:38 UTC
*** Bug 1774134 has been marked as a duplicate of this bug. ***

Comment 14 Red Hat Bugzilla 2023-09-18 00:17:38 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.