Bug 1532283

Summary: System hang during boot following update to microcode_ctl-1.17-25.2.el6_9.x86_64
Product: Red Hat Enterprise Linux 6 Reporter: Kyle Walker <kwalker>
Component: microcode_ctlAssignee: Petr Oros <poros>
Status: CLOSED CURRENTRELEASE QA Contact: Rachel Sibley <rasibley>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.9CC: cdonnell, ionut, jbastian, jpriddy, o.freyermuth, poros, riehecky, sfroemer, skozina, tgummels, tomek, vagrawal, wienemann, williamverzal1, woodard
Target Milestone: rc   
Target Release: 6.10   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: microcode_ctl-1.17-25.3.el6_9.x86_64 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-21 12:45:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1425544    

Description Kyle Walker 2018-01-08 15:02:15 UTC
Description of problem:
  Following an update of the microcode_ctl package, a system hangs during boot with messages related to the microcode load operation. Observed at this time only on systems with the following CPU model information:

  $ awk '/model/||/stepping/||/family/||/microcode/ {if(!seen[$0]++) print $0}' proc/cpuinfo
  cpu family	: 6
  model		: 79
  model name	: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
  stepping	: 1
  microcode	: 184549407


Version-Release number of selected component (if applicable):
  1:microcode_ctl-1.17-25.2.el6_9.x86_64

How reproducible:
  Easily

Steps to Reproduce:
1. On a system with the above CPU information, install a base RHEL 6.9 deployment 
2. Issue a "yum update"
3. Reboot

Actual results:
  <snip>
  microcode: CPU0 sig=0x406f1, pf=0x1, revision=0xb00001f
  platform microcode: firmware: requesting intel-ucode/06-4f-01
  <snip>
    ^- Hangs here with no further output visible

Expected results:
  A normal boot operation with no hang observed.

Additional info:
  A downgrade of the microcode to the previous revision resolves the hang.

Comment 6 Ion Badulescu 2018-01-11 17:02:07 UTC
Same here after the update. Downgrading the microcode fixes the issue.

Hardware platform is:

Supermicro X10DDW-i with 2x E5-2667v4 CPUs stepping 1, BIOS 2.0a from 8/17/2016 (the latest available.)

Incidentally, it appears that various kernels may or may not cause the hang to happen, and also the behavior is not consistent between machines with otherwise identical hardware.

Case in point:

- on one machine with the above specs, booting kernel-2.6.32-696.16.1.el6.x86_64 causes a hard hang when loading the microcode.
- on the same machine, booting kernel-2.6.32-696.18.7.el6.x86_64 causes a hard reset when loading the microcode.
- on the same machine, booting a custom (locally built) kernel based on 3.10.107 boots up fine.

However:

- on another machine with the above specs, booting the custom 3.10.107-based kernel causes a hard hang.

Downgrading the microcode fixes the problem in all the problem cases encountered.

Comment 12 Oliver Freyermuth 2018-01-12 14:26:47 UTC
We also see this on downstream distros, e.g. CentOS 6, SL 6, CentOS 7 etc. 

Checking the version of 06-4f-01, it seems the revision packaged in RHEL6 is
0xb000025
while the last revision released officially by Intel in the last microcode package from 2018-01-08 is
0xb000021

So I understand some pre-production version has been included and shipped to enterprise customers? 
Debian, Gentoo and others still ship 0xb000021 (as released officially by Intel)...

Comment 15 Ion Badulescu 2018-01-12 18:46:23 UTC
Guidance we've received from Intel directly suggests that they're aware of problems in the 0xb000025 firmware for Broadwell E/EP (as well as in other firmware revisions for other CPUs) and they recommend delaying the deployment of this firmware into production.

Comment 18 Bill Verzal 2018-01-17 15:54:40 UTC
So we applied this on ~1300 servers last week (a mix of physical and virtual).

What is the impact on ESX based OS instances?