Bug 159863

Summary: System crash with microcode update
Product: Red Hat Enterprise Linux 3 Reporter: Richard Henderson <rth>
Component: kernel-utilsAssignee: Geoff Gustafson <grgustaf>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: davej, gordon.jin, nitin.a.kamble, suresh.b.siddha
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2006-0014 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-03-15 15:40:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 168424    

Description Richard Henderson 2005-06-08 18:01:21 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.8) Gecko/20050513 Fedora/1.0.4-1.3.1 Firefox/1.0.4

Description of problem:
After upgrading from kernel-utils-2.4-8.37.7, when the microcode update is
applied, the system will panic inside the microcode driver.  This happens
with both kernel-2.4.21-32.0.1.ELsmp (updated at the same time) *and* with 
the previous kernel-2.4.21-27.0.4.ELsmp.

The system is an IBM xSeries, with 4 hyperthreaded

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 15
model           : 1
model name      : Intel(R) Xeon(TM) CPU 1.60GHz
stepping        : 1


Version-Release number of selected component (if applicable):
kernel-utils-2.4-8.37.12

How reproducible:
Always

Steps to Reproduce:
Allow the microcode_ctl service to run.

Additional info:

Comment 1 Dave Jones 2005-06-09 02:07:23 UTC
Geoff, any ideas ?

Comment 2 Gordon Jin 2005-06-11 03:30:14 UTC
Richard, could you try with UP kernel? Thanks.

Comment 3 Richard Henderson 2005-06-12 19:46:16 UTC
Yes, it works with a UP kernel.  I wrote down (most of) the oops message from 
the -32 smp kernel.  It looked a bit cleaner than the oops from the -27 kernel.

UP ----------------------------------------------------------------------

CPU0 updated from revision 0x5 to 0xa, date 07292003

SMP ---------------------------------------------------------------------

Unable to handle kernel pading request at virtual address ffffff89
*pde = 0
oops : 0
microcode: cpu7 updated ...
microcode: cpu1 updated ...
cpu: 4
eip: 0060:[<f89cf7cf>]
eflags: 00010096
eip is at do_update_one [microcode] 0x5f (2.4.21-32.0.1.ELsmp/i686)
eax: 6   ebx: 1   ecx: f89cf770    edz: 0
esi: f89d4000  edi: 4  ebp: f89d0860  esp: c4c97f44
ds, es, ss: 0068
process swapper (pid 0, stackpage: c4c97000)
stack: 000014f2 c0441a80 00000004 0f000000 00000000 00000086 00000001 c4c96000
       c4c96000 c011ca50 c011ca7f 00000000 00001f7c c03f6caa (got bored here)
call trace: c011ca50 smp_call_function_interrupt
            c011ca7f "
            c0109100 default_idle
	    (obviously top-of-stack)
code: 0f 88 86 04 00 00 86 45 14 b9 79 00 00 00 31 d2 83 c0 30 0f



Comment 4 Richard Henderson 2005-06-12 20:14:33 UTC
One last thing: I reverted to the microcode data file from kernel-utils-2.4-8.37.7
and discovered that the reason it "worked" is that it did nothing.  All cpus
report "no suitable data found".

Comment 5 Gordon Jin 2005-06-21 01:38:10 UTC
Updating BIOS should revolve the problem.

Comment 6 Richard Henderson 2005-06-21 04:24:35 UTC
Yes, I expect so, but in a trivial way -- the new BIOS will contain a microcode
update, and so there will be nothing to do once booted.  But the kernel bug
will remain unfixed.

The kernel does not use the BIOS in order to write the microcode, as far as I 
can see.  I see the kernel directly writing to MSRs to perform the update.  So
passing this off to the BIOS shouldn't be considered a viable resolution.  If
we do that, we should simply stop pretending to ship updatable microcode at all.

I suppose I can try a recent 2.6 kernel on this machine and see if things are 
working there, and if so leave it at that.  Not til next week though...

Comment 7 Gordon Jin 2005-06-21 04:43:49 UTC
Yes, I admit updating BIOS is just a work around before the bug is fixed. 

This problem happens on specific stepping only. 

We're in debugging this problem. Will update Red Hat once it's fixed.

Before it's fixed, we are recommending customers using the specific stepping 
to update BIOS to let it go.


Comment 8 Gordon Jin 2005-08-11 00:48:36 UTC
A workaround microcode update data is posted at Intel issue tracker 76581 for 
EL3-U6
https://enterprise.redhat.com/issue-tracker/?
module=issues&action=view&tid=76581

It removes the microcode for stepping f11/f22 so this system hang will not 
happen on f11/f22 if EL3-U6 includes that workaround microcode.

Comment 9 Dave Jones 2005-09-23 21:31:54 UTC
I notice theres a new upstream microcode.dat drop. (1.12).
Does that also fix this problem ?


Comment 10 Nitin Kamble 2005-09-23 22:08:12 UTC
The new microcode drop was also intended to fix this issue.

Comment 11 Dave Jones 2005-09-23 22:22:27 UTC
Ok, I'm going to dupe this across to 165987, and kill two birds with one stone
in the next update.


*** This bug has been marked as a duplicate of 165987 ***

Comment 13 Red Hat Bugzilla 2006-03-15 15:40:51 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0014.html