Bug 159863 - System crash with microcode update
Summary: System crash with microcode update
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel-utils
Version: 3.0
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
Assignee: Geoff Gustafson
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 168424
TreeView+ depends on / blocked
 
Reported: 2005-06-08 18:01 UTC by Richard Henderson
Modified: 2007-11-30 22:07 UTC (History)
4 users (show)

Fixed In Version: RHBA-2006-0014
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-15 15:40:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2006:0014 0 qe-ready SHIPPED_LIVE kernel-utils bugfix update 2006-03-14 05:00:00 UTC

Description Richard Henderson 2005-06-08 18:01:21 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.8) Gecko/20050513 Fedora/1.0.4-1.3.1 Firefox/1.0.4

Description of problem:
After upgrading from kernel-utils-2.4-8.37.7, when the microcode update is
applied, the system will panic inside the microcode driver.  This happens
with both kernel-2.4.21-32.0.1.ELsmp (updated at the same time) *and* with 
the previous kernel-2.4.21-27.0.4.ELsmp.

The system is an IBM xSeries, with 4 hyperthreaded

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 15
model           : 1
model name      : Intel(R) Xeon(TM) CPU 1.60GHz
stepping        : 1


Version-Release number of selected component (if applicable):
kernel-utils-2.4-8.37.12

How reproducible:
Always

Steps to Reproduce:
Allow the microcode_ctl service to run.

Additional info:

Comment 1 Dave Jones 2005-06-09 02:07:23 UTC
Geoff, any ideas ?

Comment 2 Gordon Jin 2005-06-11 03:30:14 UTC
Richard, could you try with UP kernel? Thanks.

Comment 3 Richard Henderson 2005-06-12 19:46:16 UTC
Yes, it works with a UP kernel.  I wrote down (most of) the oops message from 
the -32 smp kernel.  It looked a bit cleaner than the oops from the -27 kernel.

UP ----------------------------------------------------------------------

CPU0 updated from revision 0x5 to 0xa, date 07292003

SMP ---------------------------------------------------------------------

Unable to handle kernel pading request at virtual address ffffff89
*pde = 0
oops : 0
microcode: cpu7 updated ...
microcode: cpu1 updated ...
cpu: 4
eip: 0060:[<f89cf7cf>]
eflags: 00010096
eip is at do_update_one [microcode] 0x5f (2.4.21-32.0.1.ELsmp/i686)
eax: 6   ebx: 1   ecx: f89cf770    edz: 0
esi: f89d4000  edi: 4  ebp: f89d0860  esp: c4c97f44
ds, es, ss: 0068
process swapper (pid 0, stackpage: c4c97000)
stack: 000014f2 c0441a80 00000004 0f000000 00000000 00000086 00000001 c4c96000
       c4c96000 c011ca50 c011ca7f 00000000 00001f7c c03f6caa (got bored here)
call trace: c011ca50 smp_call_function_interrupt
            c011ca7f "
            c0109100 default_idle
	    (obviously top-of-stack)
code: 0f 88 86 04 00 00 86 45 14 b9 79 00 00 00 31 d2 83 c0 30 0f



Comment 4 Richard Henderson 2005-06-12 20:14:33 UTC
One last thing: I reverted to the microcode data file from kernel-utils-2.4-8.37.7
and discovered that the reason it "worked" is that it did nothing.  All cpus
report "no suitable data found".

Comment 5 Gordon Jin 2005-06-21 01:38:10 UTC
Updating BIOS should revolve the problem.

Comment 6 Richard Henderson 2005-06-21 04:24:35 UTC
Yes, I expect so, but in a trivial way -- the new BIOS will contain a microcode
update, and so there will be nothing to do once booted.  But the kernel bug
will remain unfixed.

The kernel does not use the BIOS in order to write the microcode, as far as I 
can see.  I see the kernel directly writing to MSRs to perform the update.  So
passing this off to the BIOS shouldn't be considered a viable resolution.  If
we do that, we should simply stop pretending to ship updatable microcode at all.

I suppose I can try a recent 2.6 kernel on this machine and see if things are 
working there, and if so leave it at that.  Not til next week though...

Comment 7 Gordon Jin 2005-06-21 04:43:49 UTC
Yes, I admit updating BIOS is just a work around before the bug is fixed. 

This problem happens on specific stepping only. 

We're in debugging this problem. Will update Red Hat once it's fixed.

Before it's fixed, we are recommending customers using the specific stepping 
to update BIOS to let it go.


Comment 8 Gordon Jin 2005-08-11 00:48:36 UTC
A workaround microcode update data is posted at Intel issue tracker 76581 for 
EL3-U6
https://enterprise.redhat.com/issue-tracker/?
module=issues&action=view&tid=76581

It removes the microcode for stepping f11/f22 so this system hang will not 
happen on f11/f22 if EL3-U6 includes that workaround microcode.

Comment 9 Dave Jones 2005-09-23 21:31:54 UTC
I notice theres a new upstream microcode.dat drop. (1.12).
Does that also fix this problem ?


Comment 10 Nitin Kamble 2005-09-23 22:08:12 UTC
The new microcode drop was also intended to fix this issue.

Comment 11 Dave Jones 2005-09-23 22:22:27 UTC
Ok, I'm going to dupe this across to 165987, and kill two birds with one stone
in the next update.


*** This bug has been marked as a duplicate of 165987 ***

Comment 13 Red Hat Bugzilla 2006-03-15 15:40:51 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0014.html



Note You need to log in before you can comment on or make changes to this bug.