Bug 159863
Summary: | System crash with microcode update | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Richard Henderson <rth> |
Component: | kernel-utils | Assignee: | Geoff Gustafson <grgustaf> |
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 3.0 | CC: | davej, gordon.jin, nitin.a.kamble, suresh.b.siddha |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2006-0014 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-03-15 15:40:51 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 168424 |
Description
Richard Henderson
2005-06-08 18:01:21 UTC
Geoff, any ideas ? Richard, could you try with UP kernel? Thanks. Yes, it works with a UP kernel. I wrote down (most of) the oops message from the -32 smp kernel. It looked a bit cleaner than the oops from the -27 kernel. UP ---------------------------------------------------------------------- CPU0 updated from revision 0x5 to 0xa, date 07292003 SMP --------------------------------------------------------------------- Unable to handle kernel pading request at virtual address ffffff89 *pde = 0 oops : 0 microcode: cpu7 updated ... microcode: cpu1 updated ... cpu: 4 eip: 0060:[<f89cf7cf>] eflags: 00010096 eip is at do_update_one [microcode] 0x5f (2.4.21-32.0.1.ELsmp/i686) eax: 6 ebx: 1 ecx: f89cf770 edz: 0 esi: f89d4000 edi: 4 ebp: f89d0860 esp: c4c97f44 ds, es, ss: 0068 process swapper (pid 0, stackpage: c4c97000) stack: 000014f2 c0441a80 00000004 0f000000 00000000 00000086 00000001 c4c96000 c4c96000 c011ca50 c011ca7f 00000000 00001f7c c03f6caa (got bored here) call trace: c011ca50 smp_call_function_interrupt c011ca7f " c0109100 default_idle (obviously top-of-stack) code: 0f 88 86 04 00 00 86 45 14 b9 79 00 00 00 31 d2 83 c0 30 0f One last thing: I reverted to the microcode data file from kernel-utils-2.4-8.37.7 and discovered that the reason it "worked" is that it did nothing. All cpus report "no suitable data found". Updating BIOS should revolve the problem. Yes, I expect so, but in a trivial way -- the new BIOS will contain a microcode update, and so there will be nothing to do once booted. But the kernel bug will remain unfixed. The kernel does not use the BIOS in order to write the microcode, as far as I can see. I see the kernel directly writing to MSRs to perform the update. So passing this off to the BIOS shouldn't be considered a viable resolution. If we do that, we should simply stop pretending to ship updatable microcode at all. I suppose I can try a recent 2.6 kernel on this machine and see if things are working there, and if so leave it at that. Not til next week though... Yes, I admit updating BIOS is just a work around before the bug is fixed. This problem happens on specific stepping only. We're in debugging this problem. Will update Red Hat once it's fixed. Before it's fixed, we are recommending customers using the specific stepping to update BIOS to let it go. A workaround microcode update data is posted at Intel issue tracker 76581 for EL3-U6 https://enterprise.redhat.com/issue-tracker/? module=issues&action=view&tid=76581 It removes the microcode for stepping f11/f22 so this system hang will not happen on f11/f22 if EL3-U6 includes that workaround microcode. I notice theres a new upstream microcode.dat drop. (1.12). Does that also fix this problem ? The new microcode drop was also intended to fix this issue. Ok, I'm going to dupe this across to 165987, and kill two birds with one stone in the next update. *** This bug has been marked as a duplicate of 165987 *** An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0014.html |