61187 – Enterprise Kernel - Panic On Boot

Bug 61187 - Enterprise Kernel - Panic On Boot

Summary: Enterprise Kernel - Panic On Boot

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.1
Hardware:	i686
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brian Brock
Docs Contact:
URL:	www.none.com
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-03-15 01:39 UTC by John C. Beima
Modified:	2008-08-01 16:22 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-09-30 15:39:26 UTC
Embargoed:

Attachments	(Terms of Use)

Description John C. Beima 2002-03-15 01:39:17 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)

Description of problem:
When you boot the enterprise kernel on the hardware configuration listed below 
it causes a kernel panic right after the swap is enabled.

The error I recieve is:

CPU 1: Machine Check Exception: 0000000000000007
Bank 0: f620a00022100800 at 7620a00022100800
Kernel panic: CPU context corrupt

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Just install the enterprise kernel on the hardware mentioned below and boot. 
Both the 2.4.2-9 and the 2.4.9-31 enterprise kernels panic with the same error 
at the same place.


Additional info:

ALL INTEL HARDWARE

Chassis: SR2200
Motherboard: SCB2 Boxed
Processors: Dual Pentium III 1266 Tuatium (Sp?) 512K Cache
RAM: 6 x 1024 ECC DIMMs
RAID: SRCU31
Floppy: 1.44M Standard
Hard Drives: 4 x 36G Seagate Cheetas 15K Utra 160
CD-ROM: IDE ATAPI

Comment 1 Arjan van de Ven 2002-03-15 09:34:58 UTC

This message means that the hardware tells the kernel that it's defective. That
doesn't 100% mean it is, since some hardware people happen to miswire the pin on
the cpu to the 3.3V line; passing "nocme" on the kernel commandline will disable
the machine check tests for such cases.

Comment 2 John C. Beima 2002-03-15 16:53:49 UTC

I tried you above sujestion and it gave me a slightly different error at the
exact same point...

CPU 1: Machine Check Exception: 0000000000000007
Bank 0: b620a00022100800 at 3620a00022100800
Kernel panic: CPU context corrupt

Now I do have two other things to consider....

01) This is ALL boxed genuine Intel hardware... You would think Intel would wire
their own server boards and server CPUs correctly... But hey, you would "think"
that.
02) Both the 2.4.2-9 and 2.4.9-31 smp kernels work 100%... Also the "plain"
kernels work fine.

Comment 3 Alan Cox 2002-03-15 17:23:46 UTC

I very much doubt its related to wiring (we've only seen the nomce thing needed
on pentium and now default to off for pentium). Your box is a PII and the trace
you provided is a genuine CPU exception.

Your processor reported a real error trap.

VALID, UNCORRECTABLE, ENABLED, PROCESSOR CONTEXT CORRUPT

Interconnect/Bus Error
Local processor initiated request
Generic Error
Memory Access

In other words not only did the CPU initiate a machine check, the processor
has recorded a genuine fault.

Comment 4 John C. Beima 2002-03-15 17:32:06 UTC

Actually the box is a P-III not a P-II... Also it is the new dual server version
of the P-III. Not just a pair of regular P-III on a generic board...

Now my question is if this is a genuine error, which I have NO dout, why does
the smp kernels work fine?

Actually the only reason I swicthed to the enterprise kernel was so it would use
the entire 6 Gigs. of RAM instead of just the first 4 Gigs...

Comment 5 Alan Cox 2002-03-15 20:37:19 UTC

PII/PIII - basically the same thing. All I can tell you is that your CPU
hardware itself raised an "I have a fault" signal and that the kernel dump
confirms that you system thinks its faulty.

The values I decoded came from the CPU, not from Linux. The trap came from
the processor. It thinks there is a fault, its normally right in such circumstances.

Comment 6 John C. Beima 2002-03-16 01:27:18 UTC

Okay I understand the error is comming from the CPU...

But the still doesn't explain why the error doesn't show up with the smp kernel?

The box has run for days on end 100% fine using the smp kernel...

Comment 7 Alan Cox 2002-03-16 01:59:13 UTC

I suspect only an intel engineer with an electron microscope could tell you
that. It could be something as trivial as the marginal signals being on address
lines higher than 30, or only in the 36bit paging hardware - who knows.

This is something you need to take up with your hardware supplier. The main
reason for the MCE itself is so that marginal components are caught and replaced
before they do real harm to data.

Alan

Comment 8 Bugzilla owner 2004-09-30 15:39:26 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.