From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98) Description of problem: When you boot the enterprise kernel on the hardware configuration listed below it causes a kernel panic right after the swap is enabled. The error I recieve is: CPU 1: Machine Check Exception: 0000000000000007 Bank 0: f620a00022100800 at 7620a00022100800 Kernel panic: CPU context corrupt Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.Just install the enterprise kernel on the hardware mentioned below and boot. Both the 2.4.2-9 and the 2.4.9-31 enterprise kernels panic with the same error at the same place. Additional info: ALL INTEL HARDWARE Chassis: SR2200 Motherboard: SCB2 Boxed Processors: Dual Pentium III 1266 Tuatium (Sp?) 512K Cache RAM: 6 x 1024 ECC DIMMs RAID: SRCU31 Floppy: 1.44M Standard Hard Drives: 4 x 36G Seagate Cheetas 15K Utra 160 CD-ROM: IDE ATAPI
This message means that the hardware tells the kernel that it's defective. That doesn't 100% mean it is, since some hardware people happen to miswire the pin on the cpu to the 3.3V line; passing "nocme" on the kernel commandline will disable the machine check tests for such cases.
I tried you above sujestion and it gave me a slightly different error at the exact same point... CPU 1: Machine Check Exception: 0000000000000007 Bank 0: b620a00022100800 at 3620a00022100800 Kernel panic: CPU context corrupt Now I do have two other things to consider.... 01) This is ALL boxed genuine Intel hardware... You would think Intel would wire their own server boards and server CPUs correctly... But hey, you would "think" that. 02) Both the 2.4.2-9 and 2.4.9-31 smp kernels work 100%... Also the "plain" kernels work fine.
I very much doubt its related to wiring (we've only seen the nomce thing needed on pentium and now default to off for pentium). Your box is a PII and the trace you provided is a genuine CPU exception. Your processor reported a real error trap. VALID, UNCORRECTABLE, ENABLED, PROCESSOR CONTEXT CORRUPT Interconnect/Bus Error Local processor initiated request Generic Error Memory Access In other words not only did the CPU initiate a machine check, the processor has recorded a genuine fault.
Actually the box is a P-III not a P-II... Also it is the new dual server version of the P-III. Not just a pair of regular P-III on a generic board... Now my question is if this is a genuine error, which I have NO dout, why does the smp kernels work fine? Actually the only reason I swicthed to the enterprise kernel was so it would use the entire 6 Gigs. of RAM instead of just the first 4 Gigs...
PII/PIII - basically the same thing. All I can tell you is that your CPU hardware itself raised an "I have a fault" signal and that the kernel dump confirms that you system thinks its faulty. The values I decoded came from the CPU, not from Linux. The trap came from the processor. It thinks there is a fault, its normally right in such circumstances.
Okay I understand the error is comming from the CPU... But the still doesn't explain why the error doesn't show up with the smp kernel? The box has run for days on end 100% fine using the smp kernel...
I suspect only an intel engineer with an electron microscope could tell you that. It could be something as trivial as the marginal signals being on address lines higher than 30, or only in the 36bit paging hardware - who knows. This is something you need to take up with your hardware supplier. The main reason for the MCE itself is so that marginal components are caught and replaced before they do real harm to data. Alan
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/