Bug 45612

Summary: kernel-2.4.3-12 does not boot on AST P5/90
Product: [Retired] Red Hat Linux Reporter: Ed McKenzie <eem12>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED NOTABUG QA Contact: Brock Organ <borgan>
Severity: high Docs Contact:
Priority: high    
Version: 7.1CC: alan
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2001-06-24 22:00:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ed McKenzie 2001-06-23 21:32:53 UTC
2.4.3-12 (errata) doesn't make it very far along in the boot process on my
P5/90.  It crashes after printing maybe the first five or so lines of
kernel bootup, and hangs displaying a few screenfuls of what looks like a
protection fault.  I don't have a capture of the error message, but the
error code is 0x0000000D.

Exact error will be attached shortly.  Processor is a Pentium Classic,
90MHz, and it works fine with 2.4.2-2.

Comment 1 Ed McKenzie 2001-06-23 21:41:03 UTC
The error message was ...

CPU#0: Machine Check Exception: 0x  235EEC (type 0x       D).

Correcting my earlier report, the kernel prints about half to two-thirds of a
screen of messages before blowing up (I assume the above fault could happen at
any point, and knowing what the kernel is trying to init would be useful?  It
scrolls by way too fast to see.)

Comment 2 Arjan van de Ven 2001-06-23 21:52:22 UTC
Machine Check Exception is a hardwarefault...

Comment 3 Ed McKenzie 2001-06-23 22:39:13 UTC
By 'hardwarefault' do you mean broken hardware or a triple fault in software? 
This machine has never to my knowledge failed to boot any other linux kernel in
this way.

Comment 4 Alan Cox 2001-06-24 00:01:23 UTC
A machine check exception is raised when the board or processor decides
something bad might be happening. In some cases the machine will run fine
because the threshold on the check seems to be tighter than an actual crash
(which I guess make ssense). This can also include overheat/fan faults but in
your case the fault is

Arjan - I am curious why it then blew up rather than continuing. I would be very
interested to know what a kernel patched in 
	arch/i386/kernel/entry.S

ENTRY(machine_check)
	pushl $0

(remove the pushl $0)

does on this problem box.


Comment 5 Ed McKenzie 2001-06-24 04:35:10 UTC
Removing the push instruction causes the kernel to panic with a traceback that
happily fills the screen.  "Aiee, killing interrupt handler!" seem to be the
favorite dying words for this particular kernel.

With 2.4.3-12 pristine, an actual panic occurs only occasionally.  I compiled
with gcc 2.96-85.

Comment 6 Alan Cox 2001-06-24 12:53:26 UTC
Ok, so the fault not recovering is real. I suspect your CPU is absolutely
borderline if it wasnt showing faults before the machine check. You can
certainly disable the check (mcheck_init in arch/i386/kernel/bluesmoke.c) but I
am not sure that is wise given that it is an integrity check for the system


Comment 7 Ed McKenzie 2001-06-24 20:42:19 UTC
So, to recap and resolve this report, a CPU machine check was added sometime
circa 2.4.3, and it may crash on borderline machines that formerly booted
anyway, correct?

I also notice there's no explicit check for the AMD K6 in that code.  Are such
processors treated as Pentium-compatible, or do they not support MCE at all?

Comment 8 Alan Cox 2001-06-24 22:00:24 UTC
The arch/i386/kernel/setup.c code only calls the mcheck init for processors it
knows about. Currently that is:

AMD Athlon/Duron (basically intel compatible)
VIA/Cyrix MIII/ VIA C3 (limited features)
Intel (pentium or higher)
Winchip/WinchipII/WinchipIII (limited features)

The older Cyrix processors and the K5/K6 apparently don't have the functionality



Comment 9 Mick Mearns 2001-08-27 01:24:53 UTC
 Hello list; 
  
 I recently picked up an IBM Thinkpad   755CX, this is
an older model (1995). 

Pentium-75, 40M ram 3.2G.
  I also have a "Dock II" docking station. 
  This has built-in scsi controller, I added a 2G
  fireball drive and a NEC 24X scsi cdrom. 

  I ran the on-board diags - all fine, and the floppy
  based diags - all fine.

  Ok - so it works fine under DOS and W95, I installed
  W98SE to test it -ok, then formatted back to DOS 
with cd support.
 
   When I try to install RH7.1 (or Roswell-1) it
 fails.
  I tried both CDROM (via autoboot from dos), and a  
 boot disk.
  
  It gets to the point of "running /sbin/loader",
 then  dies: with a black screen
  
  "CPU#0: Machine Check Exception: 0x 1234 (type 0  
xD)." 
  
  this scrolls forever and you have to power-off. 
 
  The dock-II has an adaptec AIC-6360 in it - what is
  the correct line to use it? 
  I have tried various combinations of "linux dd text
  aha152x=0x340,11,7,1" 
  and "linux dd text aic6x60="0x340.11,7,1". 
  None of which worked. 
 
  It uses the "Adaptec 620/6360/6370" driver for dos.
 
  Both the 7.1 and Roswell CD sets are fine. ( it
 does
  same thing for RH 7.0,  and RH 6.0)
 
  I searched for info on the web and found out how to
  set up the MWAVE and power management, 
  but could not find the install info. 
  
  NOTE ************ I Tried Toms rootdisk  and it
  worked fine!!!!! **************
  
  Toms rootdisk (1.7.218)
  
  output of dmesg 
  
  <snip>
  Intel Pentium with F0 0F bug - workaround enabled.
  alias mapping IDT readonly ...  ... done
  Linux version 2.0.37 (root@6M) (gcc version
 2.7.2.3)
  #13 Fri Oct 15 
  <snip>
  scsi : 0 hosts.
  scsi : detected total.
  <snip>
  aha152x: BIOS test: passed, auto configuration: ok,
  detected 1 controller(s)
  aha152x0: vital data: PORTBASE=0x340, IRQ=11, SCSI
  ID=7, reconnect=enabled, 
  parity=enabled, synchronous=disabled, delay=100,
  extended translation=disabled
  aha152x: trying software interrupt, ok.
  scsi0 : Adaptec 152x SCSI driver; $Revision: 1.18 $
  scsi : 1 host.
    Vendor: QUANTUM   Model: FIREBALL_TM2110S  Rev:
  300N
    Type:   Direct-Access                      ANSI
  SCSI revision: 02
  Detected scsi disk sda at scsi0, channel 0, id 0,
  lun 0
    Vendor: NEC       Model: CD-ROM DRIVE:464  Rev:
  1.04
    Type:   CD-ROM                             ANSI
  SCSI revision: 02
  Detected scsi CD-ROM sr0 at scsi0, channel 0, id 6,
  lun 0
  SCSI device sda: hdwr sector= 512 bytes. Sectors=
  4124736 [2014 MB] [2.0 GB]
   sda: sda1
  scsi : 1 host.
  --------------------------------
  contents of /etc/mtab  
  <snip>
  /dev/sda1 /mnt/vfat vfat rw 0 0
  /dev/sr0 /mnt/cdrom iso9660 ro 0 0
  
  ------------------------------
  Tom's found the controller, hard drive and cdrom.
  I was able to mount and move files around. (
 iso9660
  and vfat ).
  So what do I have to do to get RH7+ onto this
 thing?
  any ideas?
  
  Also does anyone know what the video chipset is -
  W98 just say 'Digital'
  Its a SVGA 1M 800x600x16bit TFT
  
  Thank you
    
      Mick
 
Chris Cloiber suggested 'linux nomce' I will try it later.



Comment 10 Ed McKenzie 2001-11-03 16:02:54 UTC
This bug is obsoleted by bug 55097 and the errata 2.4.9-13 kernel.