52341 – Single processor Big Sur aborts during 7.2 release candidate boot

Bug 52341 - Single processor Big Sur aborts during 7.2 release candidate boot

Summary: Single processor Big Sur aborts during 7.2 release candidate boot

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	9
Hardware:	ia64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jason Baron
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-08-22 22:26 UTC by Steven Cook
Modified:	2013-03-06 05:55 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-02-03 20:17:32 UTC
Embargoed:

Attachments	(Terms of Use)
ia64 machine check log (24.96 KB, text/plain) 2001-11-06 15:54 UTC, Bill Nottingham	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2002:104	0	high	SHIPPED_LIVE	Several bugs fixed in new kernel	2002-05-30 04:00:00 UTC

Description Steven Cook 2001-08-22 22:26:25 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.3) Gecko/20010801

Description of problem:
The system crashes into a hard lock, logging a fatal event into the BIOS
Event Log, during OS boot of Roswell on UP Big Sur systems.  This lock
drops video signal, and activates the HDD (HDD activity light is constantly
on, drive is audibly spinning/reading)  Only way to escape is power button
override.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Install Roswell (everything install) on a UP Big Sur with BIOS 117A(RC8)
or 117C (P9)
2. Reboot at the end of install, when prompted.
3. Notice the system begin booting, running through boot services, and then
lose video signal and keyboard functionality.
	

Actual Results:  At some point between the start of "Starting Red Hat
Linux" and approximately 5 minutes after boot, the system logs a fatal
error to the Event Log in BIOS, drops video signal and keyboard
functionality, and "hard-locks", requiring a power button override to power
the system down.

Expected Results:  Linux should boot normally on a UP system (no issues
with the test systems under Seawolf; no issues with DP systems - Seawolf or
Roswell)

Additional info:

tested with DP in the failing system, and failure did not occur.
tested with DP in other test systems, and failure did not occur.
tested with UP in the failing system, and failure occurred repeatibly
tested with UP in other test systems, and failure occurred repeatibly


1GB RAM, Adaptec 39160/QLogic 12160 adapters (1 each configuration), 18GB
SCSI HDD, IDE CDROM, LS120 C0/733 & C0/800 processors

Comment 1 Arjan van de Ven 2001-08-22 22:31:12 UTC

I assume this is 2.4.6-3.1; we've fixed this for 2.4.7-2 which is in "roswell2"

Comment 2 Steven Cook 2001-08-22 23:16:17 UTC

apologies for not being more clear with the submission -- with only one
"roswell" entry, I thought things were being tracked on the _current_ roswell.

This entry was logged against roswell2 (7.1.94 w/ kernel 2.4.7-2smp and
2.4.7-2), although it was also observed under roswell1.  I will update other
roswell2 entries to clarify this.

Comment 3 Bill Nottingham 2001-08-23 02:49:39 UTC

I saw a large number of machine checks with firmware 114; enough so that I
downgraded back to 103E. Could 117 have the same problem?

Comment 4 Steven Cook 2001-08-23 17:19:52 UTC

Investigating BIOS aspect.  This issue is not seen with Seawolf (7.1) on the
same BIOS revisions, so have concern that this isn't solely a BIOS issue.

Comment 5 Glen Foster 2001-08-23 21:49:18 UTC

We (Red Hat) really need to fix this before next release.

Comment 6 Steven Cook 2001-09-13 19:18:06 UTC

Note that Seawolf does not exhibit this problem on BIOS 117C, while both Roswell
& Roswell2 do.

Comment 7 Steven Cook 2001-10-04 18:55:42 UTC

"enigma" -- 7.2 rc2 does not exhibit this in the 1 case of CD-ROM based install
I have tested.  I will verify on other install types, then close this entry.

Comment 8 Steven Cook 2001-10-11 19:08:19 UTC

I have unfortunately just reproduced the failure on the Enigma install that I
had previously reported inability to reproduce.  Failure:Success count currently
stands at 4:1 on the single UP system under test.

Comment 9 Arjan van de Ven 2001-10-16 08:30:54 UTC

what version is that ? Enigma ia64 isn't due for a while .....

Comment 10 Steven Cook 2001-10-18 19:34:33 UTC

I have reproduced this failure on the distribution with the following identifiers:

file location: ftp://ftp.beta.redhat.com/pub/pensacola/rc2/iso/rc2
file names:
rc2-ia64-disc1.iso   678,406,144   9/28/2001
rc2-ia64-disc2.iso   621,152,256   9/28/2001
rc2-ia64-disc3.iso   680,402,944   9/28/2001
rc2-ia64-disc4.iso   409,448,448   9/28/2001

Announcement forwarded by Pat Rago on 02-October-2001 as "Red Hat Linux 7.2 RC-2
(Pensacola)"

The installed version of these ISOs reports (Enigma) on the login screen.

So I entered the issue as being against "Enigma", because that was what I saw on
the login screen.  Perhaps I should have entered it as "Pensacola" -- but when
two reputable sources give two different answers to the "What do I call this
thing?" question, it gets a bit difficult to make things clear without this
extra explanation.  Apologies for any confusion.

The failure has been seen on both UP and MP kernels, using install type of: CD,
NFS, FTP, HTTP.  Process used was:  Install OS, allow OS to reboot at end of
installation, and automatically boot the default kernel.  Upon finding failure
(100% reproducibility), hard power off (4-5 second power switch hold) then power
back on, and boot the second (either MP or UP, whichever was not booted by
default) kernel at the ELILO prompt.  The failure also occurs with this kernel.
Hoping this clarifies the issue a little,
--steve--

Comment 11 Steven Cook 2001-11-03 01:57:42 UTC

This issue was again reproduced, using the ia64 RC3 distribution.  Repro has
occurred on two systems tested against.

Comment 12 Bill Nottingham 2001-11-06 15:53:28 UTC

Attaching the machine check log. We need someone at Intel to decode this.

Comment 13 Bill Nottingham 2001-11-06 15:54:16 UTC

Created attachment 36621 [details]
ia64 machine check log

Comment 14 Steven Cook 2001-11-07 18:52:50 UTC

Translation from BIOS developer:

This is an AGP bus abort: FERR_PCI Non-config Master Abort.  it percolates up
through the F16 and into the SAC as a hard abort.

The only other bit of data he was able to glean from the dump attached in
previous note was that the Class Code is three bytes, rather than two, which is
setting up a one-byte offset on the data that is being decoded (Seg/Bus/Dev/Func
is off by one byte).

Question from the developer -- what produced this log file?

Looking into the feasibility of sending a error log dump tool (I believe it is
an EFI-based utility, but not sure).  Will update with another note if we can
distribute this tool.

Comment 15 Bill Nottingham 2001-11-08 20:16:15 UTC

The kernel produced the log file.

Arjan: how are console fonts loaded to the video memory? That's the *only*
thing that would be accessing the video card at the points where this appears.

Comment 16 Arjan van de Ven 2001-11-09 11:15:50 UTC

The console fonts are loaded by putting stuff in video ram and then outb'ing a
few commands to io ports of the vga card. scary code if you ask me.

Comment 17 Steven Cook 2001-11-15 20:34:23 UTC

Changed Summary: "Roswell" changed to "7.2 release candidate"

Comment 18 Michael K. Johnson 2001-12-05 20:45:04 UTC

Ben reports that he cannot reproduce this on his machine with 117C firmware
and AGP video.

Comment 19 Bill Nottingham 2001-12-05 22:13:56 UTC

Of note is that I can reproduce this *only* on initial boot (i.e., the call to
setsysfont in rc.sysinit.) I can't seem to reproduce it after boot.

Comment 20 Bill Nottingham 2001-12-05 22:21:48 UTC

(or before boot (i.e., init=/bin/bash), for that matter.

Comment 21 Bill Nottingham 2001-12-06 17:55:46 UTC

Intel: does this occurs on video adapters *other* than rage128 AGP?

Comment 22 Ben LaHaise 2001-12-06 18:19:14 UTC

I tested on an AGP Rage 128 and didn't experience the problem.

Comment 23 Steven Cook 2001-12-06 19:31:50 UTC

Failure exhibits repeatably on my box.

Configuration is:
UP C0 733
1GB RAM (8x128)
Adaptec 39160 SCSI controller
BIOS 103E+
USB Kb/Mouse
Quantum Atlas 10K2 SCSI HDD
LiteOn IDE CD-ROM
Matrox G450 AGP graphics controller

I also received an email from another Intel engineer today, asking for help
getting his system up and running, because he hasn't been able to boot Red Hat
7.2 betas/RCs since he decided to start using Red Hat on his systems.  Will
update with another note when I find out his system configuration, if there is
anything different (his system is also UP).

Comment 24 Steven Cook 2001-12-06 20:00:59 UTC

Verified the other Intel engineer has identical system configuration, wrt add-in
cards & memory.  He was running A4 processors, which I have advised him to
upgrade to C0.

Comment 25 Arjan van de Ven 2001-12-06 20:04:26 UTC

A4? eek
The very very minimum for RHL is B3

Comment 26 Steven Cook 2001-12-06 20:08:20 UTC

Yes, I know, Arjan *grin*  That's why I've got him upgrading to C0 processor to 
try to reproduce the issue there.  On a cheerful note, the system made it as 
far with an A4 processor as it did up her with a C0 processor, so something was 
letting it work...*wry chuckle*  Anyway, I'm still seeing the failure on my 
system *sigh*.  Is there any more configuration information I can provide that 
would be helpful in diagnosing this issue?

Comment 27 Arjan van de Ven 2001-12-06 20:11:16 UTC

We have one machine that shows it too. And then only with certain video cards..

Comment 28 Michael K. Johnson 2001-12-07 23:29:58 UTC

Since we have only seen this on one box, and that box only sometimes,
and then only with one of the many video cards being tested, this sounds
more and more like a firmware problem that needs to be attacked by
Intel using the hardware-level debugging tools that we don't have
here.

Comment 29 Michael K. Johnson 2001-12-10 17:24:22 UTC

Do you disagree about my educated guess that we are dealing with a
firmware-level problem?

Comment 30 Steven Cook 2001-12-11 16:16:45 UTC

Actually, yes I do disagree with your theory that this is a firmware-level
problem.  My basis for this is that this issue has never appeared on a 7.1
"Seawolf" 64-bit system, and only started appearing with 7.2 betas and RCs.  If
it were a firmware-level issue, I would expect the same issue to be reproducing
on previous versions of the software as well.  

It has been my experience in four years of software validation work that if a
firmware issue exists, it will be reproducible on multiple OS revisions, with
multiple software loads; while if the issue is with the software, it will
reproduce with multiple firmware revisions, and multiple software revisions.

I guess what I'm looking for is a "our code cannot be the cause of this, and
here is why", rather than a "we don't want this to be our problem" which is what
I'm hearing (although it may not be what you are saying).  Initial look at this
issue by the firmware team indicated nothing pointing to the firmware as the
cause -- this was just a cursory look, not an in-depth investigation, but
management has made the statement that until Red Hat can demonstrate this is not
their issue, the firmware team has other issues that are known to be related to
the firmware to work on.

Comment 31 Arjan van de Ven 2001-12-11 16:21:09 UTC

7.1 did not enable the Machine Check code for ia64's. For 7.2 this was
explicitly added on Intel's request...... so that 7.1 didn't see the machine
check exceptions is no suprise.... I'll be more than glad to turn it off again
since Intel hasn't provided Red Hat with any tools to USE such exceptions
anyway....

Comment 32 Bill Nottingham 2001-12-13 17:51:30 UTC

The recommended workaround for this will be to use the SMP kernel; this will be
documented in the release notes.

Comment 33 Need Real Name 2002-03-27 20:10:35 UTC

SGI reported the same issue - boot hang - with 7.2 GM (2.4.9-18) on 1P Big Sur 
systems. And we are able to reproduce the issue on multiple Big Sur systems 
here. This boot hang occurs on both the UP and SMP kernels, but only when 
booting one processor and using an AGP video adapter. We tested ATI AGP cards 
and N-Vidia Quatro MX-2, and saw the issue.

After debugging for some time here are the findings.

I built a development kernel (2.4.18+patches ) and tested with that kernel and 
didn't see the issue. Rebooted many times and no hangs, no mca.

When we switch to Red Hat 7.2 GM Kernel, we see hangs immediately.

The non-configration master abort occurs when there is an access to 0xA0000 
which is not claimed by the AGP device. We tested that if the card's registers 
are changed to claim the entire VGA region 0xA0000 - 0xBFFFF (originally the 
card claims 0xB8000 - 0xBFFFF region) then we don't observe the error. If we 
boot with this setting, the system does not hang.

One interesting point is that the MCA occurs everytime, when the user space 
code starts (/etc/rc.sysinit). If we comment the portion where the initrd is 
unmounted and the buffers are flushed, we can boot successfully.

Comment 34 Bill Nottingham 2002-03-28 20:43:57 UTC

With the 2.4.18 kernel you mention:

a) did you turn on CONFIG_IA64_MCA
b) did you use an initrd?

Comment 35 Need Real Name 2002-03-29 20:57:55 UTC

a) Yes, it is turned on.
b) Yes, used an initrd

Some more data:
The illegal accesses to the VGA segment are coming from the optimized asm 
version of copy_page.  Here is the scope of the problem:

One of the failing scenario: The failure occurs when copy_page is called with a 
target page which is exactly one page below the VGA range.   The problem is 
with the pre-fetch instructions at the end of the loop which are always one 
cache line ahead of the source/target pointers.  When the target page is mapped 
to 0x9C000 (0XA0000 - 16K) on the last iteration the pre-fetch will attempt to 
fetch from the VGA range, but since the video card is not programmed to respond 
to this range we get the master abort/hard fail returned to the CPU causing the 
MCA. 

One other more interesting data:
The hang occurs on the different places (with the same signature) depending on 
how you start the elilo from efi shell.

Next step: Need to unwind the stack. Any pointers will be helpful

Comment 36 Alan Cox 2002-03-29 21:57:43 UTC

That would make a lot of sense. In fact we fixed a remarkably similar Athlon
prefetch bug during the early 2.4 releases. The copy_* code needs to not
prefetch beyond the end of the block it is copying anyway - the other prefetch
is most likely to be wasted.

Comment 37 Need Real Name 2002-04-10 21:04:16 UTC

More Data:

The following entry in the tlb table covers  0xE000000000000000 to 
0xE000000003FFFFFF which includes 0xE0000000000A0000 and the memory attributes 
are WB. A translation for a page starting from 0xe000000000000000 to 
e000000003FFFFFF encompasses 0xa0000.

# TR V P rid    va               pa               ps      ed pl ar a d ma key
72 1 1 1 000007 E00000000160F000 000000000160F000 1A 64M  0  0  3  1 1 0  000007

At this point, I would like to get some data/input/update from Red Hat. Why do 
we access/prefetch that region?

Comment 38 Asit Mallick 2002-04-17 18:30:20 UTC

MCAs are happening due to a lfetch that is going beyond a page and the page 
happen to be last page of 0-640K. This causes the lfetch to go to video space 
and causes MCA.

Here is a patch to fix this. This is performance critical patch and any 
modification may results in lower performance for copy_page or clear_page. This 
patch will apply on 2.4.9 to 2.4.17. clear_page patch is not needed for 2.4.18 
and copy_page will apply to 2.4.18 also.

Thanks,
Asit
--- linux-2.4.16-akm/arch/ia64/lib/clear_page.S Fri Nov  9 14:26:17 2001
+++ linux/arch/ia64/lib/clear_page.S    Wed Apr 10 17:32:26 2002
@@ -23,15 +23,18 @@
 #define dst2           r9
 #define dst3           r10
 #define dst_fetch      r11
+#define dst_last       r14
        
 GLOBAL_ENTRY(clear_page)
        .prologue
        .regstk 1,0,0,0
        mov r16 = PAGE_SIZE/64-1        // -1 = repeat/until
+       mov r17 = PAGE_SIZE
        ;;
        .save ar.lc, saved_lc
        mov saved_lc = ar.lc
        .body
+       add dst_last = r17, dst0
        mov ar.lc = r16
        adds dst1 = 16, dst0
        adds dst2 = 32, dst0
@@ -40,10 +43,12 @@
        ;;
 1:     stf.spill.nta [dst0] = f0, 64
        stf.spill.nta [dst1] = f0, 64
+       cmp.ltu p6,p0 = dst_fetch, dst_last
        stf.spill.nta [dst2] = f0, 64
        stf.spill.nta [dst3] = f0, 64
+       ;;
 
-       lfetch [dst_fetch], 64
+(p6)   lfetch [dst_fetch], 64
        br.cloop.dptk.few 1b
        ;;      
        mov ar.lc = r2          // restore lc
--- linux-2.4.16-akm/arch/ia64/lib/copy_page.S  Fri Nov  9 14:26:17 2001        
+++ linux/arch/ia64/lib/copy_page.S     Wed Apr 10 17:38:07 2002
@@ -30,6 +30,7 @@
 #define tgt2           r23
 #define srcf           r24
 #define tgtf           r25
+#define tgt_last       r26
                        
 #define Nrot           ((8*PIPE_DEPTH+7)&~7)                                   
  
@@ -55,18 +56,21 @@  
        
        mov src1=in1
        adds src2=8,in1
+       mov tgt_last = PAGE_SIZE
        ;;
        adds tgt2=8,in0
        add srcf=512,in1
        mov ar.lc=lcount
        mov tgt1=in0
        add tgtf=512,in0
+       add tgt_last = tgt_last, in0
        ;;
 1:
 (p[0]) ld8 t1[0]=[src1],16
 (EPI)  st8 [tgt1]=t1[PIPE_DEPTH-1],16
 (p[0]) ld8 t2[0]=[src2],16
 (EPI)  st8 [tgt2]=t2[PIPE_DEPTH-1],16
+       cmp.ltu p6,p0 = tgtf, tgt_last
        ;;
 (p[0]) ld8 t3[0]=[src1],16
 (EPI)  st8 [tgt1]=t3[PIPE_DEPTH-1],16
@@ -83,8 +87,8 @@
 (p[0]) ld8 t8[0]=[src2],16
 (EPI)  st8 [tgt2]=t8[PIPE_DEPTH-1],16

-       lfetch [srcf], 64
-       lfetch [tgtf], 64
+(p6)   lfetch [srcf], 64
+(p6)   lfetch [tgtf], 64
        br.ctop.sptk.few 1b
        ;;
        mov pr=saved_pr,0xffffffffffff0000      // restore predicates

Comment 39 Scott Parsons 2002-04-18 21:13:34 UTC

Hi Arjan,

We need to be able to give our SGI support people something to tell Itanium customers (and post on our support webpages) who wish to install Red Hat 7.2. What is 
the next step in the process now that there appears to be a working patch? I'm assuming you are verifying this patch at Red Hat and considering it works correctly, 
when can we expect the patch to be officially placed on your download site?

Thanks
Scott

Comment 40 Arjan van de Ven 2002-04-18 21:19:34 UTC

Probably when the next security issue arises.

Comment 41 Jason Baron 2004-02-03 20:17:32 UTC

chaning to WONTFIX, as 7.2 ia64 is EOL

Note You need to log in before you can comment on or make changes to this bug.