Red Hat Bugzilla – Bug 52341
Single processor Big Sur aborts during 7.2 release candidate boot
Last modified: 2013-03-06 00:55:28 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.3) Gecko/20010801
Description of problem:
The system crashes into a hard lock, logging a fatal event into the BIOS
Event Log, during OS boot of Roswell on UP Big Sur systems. This lock
drops video signal, and activates the HDD (HDD activity light is constantly
on, drive is audibly spinning/reading) Only way to escape is power button
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Install Roswell (everything install) on a UP Big Sur with BIOS 117A(RC8)
or 117C (P9)
2. Reboot at the end of install, when prompted.
3. Notice the system begin booting, running through boot services, and then
lose video signal and keyboard functionality.
Actual Results: At some point between the start of "Starting Red Hat
Linux" and approximately 5 minutes after boot, the system logs a fatal
error to the Event Log in BIOS, drops video signal and keyboard
functionality, and "hard-locks", requiring a power button override to power
the system down.
Expected Results: Linux should boot normally on a UP system (no issues
with the test systems under Seawolf; no issues with DP systems - Seawolf or
tested with DP in the failing system, and failure did not occur.
tested with DP in other test systems, and failure did not occur.
tested with UP in the failing system, and failure occurred repeatibly
tested with UP in other test systems, and failure occurred repeatibly
1GB RAM, Adaptec 39160/QLogic 12160 adapters (1 each configuration), 18GB
SCSI HDD, IDE CDROM, LS120 C0/733 & C0/800 processors
I assume this is 2.4.6-3.1; we've fixed this for 2.4.7-2 which is in "roswell2"
apologies for not being more clear with the submission -- with only one
"roswell" entry, I thought things were being tracked on the _current_ roswell.
This entry was logged against roswell2 (7.1.94 w/ kernel 2.4.7-2smp and
2.4.7-2), although it was also observed under roswell1. I will update other
roswell2 entries to clarify this.
I saw a large number of machine checks with firmware 114; enough so that I
downgraded back to 103E. Could 117 have the same problem?
Investigating BIOS aspect. This issue is not seen with Seawolf (7.1) on the
same BIOS revisions, so have concern that this isn't solely a BIOS issue.
We (Red Hat) really need to fix this before next release.
Note that Seawolf does not exhibit this problem on BIOS 117C, while both Roswell
& Roswell2 do.
"enigma" -- 7.2 rc2 does not exhibit this in the 1 case of CD-ROM based install
I have tested. I will verify on other install types, then close this entry.
I have unfortunately just reproduced the failure on the Enigma install that I
had previously reported inability to reproduce. Failure:Success count currently
stands at 4:1 on the single UP system under test.
what version is that ? Enigma ia64 isn't due for a while .....
I have reproduced this failure on the distribution with the following identifiers:
file location: ftp://ftp.beta.redhat.com/pub/pensacola/rc2/iso/rc2
rc2-ia64-disc1.iso 678,406,144 9/28/2001
rc2-ia64-disc2.iso 621,152,256 9/28/2001
rc2-ia64-disc3.iso 680,402,944 9/28/2001
rc2-ia64-disc4.iso 409,448,448 9/28/2001
Announcement forwarded by Pat Rago on 02-October-2001 as "Red Hat Linux 7.2 RC-2
The installed version of these ISOs reports (Enigma) on the login screen.
So I entered the issue as being against "Enigma", because that was what I saw on
the login screen. Perhaps I should have entered it as "Pensacola" -- but when
two reputable sources give two different answers to the "What do I call this
thing?" question, it gets a bit difficult to make things clear without this
extra explanation. Apologies for any confusion.
The failure has been seen on both UP and MP kernels, using install type of: CD,
NFS, FTP, HTTP. Process used was: Install OS, allow OS to reboot at end of
installation, and automatically boot the default kernel. Upon finding failure
(100% reproducibility), hard power off (4-5 second power switch hold) then power
back on, and boot the second (either MP or UP, whichever was not booted by
default) kernel at the ELILO prompt. The failure also occurs with this kernel.
Hoping this clarifies the issue a little,
This issue was again reproduced, using the ia64 RC3 distribution. Repro has
occurred on two systems tested against.
Attaching the machine check log. We need someone at Intel to decode this.
Created attachment 36621 [details]
ia64 machine check log
Translation from BIOS developer:
This is an AGP bus abort: FERR_PCI Non-config Master Abort. it percolates up
through the F16 and into the SAC as a hard abort.
The only other bit of data he was able to glean from the dump attached in
previous note was that the Class Code is three bytes, rather than two, which is
setting up a one-byte offset on the data that is being decoded (Seg/Bus/Dev/Func
is off by one byte).
Question from the developer -- what produced this log file?
Looking into the feasibility of sending a error log dump tool (I believe it is
an EFI-based utility, but not sure). Will update with another note if we can
distribute this tool.
The kernel produced the log file.
Arjan: how are console fonts loaded to the video memory? That's the *only*
thing that would be accessing the video card at the points where this appears.
The console fonts are loaded by putting stuff in video ram and then outb'ing a
few commands to io ports of the vga card. scary code if you ask me.
Changed Summary: "Roswell" changed to "7.2 release candidate"
Ben reports that he cannot reproduce this on his machine with 117C firmware
and AGP video.
Of note is that I can reproduce this *only* on initial boot (i.e., the call to
setsysfont in rc.sysinit.) I can't seem to reproduce it after boot.
(or before boot (i.e., init=/bin/bash), for that matter.
Intel: does this occurs on video adapters *other* than rage128 AGP?
I tested on an AGP Rage 128 and didn't experience the problem.
Failure exhibits repeatably on my box.
UP C0 733
1GB RAM (8x128)
Adaptec 39160 SCSI controller
Quantum Atlas 10K2 SCSI HDD
LiteOn IDE CD-ROM
Matrox G450 AGP graphics controller
I also received an email from another Intel engineer today, asking for help
getting his system up and running, because he hasn't been able to boot Red Hat
7.2 betas/RCs since he decided to start using Red Hat on his systems. Will
update with another note when I find out his system configuration, if there is
anything different (his system is also UP).
Verified the other Intel engineer has identical system configuration, wrt add-in
cards & memory. He was running A4 processors, which I have advised him to
upgrade to C0.
The very very minimum for RHL is B3
Yes, I know, Arjan *grin* That's why I've got him upgrading to C0 processor to
try to reproduce the issue there. On a cheerful note, the system made it as
far with an A4 processor as it did up her with a C0 processor, so something was
letting it work...*wry chuckle* Anyway, I'm still seeing the failure on my
system *sigh*. Is there any more configuration information I can provide that
would be helpful in diagnosing this issue?
We have one machine that shows it too. And then only with certain video cards..
Since we have only seen this on one box, and that box only sometimes,
and then only with one of the many video cards being tested, this sounds
more and more like a firmware problem that needs to be attacked by
Intel using the hardware-level debugging tools that we don't have
Do you disagree about my educated guess that we are dealing with a
Actually, yes I do disagree with your theory that this is a firmware-level
problem. My basis for this is that this issue has never appeared on a 7.1
"Seawolf" 64-bit system, and only started appearing with 7.2 betas and RCs. If
it were a firmware-level issue, I would expect the same issue to be reproducing
on previous versions of the software as well.
It has been my experience in four years of software validation work that if a
firmware issue exists, it will be reproducible on multiple OS revisions, with
multiple software loads; while if the issue is with the software, it will
reproduce with multiple firmware revisions, and multiple software revisions.
I guess what I'm looking for is a "our code cannot be the cause of this, and
here is why", rather than a "we don't want this to be our problem" which is what
I'm hearing (although it may not be what you are saying). Initial look at this
issue by the firmware team indicated nothing pointing to the firmware as the
cause -- this was just a cursory look, not an in-depth investigation, but
management has made the statement that until Red Hat can demonstrate this is not
their issue, the firmware team has other issues that are known to be related to
the firmware to work on.
7.1 did not enable the Machine Check code for ia64's. For 7.2 this was
explicitly added on Intel's request...... so that 7.1 didn't see the machine
check exceptions is no suprise.... I'll be more than glad to turn it off again
since Intel hasn't provided Red Hat with any tools to USE such exceptions
The recommended workaround for this will be to use the SMP kernel; this will be
documented in the release notes.
SGI reported the same issue - boot hang - with 7.2 GM (2.4.9-18) on 1P Big Sur
systems. And we are able to reproduce the issue on multiple Big Sur systems
here. This boot hang occurs on both the UP and SMP kernels, but only when
booting one processor and using an AGP video adapter. We tested ATI AGP cards
and N-Vidia Quatro MX-2, and saw the issue.
After debugging for some time here are the findings.
I built a development kernel (2.4.18+patches ) and tested with that kernel and
didn't see the issue. Rebooted many times and no hangs, no mca.
When we switch to Red Hat 7.2 GM Kernel, we see hangs immediately.
The non-configration master abort occurs when there is an access to 0xA0000
which is not claimed by the AGP device. We tested that if the card's registers
are changed to claim the entire VGA region 0xA0000 - 0xBFFFF (originally the
card claims 0xB8000 - 0xBFFFF region) then we don't observe the error. If we
boot with this setting, the system does not hang.
One interesting point is that the MCA occurs everytime, when the user space
code starts (/etc/rc.sysinit). If we comment the portion where the initrd is
unmounted and the buffers are flushed, we can boot successfully.
With the 2.4.18 kernel you mention:
a) did you turn on CONFIG_IA64_MCA
b) did you use an initrd?
a) Yes, it is turned on.
b) Yes, used an initrd
Some more data:
The illegal accesses to the VGA segment are coming from the optimized asm
version of copy_page. Here is the scope of the problem:
One of the failing scenario: The failure occurs when copy_page is called with a
target page which is exactly one page below the VGA range. The problem is
with the pre-fetch instructions at the end of the loop which are always one
cache line ahead of the source/target pointers. When the target page is mapped
to 0x9C000 (0XA0000 - 16K) on the last iteration the pre-fetch will attempt to
fetch from the VGA range, but since the video card is not programmed to respond
to this range we get the master abort/hard fail returned to the CPU causing the
One other more interesting data:
The hang occurs on the different places (with the same signature) depending on
how you start the elilo from efi shell.
Next step: Need to unwind the stack. Any pointers will be helpful
That would make a lot of sense. In fact we fixed a remarkably similar Athlon
prefetch bug during the early 2.4 releases. The copy_* code needs to not
prefetch beyond the end of the block it is copying anyway - the other prefetch
is most likely to be wasted.
The following entry in the tlb table covers 0xE000000000000000 to
0xE000000003FFFFFF which includes 0xE0000000000A0000 and the memory attributes
are WB. A translation for a page starting from 0xe000000000000000 to
e000000003FFFFFF encompasses 0xa0000.
# TR V P rid va pa ps ed pl ar a d ma key
72 1 1 1 000007 E00000000160F000 000000000160F000 1A 64M 0 0 3 1 1 0 000007
At this point, I would like to get some data/input/update from Red Hat. Why do
we access/prefetch that region?
MCAs are happening due to a lfetch that is going beyond a page and the page
happen to be last page of 0-640K. This causes the lfetch to go to video space
and causes MCA.
Here is a patch to fix this. This is performance critical patch and any
modification may results in lower performance for copy_page or clear_page. This
patch will apply on 2.4.9 to 2.4.17. clear_page patch is not needed for 2.4.18
and copy_page will apply to 2.4.18 also.
--- linux-2.4.16-akm/arch/ia64/lib/clear_page.S Fri Nov 9 14:26:17 2001
+++ linux/arch/ia64/lib/clear_page.S Wed Apr 10 17:32:26 2002
@@ -23,15 +23,18 @@
#define dst2 r9
#define dst3 r10
#define dst_fetch r11
+#define dst_last r14
mov r16 = PAGE_SIZE/64-1 // -1 = repeat/until
+ mov r17 = PAGE_SIZE
.save ar.lc, saved_lc
mov saved_lc = ar.lc
+ add dst_last = r17, dst0
mov ar.lc = r16
adds dst1 = 16, dst0
adds dst2 = 32, dst0
@@ -40,10 +43,12 @@
1: stf.spill.nta [dst0] = f0, 64
stf.spill.nta [dst1] = f0, 64
+ cmp.ltu p6,p0 = dst_fetch, dst_last
stf.spill.nta [dst2] = f0, 64
stf.spill.nta [dst3] = f0, 64
- lfetch [dst_fetch], 64
+(p6) lfetch [dst_fetch], 64
mov ar.lc = r2 // restore lc
--- linux-2.4.16-akm/arch/ia64/lib/copy_page.S Fri Nov 9 14:26:17 2001
+++ linux/arch/ia64/lib/copy_page.S Wed Apr 10 17:38:07 2002
@@ -30,6 +30,7 @@
#define tgt2 r23
#define srcf r24
#define tgtf r25
+#define tgt_last r26
#define Nrot ((8*PIPE_DEPTH+7)&~7)
@@ -55,18 +56,21 @@
+ mov tgt_last = PAGE_SIZE
+ add tgt_last = tgt_last, in0
(p) ld8 t1=[src1],16
(EPI) st8 [tgt1]=t1[PIPE_DEPTH-1],16
(p) ld8 t2=[src2],16
(EPI) st8 [tgt2]=t2[PIPE_DEPTH-1],16
+ cmp.ltu p6,p0 = tgtf, tgt_last
(p) ld8 t3=[src1],16
(EPI) st8 [tgt1]=t3[PIPE_DEPTH-1],16
@@ -83,8 +87,8 @@
(p) ld8 t8=[src2],16
(EPI) st8 [tgt2]=t8[PIPE_DEPTH-1],16
- lfetch [srcf], 64
- lfetch [tgtf], 64
+(p6) lfetch [srcf], 64
+(p6) lfetch [tgtf], 64
mov pr=saved_pr,0xffffffffffff0000 // restore predicates
We need to be able to give our SGI support people something to tell Itanium customers (and post on our support webpages) who wish to install Red Hat 7.2. What is
the next step in the process now that there appears to be a working patch? I'm assuming you are verifying this patch at Red Hat and considering it works correctly,
when can we expect the patch to be officially placed on your download site?
Probably when the next security issue arises.
chaning to WONTFIX, as 7.2 ia64 is EOL