From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.3) Gecko/20010801 Description of problem: The system crashes into a hard lock, logging a fatal event into the BIOS Event Log, during OS boot of Roswell on UP Big Sur systems. This lock drops video signal, and activates the HDD (HDD activity light is constantly on, drive is audibly spinning/reading) Only way to escape is power button override. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Install Roswell (everything install) on a UP Big Sur with BIOS 117A(RC8) or 117C (P9) 2. Reboot at the end of install, when prompted. 3. Notice the system begin booting, running through boot services, and then lose video signal and keyboard functionality. Actual Results: At some point between the start of "Starting Red Hat Linux" and approximately 5 minutes after boot, the system logs a fatal error to the Event Log in BIOS, drops video signal and keyboard functionality, and "hard-locks", requiring a power button override to power the system down. Expected Results: Linux should boot normally on a UP system (no issues with the test systems under Seawolf; no issues with DP systems - Seawolf or Roswell) Additional info: tested with DP in the failing system, and failure did not occur. tested with DP in other test systems, and failure did not occur. tested with UP in the failing system, and failure occurred repeatibly tested with UP in other test systems, and failure occurred repeatibly 1GB RAM, Adaptec 39160/QLogic 12160 adapters (1 each configuration), 18GB SCSI HDD, IDE CDROM, LS120 C0/733 & C0/800 processors
I assume this is 2.4.6-3.1; we've fixed this for 2.4.7-2 which is in "roswell2"
apologies for not being more clear with the submission -- with only one "roswell" entry, I thought things were being tracked on the _current_ roswell. This entry was logged against roswell2 (7.1.94 w/ kernel 2.4.7-2smp and 2.4.7-2), although it was also observed under roswell1. I will update other roswell2 entries to clarify this.
I saw a large number of machine checks with firmware 114; enough so that I downgraded back to 103E. Could 117 have the same problem?
Investigating BIOS aspect. This issue is not seen with Seawolf (7.1) on the same BIOS revisions, so have concern that this isn't solely a BIOS issue.
We (Red Hat) really need to fix this before next release.
Note that Seawolf does not exhibit this problem on BIOS 117C, while both Roswell & Roswell2 do.
"enigma" -- 7.2 rc2 does not exhibit this in the 1 case of CD-ROM based install I have tested. I will verify on other install types, then close this entry.
I have unfortunately just reproduced the failure on the Enigma install that I had previously reported inability to reproduce. Failure:Success count currently stands at 4:1 on the single UP system under test.
what version is that ? Enigma ia64 isn't due for a while .....
I have reproduced this failure on the distribution with the following identifiers: file location: ftp://ftp.beta.redhat.com/pub/pensacola/rc2/iso/rc2 file names: rc2-ia64-disc1.iso 678,406,144 9/28/2001 rc2-ia64-disc2.iso 621,152,256 9/28/2001 rc2-ia64-disc3.iso 680,402,944 9/28/2001 rc2-ia64-disc4.iso 409,448,448 9/28/2001 Announcement forwarded by Pat Rago on 02-October-2001 as "Red Hat Linux 7.2 RC-2 (Pensacola)" The installed version of these ISOs reports (Enigma) on the login screen. So I entered the issue as being against "Enigma", because that was what I saw on the login screen. Perhaps I should have entered it as "Pensacola" -- but when two reputable sources give two different answers to the "What do I call this thing?" question, it gets a bit difficult to make things clear without this extra explanation. Apologies for any confusion. The failure has been seen on both UP and MP kernels, using install type of: CD, NFS, FTP, HTTP. Process used was: Install OS, allow OS to reboot at end of installation, and automatically boot the default kernel. Upon finding failure (100% reproducibility), hard power off (4-5 second power switch hold) then power back on, and boot the second (either MP or UP, whichever was not booted by default) kernel at the ELILO prompt. The failure also occurs with this kernel. Hoping this clarifies the issue a little, --steve--
This issue was again reproduced, using the ia64 RC3 distribution. Repro has occurred on two systems tested against.
Attaching the machine check log. We need someone at Intel to decode this.
Created attachment 36621 [details] ia64 machine check log
Translation from BIOS developer: This is an AGP bus abort: FERR_PCI Non-config Master Abort. it percolates up through the F16 and into the SAC as a hard abort. The only other bit of data he was able to glean from the dump attached in previous note was that the Class Code is three bytes, rather than two, which is setting up a one-byte offset on the data that is being decoded (Seg/Bus/Dev/Func is off by one byte). Question from the developer -- what produced this log file? Looking into the feasibility of sending a error log dump tool (I believe it is an EFI-based utility, but not sure). Will update with another note if we can distribute this tool.
The kernel produced the log file. Arjan: how are console fonts loaded to the video memory? That's the *only* thing that would be accessing the video card at the points where this appears.
The console fonts are loaded by putting stuff in video ram and then outb'ing a few commands to io ports of the vga card. scary code if you ask me.
Changed Summary: "Roswell" changed to "7.2 release candidate"
Ben reports that he cannot reproduce this on his machine with 117C firmware and AGP video.
Of note is that I can reproduce this *only* on initial boot (i.e., the call to setsysfont in rc.sysinit.) I can't seem to reproduce it after boot.
(or before boot (i.e., init=/bin/bash), for that matter.
Intel: does this occurs on video adapters *other* than rage128 AGP?
I tested on an AGP Rage 128 and didn't experience the problem.
Failure exhibits repeatably on my box. Configuration is: UP C0 733 1GB RAM (8x128) Adaptec 39160 SCSI controller BIOS 103E+ USB Kb/Mouse Quantum Atlas 10K2 SCSI HDD LiteOn IDE CD-ROM Matrox G450 AGP graphics controller I also received an email from another Intel engineer today, asking for help getting his system up and running, because he hasn't been able to boot Red Hat 7.2 betas/RCs since he decided to start using Red Hat on his systems. Will update with another note when I find out his system configuration, if there is anything different (his system is also UP).
Verified the other Intel engineer has identical system configuration, wrt add-in cards & memory. He was running A4 processors, which I have advised him to upgrade to C0.
A4? eek The very very minimum for RHL is B3
Yes, I know, Arjan *grin* That's why I've got him upgrading to C0 processor to try to reproduce the issue there. On a cheerful note, the system made it as far with an A4 processor as it did up her with a C0 processor, so something was letting it work...*wry chuckle* Anyway, I'm still seeing the failure on my system *sigh*. Is there any more configuration information I can provide that would be helpful in diagnosing this issue?
We have one machine that shows it too. And then only with certain video cards..
Since we have only seen this on one box, and that box only sometimes, and then only with one of the many video cards being tested, this sounds more and more like a firmware problem that needs to be attacked by Intel using the hardware-level debugging tools that we don't have here.
Do you disagree about my educated guess that we are dealing with a firmware-level problem?
Actually, yes I do disagree with your theory that this is a firmware-level problem. My basis for this is that this issue has never appeared on a 7.1 "Seawolf" 64-bit system, and only started appearing with 7.2 betas and RCs. If it were a firmware-level issue, I would expect the same issue to be reproducing on previous versions of the software as well. It has been my experience in four years of software validation work that if a firmware issue exists, it will be reproducible on multiple OS revisions, with multiple software loads; while if the issue is with the software, it will reproduce with multiple firmware revisions, and multiple software revisions. I guess what I'm looking for is a "our code cannot be the cause of this, and here is why", rather than a "we don't want this to be our problem" which is what I'm hearing (although it may not be what you are saying). Initial look at this issue by the firmware team indicated nothing pointing to the firmware as the cause -- this was just a cursory look, not an in-depth investigation, but management has made the statement that until Red Hat can demonstrate this is not their issue, the firmware team has other issues that are known to be related to the firmware to work on.
7.1 did not enable the Machine Check code for ia64's. For 7.2 this was explicitly added on Intel's request...... so that 7.1 didn't see the machine check exceptions is no suprise.... I'll be more than glad to turn it off again since Intel hasn't provided Red Hat with any tools to USE such exceptions anyway....
The recommended workaround for this will be to use the SMP kernel; this will be documented in the release notes.
SGI reported the same issue - boot hang - with 7.2 GM (2.4.9-18) on 1P Big Sur systems. And we are able to reproduce the issue on multiple Big Sur systems here. This boot hang occurs on both the UP and SMP kernels, but only when booting one processor and using an AGP video adapter. We tested ATI AGP cards and N-Vidia Quatro MX-2, and saw the issue. After debugging for some time here are the findings. I built a development kernel (2.4.18+patches ) and tested with that kernel and didn't see the issue. Rebooted many times and no hangs, no mca. When we switch to Red Hat 7.2 GM Kernel, we see hangs immediately. The non-configration master abort occurs when there is an access to 0xA0000 which is not claimed by the AGP device. We tested that if the card's registers are changed to claim the entire VGA region 0xA0000 - 0xBFFFF (originally the card claims 0xB8000 - 0xBFFFF region) then we don't observe the error. If we boot with this setting, the system does not hang. One interesting point is that the MCA occurs everytime, when the user space code starts (/etc/rc.sysinit). If we comment the portion where the initrd is unmounted and the buffers are flushed, we can boot successfully.
With the 2.4.18 kernel you mention: a) did you turn on CONFIG_IA64_MCA b) did you use an initrd?
a) Yes, it is turned on. b) Yes, used an initrd Some more data: The illegal accesses to the VGA segment are coming from the optimized asm version of copy_page. Here is the scope of the problem: One of the failing scenario: The failure occurs when copy_page is called with a target page which is exactly one page below the VGA range. The problem is with the pre-fetch instructions at the end of the loop which are always one cache line ahead of the source/target pointers. When the target page is mapped to 0x9C000 (0XA0000 - 16K) on the last iteration the pre-fetch will attempt to fetch from the VGA range, but since the video card is not programmed to respond to this range we get the master abort/hard fail returned to the CPU causing the MCA. One other more interesting data: The hang occurs on the different places (with the same signature) depending on how you start the elilo from efi shell. Next step: Need to unwind the stack. Any pointers will be helpful
That would make a lot of sense. In fact we fixed a remarkably similar Athlon prefetch bug during the early 2.4 releases. The copy_* code needs to not prefetch beyond the end of the block it is copying anyway - the other prefetch is most likely to be wasted.
More Data: The following entry in the tlb table covers 0xE000000000000000 to 0xE000000003FFFFFF which includes 0xE0000000000A0000 and the memory attributes are WB. A translation for a page starting from 0xe000000000000000 to e000000003FFFFFF encompasses 0xa0000. # TR V P rid va pa ps ed pl ar a d ma key 72 1 1 1 000007 E00000000160F000 000000000160F000 1A 64M 0 0 3 1 1 0 000007 At this point, I would like to get some data/input/update from Red Hat. Why do we access/prefetch that region?
MCAs are happening due to a lfetch that is going beyond a page and the page happen to be last page of 0-640K. This causes the lfetch to go to video space and causes MCA. Here is a patch to fix this. This is performance critical patch and any modification may results in lower performance for copy_page or clear_page. This patch will apply on 2.4.9 to 2.4.17. clear_page patch is not needed for 2.4.18 and copy_page will apply to 2.4.18 also. Thanks, Asit --- linux-2.4.16-akm/arch/ia64/lib/clear_page.S Fri Nov 9 14:26:17 2001 +++ linux/arch/ia64/lib/clear_page.S Wed Apr 10 17:32:26 2002 @@ -23,15 +23,18 @@ #define dst2 r9 #define dst3 r10 #define dst_fetch r11 +#define dst_last r14 GLOBAL_ENTRY(clear_page) .prologue .regstk 1,0,0,0 mov r16 = PAGE_SIZE/64-1 // -1 = repeat/until + mov r17 = PAGE_SIZE ;; .save ar.lc, saved_lc mov saved_lc = ar.lc .body + add dst_last = r17, dst0 mov ar.lc = r16 adds dst1 = 16, dst0 adds dst2 = 32, dst0 @@ -40,10 +43,12 @@ ;; 1: stf.spill.nta [dst0] = f0, 64 stf.spill.nta [dst1] = f0, 64 + cmp.ltu p6,p0 = dst_fetch, dst_last stf.spill.nta [dst2] = f0, 64 stf.spill.nta [dst3] = f0, 64 + ;; - lfetch [dst_fetch], 64 +(p6) lfetch [dst_fetch], 64 br.cloop.dptk.few 1b ;; mov ar.lc = r2 // restore lc --- linux-2.4.16-akm/arch/ia64/lib/copy_page.S Fri Nov 9 14:26:17 2001 +++ linux/arch/ia64/lib/copy_page.S Wed Apr 10 17:38:07 2002 @@ -30,6 +30,7 @@ #define tgt2 r23 #define srcf r24 #define tgtf r25 +#define tgt_last r26 #define Nrot ((8*PIPE_DEPTH+7)&~7) @@ -55,18 +56,21 @@ mov src1=in1 adds src2=8,in1 + mov tgt_last = PAGE_SIZE ;; adds tgt2=8,in0 add srcf=512,in1 mov ar.lc=lcount mov tgt1=in0 add tgtf=512,in0 + add tgt_last = tgt_last, in0 ;; 1: (p[0]) ld8 t1[0]=[src1],16 (EPI) st8 [tgt1]=t1[PIPE_DEPTH-1],16 (p[0]) ld8 t2[0]=[src2],16 (EPI) st8 [tgt2]=t2[PIPE_DEPTH-1],16 + cmp.ltu p6,p0 = tgtf, tgt_last ;; (p[0]) ld8 t3[0]=[src1],16 (EPI) st8 [tgt1]=t3[PIPE_DEPTH-1],16 @@ -83,8 +87,8 @@ (p[0]) ld8 t8[0]=[src2],16 (EPI) st8 [tgt2]=t8[PIPE_DEPTH-1],16 - lfetch [srcf], 64 - lfetch [tgtf], 64 +(p6) lfetch [srcf], 64 +(p6) lfetch [tgtf], 64 br.ctop.sptk.few 1b ;; mov pr=saved_pr,0xffffffffffff0000 // restore predicates
Hi Arjan, We need to be able to give our SGI support people something to tell Itanium customers (and post on our support webpages) who wish to install Red Hat 7.2. What is the next step in the process now that there appears to be a working patch? I'm assuming you are verifying this patch at Red Hat and considering it works correctly, when can we expect the patch to be officially placed on your download site? Thanks Scott
Probably when the next security issue arises.
chaning to WONTFIX, as 7.2 ia64 is EOL