Bug 129663 - [RHEL4] ACPI problem: Installer kernel hangs on IBM x365
[RHEL4] ACPI problem: Installer kernel hangs on IBM x365
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Geoff Gustafson
Brian Brock
:
Depends On:
Blocks: 140583
  Show dependency treegraph
 
Reported: 2004-08-11 12:58 EDT by john stultz
Modified: 2007-11-30 17:07 EST (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-12-21 14:02:31 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
1st virtual console output of the hang (36.45 KB, image/jpeg)
2004-11-11 15:34 EST, john stultz
no flags Details
2nd virtual console output (34.78 KB, image/jpeg)
2004-11-11 15:37 EST, john stultz
no flags Details
3rd virtual console output (48.93 KB, image/jpeg)
2004-11-11 15:41 EST, john stultz
no flags Details
lspci -vv output (9.13 KB, text/plain)
2004-11-18 16:33 EST, john stultz
no flags Details
dmesg from installed kernel (18.21 KB, text/plain)
2004-11-18 16:38 EST, john stultz
no flags Details
dmesg from installed kernel w/ "noapic" (13.96 KB, text/plain)
2004-11-18 18:55 EST, john stultz
no flags Details
/proc/interrupts output from installed kernel w/ "noapic" (1009 bytes, text/plain)
2004-11-18 18:56 EST, john stultz
no flags Details
acpidump output (142.23 KB, text/plain)
2004-11-18 20:08 EST, john stultz
no flags Details
acpidmp output (161.17 KB, text/plain)
2004-11-18 21:58 EST, john stultz
no flags Details
acpidmp information from relentless - Brad's (160.58 KB, text/plain)
2004-11-19 15:39 EST, Bradley Thomas
no flags Details
dmesg from booting installed kerenl w/ noapic and acpi=off (12.77 KB, text/plain)
2004-11-22 13:51 EST, john stultz
no flags Details
/proc/interrupts output for installed kernel w/ noapic and acpi=off (789 bytes, text/plain)
2004-11-22 13:53 EST, john stultz
no flags Details
noapic boot with dmesg -s64000 (13.93 KB, text/plain)
2004-11-22 14:16 EST, Bradley Thomas
no flags Details
noapic boot with /proc/interrupts (587 bytes, text/plain)
2004-11-22 14:17 EST, Bradley Thomas
no flags Details
lspci -vv from booting installed kernel with noapic (12.02 KB, text/plain)
2004-11-22 14:19 EST, Bradley Thomas
no flags Details
Brad's dmesg from booting installed kernel with acpi=off (11.22 KB, text/plain)
2004-11-22 14:21 EST, Bradley Thomas
no flags Details
Bradl's /proc/interrupts output for installed kernel with acpi=off (481 bytes, text/plain)
2004-11-22 14:22 EST, Bradley Thomas
no flags Details
biosdecode output (874 bytes, text/plain)
2004-11-24 15:15 EST, john stultz
no flags Details
debug patch to not disable unused Links vs. 2.6.9 (422 bytes, patch)
2004-11-29 17:50 EST, Len Brown
no flags Details | Diff
debug patch to register PCI interrupts in error path (1.31 KB, patch)
2004-11-29 18:08 EST, Len Brown
no flags Details | Diff
console screen capture of nogsi patch booted w/ just "noapic" (44.77 KB, image/jpeg)
2004-11-30 17:11 EST, john stultz
no flags Details
dmesg output for nogsi patch (no extra boot options) (17.98 KB, text/plain)
2004-11-30 17:33 EST, john stultz
no flags Details
/proc/interrupts output for nogsi patch (w/ no extra boot options) (1.27 KB, text/plain)
2004-11-30 17:34 EST, john stultz
no flags Details
dmesg output for nogsi patch booting w/ "noapic" and "acpi=off" (12.84 KB, text/plain)
2004-11-30 17:40 EST, john stultz
no flags Details
/proc/interrupts output for nogsi patch booting w/ "noapic acpi=off" (728 bytes, text/plain)
2004-11-30 17:41 EST, john stultz
no flags Details
serial console log from nogsi patch booting w/ "noapic" (11.41 KB, text/plain)
2004-11-30 18:54 EST, john stultz
no flags Details
diff of console logs from a "noacpi" boot and a "noacpi pci=noacpi" boot (4.51 KB, text/plain)
2004-11-30 19:37 EST, john stultz
no flags Details
extended dmesg from installed kernel w/ "noapic init=/bin/sh" (17.11 KB, text/plain)
2004-11-30 21:02 EST, john stultz
no flags Details
updated debug patch to register interrupts in error path vs 2.6.9 (1.33 KB, patch)
2004-11-30 23:58 EST, Len Brown
no flags Details | Diff
console log from updated nogsi patched kernel booting w/ "noapic" (13.16 KB, text/plain)
2004-12-01 13:55 EST, john stultz
no flags Details
console log failure.try1.warmboot log (11.13 KB, text/plain)
2004-12-01 16:20 EST, Bradley Thomas
no flags Details
console log failure.try2.warmboot log (16.54 KB, text/plain)
2004-12-01 16:22 EST, Bradley Thomas
no flags Details
console log failure.try3.coldboot.log (11.17 KB, text/plain)
2004-12-01 16:23 EST, Bradley Thomas
no flags Details
dmesg log success.dmesg.log (12.54 KB, text/plain)
2004-12-01 16:24 EST, Bradley Thomas
no flags Details
console log success.dmesg.log (12.54 KB, text/plain)
2004-12-01 16:26 EST, Bradley Thomas
no flags Details

  None (edit)
Description john stultz 2004-08-11 12:58:53 EDT
Description of problem: 
When booting the RHEL4-Alpha4 installer CD on an IBM x365, the 
installer hangs at "running /sbin/loader". Booting w/ "acpi=off" 
works around the issue.  
 
How reproducible: 
Every time.  
 
Steps to Reproduce: 
1. Insert RHEL4-alpha4 CD #1 into an IBM x365 
2. Boot system 
 
   
Actual results: 
Hang at "running /sbin/loader" 
 
Expected results: 
No hang 
 
Additional info: 
Booting w/ "acpi=off" works around the issue. 
 
Changing to the debug console, the last messages were about the USB 
controler.
Comment 1 Wendy Hung 2004-10-05 15:54:50 EDT
reproduced with beta 1 refresh.
Comment 2 john stultz 2004-10-12 16:57:55 EDT
Also confirmed the problem exists w/ Beta1 Refresh1.  
 
Any suggestions on debugging this issue? While the installer hangs, 
the system is live and responsive (vttys can be switched, the system 
traps ctrl-alt-del and reboots normally). It just seems 
the /sbin/loader app is hung. 
Comment 3 Bob Johnson 2004-11-02 09:36:38 EST
Old bug, need info from IBM.
Comment 4 Wendy Hung 2004-11-02 10:16:07 EST
What info do you need from IBM?  I don't see any requests for info or 
comments from Red Hat here.
Comment 5 mark wisner 2004-11-08 11:35:25 EST
My guess is Red Hat would like to have attached the dubug console
results. From looking at the comments, the assumption is the problem
has something to do with ACPI.
Comment 6 john stultz 2004-11-09 19:28:19 EST
This issue has been reproduced w/ RHEL 4 beta2. 
I'm looking for a way to get the console output, although it really 
isn't all that interesting. 
 
Comment 7 john stultz 2004-11-11 15:34:36 EST
Created attachment 106520 [details]
1st virtual console output of the hang

I know, jpeg screen captures are cornball, but we don't have a serial line
attached to the box, so this was the fastest way to provide the info.

This is the first console output of the installer hang.
Comment 8 john stultz 2004-11-11 15:37:36 EST
Created attachment 106521 [details]
2nd virtual console output

2nd virtual console output
Comment 9 john stultz 2004-11-11 15:41:13 EST
Created attachment 106522 [details]
3rd virtual console output

The last line about the AT keyboard showed up because I needed to plug in a PS2
keyboard in order to switch consoles, as the USB keyboard was not responding.
Comment 10 Len Brown 2004-11-15 12:30:57 EST
<can we uncheck the RHEL Beta access limitation on this bug?>

Thanks for the screen captures.  Nothing, however, jumps out
at me as broken -- at least as shown by the last screen.
Can you get a root prompt in a console and get access
to the dmesg from the beginning?

Does FC3 install on this system, or does it have the same hang?

How about other install options -- does it work if you use
the "vnc" boot option on the installer kernel or disable rhgb?

acpi=off is a big clue.  Does it also work if you use "acpi=noirq"
or "pci=noacpi"?

Comment 11 Bradley Thomas 2004-11-15 16:54:48 EST
John, what BIOS level is your x365 at?  We're not seeing this problem 
in our lab on our x365, and BIOS level is the only thing that comes 
to mind on a difference.

Are you at the latest level?
Comment 12 john stultz 2004-11-15 17:34:24 EST
Len: Just tested "acpi=noirq" and that seems to work. I'll check 
"pci=noacpi" next. 
 
Breadley: Good thought. I had seen this initially w/ a pre-GA version 
of the hardware (BIOS 1.01), but this is with a GA'ed system (BIOS 
1.05). I'll go poke around and see if I can't find a more recent 
version. 
Comment 13 john stultz 2004-11-15 19:40:17 EST
Bradley: I reproduced the problem w/ BIOS 1.08 (the latest off of 
updatexpress 3.05a). Not sure why you're not seeing this while both 
Wendy and myself are. 
 
Len: I don't have an issue w/ removing the access limitation, but I 
don't want to step on any toes, so I'll let someone from redhat 
change it.  
Comment 14 john stultz 2004-11-15 19:46:16 EST
Len: "pci=noacpi" works fine as well. 
Comment 15 Bradley Thomas 2004-11-16 17:19:31 EST
John: Realize that Wendy is also not seeing the problem as of Beta 2.
Comment 16 john stultz 2004-11-16 18:18:38 EST
Oh, that's news. Huh. Well, I just updated the BIOS on our pre-GA 
system to 1.08 and the problem still exists. So I'm seeing it on two 
of our systems. Do you have any extra hardware in your box? The ones 
in beaverton are 4cpus, 2 and 10Gigs memory and no added PCI cards. 
 
What BIOS level are you using? (Just to make sure I'm really up to 
date) 
Comment 17 john stultz 2004-11-16 20:02:01 EST
Additional clarification: it only seems the boot kernel is having 
this issue. After the system has been installed it boots and 
functions fine w/o any additional boot options. 
Comment 19 Len Brown 2004-11-17 23:32:40 EST
Please attach the output from lspci -vv and acpidmp
Please attach the dmesg -s64000 from the installed kernel boot.

Does the FC3 install kernel fail the same way as the RHEL4?

Please try booting the installed kernel with "noapic"
report if it works and if it does, attach the dmesg
and /proc/interrupts.

Please try booting the installer kernel with init=/bin/sh
capture the /proc/interrupts and as much of the dmesg as you can.
Comment 20 john stultz 2004-11-18 16:33:04 EST
Created attachment 106989 [details]
lspci -vv output

lspci -vv output from installed kernel
Comment 21 john stultz 2004-11-18 16:38:11 EST
Created attachment 106990 [details]
dmesg from installed kernel
Comment 22 john stultz 2004-11-18 17:04:09 EST
When booting w/ noapic, rhgb seems to hang at "probing new hardware". 
Looks similar to the /sbin/loader hang, as X is still active and 
working, but the init scripts are just blocked waiting for something. 
Comment 23 john stultz 2004-11-18 18:55:17 EST
Created attachment 107012 [details]
dmesg from installed kernel w/ "noapic"

Booted the install kernel w/ "noapic" and "init=/bin/sh" to capture this dmesg.
Comment 24 john stultz 2004-11-18 18:56:28 EST
Created attachment 107013 [details]
/proc/interrupts output from installed kernel w/ "noapic"
Comment 25 john stultz 2004-11-18 19:03:10 EST
The FC3 test2 (sorry, its the only one I had around) installer 
appeared to hang in the exact same way. 
Comment 26 john stultz 2004-11-18 20:08:21 EST
Created attachment 107023 [details]
acpidump output

acpidump output of installed kernel
Comment 27 Len Brown 2004-11-18 21:38:22 EST
acpidump is no good. 
Please attach the output from acpidmp, available in /usr/sbin or in pmtools: 
http://ftp.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/ 
Comment 28 john stultz 2004-11-18 21:58:00 EST
Created attachment 107027 [details]
acpidmp output
Comment 29 Len Brown 2004-11-18 22:23:50 EST
It appears that the Red Hat install-kernel has ACPI support w/o IO-APIC support, 
and the IBM BIOS has ACPI support only with IO-APIC support -- no PIC support. 
 
I.e. This BIOS does not supply any _PRT entries in PIC mode -- 
only in IOAPIC mode.  This means that that ACPI will not be able to 
route any PCI interrupts in PIC mode.  As there are no _PRT entries, 
the PIC IRQs will all be programmed with identity mappings in legacy mode, 
per the ACPI spec. 
 
So is likely that the Symbios SCSI controllers are issuing level/low 
PCI interrupts, and the PIC is looking for edge/high interrupts -- no go. 
 
If you installed this box onto a legacy IDE drive 
it would probably work -- unless you needed other 
PCI devices such as the network during the install. 
 
Can IBM confirm with its BIOS engineers if this is by design? 
If it is, then there are several options: 
1. fix the BIOS to add _PRT entries in PIC mode. 
  (note that there is an unreferenced LKUS link in the DSDT 
   so some BIOS engineer appears to be part way through something) 
2. add IOAPIC support to the Red Hat installer kernel 
3. add a blacklist entry to the Red hat installer kernel 
    to set acpi=off or acpi=noirq automatically for this box. 
4. Document that the options in #3 must be manually used 
    to install this system. 
 
In the event #3 is necessary, please attach the output 
from dmidecode, available in /usr/sbin/ or here: 
http://www.nongnu.org/dmidecode/ 
 
That said, I'd like to verify what the legacy methods are doing. 
Please attach the dmesg and /proc/interrupts from booting 
the installed kernel with "acpi=off" "noapic" 
 
Comment 32 Bradley Thomas 2004-11-19 15:01:12 EST
Len and John,

Well, this is getting weirder and weirder from our end...  John, our 
system is also at BIOS level 1.08 (a build level of 28) and we are 
working with both nothing additional in the system, as well as with 
various PCI adapters.

This system does not have IDE hard drives, they are SCSI only.  The 
fact that this works for us, and isn't for you, John, really has me 
confused as well.

Len, is there anything that we could provide from our system that is 
working?
Comment 33 john stultz 2004-11-19 15:13:40 EST
Could you attach the acpidmp output like I did? That way we can 
compare and make sure we've really got the same BIOS. I'll verify my 
build level as well. 
Comment 34 john stultz 2004-11-19 15:22:49 EST
BIOS build level is RDJT28EUS from 8/18/04. 
 
Comment 35 Bradley Thomas 2004-11-19 15:39:27 EST
Created attachment 107089 [details]
acpidmp information from relentless - Brad's

Here is the acpidmp information from my Relentless.
Comment 36 john stultz 2004-11-19 16:25:17 EST
Ah. Your acpidmp is def not the same as mine. We must not be running 
the same thing. Let me see if I can't dig up a newer bios. 
Comment 37 john stultz 2004-11-19 17:29:49 EST
I just upgraded to the latest internal version of the BIOS and the 
problem is still there. I'll follow up offline w/ Wendy and Bradley 
on Monday to see if we cannot resolve this difference. 
 
Len: Could you verify that Bradley's acpidmp output does infact have 
PIC support? That would atleast confirm your theory for the 
difference we're seeing. 
Comment 38 Len Brown 2004-11-20 04:18:31 EST
I don't see any difference in the PCI IRQ PIC-mode support 
between the BIOS in comment #35 and the earlier on in comment #28 
 
lenb@d845pe brad $ acpixtract DSDT relentless-acpidmp.txt >DSDT 
lenb@d845pe brad $ iasl -d DSDT 
Disassembly completed, written to "DSDT.dsl" 
lenb@d845pe brad $ grep PICM DSDT.dsl 
        Name (PICM, Package (0x00) {}) 
                Return (PICM) 
        Name (PICM, Package (0x00) {}) 
                Return (PICM) 
        Name (PICM, Package (0x00) {}) 
                Return (PICM) 
        Name (PICM, Package (0x00) {}) 
                Return (PICM) 
        Name (PICM, Package (0x00) {}) 
                Return (PICM) 
        Name (PICM, Package (0x00) {}) 
                Return (PICM) 
 
Indeed, the only difference in the DSDT is the number of Processors 
went from 6 to 2. 
 
Comment 39 Len Brown 2004-11-20 04:36:52 EST
Bradley,  
Are you running an x86 kernel, or an x86_64 kernel? 
John's failure is using teh x86 install kernel. (and x86 installed kernel w/ "noapic") 
Please boot the installed x86 kernel with "noapic" and attach the resulting 
dmesg -s64000 and /proc/interrupts and lspci -vv 
 
John, Bradley, 
per comment #29, it would be helpful if you can also boot the installed 
kernel with "acpi=off" "noapic" and attach the resulting dmesg 
and /proc/interrupts 
 
Comment 40 john stultz 2004-11-22 13:51:53 EST
Created attachment 107216 [details]
dmesg from booting installed kerenl w/ noapic and acpi=off
Comment 41 john stultz 2004-11-22 13:53:54 EST
Created attachment 107217 [details]
/proc/interrupts output for installed kernel w/ noapic and acpi=off
Comment 42 john stultz 2004-11-22 13:56:08 EST
Just so you know, when using noapic and acpi=off to get the above 
output the system booted normally. 
Comment 43 Bradley Thomas 2004-11-22 14:16:05 EST
Created attachment 107222 [details]
noapic boot with dmesg -s64000
Comment 44 Bradley Thomas 2004-11-22 14:17:40 EST
Created attachment 107223 [details]
noapic boot with /proc/interrupts
Comment 45 Bradley Thomas 2004-11-22 14:19:31 EST
Created attachment 107224 [details]
lspci -vv from booting installed kernel with noapic
Comment 46 Bradley Thomas 2004-11-22 14:21:32 EST
Created attachment 107225 [details]
Brad's dmesg from booting installed kernel with acpi=off
Comment 47 Bradley Thomas 2004-11-22 14:22:47 EST
Created attachment 107226 [details]
Bradl's /proc/interrupts output for installed kernel with acpi=off
Comment 48 john stultz 2004-11-22 14:23:55 EST
Bradley: It looks like you guys have a qla card installed. I'll see 
if I can find one to install. Conversley you could try removing it 
and seeing if the problem shows itself.  
Comment 49 Bradley Thomas 2004-11-22 15:24:10 EST
My tester removed all the adapters, as well as the qla card that was 
in the system, and then tested both the installed kernel as well as 
the installation kernel.

With the installed kernel, the system booted properly 6 out of 6 
times.

With the installation kernel, the system only booted 2 out of 4 
times.  Also, this was the first time the tester has seen a failure 
on an installation on Beta 2 on this system... curiouser and 
curiouser.
Comment 51 Len Brown 2004-11-22 16:15:43 EST
were the failures to boot using the installer (PIC-mode x86) kernel
after a power-on, or after a reboot?
Comment 52 Bradley Thomas 2004-11-22 16:19:18 EST
Unfortunately... on both.  1 failure was on a cold boot, the other 
failure was on a reboot.
Comment 54 Len Brown 2004-11-22 17:00:54 EST
how about the converse...
when the system doesn't fail to boot the installer kernel,
was it afer a cold power-on or a reboot?
Comment 55 Len Brown 2004-11-22 17:17:45 EST
Bradley's machine booted w/ "noapic" 
 
ACPI: PCI Interrupt Link [LKUS] (IRQs *3) 
Linux Plug and Play Support v0.97 (c) Adam Belay 
usbcore: registered new driver usbfs 
usbcore: registered new driver hub 
PCI: Using ACPI for IRQ routing 
ACPI: PCI interrupt 0000:00:08.0[A]: no GSI - using IRQ 11 
ACPI: PCI interrupt 0000:00:0f.2[A]: no GSI - using IRQ 3 
ACPI: PCI interrupt 0000:03:01.0[A]: no GSI - using IRQ 5 
ACPI: PCI interrupt 0000:03:01.1[B]: no GSI - using IRQ 7 
ACPI: PCI interrupt 0000:03:02.0[A]: no GSI - using IRQ 10 
ACPI: PCI interrupt 0000:03:02.1[B]: no GSI - using IRQ 10 
ACPI: PCI interrupt 0000:0f:08.0[A]: no GSI - using IRQ 5 
ACPI: PCI interrupt 0000:0c:08.0[A]: no GSI - using IRQ 3 
ACPI: PCI interrupt 0000:09:08.0[A]: no GSI - using IRQ 11 
 
The "no GSI" message is consistent with what we saw in the DSDT -- 
the BIOS simply doesn't tell ACPI anyting about this system in PIC mode. 
Ie. everything in comment #29 is still true, including this question: 
 
Can IBM confirm with its BIOS engineers if this is by design? 
 
The "using IRQ 10" etc. messages are basically an un-tested error path 
where we use whatever PCI has left in config space for the device 
and hope to heck it works.  Your milage may vary -- depends on 
the state of the hardware when we read it, which is why you 
may be seeing different results depending on different BIOS stimulus. 
 
 
Comment 56 john stultz 2004-11-22 17:57:29 EST
I'm trying to get ahold of the BIOS author. I'll have an update as 
soon as I learn anything. 
Comment 57 john stultz 2004-11-22 18:30:05 EST
Just talked w/ the BIOS author, it sounds like it was purposefully 
done. I'm trying to see if they can add it. Len: Would you have any 
clue why we've never seen this issue w/ RHEL3?  
Comment 58 Tim Burke 2004-11-22 20:12:25 EST
Probably because RHEL3 x86 doesn't use ACPI.
Comment 59 Len Brown 2004-11-22 20:37:12 EST
RHEL3 did not run into this because it didn't support ACPI on x86.   
So its intaller kernel used legacy-PIC mode,   
and its installed SMP kernel used legacy-MPS for IOAPIC-mode.   
   
I'm not sure why Red Hat has no IOAPIC support in its installer kernel -- maybe it is   
part of a tradition to minimize the size of the installer kernel to fit on a floppy; but other   
Red Hat-based distros will have the same problem.   
   
Note that you will also run into this with FC2 and  FC3, as Fedora Core is just like   
RHEL4 WRT ACPI and IOAPIC support.   
   
You would not have run into this using the x86_64 install kernel, because it includes   
IOAPIC support.   
   
I'd be curious to see an installed RHEL4 kernel boot with "acpi=off" and "noapic" --   
can you snag the dmesg from that?  The IOAPIC case uses MPS and that is already   
attached, but I'd like to see how PIRQ routers are used in the PIC case.  This is what   
the BIOS writer will have to add to the BIOS to support ACPI PIC-mode.   
  
It would be interesting if you notice a difference in the dmesg between one  
of these random failures and success cases with "noapic".   That may lead  
to another initialization bug.  
   
While you have the ear of the BIOS writer, ask them what LKUS in the DSDT is for.    
It is a PCI Interrupt Link Device, but is not referenced from anywhere.  
You might also mention to them that they're using an ASL compiler that  
is almost 2 years old "INTL 0x20030122", and that updates are available for  
free here: http://www.intel.com/technology/IAPC/acpi/downloads.htm  
Comment 60 john stultz 2004-11-22 21:29:59 EST
Len: See comment #40 and 41 for acpi=off noapic info. 
 
From my talks w/ the BIOS guy, It seems the strange lack of PIC 
entries in ACPI is due to the RXE pci expansion enclosure not working 
in PIC mode(see bug #99362 for more details on the apic/noapic 
installer controversy). Both the x360, x440, and x445 also have this 
quirk, however they are all on the ACPI blacklist due to bad 
behaviour early in their careers (all of the issues have been 
resolved,to my knowledge, but they remain on the list to be 
cautious). This is why we do not see the same issue on those boxes. 
 
For some reason, the MPS tables seem to provide enough info for the 
PIC mode to work well enough when no RXE or second CEC is connected. 
I'm looking to see if we cannot have the ACPI table setup to mirror 
what the MPS tables do. 
Comment 61 john stultz 2004-11-22 22:08:35 EST
So slight correction on that last comment if I'm understanding this 
correctly(I'm learnign this stuff as I go, forgive me and let me knwo 
if I'm wrong).  It is not the MPS table, but the pci config space 
that is used w/ acpi=off and noapic. And using the pci config space 
we seem to get a correct enough irq routing table to install from. My 
question is why doesn't ACPI mode fall back to the same PCI config 
space as is used w/ acpi=off? 
 
thoughts? 
Comment 62 Len Brown 2004-11-23 03:12:18 EST
Yes, legacy-IOAPIC-mode uses MPS 
Legacy PIC mode uses PIRQ routers and the values in PCI config space. 
 
However, in ACPI mode, the system doesn't know anything 
about PIRQ routers, so if the BIOS didn't set them up for us, 
then they're not set up. 
 
In comment #55, the "using IRQ11" messages are the values 
from PCI configuration space.  If these numbers don't change 
between a successful and failed boot, then we know that the 
issue isn't the values in PCI configuration space. 
 
It is probably the state of the PIRQ routers. 
 
But I don't see evidence of PIRQ routers in comment #40. 
Please attach the output from 
biosdecode, which is available in the dmidecode package: 
http://www.nongnu.org/dmidecode/ 
It will print out if the system offers any PIRQ router and entries 
in Legacy PIC mode. 
 
Comment 63 Bradley Thomas 2004-11-23 10:24:19 EST
Len, to answer an earlier question from comment #54, it would work 
sometimes on a cold boot, fail sometimes on a cold boot, work 
sometimes on a reboot, and fail sometimes on a reboot.  Basically... 
intermittent.
Comment 65 Len Brown 2004-11-23 21:22:45 EST
If the hardware can't access the entire system in PIC mode,
then how can booting with acpi=off in legacy-PIC mode be a
viable workaround?

If the hardware _can_ access the entire system in legacy-PIC mode,
why can't the BIOS supply the same capability in ACPI-PIC mode?
Comment 66 john stultz 2004-11-23 21:52:15 EST
Len: I'm not the authority on this (James Cleverdon, who's on 
vacation would be better to answer), but my undertanding is this: 
 
APIC is necessary to be able to route interrupts from devices outside 
the single system enclosure (for example: the RXE external pci 
enclosure and a 2 x445 enclosure 16way system, where two 8 way x445s 
are linked together). Thus, from the BIOS folks perspective APIC is 
necessary, however since the RedHat installer kernel doesn't support 
apic (again, see bug #99362 for details) the workaround is to 
deconfigure your system down to the point that PIC mode will suffice 
for the install. The PIC mode information is only provided by the 
BIOS as legacy support and is not a development priority. Thus the 
reason for no PIC entries in the ACPI tables. 
 
Sorry I didn't get the biosdecode data to you today. Its on my list 
for tomorrow. 
 
 
Comment 67 john stultz 2004-11-24 15:15:13 EST
Created attachment 107422 [details]
biosdecode output
Comment 68 Bradley Thomas 2004-11-29 13:52:57 EST
Len, this was put into NEEDINFO state... and I'm a bit confused on 
what further is needed from us on this issue... I blame the 
tryptophan myself for that bit of confusion :).

Could you help me get back on track here as to what is needed?  I 
thought we had supplied everything on this issue at least...
Comment 69 Len Brown 2004-11-29 17:40:39 EST
To summarize -- so I don't have to read 68 comments the
next time I get back to this bug report:-)

This exotic non-PC compatible hardware
will never install with devices in an RXE
because the RHEL4 x86 install kernel lacks IOAPIC support.
This is true both with and without ACPI and was reported
in bug #99362 against RHEL3 and bug #123050 against RHEL4.
Those with x86_64 hardware can install that IOAPIC-enabled
kernel and not see the problem, and those without an RXE
will also not see the problem.
But the only fix for those with x86
hardware and an RXE is for Red Hat to add IOAPIC support
to the x86 installer kernel, the subject of bug #123050.

Here in this humble bug, the mystery is why on an x86 system
without an RXE, acpi=noirq is necessary to make IRQ10 work
in the PIC-mode ACPI kernel.

The other mystery is why John's machine fail always,
and Bradley's machines fail only 1/3rd of the time.

IBM: per comment #62
does the "using IRQ 10" messages stay the same
on a failed vs. successful PIC-mode ACPI-enabled boot?
Is it possible to compare the failed vs. success
console logs?

biosdecode in comment #67 shows that this system
has no PIRQ routers.  The IOC is hard-coded to IRQ10
and the OS is expected to detect that by reading the
IRQ value from the device's PCI config space.  This
is the normal path in non-PIRQ legacy mode.
However, it is the error path in ACPI mode.

The correct thing for the ACPI BIOS to do would have been
to add a _PRT entry for this PCI device with a hard-coded
value of GSI 10 in PIC mode.  One could argue that the
lack of such an entry is a BIOS bug and this bug
should be closed pending a BIOS update.

But the question remains, however, why Linux's error
path isn't enabling this IRQ.  It may either be
1. hardware/BIOS magic.
   Afterall, the device is connected to IRQ10 through
   magic and is un-connected from IRQ10 through some
   magic.  Maybe the BIOS runs something in SMM mode
   to disconnect the pin from IRQ10 when ACPI mode is
   entered or when the IOAPIC is enabled.  Who knows?
   The BIOS supplies no PCI Interrupt Link Devices or
   PIRQ table entries to tell us.

   IBM: per comment #59...
   what did the BIOS writer say when you asked
   them what LKUS in the DSDT was?  Note that Linux
   Disables all PCI Interrupt Link devices that are
   not referenced by PCI devices.

   I'll attach a debug patch to not do this and see
   if it has any effect.  Of course LKUS claims to
   talk to IRQ3, but one has to assume that this
   BIOS is not telling the truth.

2. It is possible we have the ELCR for IRQ10 set
   incorrectly to Edge instead of Level.  Indeed,
   this error path is typically exercised only for
   IDE devices, which is happy as a clam using
   Edge triggered mode.

   I can attach a debug patch to check for this.

keeping bug in NEEDINFO -- please reply to the items with "IBM:"

p.s. IBM: BTW, what did the bios writer reply when you asked them
if they can use an AML compiler newer than 2-years old?
Comment 70 Len Brown 2004-11-29 17:50:18 EST
Created attachment 107581 [details]
debug patch to not disable unused Links vs. 2.6.9

Please apply this debug patch to the installed kernel and boot it with
"noapic".
If the ioc on IRQ10 works with this patch when the standard kernel
booted with same flags did not, then we know that this system does
not want us to disable the mysterious "LKUS" PCI Interrupt Link.

Note that if this debug patch were applied to the release, it would break other
systems.
Comment 71 john stultz 2004-11-29 17:56:17 EST
Len:  
 
Regarding the LKUS bits, the BIOS author said the "LKUS (Interrupt 
Link for USB device) was created when we tried to use IOAPIC of south 
bridge,but now we use external IOAPICs only. So, the LKUS is not 
used. I do not think it is a problem." 
 
Also regarding the AML compiler: "When we were developing x365, the 
compiler was pretty new. x365 was GA'ed last year,and we do not 
replace the compiler with the latest one unless we encounter a 
compiler bug." He also indicated they plan to use the latest compiler 
for future systems.  
 
 
Comment 72 Len Brown 2004-11-29 18:08:23 EST
Created attachment 107582 [details]
debug patch to register PCI interrupts in error path

Thanks for confirming that LKUS is dead BIOS code that we can ignore --
no need to test the previous debug patch.

Please apply this debug patch to the installed kernel and boot with "noapic".
This will register a PCI interrupt even though the BIOS erroneously excluded
ACPI support for it.  please attach the resulting dmesg.

Note that his debug patch may cause functioning systems in the field to fail.
Comment 74 Bradley Thomas 2004-11-30 10:53:25 EST
John, we're trying to figure out if you have an installation case 
where you do not use acpi=off in order for the install to be 
successful.  Since HT is disabled, Red Hat won't certify in that case 
with that flag...

So, does noapic and acpi=noirq work for example?  That at least would 
be a "valid" work around from a certification stand point...
Comment 75 john stultz 2004-11-30 11:17:02 EST
Bradley: From comment #12, acpi=noirq does seem to work w/ the 
installer kernel. pci=noacpi also works. 
 
Len: I should have test results for attachment 107582 [details] for you later 
today. 
Comment 76 Bob Johnson 2004-11-30 11:55:54 EST
So, John, we have a work around here ? per your last comment ?
Comment 79 john stultz 2004-11-30 15:03:05 EST
Bob: Not yet. I was out this morning, so I haven't yet tested the 
patch. Its been slow going (been sorting out how to build custom 
kernels on RHEL4) but I've got a kernel now and it should be able to 
get results soon. 
Comment 80 john stultz 2004-11-30 16:05:40 EST
Bob: Sorry, I might have confused your question. What do you mean by 
workaround? If you mean "acpi=noirq", then yea, that works for me. 
 
As for the Len's patch (attachment 107582 [details]), when used w/ noapic I get 
a hang when initializing the ide layer, before rhgb starts up (where 
it used to hang probing hardware). I'll post the debug info I can 
capture from it soon. 
Comment 81 Bradley Thomas 2004-11-30 16:14:17 EST
So, if we have to, we can use acpi=noirq.  Still want to get this 
fixed though if possible... I hate using boot parameters.

Things have been crazy here today, so we're trying to get the portion 
from comment #69 of console messages from a failed and successful 
boot tomorrow morning.
Comment 82 john stultz 2004-11-30 17:11:54 EST
Created attachment 107659 [details]
console screen capture of nogsi patch booted w/ just "noapic"

Here's a console image capture of the hang when using Len's nogsi patch w/ only
"noapic". I'll see if I can connect a serial line or get netconsole output as
well.
Comment 83 john stultz 2004-11-30 17:33:16 EST
Created attachment 107662 [details]
dmesg output for nogsi patch (no extra boot options)
Comment 84 john stultz 2004-11-30 17:34:46 EST
Created attachment 107663 [details]
/proc/interrupts output for nogsi patch (w/ no extra boot options)
Comment 85 john stultz 2004-11-30 17:40:27 EST
Created attachment 107664 [details]
dmesg output for nogsi patch booting w/ "noapic" and "acpi=off"
Comment 86 john stultz 2004-11-30 17:41:55 EST
Created attachment 107665 [details]
/proc/interrupts output for nogsi patch booting w/ "noapic acpi=off"
Comment 87 john stultz 2004-11-30 18:54:57 EST
Created attachment 107674 [details]
serial console log from nogsi patch booting w/ "noapic"

This is a bit odd. When sending the console out the serial port, it doesn't get
as far as it did when it just went to tty0. Is there any serial console gotchas
I'm unaware of?
Comment 88 john stultz 2004-11-30 19:37:56 EST
Created attachment 107678 [details]
diff of console logs from a "noacpi" boot and a "noacpi pci=noacpi" boot

Thought this might be interesting. I did a diff of the console logs between the
plain 2.6.9-1.675_ELsmp kernel using "noapic" (which hangs) and "noapic
pci=noacpi" (which boots).  The "Skipping IOAPIC probe..." bit looks curious. 

Here's a question for someone at RedHat: For some reason the serial console
stopps working after the SELinux initialization. Using the dmesg from the
booted kernel, I can see there is a good amount of stuff being loaded after
that point, so I wonder if there's a way to get around this? It might clear up
exactly at what point the system is hanging.
Comment 89 john stultz 2004-11-30 21:02:01 EST
Created attachment 107680 [details]
extended dmesg from installed kernel w/ "noapic init=/bin/sh"

Trying to sort out what's hanging the box, I noticed the following interrupts
are present w/ "noapic acpi=off" but not present in the "noapic init=/bin/sh"
/proc/interrupts logs I've captured: ohci_hcd, tg3, and
radeon@pci:0000:00:08.0.

I once again booted w/ "noapic init=/bin/sh" and modprobed the tg3 ohci-hcd and
radeon modules. 

The only odd bit I saw was the following line:
ohci_hcd 0000:00:0f.2: Unlink after no-IRQ?  Different ACPI or APIC settings
may help.

The system still seemed to be fine, but the hang I've been seeing doesn't lock
the box up, instead it seems to just block the /sbin/loader application on the
installer and the hardware probing bit (kudzu?) from the installed kernel. The
system is still alive and getting interrupts (from the keyboard and mouse,
atleast)

I do notice however, that when we load the tg3, we get a "ACPI: PCI interrupt
0000:03:01.1[B]: no GSI - using IRQ 11" message for eth1. Then when the radeon
we get a very similar message (again using IRQ 11). Is this problematic?
Comment 90 john stultz 2004-11-30 21:33:26 EST
More and more it looks like the USB bits are the problem. That 
"Unlink after no-IRQ" message makes it look like USB isn't recieving 
interrupts. In fact, after I load ohci-hcd and that message occurs, 
USB events are not recognized. I get zero interrupts 
in /proc/interrupts and inserting a usb device does nothing. 
 
Looking at the difference in the logs between the "noapic 
init=/bin/sh" and the "noapic pci=noacpi" cases, I don't see any 
indication of a problem. Both use irq3. I'm confused. 
Comment 91 john stultz 2004-11-30 21:40:35 EST
Looks to be similar to FC bug #135171 
Comment 92 Len Brown 2004-11-30 23:58:50 EST
Created attachment 107683 [details]
updated debug patch to register interrupts in error path vs 2.6.9

John, comment #87 shows that I botched the debug patch.

ACPI: PCI interrupt 0000:03:02.0[A]: no GSI - using IRQ 10	  
acpi_pci_irq_enable: NOT bailing out on error
ACPI: PCI interrupt 0000:03:02.0[A] -> GSI 0 (level, low) -> IRQ 0

The "IRQ 0" part should be "IRQ 10" and we should then see
a message that we set it to LEVEL triggered, ioc should get
interrupts and the system booted with just "noapic" should boot.

Please patch -R on it and apply this updated patch.
Boot the installed kernel with "noapic" and capture the console.
I do not think the other test cases are necessary.

Please remember that we're debugging an error case that may
not be fixable w/o breaking other systems.  The real fix, per comment #69
is to fix the BIOS, and you should be poking the BIOS team to do so.
Comment 93 john stultz 2004-12-01 12:06:49 EST
Len: I'll retest the patch and get back to you later today.  
 
I have started discussions w/ the BIOS team, but I still worry the 
kernel isn't doing the right thing. Maybe you could clear this up by 
answering: Why would the ACPI failure case (due to the lack of PIC 
PRTs) differ from the pci=noacpi case when booting w/o apic support? 
Comment 94 john stultz 2004-12-01 13:48:33 EST
Len: First attempt using your new no_gsi patch w/ "noapic" hung while 
detecting hardware. I'll have console logs for you shortly. 
Comment 95 john stultz 2004-12-01 13:55:55 EST
Created attachment 107728 [details]
console log from updated nogsi patched kernel booting w/ "noapic"

Here's the console log output from the updated no_gsi patched kernel using
"noapic". Again, console log stops after SELinux initializes, however the
system continues booting getting to the "Checking for new hardware" (kudzu)
portion of init.
Comment 96 john stultz 2004-12-01 13:59:17 EST
Also last night I booted w/ init=/bin/sh, and after mounting /proc 
and /proc/bus/usb, catting any file within the /proc/bus/usb hung 
cat. I'm guessing this is what is happening to the kudzu 
and /sbin/loader processes. 
Comment 97 Len Brown 2004-12-01 16:10:17 EST
Thanks for testing the latest debug patch.

For this test the system behaved exactly like
it did w/o the debug patch (except for the additional
output) yes?

PCI: Using ACPI for IRQ routing
ACPI: PCI interrupt 0000:00:08.0[A]: no GSI - using IRQ 11
acpi_pci_irq_enable: NOT bailing out on error
ACPI: PCI interrupt 0000:00:08.0[A] -> GSI 11 (level, low) -> IRQ 11
ACPI: PCI interrupt 0000:00:0f.2[A]: no GSI - using IRQ 3
acpi_pci_irq_enable: NOT bailing out on error
ACPI: PCI interrupt 0000:00:0f.2[A] -> GSI 3 (level, low) -> IRQ 3
ACPI: PCI interrupt 0000:03:01.0[A]: no GSI - using IRQ 5
acpi_pci_irq_enable: NOT bailing out on error
ACPI: PCI interrupt 0000:03:01.0[A] -> GSI 5 (level, low) -> IRQ 5
ACPI: PCI interrupt 0000:03:01.1[B]: no GSI - using IRQ 11
acpi_pci_irq_enable: NOT bailing out on error
ACPI: PCI interrupt 0000:03:01.1[B] -> GSI 11 (level, low) -> IRQ 11
ACPI: PCI interrupt 0000:03:02.0[A]: no GSI - using IRQ 10
acpi_pci_irq_enable: NOT bailing out on error
ACPI: PCI interrupt 0000:03:02.0[A] -> GSI 10 (level, low) -> IRQ 10
ACPI: PCI interrupt 0000:03:02.1[B]: no GSI - using IRQ 10
acpi_pci_irq_enable: NOT bailing out on error
ACPI: PCI interrupt 0000:03:02.1[B] -> GSI 10 (level, low) -> IRQ 10

This indicates that the ELCR setting is not the issue,
because if any of these IRQs were set to EDGE, we would
have seen output here where we set them to LEVEL.

Re: why doesn't the ACPI error path behave exactly as
if ACPI were not enabled?  The short answer is because this
is the first system to excerise that error path for a PCI
device, but to find the long answer we'll need an additional
debug patch...

BTW. since you can get to a shell prompt, it would be interesting
to know what /proc/interrupts says about interrupts delivered.
Comment 98 Bradley Thomas 2004-12-01 16:20:55 EST
Created attachment 107737 [details]
console log failure.try1.warmboot log

Len, this is the first of a couple logs that we're going to be posting from the
console... there are some problems that we're having on capturing the full
failure from a remote terminal, since once the framebuffer starts, we no longer
get a full output to the remote console.

If we're doing something wrong... please let us know :).
Comment 99 Bradley Thomas 2004-12-01 16:22:15 EST
Created attachment 107738 [details]
console log failure.try2.warmboot log
Comment 100 Bradley Thomas 2004-12-01 16:23:36 EST
Created attachment 107739 [details]
console log failure.try3.coldboot.log
Comment 101 Bradley Thomas 2004-12-01 16:24:42 EST
Created attachment 107740 [details]
dmesg log success.dmesg.log
Comment 102 Bradley Thomas 2004-12-01 16:26:28 EST
Created attachment 107741 [details]
console log success.dmesg.log
Comment 103 Len Brown 2004-12-01 16:31:50 EST
> console=ttyS0,57600 console=tty

I use this:

console=tty0 console=ttyS0,115200n8

I think the order is important -- the 2nd one remains the primary 
console.

Also, you might experiment with your firmware console re-direction -- 
it is possible that it is conflicting with Linux, or it may be that 
you can leave it enabled after boot and it will re-direct everything 
seen on tty0 when you disable the Linux serial console.
Comment 104 Len Brown 2004-12-01 16:33:15 EST
btw. it is easier for me to read console logs if they're
in ASCII instead of Unicode.
Comment 105 Len Brown 2004-12-01 16:47:22 EST
The two success cases were dmesg from SMP kernel boots
and the three failure cases were console captures
from UP installer-kernel boots?

I don't think we're going to learn anything from the installed
SMP kernel, unless you boot it with "noapic", and I guess
"maxcpus=1" to act like the installed kernel.

You need to add "debug" to cmdline for console captures
to get all the lines that would be in dmesg.

contratulations on making the 100th entry in this bug report:-)
Comment 106 john stultz 2004-12-01 16:51:40 EST
re comment #97 : Yes, behaviour wise, the updated no_gsi patch only  
seems to print additional debug info. The hang is the same as the  
redhat kernel w/o the patch.   
  
I'm working remote today, so unfortunately I cannot get  
the /proc/interrupts output from booting w/ init=/bin/sh. However, I  
strongly suspect it will be just like the output from attachment  
#107013 [details]. I'll update tomorrow if its not.  
Comment 107 john stultz 2004-12-02 17:45:53 EST
Gah! Oh the time wasted! And it was right under my nose! 
 
After comment #72, when you said I wouldn't need to test the patch in 
attachment #107581 [details] from comment #72, I went along ignoring that 
patch. This morning, after explaining the noapic + acpi situation to 
GregKH, he said Linus had posted a patch related to the apic bits w/ 
ACPI. So I dug up the patch from 
http://www.ussg.iu.edu/hypermail/linux/kernel/0411.2/1650.html 
 and gave it a whirl, and low and behold it worked. When I looked to 
see what it was doing, it looked oddly familar. Well, ends up it's 
the same thing as the patch in attachment #107581 [details] !!!! 
 
Crud. So Len, forgive me if passed on bad info about the LKUS bits 
being irrelevant. The BIOS guy did mention that they were related to 
the USB and that should have tipped me off when I started seeing the 
trouble was USB related.  
 
Len: Could you tell me what exactly the disabled code is doing, so I 
might have better ammo to get the BIOS folks to fix it? 
 
 
 
 
Comment 108 john stultz 2004-12-06 13:33:12 EST
Len: The BIOS guy got back to me (he's on vacation), and insists that 
nothing calls LKUS. Is the Linux ACPI subsystem calling it then? I 
don't really understand the code that the patch in comment #70 
disables. Might you clarify? 
 
After the 10th I should be able to get the BIOS folks to tweak things 
and send me some test BIOSes.  
Comment 109 Bradley Thomas 2004-12-06 14:21:37 EST
With this last bit of information added, I'm assuming that no further 
capture information is needed from my end.  If this is incorrect, 
please let me know.
Comment 110 Bradley Thomas 2004-12-08 15:25:55 EST
As of RHEL 4 pre-RC1 we are still needing to pass two boot 
parameters, but when we do... it works.  If we pass both noapic 
acpi=noirq the install works properly.

It reboots without using both of the parameters.  John, is this the 
case with yours still?
Comment 111 john stultz 2004-12-10 18:06:12 EST
Bradley: Yea, pci=noacpi is still needed when installing. I'm not 
sure why you're passing noapic to the installer kernel (apic is 
disabled there), or are you talking about the installed SMP kernel? 
 
Len: Any feedback from comment #107 and comment #108? Hopefully 
Monday I'll have a test BIOS to play with, but a summary of why the 
dead code is causing problems would help my understanding. 
Comment 112 Len Brown 2004-12-12 00:43:21 EST
Please share this with the BIOS writer:

Linux disables (evaluates the _DIS method) for ALL
PCI Interrupt Link Devices in the system, and then re-enables
them (evaluates the _STA method) only when/if the device is
enabled by a device driver.  We do this because not doing
it results in spurious interrupts on some systems.

Apparently the register write in the _DIS method of
the bogus LKUS link in this BIOS is doing a "bad thing".

Note that Linus' patch referenced above was NOT
applied to the upstream kernel, so don't count on it
from saving this system from failing in the future.

This BIOS needs to:
1. remove LKUS
2. add _PRT entries for PIC mode
Comment 113 john stultz 2004-12-13 16:46:49 EST
I wrangled up a test BIOS that had the LKUS removed. That seemed to 
fix the problem. I'm seeing what I can do to have that changed 
released in the next BIOS update. 
Comment 114 john stultz 2004-12-21 14:02:31 EST
Since it looks a BIOS update will be the fix here, I'm closing this 
as NOTABUG. I may reopen this if the BIOS engineers push back with a 
decent reason. 

Note You need to log in before you can comment on or make changes to this bug.