Description of problem: When booting the RHEL4-Alpha4 installer CD on an IBM x365, the installer hangs at "running /sbin/loader". Booting w/ "acpi=off" works around the issue. How reproducible: Every time. Steps to Reproduce: 1. Insert RHEL4-alpha4 CD #1 into an IBM x365 2. Boot system Actual results: Hang at "running /sbin/loader" Expected results: No hang Additional info: Booting w/ "acpi=off" works around the issue. Changing to the debug console, the last messages were about the USB controler.
reproduced with beta 1 refresh.
Also confirmed the problem exists w/ Beta1 Refresh1. Any suggestions on debugging this issue? While the installer hangs, the system is live and responsive (vttys can be switched, the system traps ctrl-alt-del and reboots normally). It just seems the /sbin/loader app is hung.
Old bug, need info from IBM.
What info do you need from IBM? I don't see any requests for info or comments from Red Hat here.
My guess is Red Hat would like to have attached the dubug console results. From looking at the comments, the assumption is the problem has something to do with ACPI.
This issue has been reproduced w/ RHEL 4 beta2. I'm looking for a way to get the console output, although it really isn't all that interesting.
Created attachment 106520 [details] 1st virtual console output of the hang I know, jpeg screen captures are cornball, but we don't have a serial line attached to the box, so this was the fastest way to provide the info. This is the first console output of the installer hang.
Created attachment 106521 [details] 2nd virtual console output 2nd virtual console output
Created attachment 106522 [details] 3rd virtual console output The last line about the AT keyboard showed up because I needed to plug in a PS2 keyboard in order to switch consoles, as the USB keyboard was not responding.
<can we uncheck the RHEL Beta access limitation on this bug?> Thanks for the screen captures. Nothing, however, jumps out at me as broken -- at least as shown by the last screen. Can you get a root prompt in a console and get access to the dmesg from the beginning? Does FC3 install on this system, or does it have the same hang? How about other install options -- does it work if you use the "vnc" boot option on the installer kernel or disable rhgb? acpi=off is a big clue. Does it also work if you use "acpi=noirq" or "pci=noacpi"?
John, what BIOS level is your x365 at? We're not seeing this problem in our lab on our x365, and BIOS level is the only thing that comes to mind on a difference. Are you at the latest level?
Len: Just tested "acpi=noirq" and that seems to work. I'll check "pci=noacpi" next. Breadley: Good thought. I had seen this initially w/ a pre-GA version of the hardware (BIOS 1.01), but this is with a GA'ed system (BIOS 1.05). I'll go poke around and see if I can't find a more recent version.
Bradley: I reproduced the problem w/ BIOS 1.08 (the latest off of updatexpress 3.05a). Not sure why you're not seeing this while both Wendy and myself are. Len: I don't have an issue w/ removing the access limitation, but I don't want to step on any toes, so I'll let someone from redhat change it.
Len: "pci=noacpi" works fine as well.
John: Realize that Wendy is also not seeing the problem as of Beta 2.
Oh, that's news. Huh. Well, I just updated the BIOS on our pre-GA system to 1.08 and the problem still exists. So I'm seeing it on two of our systems. Do you have any extra hardware in your box? The ones in beaverton are 4cpus, 2 and 10Gigs memory and no added PCI cards. What BIOS level are you using? (Just to make sure I'm really up to date)
Additional clarification: it only seems the boot kernel is having this issue. After the system has been installed it boots and functions fine w/o any additional boot options.
Please attach the output from lspci -vv and acpidmp Please attach the dmesg -s64000 from the installed kernel boot. Does the FC3 install kernel fail the same way as the RHEL4? Please try booting the installed kernel with "noapic" report if it works and if it does, attach the dmesg and /proc/interrupts. Please try booting the installer kernel with init=/bin/sh capture the /proc/interrupts and as much of the dmesg as you can.
Created attachment 106989 [details] lspci -vv output lspci -vv output from installed kernel
Created attachment 106990 [details] dmesg from installed kernel
When booting w/ noapic, rhgb seems to hang at "probing new hardware". Looks similar to the /sbin/loader hang, as X is still active and working, but the init scripts are just blocked waiting for something.
Created attachment 107012 [details] dmesg from installed kernel w/ "noapic" Booted the install kernel w/ "noapic" and "init=/bin/sh" to capture this dmesg.
Created attachment 107013 [details] /proc/interrupts output from installed kernel w/ "noapic"
The FC3 test2 (sorry, its the only one I had around) installer appeared to hang in the exact same way.
Created attachment 107023 [details] acpidump output acpidump output of installed kernel
acpidump is no good. Please attach the output from acpidmp, available in /usr/sbin or in pmtools: http://ftp.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/
Created attachment 107027 [details] acpidmp output
It appears that the Red Hat install-kernel has ACPI support w/o IO-APIC support, and the IBM BIOS has ACPI support only with IO-APIC support -- no PIC support. I.e. This BIOS does not supply any _PRT entries in PIC mode -- only in IOAPIC mode. This means that that ACPI will not be able to route any PCI interrupts in PIC mode. As there are no _PRT entries, the PIC IRQs will all be programmed with identity mappings in legacy mode, per the ACPI spec. So is likely that the Symbios SCSI controllers are issuing level/low PCI interrupts, and the PIC is looking for edge/high interrupts -- no go. If you installed this box onto a legacy IDE drive it would probably work -- unless you needed other PCI devices such as the network during the install. Can IBM confirm with its BIOS engineers if this is by design? If it is, then there are several options: 1. fix the BIOS to add _PRT entries in PIC mode. (note that there is an unreferenced LKUS link in the DSDT so some BIOS engineer appears to be part way through something) 2. add IOAPIC support to the Red Hat installer kernel 3. add a blacklist entry to the Red hat installer kernel to set acpi=off or acpi=noirq automatically for this box. 4. Document that the options in #3 must be manually used to install this system. In the event #3 is necessary, please attach the output from dmidecode, available in /usr/sbin/ or here: http://www.nongnu.org/dmidecode/ That said, I'd like to verify what the legacy methods are doing. Please attach the dmesg and /proc/interrupts from booting the installed kernel with "acpi=off" "noapic"
Len and John, Well, this is getting weirder and weirder from our end... John, our system is also at BIOS level 1.08 (a build level of 28) and we are working with both nothing additional in the system, as well as with various PCI adapters. This system does not have IDE hard drives, they are SCSI only. The fact that this works for us, and isn't for you, John, really has me confused as well. Len, is there anything that we could provide from our system that is working?
Could you attach the acpidmp output like I did? That way we can compare and make sure we've really got the same BIOS. I'll verify my build level as well.
BIOS build level is RDJT28EUS from 8/18/04.
Created attachment 107089 [details] acpidmp information from relentless - Brad's Here is the acpidmp information from my Relentless.
Ah. Your acpidmp is def not the same as mine. We must not be running the same thing. Let me see if I can't dig up a newer bios.
I just upgraded to the latest internal version of the BIOS and the problem is still there. I'll follow up offline w/ Wendy and Bradley on Monday to see if we cannot resolve this difference. Len: Could you verify that Bradley's acpidmp output does infact have PIC support? That would atleast confirm your theory for the difference we're seeing.
I don't see any difference in the PCI IRQ PIC-mode support between the BIOS in comment #35 and the earlier on in comment #28 lenb@d845pe brad $ acpixtract DSDT relentless-acpidmp.txt >DSDT lenb@d845pe brad $ iasl -d DSDT Disassembly completed, written to "DSDT.dsl" lenb@d845pe brad $ grep PICM DSDT.dsl Name (PICM, Package (0x00) {}) Return (PICM) Name (PICM, Package (0x00) {}) Return (PICM) Name (PICM, Package (0x00) {}) Return (PICM) Name (PICM, Package (0x00) {}) Return (PICM) Name (PICM, Package (0x00) {}) Return (PICM) Name (PICM, Package (0x00) {}) Return (PICM) Indeed, the only difference in the DSDT is the number of Processors went from 6 to 2.
Bradley, Are you running an x86 kernel, or an x86_64 kernel? John's failure is using teh x86 install kernel. (and x86 installed kernel w/ "noapic") Please boot the installed x86 kernel with "noapic" and attach the resulting dmesg -s64000 and /proc/interrupts and lspci -vv John, Bradley, per comment #29, it would be helpful if you can also boot the installed kernel with "acpi=off" "noapic" and attach the resulting dmesg and /proc/interrupts
Created attachment 107216 [details] dmesg from booting installed kerenl w/ noapic and acpi=off
Created attachment 107217 [details] /proc/interrupts output for installed kernel w/ noapic and acpi=off
Just so you know, when using noapic and acpi=off to get the above output the system booted normally.
Created attachment 107222 [details] noapic boot with dmesg -s64000
Created attachment 107223 [details] noapic boot with /proc/interrupts
Created attachment 107224 [details] lspci -vv from booting installed kernel with noapic
Created attachment 107225 [details] Brad's dmesg from booting installed kernel with acpi=off
Created attachment 107226 [details] Bradl's /proc/interrupts output for installed kernel with acpi=off
Bradley: It looks like you guys have a qla card installed. I'll see if I can find one to install. Conversley you could try removing it and seeing if the problem shows itself.
My tester removed all the adapters, as well as the qla card that was in the system, and then tested both the installed kernel as well as the installation kernel. With the installed kernel, the system booted properly 6 out of 6 times. With the installation kernel, the system only booted 2 out of 4 times. Also, this was the first time the tester has seen a failure on an installation on Beta 2 on this system... curiouser and curiouser.
were the failures to boot using the installer (PIC-mode x86) kernel after a power-on, or after a reboot?
Unfortunately... on both. 1 failure was on a cold boot, the other failure was on a reboot.
how about the converse... when the system doesn't fail to boot the installer kernel, was it afer a cold power-on or a reboot?
Bradley's machine booted w/ "noapic" ACPI: PCI Interrupt Link [LKUS] (IRQs *3) Linux Plug and Play Support v0.97 (c) Adam Belay usbcore: registered new driver usbfs usbcore: registered new driver hub PCI: Using ACPI for IRQ routing ACPI: PCI interrupt 0000:00:08.0[A]: no GSI - using IRQ 11 ACPI: PCI interrupt 0000:00:0f.2[A]: no GSI - using IRQ 3 ACPI: PCI interrupt 0000:03:01.0[A]: no GSI - using IRQ 5 ACPI: PCI interrupt 0000:03:01.1[B]: no GSI - using IRQ 7 ACPI: PCI interrupt 0000:03:02.0[A]: no GSI - using IRQ 10 ACPI: PCI interrupt 0000:03:02.1[B]: no GSI - using IRQ 10 ACPI: PCI interrupt 0000:0f:08.0[A]: no GSI - using IRQ 5 ACPI: PCI interrupt 0000:0c:08.0[A]: no GSI - using IRQ 3 ACPI: PCI interrupt 0000:09:08.0[A]: no GSI - using IRQ 11 The "no GSI" message is consistent with what we saw in the DSDT -- the BIOS simply doesn't tell ACPI anyting about this system in PIC mode. Ie. everything in comment #29 is still true, including this question: Can IBM confirm with its BIOS engineers if this is by design? The "using IRQ 10" etc. messages are basically an un-tested error path where we use whatever PCI has left in config space for the device and hope to heck it works. Your milage may vary -- depends on the state of the hardware when we read it, which is why you may be seeing different results depending on different BIOS stimulus.
I'm trying to get ahold of the BIOS author. I'll have an update as soon as I learn anything.
Just talked w/ the BIOS author, it sounds like it was purposefully done. I'm trying to see if they can add it. Len: Would you have any clue why we've never seen this issue w/ RHEL3?
Probably because RHEL3 x86 doesn't use ACPI.
RHEL3 did not run into this because it didn't support ACPI on x86. So its intaller kernel used legacy-PIC mode, and its installed SMP kernel used legacy-MPS for IOAPIC-mode. I'm not sure why Red Hat has no IOAPIC support in its installer kernel -- maybe it is part of a tradition to minimize the size of the installer kernel to fit on a floppy; but other Red Hat-based distros will have the same problem. Note that you will also run into this with FC2 and FC3, as Fedora Core is just like RHEL4 WRT ACPI and IOAPIC support. You would not have run into this using the x86_64 install kernel, because it includes IOAPIC support. I'd be curious to see an installed RHEL4 kernel boot with "acpi=off" and "noapic" -- can you snag the dmesg from that? The IOAPIC case uses MPS and that is already attached, but I'd like to see how PIRQ routers are used in the PIC case. This is what the BIOS writer will have to add to the BIOS to support ACPI PIC-mode. It would be interesting if you notice a difference in the dmesg between one of these random failures and success cases with "noapic". That may lead to another initialization bug. While you have the ear of the BIOS writer, ask them what LKUS in the DSDT is for. It is a PCI Interrupt Link Device, but is not referenced from anywhere. You might also mention to them that they're using an ASL compiler that is almost 2 years old "INTL 0x20030122", and that updates are available for free here: http://www.intel.com/technology/IAPC/acpi/downloads.htm
Len: See comment #40 and 41 for acpi=off noapic info. From my talks w/ the BIOS guy, It seems the strange lack of PIC entries in ACPI is due to the RXE pci expansion enclosure not working in PIC mode(see bug #99362 for more details on the apic/noapic installer controversy). Both the x360, x440, and x445 also have this quirk, however they are all on the ACPI blacklist due to bad behaviour early in their careers (all of the issues have been resolved,to my knowledge, but they remain on the list to be cautious). This is why we do not see the same issue on those boxes. For some reason, the MPS tables seem to provide enough info for the PIC mode to work well enough when no RXE or second CEC is connected. I'm looking to see if we cannot have the ACPI table setup to mirror what the MPS tables do.
So slight correction on that last comment if I'm understanding this correctly(I'm learnign this stuff as I go, forgive me and let me knwo if I'm wrong). It is not the MPS table, but the pci config space that is used w/ acpi=off and noapic. And using the pci config space we seem to get a correct enough irq routing table to install from. My question is why doesn't ACPI mode fall back to the same PCI config space as is used w/ acpi=off? thoughts?
Yes, legacy-IOAPIC-mode uses MPS Legacy PIC mode uses PIRQ routers and the values in PCI config space. However, in ACPI mode, the system doesn't know anything about PIRQ routers, so if the BIOS didn't set them up for us, then they're not set up. In comment #55, the "using IRQ11" messages are the values from PCI configuration space. If these numbers don't change between a successful and failed boot, then we know that the issue isn't the values in PCI configuration space. It is probably the state of the PIRQ routers. But I don't see evidence of PIRQ routers in comment #40. Please attach the output from biosdecode, which is available in the dmidecode package: http://www.nongnu.org/dmidecode/ It will print out if the system offers any PIRQ router and entries in Legacy PIC mode.
Len, to answer an earlier question from comment #54, it would work sometimes on a cold boot, fail sometimes on a cold boot, work sometimes on a reboot, and fail sometimes on a reboot. Basically... intermittent.
If the hardware can't access the entire system in PIC mode, then how can booting with acpi=off in legacy-PIC mode be a viable workaround? If the hardware _can_ access the entire system in legacy-PIC mode, why can't the BIOS supply the same capability in ACPI-PIC mode?
Len: I'm not the authority on this (James Cleverdon, who's on vacation would be better to answer), but my undertanding is this: APIC is necessary to be able to route interrupts from devices outside the single system enclosure (for example: the RXE external pci enclosure and a 2 x445 enclosure 16way system, where two 8 way x445s are linked together). Thus, from the BIOS folks perspective APIC is necessary, however since the RedHat installer kernel doesn't support apic (again, see bug #99362 for details) the workaround is to deconfigure your system down to the point that PIC mode will suffice for the install. The PIC mode information is only provided by the BIOS as legacy support and is not a development priority. Thus the reason for no PIC entries in the ACPI tables. Sorry I didn't get the biosdecode data to you today. Its on my list for tomorrow.
Created attachment 107422 [details] biosdecode output
Len, this was put into NEEDINFO state... and I'm a bit confused on what further is needed from us on this issue... I blame the tryptophan myself for that bit of confusion :). Could you help me get back on track here as to what is needed? I thought we had supplied everything on this issue at least...
To summarize -- so I don't have to read 68 comments the next time I get back to this bug report:-) This exotic non-PC compatible hardware will never install with devices in an RXE because the RHEL4 x86 install kernel lacks IOAPIC support. This is true both with and without ACPI and was reported in bug #99362 against RHEL3 and bug #123050 against RHEL4. Those with x86_64 hardware can install that IOAPIC-enabled kernel and not see the problem, and those without an RXE will also not see the problem. But the only fix for those with x86 hardware and an RXE is for Red Hat to add IOAPIC support to the x86 installer kernel, the subject of bug #123050. Here in this humble bug, the mystery is why on an x86 system without an RXE, acpi=noirq is necessary to make IRQ10 work in the PIC-mode ACPI kernel. The other mystery is why John's machine fail always, and Bradley's machines fail only 1/3rd of the time. IBM: per comment #62 does the "using IRQ 10" messages stay the same on a failed vs. successful PIC-mode ACPI-enabled boot? Is it possible to compare the failed vs. success console logs? biosdecode in comment #67 shows that this system has no PIRQ routers. The IOC is hard-coded to IRQ10 and the OS is expected to detect that by reading the IRQ value from the device's PCI config space. This is the normal path in non-PIRQ legacy mode. However, it is the error path in ACPI mode. The correct thing for the ACPI BIOS to do would have been to add a _PRT entry for this PCI device with a hard-coded value of GSI 10 in PIC mode. One could argue that the lack of such an entry is a BIOS bug and this bug should be closed pending a BIOS update. But the question remains, however, why Linux's error path isn't enabling this IRQ. It may either be 1. hardware/BIOS magic. Afterall, the device is connected to IRQ10 through magic and is un-connected from IRQ10 through some magic. Maybe the BIOS runs something in SMM mode to disconnect the pin from IRQ10 when ACPI mode is entered or when the IOAPIC is enabled. Who knows? The BIOS supplies no PCI Interrupt Link Devices or PIRQ table entries to tell us. IBM: per comment #59... what did the BIOS writer say when you asked them what LKUS in the DSDT was? Note that Linux Disables all PCI Interrupt Link devices that are not referenced by PCI devices. I'll attach a debug patch to not do this and see if it has any effect. Of course LKUS claims to talk to IRQ3, but one has to assume that this BIOS is not telling the truth. 2. It is possible we have the ELCR for IRQ10 set incorrectly to Edge instead of Level. Indeed, this error path is typically exercised only for IDE devices, which is happy as a clam using Edge triggered mode. I can attach a debug patch to check for this. keeping bug in NEEDINFO -- please reply to the items with "IBM:" p.s. IBM: BTW, what did the bios writer reply when you asked them if they can use an AML compiler newer than 2-years old?
Created attachment 107581 [details] debug patch to not disable unused Links vs. 2.6.9 Please apply this debug patch to the installed kernel and boot it with "noapic". If the ioc on IRQ10 works with this patch when the standard kernel booted with same flags did not, then we know that this system does not want us to disable the mysterious "LKUS" PCI Interrupt Link. Note that if this debug patch were applied to the release, it would break other systems.
Len: Regarding the LKUS bits, the BIOS author said the "LKUS (Interrupt Link for USB device) was created when we tried to use IOAPIC of south bridge,but now we use external IOAPICs only. So, the LKUS is not used. I do not think it is a problem." Also regarding the AML compiler: "When we were developing x365, the compiler was pretty new. x365 was GA'ed last year,and we do not replace the compiler with the latest one unless we encounter a compiler bug." He also indicated they plan to use the latest compiler for future systems.
Created attachment 107582 [details] debug patch to register PCI interrupts in error path Thanks for confirming that LKUS is dead BIOS code that we can ignore -- no need to test the previous debug patch. Please apply this debug patch to the installed kernel and boot with "noapic". This will register a PCI interrupt even though the BIOS erroneously excluded ACPI support for it. please attach the resulting dmesg. Note that his debug patch may cause functioning systems in the field to fail.
John, we're trying to figure out if you have an installation case where you do not use acpi=off in order for the install to be successful. Since HT is disabled, Red Hat won't certify in that case with that flag... So, does noapic and acpi=noirq work for example? That at least would be a "valid" work around from a certification stand point...
Bradley: From comment #12, acpi=noirq does seem to work w/ the installer kernel. pci=noacpi also works. Len: I should have test results for attachment 107582 [details] for you later today.
So, John, we have a work around here ? per your last comment ?
Bob: Not yet. I was out this morning, so I haven't yet tested the patch. Its been slow going (been sorting out how to build custom kernels on RHEL4) but I've got a kernel now and it should be able to get results soon.
Bob: Sorry, I might have confused your question. What do you mean by workaround? If you mean "acpi=noirq", then yea, that works for me. As for the Len's patch (attachment 107582 [details]), when used w/ noapic I get a hang when initializing the ide layer, before rhgb starts up (where it used to hang probing hardware). I'll post the debug info I can capture from it soon.
So, if we have to, we can use acpi=noirq. Still want to get this fixed though if possible... I hate using boot parameters. Things have been crazy here today, so we're trying to get the portion from comment #69 of console messages from a failed and successful boot tomorrow morning.
Created attachment 107659 [details] console screen capture of nogsi patch booted w/ just "noapic" Here's a console image capture of the hang when using Len's nogsi patch w/ only "noapic". I'll see if I can connect a serial line or get netconsole output as well.
Created attachment 107662 [details] dmesg output for nogsi patch (no extra boot options)
Created attachment 107663 [details] /proc/interrupts output for nogsi patch (w/ no extra boot options)
Created attachment 107664 [details] dmesg output for nogsi patch booting w/ "noapic" and "acpi=off"
Created attachment 107665 [details] /proc/interrupts output for nogsi patch booting w/ "noapic acpi=off"
Created attachment 107674 [details] serial console log from nogsi patch booting w/ "noapic" This is a bit odd. When sending the console out the serial port, it doesn't get as far as it did when it just went to tty0. Is there any serial console gotchas I'm unaware of?
Created attachment 107678 [details] diff of console logs from a "noacpi" boot and a "noacpi pci=noacpi" boot Thought this might be interesting. I did a diff of the console logs between the plain 2.6.9-1.675_ELsmp kernel using "noapic" (which hangs) and "noapic pci=noacpi" (which boots). The "Skipping IOAPIC probe..." bit looks curious. Here's a question for someone at RedHat: For some reason the serial console stopps working after the SELinux initialization. Using the dmesg from the booted kernel, I can see there is a good amount of stuff being loaded after that point, so I wonder if there's a way to get around this? It might clear up exactly at what point the system is hanging.
Created attachment 107680 [details] extended dmesg from installed kernel w/ "noapic init=/bin/sh" Trying to sort out what's hanging the box, I noticed the following interrupts are present w/ "noapic acpi=off" but not present in the "noapic init=/bin/sh" /proc/interrupts logs I've captured: ohci_hcd, tg3, and radeon@pci:0000:00:08.0. I once again booted w/ "noapic init=/bin/sh" and modprobed the tg3 ohci-hcd and radeon modules. The only odd bit I saw was the following line: ohci_hcd 0000:00:0f.2: Unlink after no-IRQ? Different ACPI or APIC settings may help. The system still seemed to be fine, but the hang I've been seeing doesn't lock the box up, instead it seems to just block the /sbin/loader application on the installer and the hardware probing bit (kudzu?) from the installed kernel. The system is still alive and getting interrupts (from the keyboard and mouse, atleast) I do notice however, that when we load the tg3, we get a "ACPI: PCI interrupt 0000:03:01.1[B]: no GSI - using IRQ 11" message for eth1. Then when the radeon we get a very similar message (again using IRQ 11). Is this problematic?
More and more it looks like the USB bits are the problem. That "Unlink after no-IRQ" message makes it look like USB isn't recieving interrupts. In fact, after I load ohci-hcd and that message occurs, USB events are not recognized. I get zero interrupts in /proc/interrupts and inserting a usb device does nothing. Looking at the difference in the logs between the "noapic init=/bin/sh" and the "noapic pci=noacpi" cases, I don't see any indication of a problem. Both use irq3. I'm confused.
Looks to be similar to FC bug #135171
Created attachment 107683 [details] updated debug patch to register interrupts in error path vs 2.6.9 John, comment #87 shows that I botched the debug patch. ACPI: PCI interrupt 0000:03:02.0[A]: no GSI - using IRQ 10 acpi_pci_irq_enable: NOT bailing out on error ACPI: PCI interrupt 0000:03:02.0[A] -> GSI 0 (level, low) -> IRQ 0 The "IRQ 0" part should be "IRQ 10" and we should then see a message that we set it to LEVEL triggered, ioc should get interrupts and the system booted with just "noapic" should boot. Please patch -R on it and apply this updated patch. Boot the installed kernel with "noapic" and capture the console. I do not think the other test cases are necessary. Please remember that we're debugging an error case that may not be fixable w/o breaking other systems. The real fix, per comment #69 is to fix the BIOS, and you should be poking the BIOS team to do so.
Len: I'll retest the patch and get back to you later today. I have started discussions w/ the BIOS team, but I still worry the kernel isn't doing the right thing. Maybe you could clear this up by answering: Why would the ACPI failure case (due to the lack of PIC PRTs) differ from the pci=noacpi case when booting w/o apic support?
Len: First attempt using your new no_gsi patch w/ "noapic" hung while detecting hardware. I'll have console logs for you shortly.
Created attachment 107728 [details] console log from updated nogsi patched kernel booting w/ "noapic" Here's the console log output from the updated no_gsi patched kernel using "noapic". Again, console log stops after SELinux initializes, however the system continues booting getting to the "Checking for new hardware" (kudzu) portion of init.
Also last night I booted w/ init=/bin/sh, and after mounting /proc and /proc/bus/usb, catting any file within the /proc/bus/usb hung cat. I'm guessing this is what is happening to the kudzu and /sbin/loader processes.
Thanks for testing the latest debug patch. For this test the system behaved exactly like it did w/o the debug patch (except for the additional output) yes? PCI: Using ACPI for IRQ routing ACPI: PCI interrupt 0000:00:08.0[A]: no GSI - using IRQ 11 acpi_pci_irq_enable: NOT bailing out on error ACPI: PCI interrupt 0000:00:08.0[A] -> GSI 11 (level, low) -> IRQ 11 ACPI: PCI interrupt 0000:00:0f.2[A]: no GSI - using IRQ 3 acpi_pci_irq_enable: NOT bailing out on error ACPI: PCI interrupt 0000:00:0f.2[A] -> GSI 3 (level, low) -> IRQ 3 ACPI: PCI interrupt 0000:03:01.0[A]: no GSI - using IRQ 5 acpi_pci_irq_enable: NOT bailing out on error ACPI: PCI interrupt 0000:03:01.0[A] -> GSI 5 (level, low) -> IRQ 5 ACPI: PCI interrupt 0000:03:01.1[B]: no GSI - using IRQ 11 acpi_pci_irq_enable: NOT bailing out on error ACPI: PCI interrupt 0000:03:01.1[B] -> GSI 11 (level, low) -> IRQ 11 ACPI: PCI interrupt 0000:03:02.0[A]: no GSI - using IRQ 10 acpi_pci_irq_enable: NOT bailing out on error ACPI: PCI interrupt 0000:03:02.0[A] -> GSI 10 (level, low) -> IRQ 10 ACPI: PCI interrupt 0000:03:02.1[B]: no GSI - using IRQ 10 acpi_pci_irq_enable: NOT bailing out on error ACPI: PCI interrupt 0000:03:02.1[B] -> GSI 10 (level, low) -> IRQ 10 This indicates that the ELCR setting is not the issue, because if any of these IRQs were set to EDGE, we would have seen output here where we set them to LEVEL. Re: why doesn't the ACPI error path behave exactly as if ACPI were not enabled? The short answer is because this is the first system to excerise that error path for a PCI device, but to find the long answer we'll need an additional debug patch... BTW. since you can get to a shell prompt, it would be interesting to know what /proc/interrupts says about interrupts delivered.
Created attachment 107737 [details] console log failure.try1.warmboot log Len, this is the first of a couple logs that we're going to be posting from the console... there are some problems that we're having on capturing the full failure from a remote terminal, since once the framebuffer starts, we no longer get a full output to the remote console. If we're doing something wrong... please let us know :).
Created attachment 107738 [details] console log failure.try2.warmboot log
Created attachment 107739 [details] console log failure.try3.coldboot.log
Created attachment 107740 [details] dmesg log success.dmesg.log
Created attachment 107741 [details] console log success.dmesg.log
> console=ttyS0,57600 console=tty I use this: console=tty0 console=ttyS0,115200n8 I think the order is important -- the 2nd one remains the primary console. Also, you might experiment with your firmware console re-direction -- it is possible that it is conflicting with Linux, or it may be that you can leave it enabled after boot and it will re-direct everything seen on tty0 when you disable the Linux serial console.
btw. it is easier for me to read console logs if they're in ASCII instead of Unicode.
The two success cases were dmesg from SMP kernel boots and the three failure cases were console captures from UP installer-kernel boots? I don't think we're going to learn anything from the installed SMP kernel, unless you boot it with "noapic", and I guess "maxcpus=1" to act like the installed kernel. You need to add "debug" to cmdline for console captures to get all the lines that would be in dmesg. contratulations on making the 100th entry in this bug report:-)
re comment #97 : Yes, behaviour wise, the updated no_gsi patch only seems to print additional debug info. The hang is the same as the redhat kernel w/o the patch. I'm working remote today, so unfortunately I cannot get the /proc/interrupts output from booting w/ init=/bin/sh. However, I strongly suspect it will be just like the output from attachment #107013 [details]. I'll update tomorrow if its not.
Gah! Oh the time wasted! And it was right under my nose! After comment #72, when you said I wouldn't need to test the patch in attachment #107581 [details] from comment #72, I went along ignoring that patch. This morning, after explaining the noapic + acpi situation to GregKH, he said Linus had posted a patch related to the apic bits w/ ACPI. So I dug up the patch from http://www.ussg.iu.edu/hypermail/linux/kernel/0411.2/1650.html and gave it a whirl, and low and behold it worked. When I looked to see what it was doing, it looked oddly familar. Well, ends up it's the same thing as the patch in attachment #107581 [details] !!!! Crud. So Len, forgive me if passed on bad info about the LKUS bits being irrelevant. The BIOS guy did mention that they were related to the USB and that should have tipped me off when I started seeing the trouble was USB related. Len: Could you tell me what exactly the disabled code is doing, so I might have better ammo to get the BIOS folks to fix it?
Len: The BIOS guy got back to me (he's on vacation), and insists that nothing calls LKUS. Is the Linux ACPI subsystem calling it then? I don't really understand the code that the patch in comment #70 disables. Might you clarify? After the 10th I should be able to get the BIOS folks to tweak things and send me some test BIOSes.
With this last bit of information added, I'm assuming that no further capture information is needed from my end. If this is incorrect, please let me know.
As of RHEL 4 pre-RC1 we are still needing to pass two boot parameters, but when we do... it works. If we pass both noapic acpi=noirq the install works properly. It reboots without using both of the parameters. John, is this the case with yours still?
Bradley: Yea, pci=noacpi is still needed when installing. I'm not sure why you're passing noapic to the installer kernel (apic is disabled there), or are you talking about the installed SMP kernel? Len: Any feedback from comment #107 and comment #108? Hopefully Monday I'll have a test BIOS to play with, but a summary of why the dead code is causing problems would help my understanding.
Please share this with the BIOS writer: Linux disables (evaluates the _DIS method) for ALL PCI Interrupt Link Devices in the system, and then re-enables them (evaluates the _STA method) only when/if the device is enabled by a device driver. We do this because not doing it results in spurious interrupts on some systems. Apparently the register write in the _DIS method of the bogus LKUS link in this BIOS is doing a "bad thing". Note that Linus' patch referenced above was NOT applied to the upstream kernel, so don't count on it from saving this system from failing in the future. This BIOS needs to: 1. remove LKUS 2. add _PRT entries for PIC mode
I wrangled up a test BIOS that had the LKUS removed. That seemed to fix the problem. I'm seeing what I can do to have that changed released in the next BIOS update.
Since it looks a BIOS update will be the fix here, I'm closing this as NOTABUG. I may reopen this if the BIOS engineers push back with a decent reason.