Description of problem: On HP xw8600 an xw9400, Kernel refuse to boot on both IA-32 and x86-64 with "acpi=off". kernel /vmlinuz-2.6.18-121.el5PAE ro root=/dev/VolGroup00/LogVol00 console=ttyS 0,115200 acpi=off [Linux-bzImage, setup=0x1e00, size=0x1befb4] initrd /initrd-2.6.18-121.el5PAE.img [Linux-initrd @ 0x37cd8000, 0x317f55 bytes] Linux version 2.6.18-121.el5PAE (brewbuilder.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)) #1 SMP Mon Oct 27 22:03:07 EDT 2008 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009d000 (usable) BIOS-e820: 000000000009d000 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e8e00 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 00000000dffc7000 (usable) BIOS-e820: 00000000dffc7000 - 00000000e0000000 (reserved) BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved) BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000120000000 (usable) 3712MB HIGHMEM available. 896MB LOWMEM available. found SMP MP-table at 000fe700 Memory for crash kernel (0x0 to 0x0) notwithin permissible range disabling kdump NX (Execute Disable) protection: active DMI 2.5 present. Using APIC driver default Intel MultiProcessor Specification v1.4 Virtual Wire compatibility mode. OEM ID: HP Product ID: Workstation APIC at: 0xFEE00000 Processor #0 15:1 APIC version 16 Processor #2 15:1 APIC version 16 Processor #1 15:1 APIC version 16 Processor #3 15:1 APIC version 16 I/O APIC #8 Version 17 at 0xFEC00000. I/O APIC #9 Version 17 at 0xFA400000. Enabling APIC mode: Flat. Using 2 I/O APICs Processors: 4 Allocating PCI resources starting at e1000000 (gap: e0000000:10000000) Detected 2600.075 MHz processor. Built 1 zonelists. Total pages: 1179648 Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 acpi=off Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Initializing CPU#0 CPU 0 irqstacks, hard=c0759000 soft=c0739000 PID hash table entries: 4096 (order: 12, 16384 bytes) Console: colour VGA+ 80x25 Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) Memory: 4147548k/4718592k available (2134k kernel code, 45112k reserved, 892k data, 228k init, 3276572k highmem) Checking if this processor honours the WP bit even in supervisor mode... Ok. Calibrating delay using timer specific routine.. 5202.38 BogoMIPS (lpj=2601193) Security Framework v1.0.0 initialized SELinux: Initializing. selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 512 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) CPU 0(2) -> Core 0 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. Checking 'hlt' instruction... OK. SMP alternatives: switching to UP code CPU0: AMD Dual-Core AMD Opteron(tm) Processor 2218 stepping 02 SMP alternatives: switching to SMP code Booting processor 1/1 eip 3000 CPU 1 irqstacks, hard=c075a000 soft=c073a000 Initializing CPU#1 Calibrating delay using timer specific routine.. 5199.28 BogoMIPS (lpj=2599644) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) CPU 1(2) -> Core 1 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#1. CPU1: AMD Dual-Core AMD Opteron(tm) Processor 2218 stepping 02 SMP alternatives: switching to SMP code Booting processor 2/2 eip 3000 CPU 2 irqstacks, hard=c075b000 soft=c073b000 Initializing CPU#2 Calibrating delay using timer specific routine.. 5199.30 BogoMIPS (lpj=2599650) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) CPU 2(2) -> Core 0 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#2. CPU2: AMD Dual-Core AMD Opteron(tm) Processor 2218 stepping 02 SMP alternatives: switching to SMP code Booting processor 3/3 eip 3000 CPU 3 irqstacks, hard=c075c000 soft=c073c000 Initializing CPU#3 Calibrating delay using timer specific routine.. 5199.28 BogoMIPS (lpj=2599642) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) CPU 3(2) -> Core 1 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#3. CPU3: AMD Dual-Core AMD Opteron(tm) Processor 2218 stepping 02 Total of 4 processors activated (20800.25 BogoMIPS). ENABLING IO-APIC IRQs ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1 checking TSC synchronization across 4 CPUs: CPU#0 had -131 usecs TSC skew, fixed it up. CPU#1 had -131 usecs TSC skew, fixed it up. CPU#2 had 131 usecs TSC skew, fixed it up. CPU#3 had 131 usecs TSC skew, fixed it up. Brought up 4 CPUs migration_cost=506 checking if image is initramfs... it is Freeing initrd memory: 3167k freed NET: Registered protocol family 16 ACPI Exception (utmutex-0262): AE_BAD_PARAMETER, Thread C3518AA0 could not acquire Mutex [2] [20060707] No dock devices found. ACPI Exception (utmutex-0262): AE_BAD_PARAMETER, Thread C3518AA0 could not acquire Mutex [2] [20060707] PCI: PCI BIOS revision 2.20 entry at 0xef3da, last bus=107 PCI: Using configuration type 1 Setting up standard PCI resources ACPI: Interpreter disabled. Linux Plug and Play Support v0.97 (c) Adam Belay pnp: PnP ACPI: disabled xen_mem: Initialising balloon driver. usbcore: registered new driver usbfs usbcore: registered new driver hub PCI: Probing PCI hardware HP xw9400 Workstation detected: disabling PCI segments PCI: Transparent bridge - 0000:00:06.0 PCI: Discovered peer bus 40 BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c04eda77 *pde = 00734001 Oops: 0000 [#1] SMP last sysfs file: Modules linked in: CPU: 0 EIP: 0060:[<c04eda77>] Not tainted VLI EFLAGS: 00010286 (2.6.18-121.el5PAE #1) EIP is at pci_create_bus+0x47/0x19a eax: 00000000 ebx: f7f9e600 ecx: 00000000 edx: 00000040 esi: f7f9e400 edi: c06aa030 ebp: 00000040 esp: c3517f68 ds: 007b es: 007b ss: 0068 Process swapper (pid: 1, ti=c3517000 task=c3518aa0 task.ti=c3517000) Stack: 00000000 00000000 c06a9e04 00000040 00000000 00000000 c04ee864 00000000 c06a9e04 c0717675 00000000 c065d9b6 00000040 000010de 00000000 c0729a24 00000000 c06fa5a8 c06f5fd8 c0404e06 00000202 c06fa42b 00000000 00000000 Call Trace: [<c04ee864>] pci_scan_bus_parented+0xa/0x1f [<c0717675>] pci_legacy_init+0xb6/0xdf [<c06fa5a8>] init+0x17d/0x24a [<c0404e06>] ret_from_fork+0x6/0x1c [<c06fa42b>] init+0x0/0x24a [<c06fa42b>] init+0x0/0x24a [<c0405c53>] kernel_thread_helper+0x7/0x10 ======================= Code: 00 00 a1 b4 a1 68 c0 ba d0 00 00 00 e8 53 fe f7 ff 85 c0 89 c6 0f 84 51 01 00 00 8b 44 24 1c 89 ea 89 7b 40 89 43 44 8b 4c 24 1c <8b> 01 e8 00 41 00 00 85 c0 89 04 24 0f 85 28 01 00 00 b8 28 22 EIP: [<c04eda77>] pci_create_bus+0x47/0x19a SS:ESP 0068:c3517f68 <0>Kernel panic - not syncing: Fatal exception Version-Release number of selected component (if applicable): kernel-2.6.18-121.el5 How reproducible: always Additional info: Maybe interesting to check with bug 463418 - [5.3] Kdump Kernel Panic at pci_create_bus+0x59/0x1f3.
Cai, this seems to work on the xw9400 in my cube. Which xw9400 did you test on? P.
Cai, this WORKSFORME with 122.el5 on the xw9400 in my cube, and hp-xw8600-01.rhts.bos.redhat.com in rhts. I'll attach a dmesg from the xw8600, P.
Created attachment 323094 [details] dmesg from xw8600 in RHTS
I have seen it on hp-xw9400-02.rhts.bos.redhat.com, although that is a on -121 Kernel. I would like to try -122 Kernel on it, but the machine is unavailable at the moment.
Prarit, the problem is still there. Yes, looks like it is working on x86-64 bare metal Kernel, but on both hp-xw9400-02.rhts.bos.redhat.com and hp-xw8600-01.rhts.bos.redhat.com, IA-32 bare metal and x86-64 Xen Domain 0 Kernel both are not working (IA-32 Xen Domain 0 Kernel is not tested). Please see attachments for boot logs.
Created attachment 323158 [details] IA-32 bare metal Kernel was panicking.
Created attachment 323159 [details] x86-64 Xen Domain 0 was panicking.
I have also tried -123.el5 Kernel on IA-32 bare metal. and it has the same problem -- working without "acpi=off"; panicking with it. Both boot logs have also been attached.
Created attachment 323162 [details] IA-32 -123.el5 bare metal Kernel with acpi=off is panicking.
Created attachment 323163 [details] IA-32 -123.el5 bare metal Kernel without acpi=off is working.
For your information, Kdump is working on hp-xw8600-01.rhts.bos.redhat.com with -123.el5 IA-32 Kernel now. - # readelf -a /var/crash/127.0.0.1-2008-11-11-04:47:37/vmcore ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: CORE (Core file) Machine: Intel 80386 Version: 0x1 Entry point address: 0x0 Start of program headers: 64 (bytes into file) Start of section headers: 0 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 5 Size of section headers: 0 (bytes) Number of section headers: 0 Section header string table index: 0 There are no sections in this file. There are no sections in this file. Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align NOTE 0x0000000000000158 0x0000000000000000 0x0000000000000000 0x00000000000004b4 0x00000000000004b4 0 LOAD 0x000000000000060c 0x00000000c0000000 0x0000000000000000 0x00000000000a0000 0x00000000000a0000 RWE 0 LOAD 0x00000000000a060c 0x00000000c0100000 0x0000000000100000 0x0000000000f00000 0x0000000000f00000 RWE 0 LOAD 0x0000000000fa060c 0x00000000c9000000 0x0000000009000000 0x000000002f000000 0x000000002f000000 RWE 0 LOAD 0x000000002ffa060c 0xffffffffffffffff 0x0000000038000000 0x0000000047fc2840 0x0000000047fc2840 RWE 0 There is no dynamic section in this file. There are no relocations in this file. There are no unwind sections in this file. No version information found in this file. Notes at offset 0x00000158 with length 0x000004b4: Owner Data size Description CORE 0x00000090 NT_PRSTATUS (prstatus structure) CORE 0x00000090 NT_PRSTATUS (prstatus structure) VMCOREINFO 0x00000354 Unknown note type: (0x00000000)
Created attachment 323164 [details] IA-32 -123.el5 bare metal Kdump is working.
Thanks Cai -- I'll get this back on my list. P.
(In reply to comment #6) > Created an attachment (id=323158) [details] > IA-32 bare metal Kernel was panicking. Cai, this isn't panicking like the description says. It looks like some other issue has caused the system install to fail... P.
(In reply to comment #7) > Created an attachment (id=323159) [details] > x86-64 Xen Domain 0 was panicking. Cai, Ostensibly this is happening because the fix for BZ 463418 has not been applied to the xen-specific code. I'll get a system up-and-running and see if I can reproduce this. P.
(In reply to comment #14) > (In reply to comment #6) > > Created an attachment (id=323158) [details] [details] > > IA-32 bare metal Kernel was panicking. > > Cai, this isn't panicking like the description says. It looks like some other > issue has caused the system install to fail... > > P. Yes, this is different form of panic, but it also only happen with acpi=off. Looks like it causes some problems for SATA devices. ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata1.00: qc timeout (cmd 0xec) ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) ata1: failed to recover some devices, retrying in 5 secs ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata1.00: qc timeout (cmd 0xec) ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) ata1: failed to recover some devices, retrying in 5 secs ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata1.00: qc timeout (cmd 0xec) ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) ata1: failed to recover some devices, retrying in 5 secs
I have reserved hp-xw9400-01.rhts.bos.redhat.com, and found both PAE and non-PAE Kernels are panicking like the description says. I'll attach boot logs of them here.
Created attachment 323184 [details] xw9400 is panicking with non-PAE Kernel.
Created attachment 323185 [details] xw9400 is panicking with PAE Kernel.
Created attachment 323223 [details] RHEL5 fix for this issue
just fyi. might be related crash: https://www.redhat.com/archives/rhelv5-list/2008-November/msg00033.html
(In reply to comment #22) > just fyi. might be related crash: > https://www.redhat.com/archives/rhelv5-list/2008-November/msg00033.html Yeah Anton -- that is the same issue. I'm putting this back into ASSIGNED for the moment. I'm going to rework the patch to come up with a more comprehensive solution. P.
This is a regression from previous behavior. This also breaks kdump on some systems. P.
Created attachment 323378 [details] RHEL5 fix for this issue
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
*** Bug 480914 has been marked as a duplicate of this bug. ***
Updating PM score.
Patch from #25 doesn't fix the issue. Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [<ffffffff803433bd>] pci_create_bus+0x59/0x1f3 PGD 0 Oops: 0000 [1] SMP last sysfs file: CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.18-128.el5.it274645xen #1 RIP: e030:[<ffffffff803433bd>] [<ffffffff803433bd>] pci_create_bus+0x59/0x1f3 RSP: e02b:ffff880006141d50 EFLAGS: 00010286 RAX: ffff88002ff8a000 RBX: ffff88002ff93200 RCX: 0000000000000000 RDX: ffffffffff578000 RSI: 0000000000000005 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffff88002ff93400 R09: 0000000000000000 R10: ffff880006141da0 R11: 0000000000000100 R12: ffff88002ff8a000 R13: 0000000000000005 R14: ffffffff80543a70 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffffff805ba000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process swapper (pid: 1, threadinfo ffff880006140000, task ffff8800000297a0) Stack: 0000000000000005 0000000000000005 0000000000000004 0000000000000000 0000000000000000 0000000000000000 0000000000000000 ffffffff8034428d 0000000000000005 ffffffff8065031d Call Trace: [<ffffffff8034428d>] pci_scan_bus_parented+0x6/0x21 [<ffffffff8065031d>] pcibios_irq_init+0x177/0x491 [<ffffffff806347e5>] init+0x1f9/0x2fe [<ffffffff8025fb2c>] child_rip+0xa/0x12 [<ffffffff806345ec>] init+0x0/0x2fe [<ffffffff8025fb22>] child_rip+0x0/0x12 Code: 8b 7d 00 e8 e2 43 00 00 48 85 c0 0f 85 68 01 00 00 48 c7 c7 RIP [<ffffffff803433bd>] pci_create_bus+0x59/0x1f3 RSP <ffff880006141d50> CR2: 0000000000000000 <0>Kernel panic - not syncing: Fatal exception (XEN) Domain 0 crashed: rebooting machine in 5 seconds.
Created attachment 337771 [details] proposed patch Customer tested and confirmed the patch works. It's the same patch as yours, based on your idea, only it's adding modifications to mach-xen pci.h files. Could you also say when the patch will be applied? Thank you!
(In reply to comment #33) > Created an attachment (id=337771) [details] > proposed patch > > > > Customer tested and confirmed the patch works. It's the same patch as yours, > based on your idea, only it's adding modifications to mach-xen pci.h files. > Patch looks good. I didn't even think about virt kernels. > Could you also say when the patch will be applied? Whenever dzickus gets around to applying it :) P. > > Thank you!
*** Bug 494114 has been marked as a duplicate of this bug. ***
in kernel-2.6.18-138.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
My system started crashing during boot up after installing 5.3 with similar stack trace and verified that kernel-2.6.18-138.el5 fixes the problem.
*** Bug 494697 has been marked as a duplicate of this bug. ***
Bug 494697 that I opened wasn't booting with acpi=off but acpi=ht. Not sure if this makes a difference but wanted to let you know. Haven't had a chance to test the patch.
(In reply to comment #40) > Bug 494697 that I opened wasn't booting with acpi=off but acpi=ht. Not sure if > this makes a difference but wanted to let you know. > > Haven't had a chance to test the patch. Hi Shad, The actual problem is an issue with multiple PCI domains, not ACPI. The change in ACPI causes the system to go from single to multiple PCI domains. I have a good feeling that this patch will fix your problem :) P.
I was also crashing while booting with acpi=ht and this patch fixed it. You should remove 'with "acpi=off"' from thebug summary.
~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative.
I got bitten by this with old Tyan PIII motherboard. 5.2 boot.iso appears to boot fine. Also tried to remove Adaptec SCSI card and then system booted with the 5.3 installation image too.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html
The patch (linux-2.6-x86-fix-calls-to-pci_scan_bus.patch) shows: + struct pci_sysdata *sd; + + sd = kzalloc(sizeof(&sd), GFP_KERNEL); + if (!sd) + panic("Cannot allocate PCI domain sysdata"); Should the sizeof(&sd) be sizeof(*sd)? ie., allocate space for struct pci_sysdata as opposed to space for a pointer