Description of problem: Sometimes, kdump does not work on IBM eServer x3105, the second kernel hangs there. RHEL5U1 has the same problem too. SysRq : Trigger a crashdump Linux version 2.6.18-88.el5 (brewbuilder.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 1 19:01:18 EDT 2008 Command line: ro root=LABEL=/ console=ttyS0,115200 irqpoll maxcpus=1 reset_devices memmap=exactmap memmap=640K@0K memmap=5116K@16384K memmap=125300K@22140K elfcorehdr=147440K memmap=24K#523328K memmap=424K#523352K BIOS-provided physical RAM map: BIOS-e820: 0000000000000100 - 000000000009dc00 (usable) BIOS-e820: 000000000009dc00 - 00000000000a0000 (reserved) BIOS-e820: 0000000000100000 - 000000001ff10000 (usable) BIOS-e820: 000000001ff10000 - 000000001ff16000 (ACPI data) BIOS-e820: 000000001ff16000 - 000000001ff80000 (ACPI NVS) BIOS-e820: 000000001ff80000 - 0000000020000000 (reserved) BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved) BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved) user-defined physical RAM map: user: 0000000000000000 - 00000000000a0000 (usable) user: 0000000001000000 - 00000000014ff000 (usable) user: 000000000159f000 - 0000000008ffc000 (usable) user: 000000001ff10000 - 000000001ff16000 (ACPI data) user: 000000001ff16000 - 000000001ff80000 (ACPI data) DMI present. SRAT: PXM 0 -> APIC 0 -> Node 0 SRAT: PXM 0 -> APIC 1 -> Node 0 SRAT: Node 0 PXM 0 0-a0000 SRAT: Node 0 PXM 0 0-20000000 Bootmem setup node 0 0000000000000000-0000000008ffc000 Memory for crash kernel (0x0 to 0x0) notwithin permissible range disabling kdump Nvidia board detected. Ignoring ACPI timer override. ACPI: PM-Timer IO Port: 0x8008 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 15:3 APIC version 16 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 15:3 APIC version 16 ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23 Setting APIC routing to physical flat Using ACPI (MADT) for SMP configuration information Nosave address range: 00000000000a0000 - 0000000001000000 Nosave address range: 00000000014ff000 - 000000000159f000 Allocating PCI resources starting at 20000000 (gap: 1ff80000:e0080000) SMP: Allowing 2 CPUs, 0 hotplug CPUs Built 1 zonelists. Total pages: 32252 Kernel command line: ro root=LABEL=/ console=ttyS0,115200 irqpoll maxcpus=1 reset_devices memmap=exactmap memmap=640K@0K memmap=5116K@16384K memmap=125300K@22140K elfcorehdr=147440K memmap=24K#523328K memmap=424K#523352K Misrouted IRQ fixup and polling support enabled This may significantly impact system performance Initializing CPU#0 PID hash table entries: 512 (order: 9, 4096 bytes) irq 26, desc: ffffffff803b7d80, depth: 1, count: 0, unhandled: 0 ->handle_irq(): ffffffff800b71df, handle_bad_irq+0x0/0x1f6 ->chip(): ffffffff802f1b80, 0xffffffff802f1b80 ->action(): 0000000000000000 IRQ_DISABLED set unexpected IRQ trap at vector 1a Console: colour VGA+ 80x25 irq 26, desc: ffffffff803b7d80, depth: 1, count: 0, unhandled: 0 ->handle_irq(): ffffffff800b71df, handle_bad_irq+0x0/0x1f6 ->chip(): ffffffff802f1b80, 0xffffffff802f1b80 ->action(): 0000000000000000 IRQ_DISABLED set IRQ_PENDING set unexpected IRQ trap at vector 1a Dentry cache hash table entries: 16384 (order: 5, 131072 bytes) Inode-cache hash table entries: 8192 (order: 4, 65536 bytes) Checking aperture... CPU 0: aperture @ 585e000000 size 32 MB Aperture too small (32 MB) No AGP bridge found Memory: 119612k/147440k available (2456k kernel code, 11444k reserved, 1246k data, 196k init) Calibrating delay using timer specific routine.. 1999.26 BogoMIPS (lpj=999634) Security Framework v1.0.0 initialized SELinux: Initializing. selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 256 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) CPU 0/1 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 1 SMP alternatives: switching to UP code ACPI: Core revision 20060707 Using local APIC timer interrupts. result 12472918 Detected 12.472 MHz APIC timer. Brought up 1 CPUs testing NMI watchdog ... OK. Disabling vsyscall due to use of PM timer time.c: Using 3.579545 MHz WALL PM GTOD PM timer. time.c: Detected 997.832 MHz processor. checking if image is initramfs... it is Freeing initrd memory: 2298k freed irq 26, desc: ffffffff803b7d80, depth: 1, count: 0, unhandled: 0 ->handle_irq(): ffffffff800b71df, handle_bad_irq+0x0/0x1f6 ->chip(): ffffffff802f1b80, 0xffffffff802f1b80 ->action(): 0000000000000000 IRQ_DISABLED set IRQ_PENDING set unexpected IRQ trap at vector 1a NET: Registered protocol family 16 No dock devices found. ACPI: bus type pci registered PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved PCI: Not using MMCONFIG. PCI: Using configuration type 1 ACPI: Interpreter enabled ACPI: Using IOAPIC for interrupt routing ACPI: PCI Root Bridge [PCI0] (0000:00) PCI: Transparent bridge - 0000:00:09.0 ACPI: PCI Interrupt Link [LNK1] (IRQs 16 17 18 19) *0, disabled. ACPI: PCI Interrupt Link [LNK2] (IRQs 16 17 18 *19) ACPI: PCI Interrupt Link [LNK3] (IRQs 16 17 18 19) *0, disabled. ACPI: PCI Interrupt Link [LNK4] (IRQs 16 17 18 19) *0, disabled. ACPI: PCI Interrupt Link [LNK5] (IRQs 16 17 18 19) *0, disabled. ACPI: PCI Interrupt Link [LSMB] (IRQs 20 21 22 23) *0, disabled. ACPI: PCI Interrupt Link [LUS0] (IRQs 20 21 *22 23) ACPI: PCI Interrupt Link [LUS2] (IRQs 20 21 22 *23) ACPI: PCI Interrupt Link [LMAC] (IRQs 20 21 22 23) *0, disabled. ACPI: PCI Interrupt Link [LACI] (IRQs 20 21 22 23) *0, disabled. ACPI: PCI Interrupt Link [LMCI] (IRQs 20 21 22 23) *0, disabled. ACPI: PCI Interrupt Link [LPID] (IRQs 20 21 22 23) *0, disabled. ACPI: PCI Interrupt Link [LTID] (IRQs 20 *21 22 23) ACPI: PCI Interrupt Link [LSI1] (IRQs *20 21 22 23), disabled. ACPI: PCI Interrupt Link [APCP] (IRQs 20 21 22 23) *0, disabled. Linux Plug and Play Support v0.97 (c) Adam Belay pnp: PnP ACPI init irq 26, desc: ffffffff803b7d80, depth: 1, count: 0, unhandled: 0 ->handle_irq(): ffffffff800b71df, handle_bad_irq+0x0/0x1f6 ->chip(): ffffffff802f1b80, 0xffffffff802f1b80 ->action(): 0000000000000000 IRQ_DISABLED set IRQ_PENDING set unexpected IRQ trap at vector 1a pnp: PnP ACPI: found 12 devices usbcore: registered new driver usbfs usbcore: registered new driver hub PCI: Using ACPI for IRQ routing PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report NetLabel: Initializing NetLabel: domain hash size = 128 NetLabel: protocols = UNLABELED CIPSOv4 NetLabel: unlabeled traffic allowed by default PCI-DMA: Disabling IOMMU. pnp: 00:03: ioport range 0x8000-0x807f could not be reserved pnp: 00:03: ioport range 0x8080-0x80ff has been reserved pnp: 00:03: ioport range 0x8400-0x847f has been reserved pnp: 00:03: ioport range 0x8480-0x84ff has been reserved pnp: 00:03: ioport range 0x8800-0x887f has been reserved pnp: 00:03: ioport range 0x8880-0x88ff has been reserved pnp: 00:03: ioport range 0x1440-0x147f has been reserved pnp: 00:03: ioport range 0x1400-0x143f has been reserved PCI: Bridge: 0000:00:09.0 IO window: 2000-2fff MEM window: d8000000-d80fffff PREFETCH window: d0000000-d7ffffff PCI: Bridge: 0000:00:0b.0 IO window: disabled. MEM window: d8100000-d81fffff PREFETCH window: 20000000-200fffff PCI: Bridge: 0000:00:0d.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. PCI: Bridge: 0000:00:0e.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. NET: Registered protocol family 2 Then, no further output. Version-Release number of selected component (if applicable): RHEL5.2-Server-20080402.0 (x86_64) kernel-2.6.18-88.el5 kexec-tools-1.102pre-20.el5 How reproducible: The failure rate is fairly high on ibm-alishan.rhts.boston.redhat.com.
Created attachment 301769 [details] sosreport
ibm-alishan has a pre-production cpu in it and likely old firmware. I am in the process of getting an updated CPU for it and also at the same time will make sure the firmware is updated to the most recent levels. Has this been seen on other IBM AMD systems that you are aware of?
There is another one which may be related, 440399: [5.2][kdump] capture kernel reset for IBM eServer x3455
I'll close this out per comment #2, and I have not seen any other machine has the same problem.