Bug 441641 - [5.2][kdump] capture kernel can hang on IBM eServer x3105
Summary: [5.2][kdump] capture kernel can hang on IBM eServer x3105
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Ed Pollard
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-04-09 10:07 UTC by Qian Cai
Modified: 2013-08-06 00:04 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-10-22 10:36:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sosreport (1.43 MB, application/octet-stream)
2008-04-09 10:07 UTC, Qian Cai
no flags Details

Description Qian Cai 2008-04-09 10:07:29 UTC
Description of problem:
Sometimes, kdump does not work on IBM eServer x3105, the second kernel hangs
there. RHEL5U1 has the same problem too.

SysRq : Trigger a crashdump
Linux version 2.6.18-88.el5 (brewbuilder.redhat.com) (gcc
version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 1 19:01:18 EDT 2008
Command line: ro root=LABEL=/ console=ttyS0,115200  irqpoll maxcpus=1
reset_devices memmap=exactmap memmap=640K@0K memmap=5116K@16384K
memmap=125300K@22140K elfcorehdr=147440K memmap=24K#523328K memmap=424K#523352K
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000100 - 000000000009dc00 (usable)
 BIOS-e820: 000000000009dc00 - 00000000000a0000 (reserved)
 BIOS-e820: 0000000000100000 - 000000001ff10000 (usable)
 BIOS-e820: 000000001ff10000 - 000000001ff16000 (ACPI data)
 BIOS-e820: 000000001ff16000 - 000000001ff80000 (ACPI NVS)
 BIOS-e820: 000000001ff80000 - 0000000020000000 (reserved)
 BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
user-defined physical RAM map:
 user: 0000000000000000 - 00000000000a0000 (usable)
 user: 0000000001000000 - 00000000014ff000 (usable)
 user: 000000000159f000 - 0000000008ffc000 (usable)
 user: 000000001ff10000 - 000000001ff16000 (ACPI data)
 user: 000000001ff16000 - 000000001ff80000 (ACPI data)
DMI present.
SRAT: PXM 0 -> APIC 0 -> Node 0
SRAT: PXM 0 -> APIC 1 -> Node 0
SRAT: Node 0 PXM 0 0-a0000
SRAT: Node 0 PXM 0 0-20000000
Bootmem setup node 0 0000000000000000-0000000008ffc000
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
disabling kdump
Nvidia board detected. Ignoring ACPI timer override.
ACPI: PM-Timer IO Port: 0x8008
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 15:3 APIC version 16
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 15:3 APIC version 16
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
Setting APIC routing to physical flat
Using ACPI (MADT) for SMP configuration information
Nosave address range: 00000000000a0000 - 0000000001000000
Nosave address range: 00000000014ff000 - 000000000159f000
Allocating PCI resources starting at 20000000 (gap: 1ff80000:e0080000)
SMP: Allowing 2 CPUs, 0 hotplug CPUs
Built 1 zonelists.  Total pages: 32252
Kernel command line: ro root=LABEL=/ console=ttyS0,115200  irqpoll maxcpus=1
reset_devices memmap=exactmap memmap=640K@0K memmap=5116K@16384K
memmap=125300K@22140K elfcorehdr=147440K memmap=24K#523328K memmap=424K#523352K
Misrouted IRQ fixup and polling support enabled
This may significantly impact system performance
Initializing CPU#0
PID hash table entries: 512 (order: 9, 4096 bytes)
irq 26, desc: ffffffff803b7d80, depth: 1, count: 0, unhandled: 0
->handle_irq():  ffffffff800b71df, handle_bad_irq+0x0/0x1f6
->chip(): ffffffff802f1b80, 0xffffffff802f1b80
->action(): 0000000000000000
  IRQ_DISABLED set
unexpected IRQ trap at vector 1a
Console: colour VGA+ 80x25
irq 26, desc: ffffffff803b7d80, depth: 1, count: 0, unhandled: 0
->handle_irq():  ffffffff800b71df, handle_bad_irq+0x0/0x1f6
->chip(): ffffffff802f1b80, 0xffffffff802f1b80
->action(): 0000000000000000
  IRQ_DISABLED set
   IRQ_PENDING set
unexpected IRQ trap at vector 1a
Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
Checking aperture...
CPU 0: aperture @ 585e000000 size 32 MB
Aperture too small (32 MB)
No AGP bridge found
Memory: 119612k/147440k available (2456k kernel code, 11444k reserved, 1246k
data, 196k init)
Calibrating delay using timer specific routine.. 1999.26 BogoMIPS (lpj=999634)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
CPU 0/1 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
SMP alternatives: switching to UP code
ACPI: Core revision 20060707
Using local APIC timer interrupts.
result 12472918
Detected 12.472 MHz APIC timer.
Brought up 1 CPUs
testing NMI watchdog ... OK.
Disabling vsyscall due to use of PM timer
time.c: Using 3.579545 MHz WALL PM GTOD PM timer.
time.c: Detected 997.832 MHz processor.
checking if image is initramfs... it is
Freeing initrd memory: 2298k freed
irq 26, desc: ffffffff803b7d80, depth: 1, count: 0, unhandled: 0
->handle_irq():  ffffffff800b71df, handle_bad_irq+0x0/0x1f6
->chip(): ffffffff802f1b80, 0xffffffff802f1b80
->action(): 0000000000000000
  IRQ_DISABLED set
   IRQ_PENDING set
unexpected IRQ trap at vector 1a
NET: Registered protocol family 16
No dock devices found.
ACPI: bus type pci registered
PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
PCI: Not using MMCONFIG.
PCI: Using configuration type 1
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Transparent bridge - 0000:00:09.0
ACPI: PCI Interrupt Link [LNK1] (IRQs 16 17 18 19) *0, disabled.
ACPI: PCI Interrupt Link [LNK2] (IRQs 16 17 18 *19)
ACPI: PCI Interrupt Link [LNK3] (IRQs 16 17 18 19) *0, disabled.
ACPI: PCI Interrupt Link [LNK4] (IRQs 16 17 18 19) *0, disabled.
ACPI: PCI Interrupt Link [LNK5] (IRQs 16 17 18 19) *0, disabled.
ACPI: PCI Interrupt Link [LSMB] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LUS0] (IRQs 20 21 *22 23)
ACPI: PCI Interrupt Link [LUS2] (IRQs 20 21 22 *23)
ACPI: PCI Interrupt Link [LMAC] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LACI] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LMCI] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LPID] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LTID] (IRQs 20 *21 22 23)
ACPI: PCI Interrupt Link [LSI1] (IRQs *20 21 22 23), disabled.
ACPI: PCI Interrupt Link [APCP] (IRQs 20 21 22 23) *0, disabled.
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
irq 26, desc: ffffffff803b7d80, depth: 1, count: 0, unhandled: 0
->handle_irq():  ffffffff800b71df, handle_bad_irq+0x0/0x1f6
->chip(): ffffffff802f1b80, 0xffffffff802f1b80
->action(): 0000000000000000
  IRQ_DISABLED set
   IRQ_PENDING set
unexpected IRQ trap at vector 1a
pnp: PnP ACPI: found 12 devices
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
PCI-DMA: Disabling IOMMU.
pnp: 00:03: ioport range 0x8000-0x807f could not be reserved
pnp: 00:03: ioport range 0x8080-0x80ff has been reserved
pnp: 00:03: ioport range 0x8400-0x847f has been reserved
pnp: 00:03: ioport range 0x8480-0x84ff has been reserved
pnp: 00:03: ioport range 0x8800-0x887f has been reserved
pnp: 00:03: ioport range 0x8880-0x88ff has been reserved
pnp: 00:03: ioport range 0x1440-0x147f has been reserved
pnp: 00:03: ioport range 0x1400-0x143f has been reserved
PCI: Bridge: 0000:00:09.0
  IO window: 2000-2fff
  MEM window: d8000000-d80fffff
  PREFETCH window: d0000000-d7ffffff
PCI: Bridge: 0000:00:0b.0
  IO window: disabled.
  MEM window: d8100000-d81fffff
  PREFETCH window: 20000000-200fffff
PCI: Bridge: 0000:00:0d.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:0e.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
NET: Registered protocol family 2

Then, no further output.

Version-Release number of selected component (if applicable):
RHEL5.2-Server-20080402.0 (x86_64)
kernel-2.6.18-88.el5
kexec-tools-1.102pre-20.el5

How reproducible:
The failure rate is fairly high on ibm-alishan.rhts.boston.redhat.com.

Comment 1 Qian Cai 2008-04-09 10:07:29 UTC
Created attachment 301769 [details]
sosreport

Comment 2 Ed Pollard 2008-04-24 14:36:47 UTC
ibm-alishan has a pre-production cpu in it and likely old firmware. I am in the
process of getting an updated CPU for it and also at the same time will make
sure the firmware is updated to the most recent levels. Has this been seen on
other IBM AMD systems that you are aware of?

Comment 3 Qian Cai 2008-04-24 23:06:34 UTC
There is another one which may be related,

440399: [5.2][kdump] capture kernel reset for IBM eServer x3455

Comment 4 Qian Cai 2008-10-22 10:36:34 UTC
I'll close this out per comment #2, and I have not seen any other machine has the same problem.


Note You need to log in before you can comment on or make changes to this bug.