Bug 477032 - kdump hang on HP xw9400
Summary: kdump hang on HP xw9400
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Neil Horman
QA Contact: Han Pingtian
URL:
Whiteboard:
Depends On:
Blocks: 527955 533192 591850
TreeView+ depends on / blocked
 
Reported: 2008-12-18 19:35 UTC by Doug Chapman
Modified: 2011-07-21 10:12 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-07-21 10:12:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
console log (41.61 KB, application/octet-stream)
2010-11-11 23:40 UTC, Han Pingtian
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1065 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update 2011-07-21 09:21:37 UTC

Description Doug Chapman 2008-12-18 19:35:52 UTC
Description of problem:
HP XW9400 systems hang on kdump.  The last line on the console is:


Memory: 247608k/278512k available (2494k kernel code, 14504k reserved, 1262k data, 200k init)

hacking in some printk's shows that this is hanging in calibrate_delay_direct().  I am still digging but it appears it isn't getting good values from the clock.  Adding "nohpet" didn't seem to fix this however manually giving it a lpj (loops per jiffy) parameter to the kdump kernel does appear to work around the issue.

specifically I am using lpj=2602224 which I found in dmesg from a standard kernel boot.


Version-Release number of selected component (if applicable):
kernel-2.6.18-120.el5 (but likely all RHEL5.x kernels)

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Aquilina 2008-12-18 20:45:30 UTC
Adding John Brown @ WGBU as an fyi...

Comment 2 Doug Chapman 2008-12-22 15:56:43 UTC
Here are my findings so far.  I will continue to dig into this after the holiday break.

This does not appear to be directly related to bug 475843 however the 9400 has that problem as well.  I am using the fixed kexec-tools for this system as well.

Upstream kernels work OK on the xw9400 so hopefully a fix can be backported.

The root of the problem is that we are not getting timer interrupts at all during kdump.  The place where it hangs is in calibrate_delay in a loop that is waiting for "jiffies" to change.  Initially I suspected hpet code however booting kdump with "nohpet" results in the same hang.

The kdump kernel is booted with the "irqpoll" option by default.  Removing this makes no difference however I found that if I boot the initial kernel with irqpoll and also the kdump kernel with irqpoll then it does not hang.  So there appears to be something that is wrong with how interrupts are configured.

I will dig more in a couple of weeks but sugestions are certainly welcome.

Comment 3 Doug Chapman 2009-01-16 17:57:18 UTC
I have done a little more digging on this.  and have found:

This is specific to the xw9400 platform, I tried a bunch of other xw servers and RHEL5.3 can create a good dump on those (using the updated kexec-tools from bug # 475843).

It works with kernel 2.6.27 from upstream, so we know it _can_ be fixed in software.

The core of the problem is that on the kdump boot we are not getting any interrupts through the 8259A (aka XT-PIC).  It appears that the XW9400 doesn't deal with switching from the IO-APIC back to the 8259A.  Evidently the upstream kernel does some different operations in this switchover which allows it to work.  Since the upstream code is so dramatically different than the RHEL5.X kernel code I have not been able to find anything obvious.


There are 2 workarounds (both require the fixed kexec-tools FYI):

1. boot the initial kernel with "noapic".  this prevents the hardware from ever switching to the IO-APIC for interrupts so we don't have the problem of switching back to the 8259A.  This is not a good workaround for customers obviously.

2. add an "lpj=" (loops per jiffy) argument to the kexec boot.  I am using lpj=2602232 which I found from dmesg of a normal boot.  Honestly for kdump I don't think we need to be too picky on this value.  This seems to be a usable workaround.  The reason this works is the code that calculates lpj is the _only_ bit of code (which I have hit) that makes use of "jiffies" prior to the system switching back over to the IO-APIC.


So, a question for John Brown.  Do you know of anything unique to the xw9400 that might cause this issue with switching from IO-APIC back to the 8259A?  As I mentioned this works upstream and if I had a better understanding of what was different with the hardware I might be able to find exactly what bit of upstream code fixed this.

Comment 4 Neil Horman 2009-01-21 15:23:04 UTC
This sounds an awful lot like bz 462519.  Feel free to take a look and close it as a dup if you think it fits.

Comment 5 Doug Chapman 2009-01-21 17:36:41 UTC
Neil,

I tried the patch from BZ 462519 on this box and the kernel (normal boot, not kdump) fails to boot:

ACPI: Core revision 20060707
initalizing for 255 cpus
..MP-BIOS bug: 8254 timer not connected to IO-APIC
Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 'noapic' kernel parameter


I also get the same failure if I boot the standard kernel and just use the patched kernel for kdump.

Comment 6 Doug Chapman 2009-01-21 17:51:57 UTC
oops, ignore my previous comment, I was using the wrong patch.  With the right patch it does look like this might fix the issue.  Going to try this a few times but looks like this is indeed the same issue.

I see the other issue is a customer issue and the initial description is marked private.  Can you share anything regarding what sort of system the customer was seeing that bug on?

Comment 8 Qian Cai 2009-02-03 02:43:17 UTC
Doug, 

Bug 473403 - [5.3] Kdump Kernel Hangs on Dell AMD Machines

has also been marked as a duplication of bug 462519.

Comment 15 Qian Cai 2009-07-02 03:11:11 UTC
Kdump kernel still hang on the hp-xw9400 in-house (hp-xw9400-02.rhts.bos.redhat.com) using the latest RHEL5.4 components,

kexec-tools-1.102pre-75.el5
kernel-2.6.18-156.el5

Red Hat Enterprise Linux Server release 5.4 Beta (Tikanga)
Kernel 2.6.18-156.el5 on an x86_64

hp-xw9400-02.rhts.bos.redhat.com login: SysRq : Trigger a crashdump
REWRITING MCP55 CFG REG
CFG = c1
Linux version 2.6.18-156.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Mon Jun 29 18:16:54 EDT 2009
Command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200  irqpoll maxcpus=1 reset_devices  hdb=cdrom memmap=exactmap memmap=640K@0K memmap=5272K@16384K memmap=125144K@22296K elfcorehdr=147440K memmap=232K$3669784K memmap=131072K$3932160K memmap=20480K$4173824K
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000010000 - 000000000009b000 (usable)
 BIOS-e820: 000000000009b000 - 00000000000a0000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000dffc6100 (usable)
 BIOS-e820: 00000000dffc6100 - 00000000e0000000 (reserved)
 BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000120000000 (usable)
user-defined physical RAM map:
 user: 0000000000000000 - 00000000000a0000 (usable)
 user: 0000000001000000 - 0000000001526000 (usable)
 user: 00000000015c6000 - 0000000008ffc000 (usable)
 user: 00000000dffc6000 - 00000000e0000000 (reserved)
 user: 00000000f0000000 - 00000000f8000000 (reserved)
 user: 00000000fec00000 - 0000000100000000 (reserved)
DMI 2.5 present.
SRAT: PXM 0 -> APIC 0 -> Node 0
SRAT: PXM 0 -> APIC 1 -> Node 0
SRAT: PXM 0 -> APIC 2 -> Node 0
SRAT: PXM 0 -> APIC 3 -> Node 0
SRAT: PXM 0 -> APIC 4 -> Node 0
SRAT: PXM 0 -> APIC 5 -> Node 0
SRAT: PXM 1 -> APIC 8 -> Node 1
SRAT: PXM 1 -> APIC 9 -> Node 1
SRAT: PXM 1 -> APIC 10 -> Node 1
SRAT: PXM 1 -> APIC 11 -> Node 1
SRAT: PXM 1 -> APIC 12 -> Node 1
SRAT: PXM 1 -> APIC 13 -> Node 1
SRAT: Node 0 PXM 0 0-a0000
SRAT: Node 0 PXM 0 0-80000000
SRAT: Node 1 PXM 1 80000000-e0000000
SRAT: Node 1 PXM 1 80000000-120000000
Bootmem setup node 0 0000000000000000-0000000008ffc000
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
disabling kdump
ACPI: PM-Timer IO Port: 0xf808
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x08] enabled)
Processor #8 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled)
Processor #1 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x09] enabled)
Processor #9 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x05] lapic_id[0x02] enabled)
Processor #2 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x06] lapic_id[0x0a] enabled)
Processor #10 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x07] lapic_id[0x03] enabled)
Processor #3 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x08] lapic_id[0x0b] enabled)
Processor #11 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x09] lapic_id[0x04] enabled)
Processor #4 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x0c] enabled)
Processor #12 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x05] enabled)
Processor #5 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x0d] enabled)
Processor #13 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x0d] lapic_id[0x0c] disabled)
ACPI: LAPIC (acpi_id[0x0e] lapic_id[0x0d] disabled)
ACPI: LAPIC (acpi_id[0x0f] lapic_id[0x0e] disabled)
ACPI: LAPIC (acpi_id[0x10] lapic_id[0x0f] disabled)
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x05] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x06] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x07] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x08] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x09] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x0a] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x0b] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x0c] high edge lint[0x1])
ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 8, version 17, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x09] address[0xfa400000] gsi_base[24])
IOAPIC[1]: apic_id 9, version 17, address 0xfa400000, GSI 24-47
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
Setting APIC routing to physical flat
ACPI: HPET id: 0x10de8201 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
Nosave address range: 00000000000a0000 - 0000000001000000
Nosave address range: 0000000001526000 - 00000000015c6000
Allocating PCI resources starting at 10000000 (gap: 8ffc000:d6fca000)
SMP: Allowing 16 CPUs, 4 hotplug CPUs
Built 1 zonelists.  Total pages: 32248
Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200  irqpoll maxcpus=1 reset_devices  hdb=cdrom memmap=exactmap memmap=640K@0K memmap=5272K@16384K memmap=125144K@22296K elfcorehdr=147440K memmap=232K$3669784K memmap=131072K$3932160K memmap=20480K$4173824K
Misrouted IRQ fixup and polling support enabled
This may significantly impact system performance
ide_setup: hdb=cdrom
Initializing CPU#0
PID hash table entries: 512 (order: 9, 4096 bytes)
Console: colour VGA+ 80x25
Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
Checking aperture...
CPU 0: aperture @ c000000 size 64 MB
CPU 1: aperture @ c000000 size 64 MB
ACPI: DMAR not present
Memory: 117688k/147440k available (2550k kernel code, 13368k reserved, 1291k data, 208k init)
Calibrating delay loop (skipped), value calculated using timer frequency.. 4399.99 BogoMIPS (lpj=2199999)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 0/3 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 3
SMP alternatives: switching to UP code
ACPI: Core revision 20060707
..MP-BIOS bug: 8254 timer not connected to IO-APIC
Using local APIC timer interrupts.
result 12499990
Detected 12.499 MHz APIC timer.
Brought up 1 CPUs
testing NMI watchdog ... OK.
time.c: Using 25.000000 MHz WALL HPET GTOD HPET/TSC timer.
time.c: Detected 800.005 MHz processor.
checking if image is initramfs... it is
Freeing initrd memory: 3579k freed
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: Using configuration type 1
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: No dock devices found.
ACPI: PCI Root Bridge [PCI0] (0000:00)
FOUND MCP55 CHIP
cfg value is c1
PCI: Transparent bridge - 0000:00:06.0
bus 0 -> pxm 0 -> node 0
ACPI: PCI Root Bridge [PCI1] (0000:40)
bus 64 -> pxm 1 -> node -1
ACPI: PCI Interrupt Link [LNKA] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LNKB] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LNKC] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LNKD] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LXPA] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LXPB] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LXPC] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LXPD] (IRQs 5 7 10 16 17 18 19 20 21 22 *23)
ACPI: PCI Interrupt Link [LXA2] (IRQs *40), disabled.
ACPI: PCI Interrupt Link [LXB2] (IRQs *41), disabled.
ACPI: PCI Interrupt Link [LXC2] (IRQs *42), disabled.
ACPI: PCI Interrupt Link [LXD2] (IRQs *43)
ACPI: PCI Interrupt Link [LSMB] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LSB0] (IRQs 5 7 10 16 17 18 19 20 *21 22 23)
ACPI: PCI Interrupt Link [LSB2] (IRQs 5 7 10 16 17 18 19 20 21 *22 23)
ACPI: PCI Interrupt Link [LMC0] (IRQs 5 7 10 16 *17 18 19 20 21 22 23)
ACPI: PCI Interrupt Link [LMC1] (IRQs 5 7 10 16 17 18 19 20 21 22 *23)
ACPI: PCI Interrupt Link [LAZA] (IRQs 5 7 10 *16 17 18 19 20 21 22 23)
ACPI: PCI Interrupt Link [LIDE] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LSA0] (IRQs 5 7 10 16 17 18 19 *20 21 22 23)
ACPI: PCI Interrupt Link [LSA1] (IRQs 5 7 10 16 17 18 *19 20 21 22 23)
ACPI: PCI Interrupt Link [LSA2] (IRQs 5 7 10 16 17 *18 19 20 21 22 23)
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
pnp: PnP ACPI: found 17 devices
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
hpet0: at MMIO 0xfed00000 (virtual 0xffffffffff5fe000), IRQs 2, 8, 31
hpet0: 3 32-bit timers, 25000000 Hz
ACPI: DMAR not present
PCI-DMA: Disabling IOMMU.
pnp: 00:0b: ioport range 0x4d0-0x4d1 has been reserved
pnp: 00:0c: ioport range 0x400-0x47f could not be reserved
pnp: 00:0c: ioport range 0x480-0x48f has been reserved
pnp: 00:0c: ioport range 0x4c0-0x4cb has been reserved
pnp: 00:0c: ioport range 0xe000-0xe07f has been reserved
pnp: 00:0c: ioport range 0xe080-0xe0ff has been reserved
pnp: 00:0c: ioport range 0xf200-0xf27f has been reserved
pnp: 00:0c: ioport range 0xf280-0xf2ff has been reserved
PCI: Bridge: 0000:00:06.0
  IO window: disabled.
  MEM window: fa100000-fa1fffff
  PREFETCH window: disabled.
PCI: Bridge: 0000:2b:00.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:2b:00.1
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:0d.0
  IO window: disabled.
  MEM window: fa000000-fa0fffff
  PREFETCH window: disabled.
PCI: mem resource #6:20000@f0000000 for 0000:18:00.0 was not allocated.
PCI: Bridge: 0000:00:0f.0
  IO window: 3000-3fff
  MEM window: f8000000-f9ffffff
  PREFETCH window 0x00000000e0000000-0x00000000efffffff
ACPI: PCI Interrupt Link [LXPD] enabled at IRQ 23
GSI 16 sharing vector 0xA9 and IRQ 16
ACPI: PCI Interrupt 0000:2b:00.1[A] -> Link [LXPD] -> GSI 23 (level, high) -> IRQ 169
PCI: Bridge: 0000:40:0d.0
  IO window: 1000-1fff
  MEM window: fa300000-fa3fffff
  PREFETCH window 0x00000000fa500000-0x00000000fa5fffff
NET: Registered protocol family 2
IP route cache hash table entries: 1024 (order: 1, 8192 bytes)
TCP established hash table entries: 4096 (order: 4, 65536 bytes)
TCP bind hash table entries: 2048 (order: 3, 32768 bytes)
TCP: Hash tables configured (established 4096 bind 2048)
TCP reno registered
audit: initializing netlink socket (disabled)
type=2000 audit(1246503741.523:1): initialized
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
Initializing Cryptographic API
alg: No test for crc32c (crc32c-generic)
ksign: Installing public key data
Loading keyring
- Added public key 151471C81E0E52F6
- User ID: Red Hat, Inc. (Kernel Module GPG key)
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
pci 0000:00:00.0: Enabling HT MSI Mapping
pci 0000:00:05.0: Enabling HT MSI Mapping
pci 0000:00:05.1: Enabling HT MSI Mapping
pci 0000:00:05.2: Enabling HT MSI Mapping
pci 0000:00:06.0: Enabling HT MSI Mapping
pci 0000:00:06.1: Enabling HT MSI Mapping
pci 0000:00:08.0: Enabling HT MSI Mapping
pci 0000:00:09.0: Enabling HT MSI Mapping
pci 0000:00:0d.0: Enabling HT MSI Mapping
pci 0000:00:0f.0: Enabling HT MSI Mapping
pci 0000:40:00.0: Enabling HT MSI Mapping
pci 0000:40:0d.0: Enabling HT MSI Mapping
assign_interrupt_mode Found MSI capability
assign_interrupt_mode Found MSI capability
assign_interrupt_mode Found MSI capability
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
Real Time Clock Driver v1.12ac
Non-volatile memory driver v1.2
Linux agpgart interface v0.101 (c) Dave Jones
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled

Comment 17 Qian Cai 2009-07-29 03:41:25 UTC
One another xw9400 machine in-house, kdump is also not working.

hp-xw9400-02.rhts.bos.redhat.com

SysRq : Trigger a crashdump
REWRITING MCP55 CFG REG
<hung...>

It was using,

kernel-2.6.18-159.el5
kexec-tools-1.102pre-77.el5

Comment 19 Prarit Bhargava 2010-10-07 13:39:26 UTC
Cai, is kdump still broken on the HP xw9400s?

Thanks,

P.

Comment 21 Neil Horman 2010-10-07 14:59:35 UTC
Cai, also, is there a cciss or hpsa array in this system?  IIRC HP has declared both of those non-functional with kdump at the moment.

Comment 22 Chao Ye 2010-10-08 07:30:20 UTC
(In reply to comment #19)
> Cai, is kdump still broken on the HP xw9400s?
> 
> Thanks,
> 
> P.

I tested with latest kernel and kexec-tools on hp-xw9400-02.rhts.eng.bos.redhat.com, still hang!
===========================================================
[root@hp-xw9400-02 ~]# rpm -q kernel kexec-tools
kernel-2.6.18-226.el5
kexec-tools-1.102pre-108.el5
===========================================================
Red Hat Enterprise Linux Server release 5.6 Beta (Tikanga)
Kernel 2.6.18-226.el5 on an x86_64

hp-xw9400-02.rhts.eng.bos.redhat.com login: SysRq : Trigger a crashdump
REWRITING MCP55 CFG REG
<===================Hang here!

Comment 23 Chao Ye 2010-10-08 07:40:52 UTC
(In reply to comment #21)
> Cai, also, is there a cciss or hpsa array in this system?  IIRC HP has declared
> both of those non-functional with kdump at the moment.

Hi Neil,

There is no cciss or hpsa array in hp-xw9400-02.rhts.eng.bos.redhat.com.

Comment 27 Neil Horman 2010-11-03 20:41:26 UTC
probably not, I've not got any idea whats causing this error.  A useful test however would be to back out my mcp55 patch to see if that fixes the problem, although I've no idea why it would

Comment 28 Neil Horman 2010-11-04 21:00:48 UTC
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2873718

Heres a build for you to try based on my suggestion in comment #27.

Comment 29 Han Pingtian 2010-11-05 07:37:56 UTC
hi, Neil, I cannot find kernel package  on the URL.

Comment 31 Neil Horman 2010-11-08 21:30:38 UTC
yeah, apparently there are two patches I need to rip out, not just one.  I'll resubmit shortly.

Comment 32 Neil Horman 2010-11-09 15:17:19 UTC
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2883183

New build

Comment 33 Han Pingtian 2010-11-09 23:28:19 UTC
The new kernel works fine.

Comment 34 Neil Horman 2010-11-10 12:05:13 UTC
Ok, this is interesting.  If the last message we get is REWRITING MCP55 CGF REG, that means that we're hanging in machine_crash_shutdown.  We should get a second printk indicating what the config value we're writing is, but we never see it so it would seem that we're hanging on the pci_read_config_dword call.  Since we have already shot down the other cpus, and not gotten an error about stopping them, we should be the only cpu running, so we're not getting any wierd smp behavior here.  I don't see how we can hang on a pci bus access.

Prarit, I think you're our pci expert.  Any thoughts on how I might hang doing a pci bus access?

Pingtan, can you provide the entire serial log of this event from boot to failure?  Thanks!

Comment 35 Han Pingtian 2010-11-11 23:40:43 UTC
Created attachment 459878 [details]
console log

Comment 36 Han Pingtian 2010-11-24 09:46:46 UTC
Red Hat Enterprise Linux Server release 5.6 Beta (Tikanga)
Kernel 2.6.18-232.el5 on an x86_64

sun-x4440-01.rhts.eng.bos.redhat.com login: SysRq : Trigger a crashdump
REWRITING MCP55 CFG REG
<----------- hangs here

Comment 37 Neil Horman 2010-11-24 12:18:34 UTC
pingtan, I'm confused, above you indicate a hang immediately after the REWRITING line, but the log you attach clearly shows us getting much more output after that.  Can you clarify?

Comment 38 Han Pingtian 2010-11-25 02:26:13 UTC
I apologize. In comment 36, I wanted to report that I encountered the same problem on another machine, sun-x4440-01.rhts.eng.bos.redhat.com, not the original hp xw9400 machine.

Comment 39 Han Pingtian 2010-11-30 10:01:34 UTC
I have encountered this problem on dell-pem805-01.rhts.eng.bos.redhat.com, with rhel5.6 snapshot3:

Red Hat Enterprise Linux Server release 5.6 Beta (Tikanga) 
Kernel 2.6.18-233.el5PAE on an i686 
 
dell-pem805-01.rhts.eng.bos.redhat.com login: 11/29/10 06:48:06  testID:772787 finish:
11/29/10 06:48:26  JobID:34899 Test:/kernel/kdump/config-ssh Response:1
11/29/10 06:48:26  testID:772788 start:
[-- MARK -- Mon Nov 29 06:50:00 2010] 
11/29/10 06:50:29  testID:772788 finish:
11/29/10 06:51:23  JobID:34899 Test:/kernel/kdump/config-filter Response:1
11/29/10 06:51:23  testID:772789 start:
11/29/10 06:51:49  testID:772789 finish:
11/29/10 06:52:41  JobID:34899 Test:/kernel/kdump/crash-sysrq-c Response:1
11/29/10 06:52:41  testID:772790 start:
SysRq : Trigger a crashdump 
REWRITING MCP55 CFG REG 
<------------------- hangs here

Comment 40 Neil Horman 2010-11-30 12:08:01 UTC
ok, but thats not the same problem, we're getting much further in those boots and so its pretty clearly a different problem (or at least a different system).

ping prarit, any thoughts on comment 34

Comment 41 Neil Horman 2011-01-27 15:50:43 UTC
prarit ping?

Comment 42 Prarit Bhargava 2011-02-01 13:45:00 UTC
(In reply to comment #34)
> Ok, this is interesting.  If the last message we get is REWRITING MCP55 CGF
> REG, that means that we're hanging in machine_crash_shutdown.  We should get a
> second printk indicating what the config value we're writing is, but we never
> see it so it would seem that we're hanging on the pci_read_config_dword call. 
> Since we have already shot down the other cpus, and not gotten an error about
> stopping them, we should be the only cpu running, so we're not getting any
> wierd smp behavior here.  I don't see how we can hang on a pci bus access.
> 
> Prarit, I think you're our pci expert.  Any thoughts on how I might hang doing
> a pci bus access?

Without a PCI Bus analyser it would be pretty difficult.  Does this work upstream?

P.

Comment 43 Neil Horman 2011-02-01 14:57:37 UTC
I honestly don't know, its been so long, but I can try.

Comment 44 Neil Horman 2011-02-01 16:03:53 UTC
So, yes, it appears that this works upstream (2.6.35). I'm testing on the Latest RHEL5 to make sure that it still fails

Comment 45 Neil Horman 2011-02-01 20:43:31 UTC
grumble, I think this needs upstream commit 49c2fa08a77a7eefa4cbc73601f64984aceacfa7

Comment 46 Neil Horman 2011-02-01 21:25:50 UTC
yup, thats it, I'll post the fix in the AM

Comment 47 RHEL Program Management 2011-02-04 20:31:05 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 53 Jarod Wilson 2011-02-21 20:55:51 UTC
in kernel-2.6.18-245.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 55 Steve Cormack 2011-03-08 18:39:19 UTC
After installing this kernel on the top of x64 RHEL 5.6, I was able to successfully invoke a kdump.  After the expected reboot, the resulting vmcore file was created in /var/crash/[timestamp].  (Note that the xw9400 system used has 16 GB RAM, if that matters.)

Comment 56 errata-xmlrpc 2011-07-21 10:12:23 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html


Note You need to log in before you can comment on or make changes to this bug.