Bug 473038

Summary: [5.3] Kdump Kernel Hangs on HP XW Machines
Product: Red Hat Enterprise Linux 5 Reporter: Qian Cai <qcai>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED NEXTRELEASE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.3CC: clalance, cwyse, dchapman, duck, dzickus, mgahagan, prarit, syeghiay, tao
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-02 14:45:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 483701, 485920    
Attachments:
Description Flags
xw9300 dmidecode information
none
xw9400 demidecode information
none
patch to disable hpet timer on crash shutdown none

Description Qian Cai 2008-11-26 08:23:02 UTC
Description of problem:
On x86-64, we have seen kdump kernel hung on those machines in RHTS.

[machine]      [BIOS]
hp-xw4200-01   1.9 -  06/24/2005
hp-xw4600-01   1.11 - 04/30/2008
hp-xw6600-01   1.18 - 05/05/2008
hp-xw6600-02   1.18 - 05/05/2008
hp-xw6800-01   0.28 - 08/28/2008
hp-xw6800-02   0.28 - 08/28/2008
hp-xw8400-01   2.31 - 03/14/2008
hp-xw8600-01   1.18 - 05/05/2008
hp-xw8800-01   0.31 - 10/13/2008
hp-xw9400-02   3.1  - 10/12/2007

...
assign_interrupt_mode Found MSI capability
assign_interrupt_mode Found MSI capability
assign_interrupt_mode Found MSI capability
assign_interrupt_mode Found MSI capability
assign_interrupt_mode Found MSI capability
assign_interrupt_mode Found MSI capability
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
Real Time Clock Driver v1.12ac
Non-volatile memory driver v1.2
Linux agpgart interface v0.101 (c) Dave Jones
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled

or

...
Console: colour VGA+ 80x25
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Memory: 122332k/146796k available (2134k kernel code, 8604k reserved, 891k data, 228k init, 0k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
hpet0: at MMIO 0xfed00000 (virtual 0xc9800000), IRQs 2, 8, 31
hpet0: 3 32-bit timers, 25000000 Hz
Using HPET for base-timer

Links to full logs,
https://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5236532
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5176912

Version-Release number of selected component (if applicable):
kernel-2.6.18-123.el5
kexec-tools-1.102pre-50.el5

How reproducible:
always

Steps to Reproduce:
1. configure kdump with crashkernel=128M@16M
2. echo c >/proc/sysrq-trigger
  
Actual results:
Kdump kernel hung.

Expected results:
Kdump should work.

Comment 1 Neil Horman 2008-12-01 16:53:47 UTC
Can you bisect and tell me what the last kdump kernel to boot properly on this system was?

Comment 3 Qian Cai 2008-12-02 01:35:19 UTC
(In reply to comment #1)
> Can you bisect and tell me what the last kdump kernel to boot properly on this
> system was?

It is not a regression, and seemed never work before.

Comment 13 Qian Cai 2008-12-04 06:41:31 UTC
Hp-xw9300 is indeed also affected even with -125.el5 kernel. It may just have a lower rate of failure than other machines listed in comment #0.

BIOS Information
        Vendor: Hewlett-Packard
        Version: 786B9 v2.09
        Release Date: 11/28/2006

Comment 14 Neil Horman 2008-12-04 13:47:18 UTC
I'm reserving xw94400-02 to try reproduce this for myself.

Comment 16 Neil Horman 2008-12-05 16:54:12 UTC
Created attachment 325878 [details]
patch to disable hpet timer on crash shutdown

This patch solved the problem for me on hp-xw9400-02.  Its a backport of upstream commit 0c1b2724069951b1902373e688042b2ec382f68f, and disables the hpet on shutdown so as to allow it to re-init on kdump kernel boot without hanging.  I've tested it on x86_64, but not x86.  Bear in mind that these machines also suffer from the gart errors that doug chapman has been chasing in bz 463144, so you may need the patches from that bug as well in your testing.  Please confirm that this solves your problem, and we'll figure out when to get this in.

Comment 18 Don Zickus 2008-12-09 21:05:32 UTC
in kernel-2.6.18-126.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 21 Doug Chapman 2008-12-10 16:10:51 UTC
We have 2 different problems here.  There is the HPET issue where it hangs at:

Using HPET for base-timer


and another seemingly unrelated issue where it hangs at:

Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled


The hpet fix in the -126 kernel does not fix the Serial driver hang.
I still see that hang on: hp-xw8600-01



also, for the HPET issue, I think that could be worked around with by adding
"acpi=off" to KDUMP_COMMANDLINE_APPEND in /etc/sysconfig/kdump.  From looking
at the code it appears this will prevent the kdump kernel from using hpet and
hopefully will resolve the issue.  So far I have not been able to confirm that
because I have not been able to get my hands on a system that exibits the HPET
kdump hang.

Comment 22 Doug Chapman 2008-12-10 19:01:13 UTC
I created a new BZ # 475843 for the serial hang since that seems to be separate from the HPET issue.

Comment 23 Linda Wang 2008-12-10 20:01:25 UTC
the HPET issue is tracked with bug 475652. since this causes regression,
we are planning to pull this patch out from 5.3. bug 475652 will be
used to backout the patch. Once the patch is out, will move this bug
back to ASSI.

Comment 28 Linda Wang 2008-12-23 15:47:47 UTC
The patch that went into kernel-2.6.18-126.el5 in the following changelog:

* Mon Dec 08 2008 Don Zickus  [2.6.18-126.el5]
- [x86] disable hpet on machine_crash_shutdown (Neil Horman ) [473038]

has been reverted from 5.3 w/ kernel-2.6.18-127.el5 under the following changlog:

* Mon Dec 15 2008 Don Zickus  [2.6.18-127.el5]
- Revert: [x86] disable hpet on machine_crash_shutdown (Neil Horman ) [475652]

Please see bug 475652 and bug 475843 for more detailed info.

Comment 29 Chris Lalancette 2009-01-22 08:40:26 UTC
*** Bug 468000 has been marked as a duplicate of this bug. ***

Comment 30 Chris Lalancette 2009-01-22 08:41:14 UTC
*** Bug 283191 has been marked as a duplicate of this bug. ***

Comment 31 Issue Tracker 2009-01-28 21:37:34 UTC
Tested with kexec-tools-1.102pre-56.el5_3.1 from an update from RHEL5.3. 
Looks like the problem is still in the 2.6.18-128.el5 kernel on xw9400s
and xw9300s.


This event sent from IssueTracker by cwyse 
 issue 236843

Comment 32 Qian Cai 2009-02-01 09:24:21 UTC
Sorry for the late replay. I was on vacation in the last few days.

I have been able to reproduce the same problem on hp-xw9400-02.rhts.bos.redhat.com using the same version of kexec-tools and kernel packages as the above.

# echo c >/proc/sysrq-trigger 
SysRq : Trigger a crashdump
Linux version 2.6.18-128.el5 (mockbuild.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Wed Dec 17 11:41:38 EST 2008
Command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200  irqpoll maxcpus=1 reset_devices  hdb=cdrom memmap=exactmap memmap=640K@0K memmap=5176K@16384K memmap=125240K@22200K elfcorehdr=147440K memmap=228K$3669788K memmap=131072K$3932160K memmap=20480K$4173824K
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000100 - 000000000009d000 (usable)
 BIOS-e820: 000000000009d000 - 00000000000a0000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000dffc7000 (usable)
 BIOS-e820: 00000000dffc7000 - 00000000e0000000 (reserved)
 BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000120000000 (usable)
user-defined physical RAM map:
 user: 0000000000000000 - 00000000000a0000 (usable)
 user: 0000000001000000 - 000000000150e000 (usable)
 user: 00000000015ae000 - 0000000008ffc000 (usable)
 user: 00000000dffc7000 - 00000000e0000000 (reserved)
 user: 00000000f0000000 - 00000000f8000000 (reserved)
 user: 00000000fec00000 - 0000000100000000 (reserved)
DMI 2.5 present.
SRAT: PXM 0 -> APIC 0 -> Node 0
SRAT: PXM 0 -> APIC 1 -> Node 0
SRAT: PXM 1 -> APIC 2 -> Node 1
SRAT: PXM 1 -> APIC 3 -> Node 1
SRAT: Node 0 PXM 0 0-a0000
SRAT: Node 0 PXM 0 0-80000000
SRAT: Node 1 PXM 1 80000000-e0000000
SRAT: Node 1 PXM 1 80000000-120000000
Bootmem setup node 0 0000000000000000-0000000008ffc000
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
disabling kdump
ACPI: PM-Timer IO Port: 0xf808
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:1 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
Processor #2 15:1 APIC version 16
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled)
Processor #1 15:1 APIC version 16
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x03] enabled)
Processor #3 15:1 APIC version 16
ACPI: LAPIC (acpi_id[0x05] lapic_id[0x04] disabled)
ACPI: LAPIC (acpi_id[0x06] lapic_id[0x05] disabled)
ACPI: LAPIC (acpi_id[0x07] lapic_id[0x06] disabled)
ACPI: LAPIC (acpi_id[0x08] lapic_id[0x07] disabled)
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1])
ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 8, version 17, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x09] address[0xfa400000] gsi_base[24])
IOAPIC[1]: apic_id 9, version 17, address 0xfa400000, GSI 24-47
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
Setting APIC routing to physical flat
ACPI: HPET id: 0x10de8201 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
Nosave address range: 00000000000a0000 - 0000000001000000
Nosave address range: 000000000150e000 - 00000000015ae000
Allocating PCI resources starting at 10000000 (gap: 8ffc000:d6fcb000)
SMP: Allowing 8 CPUs, 4 hotplug CPUs
Built 1 zonelists.  Total pages: 32250
Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200  irqpoll maxcpus=1 reset_devices  hdb=cdrom memmap=exactmap memmap=640K@0K memmap=5176K@16384K memmap=125240K@22200K elfcorehdr=147440K memmap=228K$3669788K memmap=131072K$3932160K memmap=20480K$4173824K
Misrouted IRQ fixup and polling support enabled
This may significantly impact system performance
ide_setup: hdb=cdrom
Initializing CPU#0
PID hash table entries: 512 (order: 9, 4096 bytes)
Console: colour VGA+ 80x25
Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
Checking aperture...
CPU 0: aperture @ c000000 size 64 MB
CPU 1: aperture @ c000000 size 64 MB
Memory: 118524k/147440k available (2494k kernel code, 12532k reserved, 1263k data, 200k init)

This is the only machine I could reproduce the problem. I have tested on the following machines including xw9300 for several times without any problem.

hp-xw9300-01.rhts.bos.redhat.com
hp-xw8600-01.rhts.bos.redhat.com
hp-xw8400-01.rhts.bos.redhat.com
hp-xw6800-02.rhts.bos.redhat.com
hp-xw6800-01.rhts.bos.redhat.com
hp-xw6600-02.rhts.bos.redhat.com
hp-xw6600-01.rhts.bos.redhat.com
hp-xw6400-01.rhts.bos.redhat.com
hp-xw4800-01.rhts.bos.redhat.com
hp-xw4550-01.rhts.bos.redhat.com

Comment 33 Neil Horman 2009-02-01 23:41:05 UTC
See comment 21.  Can you try this with acpi=off added to the commandline for the kdump kernel as a workaround?  As per bz 475652, we had to remove the hpet fix in the kernel to avoid breaking some systems without hpets available.  Well use this bug to get that square for 5.4.  If you could confirm that disabling acpi in the kdump kernel is an effective workaround, that would be helpful.  Thank you!

Comment 34 Qian Cai 2009-02-02 01:41:26 UTC
Well, the kexec-tools package used for testing has included the fix for,

Bug 475843 - kdump boot hangs in msleep on several HP XW systems

https://bugzilla.redhat.com/show_bug.cgi?id=475843#c3 started that it is going to
fix both HPET and serial console hanging bugs. Indeed, it looks like it fix for all HP XW machines except xw9400 (perhaps xw9300 mentioned in comment #31, although I have not able to reproduce it in house).

"acpi=off" is not going to be workaround at the moment, because it will suffer from the issue,

Bug 470202 - Kernel Panic at pci_scan_bus_parented+0xa/0x1f on HP XW Workstation with "acpi=off"

Comment 35 Neil Horman 2009-02-02 03:19:01 UTC
ugh, thats right.  Ok, so I just need to get the hpet fix in for systems that have hpet timers, and adjust it so that systems without hpets are unaffected.  Ok, I'll work up a patch soon.  thanks!

Comment 36 Doug Chapman 2009-02-02 15:59:04 UTC
Cai, Neil,

The problem on the XW9400 is being tracked in BZ 477032

Comment 37 Neil Horman 2009-02-02 19:51:29 UTC
Doug, are we sure this is the same problem?  I was under the impression that this bug was specfically related to the issue in which the hpet wouldn't re-init if it wasn't shutdown properly prior to kdump boot, whereas bz 477032 is the apic initalization bug.  Am I missing something?

Comment 38 Doug Chapman 2009-02-02 20:17:30 UTC
(In reply to comment #37)
> Doug, are we sure this is the same problem?  I was under the impression that
> this bug was specfically related to the issue in which the hpet wouldn't
> re-init if it wasn't shutdown properly prior to kdump boot, whereas bz 477032
> is the apic initalization bug.  Am I missing something?

I have heard the hpet problem described but from my investigation I have _never_ seen that on any of the HP systems.  I can see how it might appear that the problem lies in HPET and it might on other systems but the problem that Cai describes in comment #32 is most certainly the same issue I have been tracking down with the ioapic in BZ 477032.

Comment 39 RHEL Program Management 2009-02-16 15:08:10 UTC
Updating PM score.

Comment 43 Neil Horman 2009-04-02 13:46:52 UTC
Cai, I've tried reproducing this on both the 8800 and the 9400 recently, and am unable to.  Are you able to reproduce this anywhere at the moment?

Comment 44 Qian Cai 2009-04-02 14:45:46 UTC
I don't think there is anything outstanding here. All other HP XW machines except 9400
are working right now in 5.4/5.3.z thanks to the fix of (in kexec-tools-1.102-pre57.el5),

Bug 475843 -  kdump boot hangs in msleep on several HP XW systems

9400 issue has been tracked in (currently under invetsigation),

Bug 477032 -  kdump hang on HP xw9400 in calibrate_delay()

So, I'll close this one out. Please re-open it if needed.