Bug 473038
Summary: | [5.3] Kdump Kernel Hangs on HP XW Machines | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Qian Cai <qcai> | ||||||||
Component: | kernel | Assignee: | Neil Horman <nhorman> | ||||||||
Status: | CLOSED NEXTRELEASE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 5.3 | CC: | clalance, cwyse, dchapman, duck, dzickus, mgahagan, prarit, syeghiay, tao | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2009-04-02 14:45:46 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 483701, 485920 | ||||||||||
Attachments: |
|
Description
Qian Cai
2008-11-26 08:23:02 UTC
Can you bisect and tell me what the last kdump kernel to boot properly on this system was? (In reply to comment #1) > Can you bisect and tell me what the last kdump kernel to boot properly on this > system was? It is not a regression, and seemed never work before. Hp-xw9300 is indeed also affected even with -125.el5 kernel. It may just have a lower rate of failure than other machines listed in comment #0. BIOS Information Vendor: Hewlett-Packard Version: 786B9 v2.09 Release Date: 11/28/2006 I'm reserving xw94400-02 to try reproduce this for myself. Created attachment 325878 [details] patch to disable hpet timer on crash shutdown This patch solved the problem for me on hp-xw9400-02. Its a backport of upstream commit 0c1b2724069951b1902373e688042b2ec382f68f, and disables the hpet on shutdown so as to allow it to re-init on kdump kernel boot without hanging. I've tested it on x86_64, but not x86. Bear in mind that these machines also suffer from the gart errors that doug chapman has been chasing in bz 463144, so you may need the patches from that bug as well in your testing. Please confirm that this solves your problem, and we'll figure out when to get this in. in kernel-2.6.18-126.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 We have 2 different problems here. There is the HPET issue where it hangs at: Using HPET for base-timer and another seemingly unrelated issue where it hangs at: Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled The hpet fix in the -126 kernel does not fix the Serial driver hang. I still see that hang on: hp-xw8600-01 also, for the HPET issue, I think that could be worked around with by adding "acpi=off" to KDUMP_COMMANDLINE_APPEND in /etc/sysconfig/kdump. From looking at the code it appears this will prevent the kdump kernel from using hpet and hopefully will resolve the issue. So far I have not been able to confirm that because I have not been able to get my hands on a system that exibits the HPET kdump hang. I created a new BZ # 475843 for the serial hang since that seems to be separate from the HPET issue. the HPET issue is tracked with bug 475652. since this causes regression, we are planning to pull this patch out from 5.3. bug 475652 will be used to backout the patch. Once the patch is out, will move this bug back to ASSI. The patch that went into kernel-2.6.18-126.el5 in the following changelog: * Mon Dec 08 2008 Don Zickus [2.6.18-126.el5] - [x86] disable hpet on machine_crash_shutdown (Neil Horman ) [473038] has been reverted from 5.3 w/ kernel-2.6.18-127.el5 under the following changlog: * Mon Dec 15 2008 Don Zickus [2.6.18-127.el5] - Revert: [x86] disable hpet on machine_crash_shutdown (Neil Horman ) [475652] Please see bug 475652 and bug 475843 for more detailed info. *** Bug 468000 has been marked as a duplicate of this bug. *** *** Bug 283191 has been marked as a duplicate of this bug. *** Tested with kexec-tools-1.102pre-56.el5_3.1 from an update from RHEL5.3. Looks like the problem is still in the 2.6.18-128.el5 kernel on xw9400s and xw9300s. This event sent from IssueTracker by cwyse issue 236843 Sorry for the late replay. I was on vacation in the last few days. I have been able to reproduce the same problem on hp-xw9400-02.rhts.bos.redhat.com using the same version of kexec-tools and kernel packages as the above. # echo c >/proc/sysrq-trigger SysRq : Trigger a crashdump Linux version 2.6.18-128.el5 (mockbuild.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Wed Dec 17 11:41:38 EST 2008 Command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 irqpoll maxcpus=1 reset_devices hdb=cdrom memmap=exactmap memmap=640K@0K memmap=5176K@16384K memmap=125240K@22200K elfcorehdr=147440K memmap=228K$3669788K memmap=131072K$3932160K memmap=20480K$4173824K BIOS-provided physical RAM map: BIOS-e820: 0000000000000100 - 000000000009d000 (usable) BIOS-e820: 000000000009d000 - 00000000000a0000 (reserved) BIOS-e820: 0000000000100000 - 00000000dffc7000 (usable) BIOS-e820: 00000000dffc7000 - 00000000e0000000 (reserved) BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved) BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000120000000 (usable) user-defined physical RAM map: user: 0000000000000000 - 00000000000a0000 (usable) user: 0000000001000000 - 000000000150e000 (usable) user: 00000000015ae000 - 0000000008ffc000 (usable) user: 00000000dffc7000 - 00000000e0000000 (reserved) user: 00000000f0000000 - 00000000f8000000 (reserved) user: 00000000fec00000 - 0000000100000000 (reserved) DMI 2.5 present. SRAT: PXM 0 -> APIC 0 -> Node 0 SRAT: PXM 0 -> APIC 1 -> Node 0 SRAT: PXM 1 -> APIC 2 -> Node 1 SRAT: PXM 1 -> APIC 3 -> Node 1 SRAT: Node 0 PXM 0 0-a0000 SRAT: Node 0 PXM 0 0-80000000 SRAT: Node 1 PXM 1 80000000-e0000000 SRAT: Node 1 PXM 1 80000000-120000000 Bootmem setup node 0 0000000000000000-0000000008ffc000 Memory for crash kernel (0x0 to 0x0) notwithin permissible range disabling kdump ACPI: PM-Timer IO Port: 0xf808 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) Processor #0 15:1 APIC version 16 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled) Processor #2 15:1 APIC version 16 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled) Processor #1 15:1 APIC version 16 ACPI: LAPIC (acpi_id[0x04] lapic_id[0x03] enabled) Processor #3 15:1 APIC version 16 ACPI: LAPIC (acpi_id[0x05] lapic_id[0x04] disabled) ACPI: LAPIC (acpi_id[0x06] lapic_id[0x05] disabled) ACPI: LAPIC (acpi_id[0x07] lapic_id[0x06] disabled) ACPI: LAPIC (acpi_id[0x08] lapic_id[0x07] disabled) ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1]) ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 8, version 17, address 0xfec00000, GSI 0-23 ACPI: IOAPIC (id[0x09] address[0xfa400000] gsi_base[24]) IOAPIC[1]: apic_id 9, version 17, address 0xfa400000, GSI 24-47 ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge) Setting APIC routing to physical flat ACPI: HPET id: 0x10de8201 base: 0xfed00000 Using ACPI (MADT) for SMP configuration information Nosave address range: 00000000000a0000 - 0000000001000000 Nosave address range: 000000000150e000 - 00000000015ae000 Allocating PCI resources starting at 10000000 (gap: 8ffc000:d6fcb000) SMP: Allowing 8 CPUs, 4 hotplug CPUs Built 1 zonelists. Total pages: 32250 Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 irqpoll maxcpus=1 reset_devices hdb=cdrom memmap=exactmap memmap=640K@0K memmap=5176K@16384K memmap=125240K@22200K elfcorehdr=147440K memmap=228K$3669788K memmap=131072K$3932160K memmap=20480K$4173824K Misrouted IRQ fixup and polling support enabled This may significantly impact system performance ide_setup: hdb=cdrom Initializing CPU#0 PID hash table entries: 512 (order: 9, 4096 bytes) Console: colour VGA+ 80x25 Dentry cache hash table entries: 16384 (order: 5, 131072 bytes) Inode-cache hash table entries: 8192 (order: 4, 65536 bytes) Checking aperture... CPU 0: aperture @ c000000 size 64 MB CPU 1: aperture @ c000000 size 64 MB Memory: 118524k/147440k available (2494k kernel code, 12532k reserved, 1263k data, 200k init) This is the only machine I could reproduce the problem. I have tested on the following machines including xw9300 for several times without any problem. hp-xw9300-01.rhts.bos.redhat.com hp-xw8600-01.rhts.bos.redhat.com hp-xw8400-01.rhts.bos.redhat.com hp-xw6800-02.rhts.bos.redhat.com hp-xw6800-01.rhts.bos.redhat.com hp-xw6600-02.rhts.bos.redhat.com hp-xw6600-01.rhts.bos.redhat.com hp-xw6400-01.rhts.bos.redhat.com hp-xw4800-01.rhts.bos.redhat.com hp-xw4550-01.rhts.bos.redhat.com See comment 21. Can you try this with acpi=off added to the commandline for the kdump kernel as a workaround? As per bz 475652, we had to remove the hpet fix in the kernel to avoid breaking some systems without hpets available. Well use this bug to get that square for 5.4. If you could confirm that disabling acpi in the kdump kernel is an effective workaround, that would be helpful. Thank you! Well, the kexec-tools package used for testing has included the fix for, Bug 475843 - kdump boot hangs in msleep on several HP XW systems https://bugzilla.redhat.com/show_bug.cgi?id=475843#c3 started that it is going to fix both HPET and serial console hanging bugs. Indeed, it looks like it fix for all HP XW machines except xw9400 (perhaps xw9300 mentioned in comment #31, although I have not able to reproduce it in house). "acpi=off" is not going to be workaround at the moment, because it will suffer from the issue, Bug 470202 - Kernel Panic at pci_scan_bus_parented+0xa/0x1f on HP XW Workstation with "acpi=off" ugh, thats right. Ok, so I just need to get the hpet fix in for systems that have hpet timers, and adjust it so that systems without hpets are unaffected. Ok, I'll work up a patch soon. thanks! Cai, Neil, The problem on the XW9400 is being tracked in BZ 477032 Doug, are we sure this is the same problem? I was under the impression that this bug was specfically related to the issue in which the hpet wouldn't re-init if it wasn't shutdown properly prior to kdump boot, whereas bz 477032 is the apic initalization bug. Am I missing something? (In reply to comment #37) > Doug, are we sure this is the same problem? I was under the impression that > this bug was specfically related to the issue in which the hpet wouldn't > re-init if it wasn't shutdown properly prior to kdump boot, whereas bz 477032 > is the apic initalization bug. Am I missing something? I have heard the hpet problem described but from my investigation I have _never_ seen that on any of the HP systems. I can see how it might appear that the problem lies in HPET and it might on other systems but the problem that Cai describes in comment #32 is most certainly the same issue I have been tracking down with the ioapic in BZ 477032. Updating PM score. Cai, I've tried reproducing this on both the 8800 and the 9400 recently, and am unable to. Are you able to reproduce this anywhere at the moment? I don't think there is anything outstanding here. All other HP XW machines except 9400 are working right now in 5.4/5.3.z thanks to the fix of (in kexec-tools-1.102-pre57.el5), Bug 475843 - kdump boot hangs in msleep on several HP XW systems 9400 issue has been tracked in (currently under invetsigation), Bug 477032 - kdump hang on HP xw9400 in calibrate_delay() So, I'll close this one out. Please re-open it if needed. |