Description of problem: Kdump kernel hangs on Dell machines with AMD CPU. ... Initializing CPU#0 CPU 0 irqstacks, hard=c1359000 soft=c1339000 PID hash table entries: 1024 (order: 10, 4096 bytes) Console: colour VGA+ 80x25 Dentry cache hash table entries: 32768 (order: 5, 131072 bytes) Inode-cache hash table entries: 16384 (order: 4, 65536 bytes) Memory: 122328k/146796k available (2134k kernel code, 8648k reserved, 891k data, 228k init, 0k highmem) Checking if this processor honours the WP bit even in supervisor mode... Ok. hpet0: at MMIO 0xfed00000 (virtual 0xc9800000), IRQs 2, 8, 31 hpet0: 3 32-bit timers, 25000000 Hz Using HPET for base-timer Adding "acpi=off noacpi" or "hpet=off hpet=disabled", kdump kernel hangs at a different place. ... Initializing CPU#0 CPU 0 irqstacks, hard=c1359000 soft=c1339000 PID hash table entries: 1024 (order: 10, 4096 bytes) Console: colour VGA+ 80x25 Dentry cache hash table entries: 32768 (order: 5, 131072 bytes) Inode-cache hash table entries: 16384 (order: 4, 65536 bytes) Memory: 122328k/146796k available (2134k kernel code, 8648k reserved, 891k data, 228k init, 0k highmem) Checking if this processor honours the WP bit even in supervisor mode... Ok. Adding "maxcpus=0" does not help either. PAE kernel, non-PAE kernel, and Xen Domain 0 kernel are all affected. Seen it on two machines so far, dell-pem605-01.rhts.bos.redhat.com dell-per805-01.rhts.bos.redhat.com I don't know if it is a regression because those machines are probably just added to RHTS, nor if it affects IA-32 only. Version-Release number of selected component (if applicable): kernel-2.6.18-124.el5 kernel-PAE-2.6.18-124.el5 kernel-xen-2.6.18-124.el5 kexec-tools-1.102pre-50.el5 How reproducible: Around 50% with bare metal kernel. Here is the testing result on those two machines. All testing are done on IA-32. dell-pem605: bare metal kernel (sysrq-c): 2 FAIL - 3 PASS Xen Domain 0 kernel (sysrq-c): 4 FAIL - 0 PASS dell-per805: bare metal kernel (sysrq-c): 1 FAIL - 2 PASS Xen Domain 0 kernel (sysrq-c): 1 FAIL - 2 PASS Steps to Reproduce: 1. configure kdump with crashkernel=128M@16M 2. echo c >/proc/sysrq-c
Created attachment 324968 [details] dell-pem605-01 kdump kernel hangs
Created attachment 324969 [details] dell-pem605-01 kdump kernel hangs with "acpi=off noacpi"
Created attachment 324971 [details] dell-pem605-01 normal kernel boots
Created attachment 324972 [details] dell-pem605-01 dmidecode
Created attachment 324974 [details] dell-pem605-01 cpuinfo
# uname -ra Linux dell-pem605-01.rhts.bos.redhat.com 2.6.18-124.el5PAE #1 SMP Mon Nov 17 17:11:02 EST 2008 i686 athlon i386 GNU/Linux
Created attachment 324975 [details] dell-per805-01 kdump kernel hangs
Created attachment 324976 [details] dell-per805-01 kdump kernel hangs with "acpi=off noacpi"
Created attachment 324977 [details] dell-per805-01 normal kernel boots
Created attachment 324979 [details] dell-per805-01 dmidecode
Created attachment 324980 [details] dell-per805-01 cpuinfo
# uname -ra Linux dell-per805-01.rhts.bos.redhat.com 2.6.18-124.el5PAE #1 SMP Mon Nov 17 17:11:02 EST 2008 i686 athlon i386 GNU/Linux
Adding "noapic noacpi acpi=off" to kdump kernel did not help either.
ok, so this system has never worked. Is it only with the PAE kernel, or all kernels that it fails? In fact, why are you running the pae kernel on this system? Isn't it 64 bit hardware?
As you can read from the bug description, PAE kernel, non-PAE kernel, and Xen Domain 0 kernel are all affected on both IA-32 and x86-64 architectures.
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Note that this problem can be workaround if encountered by setting hpet=off in the KDUMP_COMMANDLINE_APPEND variable in /et/csysconfig//kdump.conf
I was talking with dchapman today, who recently found out that on several hp systems which are having a simmilar problem to this, that there are acpi regions in the e820 map which are not marked as ACPI NVS/DATA, but rather simply 'reserved'. As kdump ignores reserved sections, it doesnt map them in kdump kernels, causing all sort of odd behavior. By explicitly mapping those segments, it made the systems work. A simmilar patch is working here. Doug is going to push a kexec and/or kernel patch upstream to universally map the reserved areas of ram. Since this appears to be the same problem, I'm going to close this as a dup of the bug that he is tracking this in, bz 475843. Once doug has his fixes pushed upstream, I'll pull them into the kernel and kexec for RHEL *** This bug has been marked as a duplicate of bug 475843 ***
I am afraid this is not the same as bug 475843. According to https://bugzilla.redhat.com/show_bug.cgi?id=475843#c16, the bug should be fixed by using kexec-tools-1.102pre-57.el5. However, I have tried kernel-PAE-2.6.18-128.el5 and kexec-tools-1.102pre-57.el5 here, and it did not solve the problem. dell-pem605-01.rhts.bos.redhat.com login: SysRq : Trigger a crashdump Linux version 2.6.18-128.el5PAE (mockbuild.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP8 BIOS-provided physical RAM map: BIOS-e820: 0000000000000100 - 00000000000a0000 (usable) BIOS-e820: 0000000000100000 - 00000000dfaa0000 (usable) BIOS-e820: 00000000dfaa0000 - 00000000dfab6000 (reserved) BIOS-e820: 00000000dfab6000 - 00000000dfad5c00 (ACPI data) BIOS-e820: 00000000dfad5c00 - 00000000e0000000 (reserved) BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved) BIOS-e820: 00000000fe000000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000120000000 (usable) user-defined physical RAM map: user: 0000000000000000 - 00000000000a0000 (usable) user: 0000000001000000 - 0000000008f5b000 (usable) ... Memory: 122328k/146796k available (2119k kernel code, 8612k reserved, 879k data, 228k init, 0k highmem) Checking if this processor honours the WP bit even in supervisor mode... Ok. hpet0: at MMIO 0xfed00000 (virtual 0xc9800000), IRQs 2, 8, 31 hpet0: 3 32-bit timers, 25000000 Hz Using HPET for base-timer In addition, from https://bugzilla.redhat.com/show_bug.cgi?id=475843#c3, it said that the problem in that bug was that it was unable to map the ACPI tables. However, it seems clearly that ACPI data is here at least for this bug, BIOS-e820: 00000000dfab6000 - 00000000dfad5c00 (ACPI data)
(In reply to comment #21) > Release note added. If any revisions are required, please set the > "requires_release_notes" flag to "?" and edit the "Release Notes" field > accordingly. > All revisions will be proofread by the Engineering Content Services team. > > New Contents: > Note that this problem can be workaround if encountered by setting hpet=off in > the KDUMP_COMMANDLINE_APPEND variable in /et/csysconfig//kdump.conf This is wrong. As I stated in comment #0, Adding "acpi=off noacpi" or "hpet=off hpet=disabled", kdump kernel hangs at a different place. There is basically no workaround.
Regarding your ACPI comment, its not that we're not expressly mapping the ACPI regions, in fact we are. The problem that Doug noted was that sometimes bios vendors will place ACPI data (or other ancilliary data required to make various bits of hardware function properly inside areas marked in the e820 tables as reserved (rather than ACPI). kdump was not mapping this into the kdump kernel, hence all sorts of odd problems arose, resulting in various odd failures. Cai, do you have the failing machine reserved in rhts at the moement, and if so, which one? I'd like to poke about on it a bit and verify that we're correctly reserving all the memory regions properly now. Thanks!
Cai, I've been working on dell-per805-01.rhts.bos.redhat.com, and using kexec-tools-1.102pre-57.el5, I can capture a vmcore no problem. Can you please confirm? 57 is whats supposed to be shipping with 5.3, so I think, if you're comfortable with this, we should be able to close this, since 57 is the version that starts mapping reserve sections of memory as we should be. Please confirm.
Neil, As you can see from comment #0, it might works sometimes, but the failure rate looks like around 50%. Just be a little patient. :) I have just reproduced the problem on the same machine using kexec-tools-1.102pre-57.el5. # rpm -q kexec-tools kexec-tools-1.102pre-57.el5 # echo c >/proc/sysrq-trigger SysRq : Trigger a crashdump Linux version 2.6.18-128.el5PAE (mockbuild.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.28 BIOS-provided physical RAM map: BIOS-e820: 0000000000000100 - 00000000000a0000 (usable) BIOS-e820: 0000000000100000 - 00000000cfaa0000 (usable) BIOS-e820: 00000000cfaa0000 - 00000000cfab6000 (reserved) BIOS-e820: 00000000cfab6000 - 00000000cfad5c00 (ACPI data) BIOS-e820: 00000000cfad5c00 - 00000000d0000000 (reserved) BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved) BIOS-e820: 00000000fe000000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000130000000 (usable) user-defined physical RAM map: user: 0000000000000000 - 00000000000a0000 (usable) user: 0000000001000000 - 0000000008f5b000 (usable) ... Memory: 122584k/146796k available (2119k kernel code, 8352k reserved, 879k data, 228k init, 0k highmem) Checking if this processor honours the WP bit even in supervisor mode... Ok. hpet0: at MMIO 0xfed00000 (virtual 0xc9800000), IRQs 2, 8, 31 hpet0: 3 32-bit timers, 25000000 Hz Using HPET for base-timer
Interesting, kexec doesn't seem to pick up on adding the reserved areas every time, and thats what corresponds to the HPET hang. Should be pretty easy to track down. Odd behavior though, we just parse /proc/iomem to get that info, I wonder whats changing the behavior.
I think I see at least part of the problem. The PAE kernel maps some of its ram to a different location than what we normally find in the origional e820 map, and so on kdump reboot we have some ram remapped in the physical e820 map using a address that is inaccessible until later during the boot. I need to figure out how to override that
Created attachment 328586 [details] patch to add reserved sections to i386 kexec Additionally, we'll also need this patch to kexec. Even though this is a 64 bit system, its running a 32 bit OS, and kexec has a per-arch kexec command line generation setup, so the code we added to x86_64 to add reserved & acpi e820 sections needs to be copied over to x86. This patch does that. so all I need to figure out now is how to fix up the remapped physical memory issue (I think)
Created attachment 328782 [details] new version of patch to add reserved sections to i386 kexec grr, it look like even in addition ot the above patch, we still hang on the hpet timer. I'll need to dig in farther to find exactly where we're hanging.
grr, just tracked this down. We're getting stuck in calibrate_delay. Given that these are quad core processors on an HT bus, I'm strongly suspicious that this is a duplicate of bz 462519, the fix for which is a much earlier initalization of the apic, which I am trying to figure out. I'm going to close this as a dupe of that, and we can re-open if/wehn I figure out how to handle the APIC movement properly. *** This bug has been marked as a duplicate of bug 462519 ***
FYI. I have seen kdump kernel failed on another two Dell machines during RHEL5.4 testing, which looks like have the same issues. dell-pem805-01.rhts.bos.redhat.com dell-pem905-01.rhts.bos.redhat.com Although we can go a little bit further beyond the line of "Using HPET for base-timer" by using, kernel-2.6.18-153.el5 kexec-tools-1.102pre-73.el5 ... Using HPET for base-timer Calibrating delay loop (skipped), value calculated using timer frequency.. 4000) Security Framework v1.0.0 initialized SELinux: Initializing. selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 512 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 0(4) -> Core 2 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. Checking 'hlt' instruction... irq 106, desc: c12f2380, depth: 1, count: 0, unha0 ->handle_irq(): c104b61e, handle_bad_irq+0x0/0x1a6 ->chip(): c1288d80, 0xc1288d80 ->action(): 00000000 IRQ_DISABLED set unexpected IRQ trap at vector 6a irq 114, desc: c12f2780, depth: 1, count: 0, unhandled: 0 ->handle_irq(): c104b61e, handle_bad_irq+0x0/0x1a6 ->chip(): c1288d80, 0xc1288d80 ->action(): 00000000 IRQ_DISABLED set unexpected IRQ trap at vector 72 So, the affected machines in RHTS apparently have increased to 4, dell-pem805-01.rhts.bos.redhat.com dell-pem905-01.rhts.bos.redhat.com dell-pem605-01.rhts.bos.redhat.com dell-per805-01.rhts.bos.redhat.com
Neil, this looks like turns out to be something different that the 32-bit variant of, Bug 462519 - Tracking Early Init Apic fix for kdump issues because kdump kernel still hang by using kernel-2.6.18-156.el5 and kexec-tools-1.102pre-75.el5. Red Hat Enterprise Linux Server release 5.4 Beta (Tikanga) Kernel 2.6.18-156.el5PAE on an i686 dell-per805-01.rhts.bos.redhat.com login: SysRq : Trigger a crashdump Linux version 2.6.18-156.el5PAE (mockbuild.redhat.com) (gcc v9 BIOS-provided physical RAM map: BIOS-e820: 0000000000010000 - 00000000000a0000 (usable) BIOS-e820: 0000000000100000 - 00000000cfaa0000 (usable) BIOS-e820: 00000000cfaa0000 - 00000000cfab6000 (reserved) BIOS-e820: 00000000cfab6000 - 00000000cfad5c00 (ACPI data) BIOS-e820: 00000000cfad5c00 - 00000000d0000000 (reserved) BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved) BIOS-e820: 00000000fe000000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000130000000 (usable) user-defined physical RAM map: user: 0000000000000000 - 00000000000a0000 (usable) user: 0000000001000000 - 0000000008f5b000 (usable) 0MB HIGHMEM available. 143MB LOWMEM available. found SMP MP-table at 000fe710 Memory for crash kernel (0x0 to 0x0) notwithin permissible range disabling kdump NX (Execute Disable) protection: active DMI 2.5 present. Using APIC driver default ACPI: PM-Timer IO Port: 0x508 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) Processor #0 0:2 APIC version 16 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x04] enabled) Processor #4 0:2 APIC version 16 WARNING: maxcpus limit of 1 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled) Processor #1 0:2 APIC version 16 WARNING: maxcpus limit of 1 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x04] lapic_id[0x05] enabled) Processor #5 0:2 APIC version 16 WARNING: maxcpus limit of 1 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x05] lapic_id[0x02] enabled) Processor #2 0:2 APIC version 16 WARNING: maxcpus limit of 1 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x06] lapic_id[0x06] enabled) Processor #6 0:2 APIC version 16 WARNING: maxcpus limit of 1 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x07] lapic_id[0x03] enabled) Processor #3 0:2 APIC version 16 WARNING: maxcpus limit of 1 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x08] lapic_id[0x07] enabled) Processor #7 0:2 APIC version 16 WARNING: maxcpus limit of 1 reached. Processor ignored. ACPI: LAPIC_NMI (acpi_id[0xff] high edge lint[0x1]) ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 8, version 17, address 0xfec00000, GSI 0-23 ACPI: IOAPIC (id[0x09] address[0xd7ffe000] gsi_base[32]) IOAPIC[1]: apic_id 9, version 17, address 0xd7ffe000, GSI 32-55 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) Enabling APIC mode: Flat. Using 2 I/O APICs ACPI: HPET id: 0x10de8201 base: 0xfed00000 Using ACPI (MADT) for SMP configuration information Allocating PCI resources starting at 10000000 (gap: 08f5b000:f70a5000) Detected 2300.279 MHz processor. Built 1 zonelists. Total pages: 36699 Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS1,115200 irqK Misrouted IRQ fixup and polling support enabled This may significantly impact system performance Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Initializing CPU#0 CPU 0 irqstacks, hard=c135d000 soft=c133d000 PID hash table entries: 1024 (order: 10, 4096 bytes) Console: colour VGA+ 80x25 Dentry cache hash table entries: 32768 (order: 5, 131072 bytes) Inode-cache hash table entries: 16384 (order: 4, 65536 bytes) Memory: 122200k/146796k available (2151k kernel code, 8824k reserved, 886k data) Checking if this processor honours the WP bit even in supervisor mode... Ok. hpet0: at MMIO 0xfed00000 (virtual 0xc9800000), IRQs 2, 8, 31 hpet0: 3 32-bit timers, 25000000 Hz Using HPET for base-timer Calibrating delay loop (skipped), value calculated using timer frequency.. 4600) Security Framework v1.0.0 initialized SELinux: Initializing. selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 512 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 0(4) -> Core 3 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. Checking 'hlt' instruction... <hanging ...> Do you want me to create a new bug or set it to Assigned?
OK, looks like the patch has not been integrated yet. Please disregard comment #35 and #36.
open a new bug please, the above log looks like we were getting farther than we did previously.
OK. A new bug has been filed here, Bug 510645 - Kdump Kernel Stops on Dell Machines at: Checking 'hlt' instruction
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Note that this problem can be workaround if encountered by setting hpet=off in the KDUMP_COMMANDLINE_APPEND variable in /et/csysconfig//kdump.conf+Note that this problem can be workaround if encountered by setting hpet=off in the KDUMP_COMMANDLINE_APPEND variable in /etc/sysconfig/kdump.conf
in kernel-2.6.18-168.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
Deleted Release Notes Contents. Old Contents: Note that this problem can be workaround if encountered by setting hpet=off in the KDUMP_COMMANDLINE_APPEND variable in /etc/sysconfig/kdump.conf
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html
After verification, this issue have been fix in Redhat 5.5 32bit. But it sill happen in Redhat 5.5 64bit. Need re-open it.
Add ken in the thread.
Just curious if the BIOSes running on these systems are the latest. -Shyam Iyer Dell Onsite Engineer