Description of problem: HP XW9400 systems hang on kdump. The last line on the console is: Memory: 247608k/278512k available (2494k kernel code, 14504k reserved, 1262k data, 200k init) hacking in some printk's shows that this is hanging in calibrate_delay_direct(). I am still digging but it appears it isn't getting good values from the clock. Adding "nohpet" didn't seem to fix this however manually giving it a lpj (loops per jiffy) parameter to the kdump kernel does appear to work around the issue. specifically I am using lpj=2602224 which I found in dmesg from a standard kernel boot. Version-Release number of selected component (if applicable): kernel-2.6.18-120.el5 (but likely all RHEL5.x kernels) How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Adding John Brown @ WGBU as an fyi...
Here are my findings so far. I will continue to dig into this after the holiday break. This does not appear to be directly related to bug 475843 however the 9400 has that problem as well. I am using the fixed kexec-tools for this system as well. Upstream kernels work OK on the xw9400 so hopefully a fix can be backported. The root of the problem is that we are not getting timer interrupts at all during kdump. The place where it hangs is in calibrate_delay in a loop that is waiting for "jiffies" to change. Initially I suspected hpet code however booting kdump with "nohpet" results in the same hang. The kdump kernel is booted with the "irqpoll" option by default. Removing this makes no difference however I found that if I boot the initial kernel with irqpoll and also the kdump kernel with irqpoll then it does not hang. So there appears to be something that is wrong with how interrupts are configured. I will dig more in a couple of weeks but sugestions are certainly welcome.
I have done a little more digging on this. and have found: This is specific to the xw9400 platform, I tried a bunch of other xw servers and RHEL5.3 can create a good dump on those (using the updated kexec-tools from bug # 475843). It works with kernel 2.6.27 from upstream, so we know it _can_ be fixed in software. The core of the problem is that on the kdump boot we are not getting any interrupts through the 8259A (aka XT-PIC). It appears that the XW9400 doesn't deal with switching from the IO-APIC back to the 8259A. Evidently the upstream kernel does some different operations in this switchover which allows it to work. Since the upstream code is so dramatically different than the RHEL5.X kernel code I have not been able to find anything obvious. There are 2 workarounds (both require the fixed kexec-tools FYI): 1. boot the initial kernel with "noapic". this prevents the hardware from ever switching to the IO-APIC for interrupts so we don't have the problem of switching back to the 8259A. This is not a good workaround for customers obviously. 2. add an "lpj=" (loops per jiffy) argument to the kexec boot. I am using lpj=2602232 which I found from dmesg of a normal boot. Honestly for kdump I don't think we need to be too picky on this value. This seems to be a usable workaround. The reason this works is the code that calculates lpj is the _only_ bit of code (which I have hit) that makes use of "jiffies" prior to the system switching back over to the IO-APIC. So, a question for John Brown. Do you know of anything unique to the xw9400 that might cause this issue with switching from IO-APIC back to the 8259A? As I mentioned this works upstream and if I had a better understanding of what was different with the hardware I might be able to find exactly what bit of upstream code fixed this.
This sounds an awful lot like bz 462519. Feel free to take a look and close it as a dup if you think it fits.
Neil, I tried the patch from BZ 462519 on this box and the kernel (normal boot, not kdump) fails to boot: ACPI: Core revision 20060707 initalizing for 255 cpus ..MP-BIOS bug: 8254 timer not connected to IO-APIC Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 'noapic' kernel parameter I also get the same failure if I boot the standard kernel and just use the patched kernel for kdump.
oops, ignore my previous comment, I was using the wrong patch. With the right patch it does look like this might fix the issue. Going to try this a few times but looks like this is indeed the same issue. I see the other issue is a customer issue and the initial description is marked private. Can you share anything regarding what sort of system the customer was seeing that bug on?
Doug, Bug 473403 - [5.3] Kdump Kernel Hangs on Dell AMD Machines has also been marked as a duplication of bug 462519.
Kdump kernel still hang on the hp-xw9400 in-house (hp-xw9400-02.rhts.bos.redhat.com) using the latest RHEL5.4 components, kexec-tools-1.102pre-75.el5 kernel-2.6.18-156.el5 Red Hat Enterprise Linux Server release 5.4 Beta (Tikanga) Kernel 2.6.18-156.el5 on an x86_64 hp-xw9400-02.rhts.bos.redhat.com login: SysRq : Trigger a crashdump REWRITING MCP55 CFG REG CFG = c1 Linux version 2.6.18-156.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Mon Jun 29 18:16:54 EDT 2009 Command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 irqpoll maxcpus=1 reset_devices hdb=cdrom memmap=exactmap memmap=640K@0K memmap=5272K@16384K memmap=125144K@22296K elfcorehdr=147440K memmap=232K$3669784K memmap=131072K$3932160K memmap=20480K$4173824K BIOS-provided physical RAM map: BIOS-e820: 0000000000010000 - 000000000009b000 (usable) BIOS-e820: 000000000009b000 - 00000000000a0000 (reserved) BIOS-e820: 0000000000100000 - 00000000dffc6100 (usable) BIOS-e820: 00000000dffc6100 - 00000000e0000000 (reserved) BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved) BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000120000000 (usable) user-defined physical RAM map: user: 0000000000000000 - 00000000000a0000 (usable) user: 0000000001000000 - 0000000001526000 (usable) user: 00000000015c6000 - 0000000008ffc000 (usable) user: 00000000dffc6000 - 00000000e0000000 (reserved) user: 00000000f0000000 - 00000000f8000000 (reserved) user: 00000000fec00000 - 0000000100000000 (reserved) DMI 2.5 present. SRAT: PXM 0 -> APIC 0 -> Node 0 SRAT: PXM 0 -> APIC 1 -> Node 0 SRAT: PXM 0 -> APIC 2 -> Node 0 SRAT: PXM 0 -> APIC 3 -> Node 0 SRAT: PXM 0 -> APIC 4 -> Node 0 SRAT: PXM 0 -> APIC 5 -> Node 0 SRAT: PXM 1 -> APIC 8 -> Node 1 SRAT: PXM 1 -> APIC 9 -> Node 1 SRAT: PXM 1 -> APIC 10 -> Node 1 SRAT: PXM 1 -> APIC 11 -> Node 1 SRAT: PXM 1 -> APIC 12 -> Node 1 SRAT: PXM 1 -> APIC 13 -> Node 1 SRAT: Node 0 PXM 0 0-a0000 SRAT: Node 0 PXM 0 0-80000000 SRAT: Node 1 PXM 1 80000000-e0000000 SRAT: Node 1 PXM 1 80000000-120000000 Bootmem setup node 0 0000000000000000-0000000008ffc000 Memory for crash kernel (0x0 to 0x0) notwithin permissible range disabling kdump ACPI: PM-Timer IO Port: 0xf808 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) Processor #0 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x08] enabled) Processor #8 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled) Processor #1 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x04] lapic_id[0x09] enabled) Processor #9 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x05] lapic_id[0x02] enabled) Processor #2 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x06] lapic_id[0x0a] enabled) Processor #10 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x07] lapic_id[0x03] enabled) Processor #3 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x08] lapic_id[0x0b] enabled) Processor #11 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x09] lapic_id[0x04] enabled) Processor #4 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x0c] enabled) Processor #12 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x05] enabled) Processor #5 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x0d] enabled) Processor #13 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x0d] lapic_id[0x0c] disabled) ACPI: LAPIC (acpi_id[0x0e] lapic_id[0x0d] disabled) ACPI: LAPIC (acpi_id[0x0f] lapic_id[0x0e] disabled) ACPI: LAPIC (acpi_id[0x10] lapic_id[0x0f] disabled) ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x05] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x06] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x07] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x08] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x09] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x0a] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x0b] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x0c] high edge lint[0x1]) ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 8, version 17, address 0xfec00000, GSI 0-23 ACPI: IOAPIC (id[0x09] address[0xfa400000] gsi_base[24]) IOAPIC[1]: apic_id 9, version 17, address 0xfa400000, GSI 24-47 ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge) Setting APIC routing to physical flat ACPI: HPET id: 0x10de8201 base: 0xfed00000 Using ACPI (MADT) for SMP configuration information Nosave address range: 00000000000a0000 - 0000000001000000 Nosave address range: 0000000001526000 - 00000000015c6000 Allocating PCI resources starting at 10000000 (gap: 8ffc000:d6fca000) SMP: Allowing 16 CPUs, 4 hotplug CPUs Built 1 zonelists. Total pages: 32248 Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 irqpoll maxcpus=1 reset_devices hdb=cdrom memmap=exactmap memmap=640K@0K memmap=5272K@16384K memmap=125144K@22296K elfcorehdr=147440K memmap=232K$3669784K memmap=131072K$3932160K memmap=20480K$4173824K Misrouted IRQ fixup and polling support enabled This may significantly impact system performance ide_setup: hdb=cdrom Initializing CPU#0 PID hash table entries: 512 (order: 9, 4096 bytes) Console: colour VGA+ 80x25 Dentry cache hash table entries: 16384 (order: 5, 131072 bytes) Inode-cache hash table entries: 8192 (order: 4, 65536 bytes) Checking aperture... CPU 0: aperture @ c000000 size 64 MB CPU 1: aperture @ c000000 size 64 MB ACPI: DMAR not present Memory: 117688k/147440k available (2550k kernel code, 13368k reserved, 1291k data, 208k init) Calibrating delay loop (skipped), value calculated using timer frequency.. 4399.99 BogoMIPS (lpj=2199999) Security Framework v1.0.0 initialized SELinux: Initializing. selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 256 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 0/3 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 3 SMP alternatives: switching to UP code ACPI: Core revision 20060707 ..MP-BIOS bug: 8254 timer not connected to IO-APIC Using local APIC timer interrupts. result 12499990 Detected 12.499 MHz APIC timer. Brought up 1 CPUs testing NMI watchdog ... OK. time.c: Using 25.000000 MHz WALL HPET GTOD HPET/TSC timer. time.c: Detected 800.005 MHz processor. checking if image is initramfs... it is Freeing initrd memory: 3579k freed NET: Registered protocol family 16 ACPI: bus type pci registered PCI: Using configuration type 1 ACPI: Interpreter enabled ACPI: Using IOAPIC for interrupt routing ACPI: No dock devices found. ACPI: PCI Root Bridge [PCI0] (0000:00) FOUND MCP55 CHIP cfg value is c1 PCI: Transparent bridge - 0000:00:06.0 bus 0 -> pxm 0 -> node 0 ACPI: PCI Root Bridge [PCI1] (0000:40) bus 64 -> pxm 1 -> node -1 ACPI: PCI Interrupt Link [LNKA] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled. ACPI: PCI Interrupt Link [LNKB] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled. ACPI: PCI Interrupt Link [LNKC] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled. ACPI: PCI Interrupt Link [LNKD] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled. ACPI: PCI Interrupt Link [LXPA] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled. ACPI: PCI Interrupt Link [LXPB] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled. ACPI: PCI Interrupt Link [LXPC] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled. ACPI: PCI Interrupt Link [LXPD] (IRQs 5 7 10 16 17 18 19 20 21 22 *23) ACPI: PCI Interrupt Link [LXA2] (IRQs *40), disabled. ACPI: PCI Interrupt Link [LXB2] (IRQs *41), disabled. ACPI: PCI Interrupt Link [LXC2] (IRQs *42), disabled. ACPI: PCI Interrupt Link [LXD2] (IRQs *43) ACPI: PCI Interrupt Link [LSMB] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled. ACPI: PCI Interrupt Link [LSB0] (IRQs 5 7 10 16 17 18 19 20 *21 22 23) ACPI: PCI Interrupt Link [LSB2] (IRQs 5 7 10 16 17 18 19 20 21 *22 23) ACPI: PCI Interrupt Link [LMC0] (IRQs 5 7 10 16 *17 18 19 20 21 22 23) ACPI: PCI Interrupt Link [LMC1] (IRQs 5 7 10 16 17 18 19 20 21 22 *23) ACPI: PCI Interrupt Link [LAZA] (IRQs 5 7 10 *16 17 18 19 20 21 22 23) ACPI: PCI Interrupt Link [LIDE] (IRQs 5 7 10 16 17 18 19 20 21 22 23) *0, disabled. ACPI: PCI Interrupt Link [LSA0] (IRQs 5 7 10 16 17 18 19 *20 21 22 23) ACPI: PCI Interrupt Link [LSA1] (IRQs 5 7 10 16 17 18 *19 20 21 22 23) ACPI: PCI Interrupt Link [LSA2] (IRQs 5 7 10 16 17 *18 19 20 21 22 23) Linux Plug and Play Support v0.97 (c) Adam Belay pnp: PnP ACPI init pnp: PnP ACPI: found 17 devices usbcore: registered new driver usbfs usbcore: registered new driver hub PCI: Using ACPI for IRQ routing PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report NetLabel: Initializing NetLabel: domain hash size = 128 NetLabel: protocols = UNLABELED CIPSOv4 NetLabel: unlabeled traffic allowed by default hpet0: at MMIO 0xfed00000 (virtual 0xffffffffff5fe000), IRQs 2, 8, 31 hpet0: 3 32-bit timers, 25000000 Hz ACPI: DMAR not present PCI-DMA: Disabling IOMMU. pnp: 00:0b: ioport range 0x4d0-0x4d1 has been reserved pnp: 00:0c: ioport range 0x400-0x47f could not be reserved pnp: 00:0c: ioport range 0x480-0x48f has been reserved pnp: 00:0c: ioport range 0x4c0-0x4cb has been reserved pnp: 00:0c: ioport range 0xe000-0xe07f has been reserved pnp: 00:0c: ioport range 0xe080-0xe0ff has been reserved pnp: 00:0c: ioport range 0xf200-0xf27f has been reserved pnp: 00:0c: ioport range 0xf280-0xf2ff has been reserved PCI: Bridge: 0000:00:06.0 IO window: disabled. MEM window: fa100000-fa1fffff PREFETCH window: disabled. PCI: Bridge: 0000:2b:00.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. PCI: Bridge: 0000:2b:00.1 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. PCI: Bridge: 0000:00:0d.0 IO window: disabled. MEM window: fa000000-fa0fffff PREFETCH window: disabled. PCI: mem resource #6:20000@f0000000 for 0000:18:00.0 was not allocated. PCI: Bridge: 0000:00:0f.0 IO window: 3000-3fff MEM window: f8000000-f9ffffff PREFETCH window 0x00000000e0000000-0x00000000efffffff ACPI: PCI Interrupt Link [LXPD] enabled at IRQ 23 GSI 16 sharing vector 0xA9 and IRQ 16 ACPI: PCI Interrupt 0000:2b:00.1[A] -> Link [LXPD] -> GSI 23 (level, high) -> IRQ 169 PCI: Bridge: 0000:40:0d.0 IO window: 1000-1fff MEM window: fa300000-fa3fffff PREFETCH window 0x00000000fa500000-0x00000000fa5fffff NET: Registered protocol family 2 IP route cache hash table entries: 1024 (order: 1, 8192 bytes) TCP established hash table entries: 4096 (order: 4, 65536 bytes) TCP bind hash table entries: 2048 (order: 3, 32768 bytes) TCP: Hash tables configured (established 4096 bind 2048) TCP reno registered audit: initializing netlink socket (disabled) type=2000 audit(1246503741.523:1): initialized Total HugeTLB memory allocated, 0 VFS: Disk quotas dquot_6.5.1 Dquot-cache hash table entries: 512 (order 0, 4096 bytes) Initializing Cryptographic API alg: No test for crc32c (crc32c-generic) ksign: Installing public key data Loading keyring - Added public key 151471C81E0E52F6 - User ID: Red Hat, Inc. (Kernel Module GPG key) io scheduler noop registered io scheduler anticipatory registered io scheduler deadline registered io scheduler cfq registered (default) pci 0000:00:00.0: Enabling HT MSI Mapping pci 0000:00:05.0: Enabling HT MSI Mapping pci 0000:00:05.1: Enabling HT MSI Mapping pci 0000:00:05.2: Enabling HT MSI Mapping pci 0000:00:06.0: Enabling HT MSI Mapping pci 0000:00:06.1: Enabling HT MSI Mapping pci 0000:00:08.0: Enabling HT MSI Mapping pci 0000:00:09.0: Enabling HT MSI Mapping pci 0000:00:0d.0: Enabling HT MSI Mapping pci 0000:00:0f.0: Enabling HT MSI Mapping pci 0000:40:00.0: Enabling HT MSI Mapping pci 0000:40:0d.0: Enabling HT MSI Mapping assign_interrupt_mode Found MSI capability assign_interrupt_mode Found MSI capability assign_interrupt_mode Found MSI capability pci_hotplug: PCI Hot Plug PCI Core version: 0.5 Real Time Clock Driver v1.12ac Non-volatile memory driver v1.2 Linux agpgart interface v0.101 (c) Dave Jones Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
One another xw9400 machine in-house, kdump is also not working. hp-xw9400-02.rhts.bos.redhat.com SysRq : Trigger a crashdump REWRITING MCP55 CFG REG <hung...> It was using, kernel-2.6.18-159.el5 kexec-tools-1.102pre-77.el5
Cai, is kdump still broken on the HP xw9400s? Thanks, P.
Cai, also, is there a cciss or hpsa array in this system? IIRC HP has declared both of those non-functional with kdump at the moment.
(In reply to comment #19) > Cai, is kdump still broken on the HP xw9400s? > > Thanks, > > P. I tested with latest kernel and kexec-tools on hp-xw9400-02.rhts.eng.bos.redhat.com, still hang! =========================================================== [root@hp-xw9400-02 ~]# rpm -q kernel kexec-tools kernel-2.6.18-226.el5 kexec-tools-1.102pre-108.el5 =========================================================== Red Hat Enterprise Linux Server release 5.6 Beta (Tikanga) Kernel 2.6.18-226.el5 on an x86_64 hp-xw9400-02.rhts.eng.bos.redhat.com login: SysRq : Trigger a crashdump REWRITING MCP55 CFG REG <===================Hang here!
(In reply to comment #21) > Cai, also, is there a cciss or hpsa array in this system? IIRC HP has declared > both of those non-functional with kdump at the moment. Hi Neil, There is no cciss or hpsa array in hp-xw9400-02.rhts.eng.bos.redhat.com.
probably not, I've not got any idea whats causing this error. A useful test however would be to back out my mcp55 patch to see if that fixes the problem, although I've no idea why it would
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2873718 Heres a build for you to try based on my suggestion in comment #27.
hi, Neil, I cannot find kernel package on the URL.
yeah, apparently there are two patches I need to rip out, not just one. I'll resubmit shortly.
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2883183 New build
The new kernel works fine.
Ok, this is interesting. If the last message we get is REWRITING MCP55 CGF REG, that means that we're hanging in machine_crash_shutdown. We should get a second printk indicating what the config value we're writing is, but we never see it so it would seem that we're hanging on the pci_read_config_dword call. Since we have already shot down the other cpus, and not gotten an error about stopping them, we should be the only cpu running, so we're not getting any wierd smp behavior here. I don't see how we can hang on a pci bus access. Prarit, I think you're our pci expert. Any thoughts on how I might hang doing a pci bus access? Pingtan, can you provide the entire serial log of this event from boot to failure? Thanks!
Created attachment 459878 [details] console log
Red Hat Enterprise Linux Server release 5.6 Beta (Tikanga) Kernel 2.6.18-232.el5 on an x86_64 sun-x4440-01.rhts.eng.bos.redhat.com login: SysRq : Trigger a crashdump REWRITING MCP55 CFG REG <----------- hangs here
pingtan, I'm confused, above you indicate a hang immediately after the REWRITING line, but the log you attach clearly shows us getting much more output after that. Can you clarify?
I apologize. In comment 36, I wanted to report that I encountered the same problem on another machine, sun-x4440-01.rhts.eng.bos.redhat.com, not the original hp xw9400 machine.
I have encountered this problem on dell-pem805-01.rhts.eng.bos.redhat.com, with rhel5.6 snapshot3: Red Hat Enterprise Linux Server release 5.6 Beta (Tikanga) Kernel 2.6.18-233.el5PAE on an i686 dell-pem805-01.rhts.eng.bos.redhat.com login: 11/29/10 06:48:06 testID:772787 finish: 11/29/10 06:48:26 JobID:34899 Test:/kernel/kdump/config-ssh Response:1 11/29/10 06:48:26 testID:772788 start: [-- MARK -- Mon Nov 29 06:50:00 2010] 11/29/10 06:50:29 testID:772788 finish: 11/29/10 06:51:23 JobID:34899 Test:/kernel/kdump/config-filter Response:1 11/29/10 06:51:23 testID:772789 start: 11/29/10 06:51:49 testID:772789 finish: 11/29/10 06:52:41 JobID:34899 Test:/kernel/kdump/crash-sysrq-c Response:1 11/29/10 06:52:41 testID:772790 start: SysRq : Trigger a crashdump REWRITING MCP55 CFG REG <------------------- hangs here
ok, but thats not the same problem, we're getting much further in those boots and so its pretty clearly a different problem (or at least a different system). ping prarit, any thoughts on comment 34
prarit ping?
(In reply to comment #34) > Ok, this is interesting. If the last message we get is REWRITING MCP55 CGF > REG, that means that we're hanging in machine_crash_shutdown. We should get a > second printk indicating what the config value we're writing is, but we never > see it so it would seem that we're hanging on the pci_read_config_dword call. > Since we have already shot down the other cpus, and not gotten an error about > stopping them, we should be the only cpu running, so we're not getting any > wierd smp behavior here. I don't see how we can hang on a pci bus access. > > Prarit, I think you're our pci expert. Any thoughts on how I might hang doing > a pci bus access? Without a PCI Bus analyser it would be pretty difficult. Does this work upstream? P.
I honestly don't know, its been so long, but I can try.
So, yes, it appears that this works upstream (2.6.35). I'm testing on the Latest RHEL5 to make sure that it still fails
grumble, I think this needs upstream commit 49c2fa08a77a7eefa4cbc73601f64984aceacfa7
yup, thats it, I'll post the fix in the AM
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-245.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
After installing this kernel on the top of x64 RHEL 5.6, I was able to successfully invoke a kdump. After the expected reboot, the resulting vmcore file was created in /var/crash/[timestamp]. (Note that the xw9400 system used has 16 GB RAM, if that matters.)
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html