Bug 424511
Summary: | kdump broken on x86_64 between -45 and -58 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Neil Horman <nhorman> | ||||
Component: | kernel | Assignee: | Neil Horman <nhorman> | ||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||
Severity: | low | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 5.2 | CC: | anderson, bmaly, dzickus, i-kitayama, qcai | ||||
Target Milestone: | rc | Keywords: | Regression | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | RHBA-2008-0314 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-05-21 15:04:10 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 223925 | ||||||
Attachments: |
|
Description
Neil Horman
2007-12-14 00:08:38 UTC
kernel-2.6.18-52 + the vmcoreinfo patch fails. Next try will be 2.6.18-49 + vmcoreinfo patch kernel-2.6.18-49 + the vmcoreinfo patch fails. Next try will be 2.6.18-47 + vmcoreinfo kernel-2.6.18-47 + the vmcoreinfo patch fails. So either: a) there is something unique about my branch that causes this to work there b) a patch in -45 or -46 causes this patch to stop working..... I've done a bisect all the way back to -45, and discovered something interesting. The patch that was incorporated to the kernel in -58 doesn't match what I did my origional work on, which is why its not working currently. If I use my origional patch, it works on -58. I'm not sure if I posted the wrong version of the patch, or if something got corrupted or changed during the submission process. So far all differences appear trivial, but I need to do a more through search. Created attachment 289688 [details] origional working patch So, I've done some digging, and it appears this patch has undergone a good deal of change as its gone through the submission process. I'm attaching the origional version of my patch, which I've confirmed work on my origional development branch, and reconfirmed to behave properly on a -57 kernel (the last kernel to not contain the non-functional variant of the patch). Looking at the mailing list archives: http://post-office.corp.redhat.com/archives/rhkernel-list/2007-September/msg00397.html What I submitted is somewhat different. It appears to be a rediff against an updated kernel versions, but there may be more to it than that (I apologize don, I don't know what happened, or how I wound up with a different patch like that). Further to that, what got integrated in the kernel looks still different from what was posted to the list. Again it looks on the surface like a simple rediff, but there may be more to it than that. Clearly somewhere in the process something changed that affected the behavior of the patch. Don, not sure how you want to handle this. Should we figure out what went wrong, do you just want to take my origional patch version attached above, or shall I repost? Just let me know. Thanks! This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP. in 2.6.18-64.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Testing 2.6.18-64.el5, the capture kernel hangs/freezes every time, just after printing: checking for hardware changes: [OK] Since kudzu is S05, it's hanging in S06cpuspeed. When the primary kernel boots, there is an error message printed when that service starts, but the kernel continues: checking for hardware changes: [OK] FATAL: error inserting acpi_cpufreq (<path-to>/acpi-cpufreq.ko): No such device ... (continues) ... But when running the same kernel as a kdump capture kernel, it freezes the system. Don has a report from the LTP guys that the capture kernel is actually panicking; I don't have a serial kernel attached, so it looks like a hard freeze. After disabling the "cpuspeed" service, the stock 2.6.18-64.el5 kernel kdumps OK, with a valid vmcoreinfo note section: # readelf -a vmcore ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: CORE (Core file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x0 Start of program headers: 64 (bytes into file) Start of section headers: 0 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 5 Size of section headers: 0 (bytes) Number of section headers: 0 Section header string table index: 0 There are no sections in this file. There are no sections in this file. Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align NOTE 0x0000000000000158 0x0000000000000000 0x0000000000000000 0x00000000000006a0 0x00000000000006a0 0 LOAD 0x00000000000007f8 0xffffffff80000000 0x0000000000200000 0x00000000004d3000 0x00000000004d3000 RWE 0 LOAD 0x00000000004d37f8 0xffff810000000000 0x0000000000000000 0x00000000000a0000 0x00000000000a0000 RWE 0 LOAD 0x00000000005737f8 0xffff810000100000 0x0000000000100000 0x0000000000f00000 0x0000000000f00000 RWE 0 LOAD 0x00000000014737f8 0xffff810009000000 0x0000000009000000 0x0000000036e8cc00 0x0000000036e8cc00 RWE 0 There is no dynamic section in this file. There are no relocations in this file. There are no unwind sections in this file. No version information found in this file. Notes at offset 0x00000158 with length 0x000006a0: Owner Data size Description CORE 0x00000150 NT_PRSTATUS (prstatus structure) CORE 0x00000150 NT_PRSTATUS (prstatus structure) VMCOREINFO 0x000003c0 Unknown note type: (0x00000000) and the crash utility runs fine: # crash vm* crash 4.0-4.12d Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: vmlinux DUMPFILE: vmcore CPUS: 2 DATE: Wed Jan 9 07:08:54 2008 UPTIME: 00:07:45 LOAD AVERAGE: 0.00, 0.07, 0.06 TASKS: 100 NODENAME: dhcp83-53.boston.redhat.com RELEASE: 2.6.18-64.el5 VERSION: #1 SMP Mon Jan 7 18:03:43 EST 2008 MACHINE: x86_64 (2793 Mhz) MEMORY: 1 GB PANIC: "SysRq : Trigger a crashdump" PID: 2609 COMMAND: "bash" TASK: ffff810031864080 [THREAD_INFO: ffff81003218c000] CPU: 0 STATE: TASK_RUNNING (SYSRQ) crash> I'll test the dom0 kdump next... Yeah, Don just emailed me about the cpufreq thing, Its going to be a separate issue from this. (In reply to comment #14) > Yeah, Don just emailed me about the cpufreq thing, Its going to be a > separate issue from this. Yep, it's always something isn't it? Anyway, I tested a dom0 kdump (without the cpuspeed service), and initially it looked OK. But -- unlike the x86, where the dom0 VMCOREINFO notes section seems to be there but is corrupt (BZ #423731: i386 dom0 kdump vmcore file created with bogus notes section) -- this x86_64 dom0 kdump has no VMCOREINFO notes section at all: # readelf -a vmcore ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: CORE (Core file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x0 Start of program headers: 64 (bytes into file) Start of section headers: 0 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 4 Size of section headers: 0 (bytes) Number of section headers: 0 Section header string table index: 0 There are no sections in this file. There are no sections in this file. Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align NOTE 0x0000000000000120 0x0000000000000000 0x0000000000000000 0x0000000000000380 0x0000000000000380 0 LOAD 0x00000000000004a0 0xffff810000000000 0x0000000000000000 0x00000000000a0000 0x00000000000a0000 RWE 0 LOAD 0x00000000000a04a0 0xffff810000100000 0x0000000000100000 0x0000000000f00000 0x0000000000f00000 RWE 0 LOAD 0x0000000000fa04a0 0xffff810009000000 0x0000000009000000 0x0000000036e8c000 0x0000000036e8c000 RWE 0 There is no dynamic section in this file. There are no relocations in this file. There are no unwind sections in this file. No version information found in this file. Notes at offset 0x00000120 with length 0x00000380: Owner Data size Description CORE 0x00000150 NT_PRSTATUS (prstatus structure) Xen 0x00000020 Unknown note type: (0x01000002) CORE 0x00000150 NT_PRSTATUS (prstatus structure) Xen 0x00000020 Unknown note type: (0x01000002) Xen 0x00000048 Unknown note type: (0x01000001) # and crash runs fine: # crash vm* crash 4.0-4.12d Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: vmlinux DUMPFILE: vmcore CPUS: 2 DATE: Wed Jan 9 09:35:04 2008 UPTIME: 00:08:47 LOAD AVERAGE: 0.04, 0.15, 0.14 TASKS: 107 NODENAME: dhcp83-53.boston.redhat.com RELEASE: 2.6.18-64.el5xen VERSION: #1 SMP Mon Jan 7 18:18:40 EST 2008 MACHINE: x86_64 (2793 Mhz) MEMORY: 832.5 MB PANIC: "SysRq : Trigger a crashdump" PID: 2735 COMMAND: "bash" TASK: ffff880033db9040 [THREAD_INFO: ffff880023db8000] CPU: 1 STATE: TASK_RUNNING (SYSRQ) crash> Is that by design? No, thats not by design, its probably just a result of the data contained in /proc/sys/vmcoreinfo being such that kexec recognizes it as being bogus. Keni'chi's patch will just remove /sys/kernel/vmcoreinfo for xen kernels and bring the x86/i386 behavior into alingment, and will produce no VMCOREINFO section for either (since dom0 kernels can't produce the needed info in that file). (In reply to comment #14) > Yeah, Don just emailed me about the cpufreq thing, Its going to be a > separate issue from this. > For whatever reason, when loading that "ACPI fall-back" acpi-cpufreq module from the kdump capture kernel, it crashes/hangs the system, whereas on the normal kernel boot, it returns from the modprobe to print the FATAL error message. start() { if [ ! -f /var/lock/subsys/cpuspeed ]; then # Attempt to load scaling_driver if not loaded but it is configured for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver; do # We want to run the code below only if the # wildcard above got no matches. [ ! -f "$file" ] || break if [ -n "$DRIVER" ]; then /sbin/modprobe "$DRIVER" else if [ -d /proc/acpi ]; then # use ACPI as a fallback /sbin/modprobe acpi-cpufreq else # This is a no-ACPI machine. Just exit. return 0 fi fi done Marking this verified, it looks like kdump has been working for a while now. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html |