Bug 424511

Summary: kdump broken on x86_64 between -45 and -58
Product: Red Hat Enterprise Linux 5 Reporter: Neil Horman <nhorman>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: low Docs Contact:
Priority: low    
Version: 5.2CC: anderson, bmaly, dzickus, i-kitayama, qcai
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2008-0314 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-21 15:04:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 223925    
Attachments:
Description Flags
origional working patch none

Description Neil Horman 2007-12-14 00:08:38 UTC
The vmcoreinfo patch added against bz 253850 works in the -45 kernel, but fails
in the -58 kernel, producing a zero length /proc/vmcore file.  This is a
regression we need to track down.

Comment 1 Neil Horman 2007-12-14 18:03:44 UTC
kernel-2.6.18-52 + the vmcoreinfo patch fails.  Next try will be 2.6.18-49 +
vmcoreinfo patch

Comment 2 Neil Horman 2007-12-14 18:23:47 UTC
kernel-2.6.18-49 + the vmcoreinfo patch fails.  Next try will be 2.6.18-47 +
vmcoreinfo

Comment 3 Neil Horman 2007-12-14 21:12:59 UTC
kernel-2.6.18-47 + the vmcoreinfo patch fails.  So either:
a) there is something unique about my branch that causes this to work there
b) a patch in -45 or -46 causes this patch to stop working.....

Comment 4 Neil Horman 2007-12-15 03:02:17 UTC
I've done a bisect all the way back to -45, and discovered something
interesting.  The patch that was incorporated to the kernel in -58 doesn't match
what I did my origional work on, which is why its not working currently. If I
use my origional patch, it works on -58.  I'm not sure if I posted the wrong
version of the patch, or if something got corrupted or changed during the
submission process.  So far all differences appear trivial, but I need to do a
more through search. 

Comment 5 Neil Horman 2007-12-15 14:23:13 UTC
Created attachment 289688 [details]
origional working patch

So, I've done some digging, and it appears this patch has undergone a good deal
of change as its gone through the submission process.  I'm attaching the
origional version of my patch, which I've confirmed work on my origional
development branch, and reconfirmed to behave properly on a -57 kernel (the
last kernel to not contain the non-functional variant of the patch).  Looking
at the mailing list archives:
http://post-office.corp.redhat.com/archives/rhkernel-list/2007-September/msg00397.html

What I submitted is somewhat different.  It appears to be a rediff against an
updated kernel versions, but there may be more to it than that (I apologize
don, I don't know what happened, or how I wound up with a different patch like
that).

Further to that, what got integrated in the kernel looks still different from
what was posted to the list.  Again it looks on the surface like a simple
rediff, but there may be more to it than that.	

Clearly somewhere in the process something changed that affected the behavior
of the patch.  Don, not sure how you want to handle this.  Should we figure out
what went wrong, do you just want to take my origional patch version attached
above, or shall I repost?  Just let me know.  Thanks!

Comment 8 RHEL Program Management 2008-01-02 20:45:55 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 RHEL Program Management 2008-01-02 20:46:11 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 11 Don Zickus 2008-01-08 20:46:59 UTC
in 2.6.18-64.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 12 Dave Anderson 2008-01-09 16:46:53 UTC
Testing 2.6.18-64.el5, the capture kernel hangs/freezes every time,
just after printing:

checking for hardware changes:  [OK]

Since kudzu is S05, it's hanging in S06cpuspeed.

When the primary kernel boots, there is an error message printed when that
service starts, but the kernel continues:

checking for hardware changes:  [OK]
FATAL: error inserting acpi_cpufreq (<path-to>/acpi-cpufreq.ko): No such device  
... (continues) ...

But when running the same kernel as a kdump capture kernel, it freezes the
system.  Don has a report from the LTP guys that the capture kernel is
actually panicking; I don't have a serial kernel attached, so it looks
like a hard freeze.




Comment 13 Dave Anderson 2008-01-09 17:00:01 UTC
After disabling the "cpuspeed" service, the stock 2.6.18-64.el5
kernel kdumps OK, with a valid vmcoreinfo note section:
  
  # readelf -a vmcore
  ELF Header:
    Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
    Class:                             ELF64
    Data:                              2's complement, little endian
    Version:                           1 (current)
    OS/ABI:                            UNIX - System V
    ABI Version:                       0
    Type:                              CORE (Core file)
    Machine:                           Advanced Micro Devices X86-64
    Version:                           0x1
    Entry point address:               0x0
    Start of program headers:          64 (bytes into file)
    Start of section headers:          0 (bytes into file)
    Flags:                             0x0
    Size of this header:               64 (bytes)
    Size of program headers:           56 (bytes)
    Number of program headers:         5
    Size of section headers:           0 (bytes)
    Number of section headers:         0
    Section header string table index: 0
  
  There are no sections in this file.
  
  There are no sections in this file.
  
  Program Headers:
    Type           Offset             VirtAddr           PhysAddr
                   FileSiz            MemSiz              Flags  Align
    NOTE           0x0000000000000158 0x0000000000000000 0x0000000000000000
                   0x00000000000006a0 0x00000000000006a0         0
    LOAD           0x00000000000007f8 0xffffffff80000000 0x0000000000200000
                   0x00000000004d3000 0x00000000004d3000  RWE    0
    LOAD           0x00000000004d37f8 0xffff810000000000 0x0000000000000000
                   0x00000000000a0000 0x00000000000a0000  RWE    0
    LOAD           0x00000000005737f8 0xffff810000100000 0x0000000000100000
                   0x0000000000f00000 0x0000000000f00000  RWE    0
    LOAD           0x00000000014737f8 0xffff810009000000 0x0000000009000000
                   0x0000000036e8cc00 0x0000000036e8cc00  RWE    0
  
  There is no dynamic section in this file.
  
  There are no relocations in this file.
  
  There are no unwind sections in this file.
  
  No version information found in this file.
  
  Notes at offset 0x00000158 with length 0x000006a0:
    Owner         Data size       Description
    CORE          0x00000150      NT_PRSTATUS (prstatus structure)
    CORE          0x00000150      NT_PRSTATUS (prstatus structure)
    VMCOREINFO            0x000003c0      Unknown note type: (0x00000000)

and the crash utility runs fine:

  # crash vm*
  
  crash 4.0-4.12d
  Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008  Red Hat, Inc.
  Copyright (C) 2004, 2005, 2006  IBM Corporation
  Copyright (C) 1999-2006  Hewlett-Packard Co
  Copyright (C) 2005, 2006  Fujitsu Limited
  Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
  Copyright (C) 2005  NEC Corporation
  Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
  Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
  This program is free software, covered by the GNU General Public License,
  and you are welcome to change it and/or distribute copies of it under
  certain conditions.  Enter "help copying" to see the conditions.
  This program has absolutely no warranty.  Enter "help warranty" for details.
   
  GNU gdb 6.1
  Copyright 2004 Free Software Foundation, Inc.
  GDB is free software, covered by the GNU General Public License, and you are
  welcome to change it and/or distribute copies of it under certain conditions.
  Type "show copying" to see the conditions.
  There is absolutely no warranty for GDB.  Type "show warranty" for details.
  This GDB was configured as "x86_64-unknown-linux-gnu"...
  
        KERNEL: vmlinux                           
      DUMPFILE: vmcore
          CPUS: 2
          DATE: Wed Jan  9 07:08:54 2008
        UPTIME: 00:07:45
  LOAD AVERAGE: 0.00, 0.07, 0.06
         TASKS: 100
      NODENAME: dhcp83-53.boston.redhat.com
       RELEASE: 2.6.18-64.el5
       VERSION: #1 SMP Mon Jan 7 18:03:43 EST 2008
       MACHINE: x86_64  (2793 Mhz)
        MEMORY: 1 GB
         PANIC: "SysRq : Trigger a crashdump"
           PID: 2609
       COMMAND: "bash"
          TASK: ffff810031864080  [THREAD_INFO: ffff81003218c000]
           CPU: 0
         STATE: TASK_RUNNING (SYSRQ)
  
  crash> 
  
I'll test the dom0 kdump next...


Comment 14 Neil Horman 2008-01-09 18:45:23 UTC
Yeah,   Don just emailed me about the cpufreq thing,  Its going to be a separate
issue from this.


Comment 15 Dave Anderson 2008-01-09 19:35:06 UTC
(In reply to comment #14)
> Yeah,   Don just emailed me about the cpufreq thing,  Its going to be a 
> separate issue from this.

Yep, it's always something isn't it?

Anyway, I tested a dom0 kdump (without the cpuspeed service), and initially
it looked OK.  But -- unlike the x86, where the dom0 VMCOREINFO notes section
seems to be there but is corrupt (BZ #423731: i386 dom0 kdump vmcore file
created with bogus notes section) -- this x86_64 dom0 kdump has no VMCOREINFO
notes section at all:

  # readelf -a vmcore
  ELF Header:
    Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
    Class:                             ELF64
    Data:                              2's complement, little endian
    Version:                           1 (current)
    OS/ABI:                            UNIX - System V
    ABI Version:                       0
    Type:                              CORE (Core file)
    Machine:                           Advanced Micro Devices X86-64
    Version:                           0x1
    Entry point address:               0x0
    Start of program headers:          64 (bytes into file)
    Start of section headers:          0 (bytes into file)
    Flags:                             0x0
    Size of this header:               64 (bytes)
    Size of program headers:           56 (bytes)
    Number of program headers:         4
    Size of section headers:           0 (bytes)
    Number of section headers:         0
    Section header string table index: 0
  
  There are no sections in this file.
  
  There are no sections in this file.
  
  Program Headers:
    Type           Offset             VirtAddr           PhysAddr
                   FileSiz            MemSiz              Flags  Align
    NOTE           0x0000000000000120 0x0000000000000000 0x0000000000000000
                   0x0000000000000380 0x0000000000000380         0
    LOAD           0x00000000000004a0 0xffff810000000000 0x0000000000000000
                   0x00000000000a0000 0x00000000000a0000  RWE    0
    LOAD           0x00000000000a04a0 0xffff810000100000 0x0000000000100000
                   0x0000000000f00000 0x0000000000f00000  RWE    0
    LOAD           0x0000000000fa04a0 0xffff810009000000 0x0000000009000000
                   0x0000000036e8c000 0x0000000036e8c000  RWE    0
  
  There is no dynamic section in this file.
  
  There are no relocations in this file.
  
  There are no unwind sections in this file.
  
  No version information found in this file.
  
  Notes at offset 0x00000120 with length 0x00000380:
    Owner         Data size       Description
    CORE          0x00000150      NT_PRSTATUS (prstatus structure)
    Xen           0x00000020      Unknown note type: (0x01000002)
    CORE          0x00000150      NT_PRSTATUS (prstatus structure)
    Xen           0x00000020      Unknown note type: (0x01000002)
    Xen           0x00000048      Unknown note type: (0x01000001)
  # 
  
and crash runs fine: 
  
  # crash vm*
  
  crash 4.0-4.12d
  Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008  Red Hat, Inc.
  Copyright (C) 2004, 2005, 2006  IBM Corporation
  Copyright (C) 1999-2006  Hewlett-Packard Co
  Copyright (C) 2005, 2006  Fujitsu Limited
  Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
  Copyright (C) 2005  NEC Corporation
  Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
  Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
  This program is free software, covered by the GNU General Public License,
  and you are welcome to change it and/or distribute copies of it under
  certain conditions.  Enter "help copying" to see the conditions.
  This program has absolutely no warranty.  Enter "help warranty" for details.
   
  GNU gdb 6.1
  Copyright 2004 Free Software Foundation, Inc.
  GDB is free software, covered by the GNU General Public License, and you are
  welcome to change it and/or distribute copies of it under certain conditions.
  Type "show copying" to see the conditions.
  There is absolutely no warranty for GDB.  Type "show warranty" for details.
  This GDB was configured as "x86_64-unknown-linux-gnu"...
  
        KERNEL: vmlinux                           
      DUMPFILE: vmcore
          CPUS: 2
          DATE: Wed Jan  9 09:35:04 2008
        UPTIME: 00:08:47
  LOAD AVERAGE: 0.04, 0.15, 0.14
         TASKS: 107
      NODENAME: dhcp83-53.boston.redhat.com
       RELEASE: 2.6.18-64.el5xen
       VERSION: #1 SMP Mon Jan 7 18:18:40 EST 2008
       MACHINE: x86_64  (2793 Mhz)
        MEMORY: 832.5 MB
         PANIC: "SysRq : Trigger a crashdump"
           PID: 2735
       COMMAND: "bash"
          TASK: ffff880033db9040  [THREAD_INFO: ffff880023db8000]
           CPU: 1
         STATE: TASK_RUNNING (SYSRQ)
  
  crash> 

Is that by design?


Comment 16 Neil Horman 2008-01-09 20:14:34 UTC
No, thats not by design, its probably just a result of the data contained in
/proc/sys/vmcoreinfo being such that kexec recognizes it as being bogus. 
Keni'chi's patch will just remove /sys/kernel/vmcoreinfo for xen kernels and
bring the x86/i386 behavior into alingment, and will produce no VMCOREINFO
section for either (since dom0 kernels can't produce the needed info in that file).

Comment 17 Dave Anderson 2008-01-10 15:10:51 UTC
(In reply to comment #14)
> Yeah,   Don just emailed me about the cpufreq thing,  Its going to be a 
> separate issue from this.
> 

For whatever reason, when loading that "ACPI fall-back" acpi-cpufreq module
from the kdump capture kernel, it crashes/hangs the system, whereas on
the normal kernel boot, it returns from the modprobe to print the FATAL
error message.

start() {
        if [ ! -f /var/lock/subsys/cpuspeed ]; then
                # Attempt to load scaling_driver if not loaded but it is configured
                for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver; do
                        # We want to run the code below only if the
                        # wildcard above got no matches.
                        [ ! -f "$file" ] || break

                        if [ -n "$DRIVER" ]; then
                                /sbin/modprobe "$DRIVER"
                        else
                                if [ -d /proc/acpi ]; then
                                        # use ACPI as a fallback
                                        /sbin/modprobe acpi-cpufreq
                                else
                                        # This is a no-ACPI machine. Just exit.
                                        return 0
                                fi
                        fi
                done



Comment 19 Mike Gahagan 2008-04-29 17:06:43 UTC
Marking this verified, it looks like kdump has been working for a while now.


Comment 21 errata-xmlrpc 2008-05-21 15:04:10 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html