Bug 243118

Summary: kexec-tools package needs update to work with xen
Product: Red Hat Enterprise Linux 5 Reporter: Gerd Hoffmann <kraxel>
Component: kexec-toolsAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact:
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.0CC: anderson, bstein, ddomingo, djuran, dzickus, hbrock, jan.kratochvil, jarod, jfeeney, nhorman, nobody+mkumar, tao, vgoyal, xen-maint
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0548 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-07 18:03:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 212843, 244301    
Bug Blocks:    
Attachments:
Description Flags
patch to enable xen crashdumps
none
new version of patch
none
new patch
none
additional bits for xen support
none
64bit bits none

Comment 1 Gerd Hoffmann 2007-06-07 12:52:11 UTC
Current rhel-5 kexec-tools don't work for x86_64/xen.

Suggested kexec-tools package is the kexec-tools-testing tree at kernel.org.

I think the important patch is this one:

http://git.kernel.org/?p=linux/kernel/git/horms/kexec-tools-testing.git;a=commitdiff;h=c41620b1d2717a6eb1969ad03758a1b707ba55ab

There are some dependencies to other patches though, just cherry-picking that
single patch doesn't work :-(

Comment 2 Gerd Hoffmann 2007-06-07 13:04:05 UTC
One more thing: in the xen case the crashkernel= cmds line is passed to the xen
kernel not the linux kernel and thus it isn't visible in /proc/cmdline.  The
sanity check in /etc/init.d/kdump fails due to that.  Suggested fix: look for a
sane crash kernel region in /proc/iomem instead like /sbin/kexec does.

Comment 3 Neil Horman 2007-06-07 19:27:07 UTC
So we pretty clearly need the patch above in Comment #1 for this to work, but
the script errors described in the initial comment should have been fixed by
now.  if you would please test with the latest kexec-tools package
(kexec-tools-1.101-173.el5) to confirm that those script errors are resolved,
I'll pull in the additional kexec patch refereced above.  Thanks! 

Comment 4 Neil Horman 2007-06-07 20:26:55 UTC
Created attachment 156504 [details]
patch to enable xen crashdumps

I acutally take back what I said before.  Looking at the upstream kexec-tools,
I think there is much more to xen support than the referenced patch.  There is
quite a bit of infrastrucutre in place upstream for this, which can be
backported, but its not quite as simple as one patch.  Also, based on the
initial comment, our xen (dom0) kernels have no support in them for kexec yet
(as evidenced by the lack of /sys/kernel/kexec_crash_loaded).  Until the kernel
inherits kexec support from upstream, I'm not sure theres a whole lot of worth
incorporating this, as there will be no way to test our kexec with our kernel. 
I'd say at this point, lets test with this patch in place, to verify that it
doesn't cause any regressions in our xen kernel as it is, verify that the
latest kdump initscript doesn't fail in the way described, and then lets wait
until our kernel gets kexec-support in xen to square away any remaining edges
from this backport.

Comment 5 Neil Horman 2007-06-07 20:40:43 UTC
Ok, I'm still catching up on this bug, I see where Gerd Has posted the upstream
xen kdump patches.  To be honest, I'm not thrilled with us taking these patches
so close to the 5.1 submit deadline (We should incorporate them right after we
release to maximize testing).  But if its a 5.1 requirement I don't know what
else we can do.  Try the patch I uploaded and see if it does what we need it to
do.  If it misses the mark, I'll load a xen kernel on my debug system in the AM
and fish the rest of the xen bits out of upstream

Comment 6 Gerd Hoffmann 2007-06-08 07:38:48 UTC
Doesn't build for me.  Patch incomplete maybe?
It's a fresh distcvs checkout (173) plus comment 4 patch.

gcc -Wall -g -fno-strict-aliasing -I./include -I./util_lib/include
-DVERSION='"1.101"' -DRELEASE_DATE='"15 February 2005"' -DPACKAGE_NAME=\"\"
-DPACKAGE_TARNAME=\"\" -DPACKAGE_VERSION=\"\" -DPACKAGE_STRING=\"\"
-DPACKAGE_BUGREPORT=\"\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1
-DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1
-DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1
-DHAVE_ZLIB_H=1   -Ikexec/arch/x86_64/include -o
/home/kraxel/BUILD/kexec-tools-1.101/objdir-x86_64-redhat-linux-gnu/kexec/crashdump-xen.o
-c kexec/crashdump-xen.c
kexec/crashdump-xen.c:35: warning: ‘struct crash_elf_info’ declared inside
parameter list
kexec/crashdump-xen.c:35: warning: its scope is only this definition or
declaration, which is probably not what you want
kexec/crashdump-xen.c: In function ‘xen_architecture’:
kexec/crashdump-xen.c:37: error: dereferencing pointer to incomplete type
kexec/crashdump-xen.c: In function ‘xen_get_nr_phys_cpus’:
kexec/crashdump-xen.c:106: warning: statement with no effect
kexec/crashdump-xen.c:92: warning: unused variable ‘match’
make: ***
[/home/kraxel/BUILD/kexec-tools-1.101/objdir-x86_64-redhat-linux-gnu/kexec/crashdump-xen.o]
Error 1


Comment 7 Neil Horman 2007-06-08 10:49:16 UTC
Created attachment 156565 [details]
new version of patch

Sorry, forgot to backup one of the files that needed to be changed, so it
didn't get picked up in the diff.  New patch attached

Comment 8 Gerd Hoffmann 2007-06-08 11:30:54 UTC
Patch is incomplete too.  xen infrastructure is there now, it also builds, but
fails to load the crash kernel because the important chunk linked in comment #1
isn't included.

Comment 9 Neil Horman 2007-06-08 12:13:54 UTC
Created attachment 156570 [details]
new patch

dang, my bad.  Here it is, fixed.

Comment 10 Gerd Hoffmann 2007-06-08 15:22:05 UTC
now it works, thanks.

Comment 11 Neil Horman 2007-06-08 18:52:25 UTC
Ok, then we just need to get this pm and qa acked for me to commit.

Comment 12 RHEL Program Management 2007-06-08 19:04:10 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 14 Neil Horman 2007-06-08 21:07:44 UTC
Thanks James!  Fixed in -174.el5

Comment 15 Dave Anderson 2007-06-12 14:57:28 UTC
Can you clarify the exact setup procedure for xen kernels?

I'm running the 2.6.18-20.el5.kraxel.6xen and kexec-tools-1.101-174.el5.
I've modified the /etc/sysconfig/kdump file to use the stock kernel as the
kdump kernel:

  KDUMP_KERNELVER="2.6.18-20.el5.kraxel.6"

But on both x86 and x86_64, I get: 

  kdump: Cannot load /boot/vmlinuz-2.6.18-20.el5.kraxel.6
  kdump: kexec: failed to load kdump kernel
  kdump: failed to start up
 
On both machines, /boot/vmlinuz-2.6.18-20.el5.kraxel.6 and 
/boot/initrd-2.6.18-20.el5.kraxel.6kdump.img files exist,
and I'm setting crashkernel=96M@16M (which works for the 
the non-xen kernels).






Comment 16 Dave Anderson 2007-06-12 15:53:10 UTC
> and I'm setting crashkernel=96M@16M (which works for the 
> the non-xen kernels)

BTW, I left the crashkernel=96M@16M on the vmlinuz line in grub,
which I now see is the wrong thing to do, since the kernel logs show
the message from parse_cmdline_early():

  "Ignoring crashkernel command line, parameter will be supplied by xen"

But moving the crashkernel=96M@16M line to the "/xen.gz-2.6.18-20.el5.kraxel.6"
kernel line in grub.conf, at least on x86_64, results in what Jaron reports
in BZ #243880 "[RHEL5.1 Xen Kdump] Panic: unable to reserve kdump memory":

  https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243880



Comment 17 Stephen Tweedie 2007-06-12 15:54:51 UTC
You'll need to load @32M, not @16M, due to the memory layout of the hypervisor.

Beyond that, "sh -x /etc/rc.d/init.d/kdump start" is the easiest way to find out
what's going wrong with the kdump script.

Comment 18 Dave Anderson 2007-06-12 16:13:40 UTC
Ok, excellent -- thanks, that loads OK on x86_64.  

So just to be clear, the state of the kexec-tools now is that the
crashkernel= line needs to be placed on *both* the xen-gz and vmlinuz
lines, because the init.d/kdump script parses /proc/cmdline to get
the parameters.



Comment 19 Jarod Wilson 2007-06-12 18:16:15 UTC
(In reply to comment #18)
> Ok, excellent -- thanks, that loads OK on x86_64.

Likewise here.

> So just to be clear, the state of the kexec-tools now is that the
> crashkernel= line needs to be placed on *both* the xen-gz and vmlinuz
> lines, because the init.d/kdump script parses /proc/cmdline to get
> the parameters.

I believe there was a suggestion to have the kdump initscript parse /proc/iomem
instead, not sure if that has been investigated just yet. Neil?

Comment 20 Stephen Tweedie 2007-06-12 18:22:58 UTC
Right, parsing /proc/iomem would be far superior: it's just asking for trouble
if we expect to parse both the xen and vmlinuz lines for this info.

Comment 21 Dave Anderson 2007-06-12 18:49:07 UTC
Another thing -- has anybody actually been able to analyze the resultant
xen vmcores?

I was successful in *creating* an x86 xen vmcore:

# strings vmcore | grep "Linux ver"
Linux version 2.6.18-20.el5.kraxel.6xen (root.boston.redhat.com)
(gcc version 4.1.1 20070105 (Red Hat 4.1.1-52)) #1 SMP Fri Jun 8 15:43:18 EDT 2007
#

But the vmcore appears to be missing the missing the NT_PRSTATUS,
and the xen-specific XEN_ELFNOTE_CRASH_INFO and XEN_ELFNOTE_CRASH_REGS
notes sections:
 
# readelf -a vmcore
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              CORE (Core file)
  Machine:                           Intel 80386
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          0 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         5
  Size of section headers:           0 (bytes)
  Number of section headers:         0
  Section header string table index: 0

There are no sections in this file.

There are no sections in this file.

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  NOTE           0x0000000000000158 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000         0
  LOAD           0x0000000000000158 0x00000000c0000000 0x0000000000000000
                 0x00000000000a0000 0x00000000000a0000  RWE    0
  LOAD           0x00000000000a0158 0x00000000c0100000 0x0000000000100000
                 0x0000000001f00000 0x0000000001f00000  RWE    0
  LOAD           0x0000000001fa0158 0x00000000c8000000 0x0000000008000000
                 0x0000000030000000 0x0000000030000000  RWE    0
  LOAD           0x0000000031fa0158 0xffffffffffffffff 0x0000000038000000
                 0x0000000007ee0000 0x0000000007ee0000  RWE    0

There is no dynamic section in this file.

There are no relocations in this file.

There are no unwind sections in this file.

No version information found in this file.
#

... and so the crash utility cannot handle it, i.e., it doesn't even
recognize it as a xen kdump dumpfile.  

Here's the output from a sample x86 xen vmcore that I used for development,
which was given to me by Magnus Damm.  Note the extra xen sections at the
end of the readelf output:

# readelf -a \
vmcore-12733-i386-kexec-tools-testing-b5c22baac1a632363a91da666886bb0ae285bd67
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              CORE (Core file)
  Machine:                           Intel 80386
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          0 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         5
  Size of section headers:           0 (bytes)
  Number of section headers:         0
  Section header string table index: 0

There are no sections in this file.

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  NOTE           0x0000000000000158 0x0000000000000000 0x0000000000000000
                 0x00000000000001bc 0x00000000000001bc         0
  LOAD           0x0000000000000314 0x00000000c0000000 0x0000000000000000
                 0x00000000000a0000 0x00000000000a0000  RWE    0
  LOAD           0x00000000000a0314 0x00000000c0100000 0x0000000000100000
                 0x0000000001f00000 0x0000000001f00000  RWE    0
  LOAD           0x0000000001fa0314 0x00000000c6000000 0x0000000006000000
                 0x0000000032000000 0x0000000032000000  RWE    0
  LOAD           0x0000000033fa0314 0xffffffffffffffff 0x0000000038000000
                 0x00000000077f0000 0x00000000077f0000  RWE    0

There is no dynamic segment in this file.

There are no relocations in this file.

There are no unwind sections in this file.

No version information found in this file.

Notes at offset 0x00000158 with length 0x000001bc:
  Owner         Data size       Description
  CORE          0x00000090      NT_PRSTATUS (prstatus structure)
  Xen           0x00000010      Unknown note type: (0x01000002)
  Xen           0x00000024      Unknown note type: (0x01000001)
  CORE          0x00000090      NT_PRSTATUS (prstatus structure)
  Xen           0x00000010      Unknown note type: (0x01000002)
#

There's an NT_PRSTATUS Notes section for each of 2 cpus, a single 
XEN_ELFNOTE_CRASH_INFO (0x01000001) sections and two XEN_ELFNOTE_CRASH_REGS
(0x01000002), also 1 per cpu.

The XEN_ELFNOTE_CRASH_INFO Note is what's crucial, as it contains
the key to translating the dom0 pfns into the physical memory
described by the PT_LOAD segments.

I'm under the understanding that those notes get set up at kexec_load time
while running in the first kernel, and should be sitting there for the
secondary kernel to export in /proc/vmcore.









Comment 22 Gerd Hoffmann 2007-06-14 10:42:21 UTC
Looks like we have to pull more xen support bits into kexec-tools.

When compiling the xen-tools-testing tree as-is (see comment #1) and use the
resulting kexec binary, then the generated vmcore actually has the notes.

Looking ...

Comment 23 Gerd Hoffmann 2007-06-14 14:35:23 UTC
Created attachment 157006 [details]
additional bits for xen support

Tested on i386, will look at x86_64 now, stay tuned ...

Comment 24 Gerd Hoffmann 2007-06-14 16:01:18 UTC
Created attachment 157015 [details]
64bit bits

Comment 25 Neil Horman 2007-06-14 20:23:50 UTC
Gerd, please don't post to bz's after they're in modified state, otherwise I
tend to loose track of them (I filter them out in my bz view).  If you could
open a new bz with these patches, I'd be happy to incorporate them.  Thanks!

Comment 26 Jarod Wilson 2007-06-14 21:07:50 UTC
New bz to track this is bug 244301. Note that we appear to still need additional
patches. Or at least my test boxes do. I can't even get a dump with the non-xen
kraxel kernel on two boxes that work fine w/the 5.0GA kernel...

Comment 29 errata-xmlrpc 2007-11-07 18:03:00 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0548.html