253850 – Need to backport kernel vmcore notes changes for makedumpfile

Bug 253850 - Need to backport kernel vmcore notes changes for makedumpfile

Summary: Need to backport kernel vmcore notes changes for makedumpfile

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.1
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Neil Horman
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	223632 (view as bug list)
Depends On:	439304
Blocks:	232927 249266 296431 372911 420521 422431 422441 425461
TreeView+	depends on / blocked

Reported:	2007-08-22 13:02 UTC by Neil Horman
Modified:	2009-06-20 00:39 UTC (History)
CC List:	8 users (show)
Fixed In Version:	RHBA-2008-0314
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-21 14:53:27 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
This is the patch in -mm that we need to backport (11.11 KB, patch) 2007-09-05 15:22 UTC, Neil Horman	no flags	Details \| Diff
backport of vmcore elf notes patch (12.77 KB, patch) 2007-09-12 18:14 UTC, Neil Horman	no flags	Details \| Diff
updated backport with fixes (13.06 KB, patch) 2007-09-18 17:25 UTC, Neil Horman	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0314	0	normal	SHIPPED_LIVE	Updated kernel packages for Red Hat Enterprise Linux 5.2	2008-05-20 18:43:34 UTC

Description Neil Horman 2007-08-22 13:02:14 UTC

Description of problem:
Need to backport the kernel portion of this thread:
http://lists.infradead.org/pipermail/kexec/2007-August/000521.html
So that we can support the use of makedumpfile without a config file present

This is to track the kernel portion of that work. The user space portion is
covered in another bz.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Neil Horman 2007-08-22 13:09:33 UTC

*** Bug 223632 has been marked as a duplicate of this bug. ***

Comment 2 Larry Troan 2007-08-30 22:53:11 UTC

Bug 223632, marked as a DUP of this bug, was Priority=URGENT and Severity=HIGH 
yet this bug is LOW/LOW. We have a hole in our tools/process. 

Per other bug, code is upstream and needs to be backported to 5.2.

RAISING TO CORRECT PRIORITY/SEVERITY.

Comment 3 Neil Horman 2007-08-30 23:40:22 UTC

this isn't urgent, larry, this is slated for 5.2

*** This bug has been marked as a duplicate of 253852 ***

Comment 4 Neil Horman 2007-08-31 13:19:51 UTC

Also, larry, it should be noted that this code is not actually yet upstream. 
Its been posted for review and came back with some concerns from Andrew Morton.
 I have yet to see it included in -mm.  I've sent a note to Kenichi asking for
status and have not yet heard back.

Comment 6 Neil Horman 2007-09-05 15:22:41 UTC

Created attachment 187591 [details]
This is the patch in -mm that we need to backport

Comment 7 Neil Horman 2007-09-05 18:53:14 UTC

Building and testing the backport of the above patch now

Comment 8 Neil Horman 2007-09-12 18:14:59 UTC

Created attachment 193721 [details]
backport of vmcore elf notes patch

This is a backport of the upstream patch, plus some subsequent cleanup sent to
akpm.  I'll post when the makedumpfile and kexec-tools components take
advantage of this.

Comment 9 Daniel Riek 2007-09-17 17:41:17 UTC

Adjusting priority due to our new priority inclusion criteria as outlined in
http://intranet.corp.redhat.com/ic/intranet/RHELInclusionCriteria.html

Comment 10 Neil Horman 2007-09-18 17:25:49 UTC

Created attachment 198621 [details]
updated backport with fixes

Comment 12 Don Zickus 2007-11-29 17:05:34 UTC

in 2.6.18-58.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 13 Dave Anderson 2007-12-12 21:42:36 UTC

I'm not sure where the following problem lies, but this VMCOREINFO addition
may have something to with it.

Running this combination on an i386: 

  # uname -r
  2.6.18-58.el5xen
  # rpm -qa | grep kexec-tools
  kexec-tools-1.102pre-8.el5
  #

the vmcore of a dom0 kdump creates a nonsensical NOTES section, and
and cannot be analyzed with the crash utility.  However, a kdump of the
2.6.18-58.el5 bare metal kernel is OK.

Here's the evidence.  Taking the bare metal kernel, note the the
NOTE section has a size 4a0 bytes, taken up the two CORE notes
plus the new VMCOREINFO note:

# readelf -a vmcore
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              CORE (Core file)
  Machine:                           Intel 80386
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          0 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         5
  Size of section headers:           0 (bytes)
  Number of section headers:         0
  Section header string table index: 0

There are no sections in this file.

There are no sections in this file.

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  NOTE           0x0000000000000158 0x0000000000000000 0x0000000000000000
                 0x00000000000004a0 0x00000000000004a0         0
  LOAD           0x00000000000005f8 0x00000000c0000000 0x0000000000000000
                 0x00000000000a0000 0x00000000000a0000  RWE    0
  LOAD           0x00000000000a05f8 0x00000000c0100000 0x0000000000100000
                 0x0000000001f00000 0x0000000001f00000  RWE    0
  LOAD           0x0000000001fa05f8 0x00000000ca000000 0x000000000a000000
                 0x000000002e000000 0x000000002e000000  RWE    0
  LOAD           0x000000002ffa05f8 0xffffffffffffffff 0x0000000038000000
                 0x0000000007e8cc00 0x0000000007e8cc00  RWE    0

There is no dynamic section in this file.

There are no relocations in this file.

There are no unwind sections in this file.

No version information found in this file.

Notes at offset 0x00000158 with length 0x000004a0:
  Owner         Data size       Description
  CORE          0x00000090      NT_PRSTATUS (prstatus structure)
  CORE          0x00000090      NT_PRSTATUS (prstatus structure)
  VMCOREINFO            0x00000340      Unknown note type: (0x00000000)
#

On the same system running the 2.6.18-58.el5xen kernel, check
this out, where the NOTES section is advertised as having
0x14228e98 (?) bytes, and like the crash utility, causes readelf
itself to die with a segmentation violation:

# readelf -a vmcore
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              CORE (Core file)
  Machine:                           Intel 80386
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          0 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         5
  Size of section headers:           0 (bytes)
  Number of section headers:         0
  Section header string table index: 0

There are no sections in this file.

There are no sections in this file.

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  NOTE           0x0000000000000158 0x0000000000000000 0x0000000000000000
                 0x0000000014228e98 0x0000000014228e98         0
  LOAD           0x0000000014228ff0 0x00000000c0000000 0x0000000000000000
                 0x00000000000a0000 0x00000000000a0000  RWE    0
  LOAD           0x00000000142c8ff0 0x00000000c0100000 0x0000000000100000
                 0x0000000001f00000 0x0000000001f00000  RWE    0
  LOAD           0x00000000161c8ff0 0x00000000ca000000 0x000000000a000000
                 0x000000002e000000 0x000000002e000000  RWE    0
  LOAD           0x00000000441c8ff0 0xffffffffffffffff 0x0000000038000000
                 0x0000000007e8c000 0x0000000007e8c000  RWE    0

There is no dynamic section in this file.

There are no relocations in this file.

There are no unwind sections in this file.

No version information found in this file.

Notes at offset 0x00000158 with length 0x14228e98:
  Owner         Data size       Description
  CORE          0x00000090      NT_PRSTATUS (prstatus structure)
  Xen           0x00000010      Unknown note type: (0x01000002)
  Xen           0x00000024      Unknown note type: (0x01000001)
  CORE          0x00000090      NT_PRSTATUS (prstatus structure)
  Xen           0x00000010      Unknown note type: (0x01000002)
Segmentation fault
#

I took two separate dom0 kdumps, and both resulted the same vmcore,
i.e., with the same NOTE size of 0x14228e98.

Again, I'm not pointing specifically at the VMCOREINFO addition,
but it's of note the that the next NOTE section that should have been
displayed would be the VMCOREINFO.  But the kernel patch only seems
to address the bare-metal side, and doesn't touch the xen side.

So perhaps there could be a mis-match between the latest kexec-tools,
the VMCOREINFO patch, and xen?

I'll try the same thing on an x86_64.

Comment 14 Ken'ichi Ohmichi 2007-12-13 07:10:29 UTC

Hi Dave,

Thank you for the report.
kexec-tools gets the note size of VMCOREINFO from /sys/kernel/vmcoreinfo.
I guess that /sys/kernel/vmcoreinfo contains invalid values on dom0 kernel.

Could you please check /sys/kernel/vmcoreinfo on dom0 kernel and report it ?

Comment 15 Dave Anderson 2007-12-13 14:18:41 UTC

(In reply to comment #14)
> Hi Dave,
> 
> Thank you for the report.
> kexec-tools gets the note size of VMCOREINFO from /sys/kernel/vmcoreinfo.
> I guess that /sys/kernel/vmcoreinfo contains invalid values on dom0 kernel.
> 
> Could you please check /sys/kernel/vmcoreinfo on dom0 kernel and report it ?
> 

  # uname -r
  2.6.18-58.el5xen
  # rpm -qa | grep kexec-tools
  kexec-tools-1.102pre-8.el5
  # cat /sys/kernel/vmcoreinfo
  780e20 1000
  #

If I boot the standard kernel, it looks similar:

  # uname -r
  2.6.18-58.el5
  # cat /sys/kernel/vmcoreinfo
  79a080 1000
  #

Is there a dependency between the kernel and kexec-tools versions
with respect to VMCOREINFO handling?

> I'll try the same thing on an x86_64.

I'm also having a strange issue with this on x86_64.  I updated my
test x86_64 system to 2.6.18-58.el5, and while still using an earlier
version of kexec-tools, I was able to take a bare-metal kdump OK.

But then I updated to kexec-tools-1.102pre-8.el5, and now I'm getting
an error I cannot explain, where I'm unable to even start kdump:

  # uname -r
  2.6.18-58.el5
  # chkconfig kdump --list
  kdump           0:off   1:off   2:on    3:on    4:on    5:on    6:off
  # service kdump status
  Kdump is not operational
  # service kdump start
  Starting kdump:                                            [FAILED]
  # tail /var/log/messages
  ...
  Dec 13 04:20:43 dhcp83-53 kdump: No crashkernel parameter specified for
  running kernel
  Dec 13 04:20:43 dhcp83-53 kdump: failed to start up
  #

But the crashkernel parameter is there: 

  # grep vmlinuz-2.6.18-58.el5 /etc/grub.conf
        kernel /vmlinuz-2.6.18-58.el5 ro root=/dev/VolGroup00/LogVol00 rhgb
  quiet crashkernel=128M@16M

One thing that I did was to manually set the KDUMP_KERNELVER so that I
could use it for a subsequent dom0 setup:

  # grep 2.6.18-58.el5 /etc/sysconfig/kdump
  KDUMP_KERNELVER="2.6.18-58.el5"
  #

Perhaps I'm doing something wrong (?), so I'll keep tinkering...

Comment 16 Dave Anderson 2007-12-13 15:16:34 UTC

> Perhaps I'm doing something wrong (?), so I'll keep tinkering...

What's happening is that when my system boots into the kdump kernel,
it *stays* in the kdump kernel, although /proc/vmcore is zero bytes long.
(So my attempt at "service kdump start" fails appropriately...)

Anyway, I don't know understand why this is happening, as things were
working fine with the older version of kexec-tools?

Comment 17 Dave Anderson 2007-12-13 15:29:07 UTC

> Anyway, I don't know understand why this is happening, as things were
> working fine with the older version of kexec-tools?

When the kdump kernel runs, I see the "Attempting to enter user-space to capture
vmcore" message quickly going by, but the kernel just continues on, and goes
through the normal boot process, bringing  up the graphical window, etc. and
just ends up staying in the kdump kernel.  Perhaps because /proc/vmcore is zero
bytes long, it isn't doing the copy to /var/crash, and rebooting back into the
standard kernel?

Comment 18 Neil Horman 2007-12-13 15:51:12 UTC

Dave:
"Is there a dependency between the kernel and kexec-tools versions
with respect to VMCOREINFO handling?"

Yes, IIRC there is, if you use a newer kexec-tools version (102.pre
specifically), I think kexec expects to find /sys/kernel/vmcoreinfo.  If it
doesn't it won't load the new kernel.  I might be wrong, but thats what I remember.

I'm trying to parse what you're saying in comment 16, and I'm having a bit of
trouble.  You say that in the kdump kernel /proc/vmcore is zero bytes long.  If
thats the case, I would expect the kdump kernel to _not_ reboot, as you are
observing.  This is because the kdump init script detects the need to reboot
based on the length of /proc/vmcore.  If the file is of non-zero length, we
record the core file and reboot, otherwise we just try in insert the kdump
kernel into kernel memory as normal, which I would expect to fail under a kdump
kernel due to the lack of a crashkernel command line option on its kernel
command line.

Regarding your "Attempting to enter user-space to capture
vmcore" I'm not sure I'm familiar with that particular log message, I'm not sure
where its comming from.  Can you provide  a serial console log of kdumps attempt
to capture a vmcore?  Thanks!

Comment 19 Dave Anderson 2007-12-13 16:17:44 UTC

I can't.  But the error message string is found in /sbin/mkdumprd,
although I haven't checked where it puts it.

Anyway, I restored kexec-tools-1.101-194.4.el5, and kdump of the
bare-metal 2.6.18-58.el5 kernel works OK again.  Although, as expected,
there's no VMCOREINFO notes section:

  # readelf -a vmcore
  ...
  Notes at offset 0x00000158 with length 0x000002c8:
    Owner         Data size       Description
    CORE          0x00000150      NT_PRSTATUS (prstatus structure)
    CORE          0x00000150      NT_PRSTATUS (prstatus structure)
  # 

Sorry for the confusion.  So to clarify, these are my results:

i386:

 - Kernel 2.6.18-58.el5 with kexec-tools-1.102pre-8.el5, kdump works OK
   and the VMCOREINFO section shows up OK.
 - Kernel 2.6.18-58.el5xen with kexec-tools-1.102pre-8.el5, kdump creates
   the bogus vmcore with the bizarrely-large notes section. 

x86_64:

 - Kernel 2.6.18-58.el5 and kexec-tools-1.102pre-8.el5, the /proc/vmcore is
   zero bytes long, and so the kdump kernel continues to run.
 - Kernel 2.6.18-58.el5 with kexec-tools-1.101-194.4.el5, kdump works OK
   although there's no VMCOREINFO section in the vmcore.
 - Kernel 2.6.18-58.el5xen with kexec-tools-1.101-194.4.el5, kdump works OK
   although there's no VMCOREINFO section in the vmcore.

Comment 20 Neil Horman 2007-12-13 17:14:09 UTC

Ok, thank you Dave, regarding the log message, I must have just forgotten that I
added that.

Regarding your results, I think we should open separate bugs for the i386 xen
case and the x86_64 case.  The i386 case seems like something that just wasn't
tested with the origional patch, and the latter actually sounds familiar, like
vmcore initalization just isn't setting the size of the vmcore file properly on
kdump kernel boot up (which sounds familiar, like we fixed it in fedora a bit
ago, I'll need to check).

Thanks!

Comment 21 Dave Anderson 2007-12-13 17:17:50 UTC

OK -- I'll file an i386 "bizarre-notes-length" BZ, and let you
further investigate the x86_64 issue...

Thanks,
  Dave

Comment 22 Dave Anderson 2007-12-13 17:33:42 UTC

Bugzilla Bug 423731: i386 dom0 kdump vmcore file created with bogus notes section
https://bugzilla.redhat.com/show_bug.cgi?id=423731

Comment 23 Neil Horman 2007-12-13 19:58:50 UTC

Thanks dave.  I've recreated the zero length vmcore issue.  The size of that
file is determined in the vmcore initcall, and it will be zero if there is an
error parsing the elf headers for the vmcore.  Not sure whats going on exactly yet.

Comment 24 Neil Horman 2007-12-14 00:06:30 UTC

Interesting data point, I just retrieved and rebuilt my
private-nhorman-bz253850-branch from cvs where I did the initial backport for
this patch, and it provides valid /proc/vmcore file in kdump with a valid
VMCOREINFO section on my intel x86_64 system.  That was branched from
2.6.18-45.el5 (at least thats the last entry in the changelog from the spec
file).  So something has changed between -45 and -58 that caused this.  I note a
ppc change in -55 for a kexec hang that may be related, but I'm realy not sure.
 I figure I'll just bisect  the kernels and see where things go south.

also, since I've discovered that things seem to work in -45, and this bug is in
modified state, I'm going to open a new bug for this, to help me track it.

Comment 25 Neil Horman 2007-12-14 00:09:19 UTC

I opened bz 424511 to track this

Comment 27 John Poelstra 2008-03-21 03:58:03 UTC

Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot1--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you

Comment 28 John Poelstra 2008-04-02 21:39:06 UTC

Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot3--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you

Comment 29 Ken'ichi Ohmichi 2008-04-04 08:41:54 UTC

We tested RHEL5.2GA-Snapshot3, and confirmed this feature works fine
on both i386 and ia64. But this feature doesn't work on x86_64 due to
BZ#439304.

Comment 30 John Poelstra 2008-04-09 22:44:47 UTC

Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot4--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you

Comment 31 John Poelstra 2008-04-23 17:40:00 UTC

Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot6--available now on partners.redhat.com.  

We are nearing GA for 5.2 so please test and confirm that your issue is fixed ASAP.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you

Comment 32 John Poelstra 2008-05-01 16:50:07 UTC

Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot7--available now on partners.redhat.com.  

We are nearing GA for 5.2--this is the last opportunity to test and confirm that
your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you

Comment 34 errata-xmlrpc 2008-05-21 14:53:27 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html

Note You need to log in before you can comment on or make changes to this bug.