Description of problem: Need to backport the kernel portion of this thread: http://lists.infradead.org/pipermail/kexec/2007-August/000521.html So that we can support the use of makedumpfile without a config file present This is to track the kernel portion of that work. The user space portion is covered in another bz. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
*** Bug 223632 has been marked as a duplicate of this bug. ***
Bug 223632, marked as a DUP of this bug, was Priority=URGENT and Severity=HIGH yet this bug is LOW/LOW. We have a hole in our tools/process. Per other bug, code is upstream and needs to be backported to 5.2. RAISING TO CORRECT PRIORITY/SEVERITY.
this isn't urgent, larry, this is slated for 5.2 *** This bug has been marked as a duplicate of 253852 ***
Also, larry, it should be noted that this code is not actually yet upstream. Its been posted for review and came back with some concerns from Andrew Morton. I have yet to see it included in -mm. I've sent a note to Kenichi asking for status and have not yet heard back.
Created attachment 187591 [details] This is the patch in -mm that we need to backport
Building and testing the backport of the above patch now
Created attachment 193721 [details] backport of vmcore elf notes patch This is a backport of the upstream patch, plus some subsequent cleanup sent to akpm. I'll post when the makedumpfile and kexec-tools components take advantage of this.
Adjusting priority due to our new priority inclusion criteria as outlined in http://intranet.corp.redhat.com/ic/intranet/RHELInclusionCriteria.html
Created attachment 198621 [details] updated backport with fixes
in 2.6.18-58.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
I'm not sure where the following problem lies, but this VMCOREINFO addition may have something to with it. Running this combination on an i386: # uname -r 2.6.18-58.el5xen # rpm -qa | grep kexec-tools kexec-tools-1.102pre-8.el5 # the vmcore of a dom0 kdump creates a nonsensical NOTES section, and and cannot be analyzed with the crash utility. However, a kdump of the 2.6.18-58.el5 bare metal kernel is OK. Here's the evidence. Taking the bare metal kernel, note the the NOTE section has a size 4a0 bytes, taken up the two CORE notes plus the new VMCOREINFO note: # readelf -a vmcore ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: CORE (Core file) Machine: Intel 80386 Version: 0x1 Entry point address: 0x0 Start of program headers: 64 (bytes into file) Start of section headers: 0 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 5 Size of section headers: 0 (bytes) Number of section headers: 0 Section header string table index: 0 There are no sections in this file. There are no sections in this file. Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align NOTE 0x0000000000000158 0x0000000000000000 0x0000000000000000 0x00000000000004a0 0x00000000000004a0 0 LOAD 0x00000000000005f8 0x00000000c0000000 0x0000000000000000 0x00000000000a0000 0x00000000000a0000 RWE 0 LOAD 0x00000000000a05f8 0x00000000c0100000 0x0000000000100000 0x0000000001f00000 0x0000000001f00000 RWE 0 LOAD 0x0000000001fa05f8 0x00000000ca000000 0x000000000a000000 0x000000002e000000 0x000000002e000000 RWE 0 LOAD 0x000000002ffa05f8 0xffffffffffffffff 0x0000000038000000 0x0000000007e8cc00 0x0000000007e8cc00 RWE 0 There is no dynamic section in this file. There are no relocations in this file. There are no unwind sections in this file. No version information found in this file. Notes at offset 0x00000158 with length 0x000004a0: Owner Data size Description CORE 0x00000090 NT_PRSTATUS (prstatus structure) CORE 0x00000090 NT_PRSTATUS (prstatus structure) VMCOREINFO 0x00000340 Unknown note type: (0x00000000) # On the same system running the 2.6.18-58.el5xen kernel, check this out, where the NOTES section is advertised as having 0x14228e98 (?) bytes, and like the crash utility, causes readelf itself to die with a segmentation violation: # readelf -a vmcore ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: CORE (Core file) Machine: Intel 80386 Version: 0x1 Entry point address: 0x0 Start of program headers: 64 (bytes into file) Start of section headers: 0 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 5 Size of section headers: 0 (bytes) Number of section headers: 0 Section header string table index: 0 There are no sections in this file. There are no sections in this file. Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align NOTE 0x0000000000000158 0x0000000000000000 0x0000000000000000 0x0000000014228e98 0x0000000014228e98 0 LOAD 0x0000000014228ff0 0x00000000c0000000 0x0000000000000000 0x00000000000a0000 0x00000000000a0000 RWE 0 LOAD 0x00000000142c8ff0 0x00000000c0100000 0x0000000000100000 0x0000000001f00000 0x0000000001f00000 RWE 0 LOAD 0x00000000161c8ff0 0x00000000ca000000 0x000000000a000000 0x000000002e000000 0x000000002e000000 RWE 0 LOAD 0x00000000441c8ff0 0xffffffffffffffff 0x0000000038000000 0x0000000007e8c000 0x0000000007e8c000 RWE 0 There is no dynamic section in this file. There are no relocations in this file. There are no unwind sections in this file. No version information found in this file. Notes at offset 0x00000158 with length 0x14228e98: Owner Data size Description CORE 0x00000090 NT_PRSTATUS (prstatus structure) Xen 0x00000010 Unknown note type: (0x01000002) Xen 0x00000024 Unknown note type: (0x01000001) CORE 0x00000090 NT_PRSTATUS (prstatus structure) Xen 0x00000010 Unknown note type: (0x01000002) Segmentation fault # I took two separate dom0 kdumps, and both resulted the same vmcore, i.e., with the same NOTE size of 0x14228e98. Again, I'm not pointing specifically at the VMCOREINFO addition, but it's of note the that the next NOTE section that should have been displayed would be the VMCOREINFO. But the kernel patch only seems to address the bare-metal side, and doesn't touch the xen side. So perhaps there could be a mis-match between the latest kexec-tools, the VMCOREINFO patch, and xen? I'll try the same thing on an x86_64.
Hi Dave, Thank you for the report. kexec-tools gets the note size of VMCOREINFO from /sys/kernel/vmcoreinfo. I guess that /sys/kernel/vmcoreinfo contains invalid values on dom0 kernel. Could you please check /sys/kernel/vmcoreinfo on dom0 kernel and report it ?
(In reply to comment #14) > Hi Dave, > > Thank you for the report. > kexec-tools gets the note size of VMCOREINFO from /sys/kernel/vmcoreinfo. > I guess that /sys/kernel/vmcoreinfo contains invalid values on dom0 kernel. > > Could you please check /sys/kernel/vmcoreinfo on dom0 kernel and report it ? > # uname -r 2.6.18-58.el5xen # rpm -qa | grep kexec-tools kexec-tools-1.102pre-8.el5 # cat /sys/kernel/vmcoreinfo 780e20 1000 # If I boot the standard kernel, it looks similar: # uname -r 2.6.18-58.el5 # cat /sys/kernel/vmcoreinfo 79a080 1000 # Is there a dependency between the kernel and kexec-tools versions with respect to VMCOREINFO handling? > I'll try the same thing on an x86_64. I'm also having a strange issue with this on x86_64. I updated my test x86_64 system to 2.6.18-58.el5, and while still using an earlier version of kexec-tools, I was able to take a bare-metal kdump OK. But then I updated to kexec-tools-1.102pre-8.el5, and now I'm getting an error I cannot explain, where I'm unable to even start kdump: # uname -r 2.6.18-58.el5 # chkconfig kdump --list kdump 0:off 1:off 2:on 3:on 4:on 5:on 6:off # service kdump status Kdump is not operational # service kdump start Starting kdump: [FAILED] # tail /var/log/messages ... Dec 13 04:20:43 dhcp83-53 kdump: No crashkernel parameter specified for running kernel Dec 13 04:20:43 dhcp83-53 kdump: failed to start up # But the crashkernel parameter is there: # grep vmlinuz-2.6.18-58.el5 /etc/grub.conf kernel /vmlinuz-2.6.18-58.el5 ro root=/dev/VolGroup00/LogVol00 rhgb quiet crashkernel=128M@16M One thing that I did was to manually set the KDUMP_KERNELVER so that I could use it for a subsequent dom0 setup: # grep 2.6.18-58.el5 /etc/sysconfig/kdump KDUMP_KERNELVER="2.6.18-58.el5" # Perhaps I'm doing something wrong (?), so I'll keep tinkering...
> Perhaps I'm doing something wrong (?), so I'll keep tinkering... What's happening is that when my system boots into the kdump kernel, it *stays* in the kdump kernel, although /proc/vmcore is zero bytes long. (So my attempt at "service kdump start" fails appropriately...) Anyway, I don't know understand why this is happening, as things were working fine with the older version of kexec-tools?
> Anyway, I don't know understand why this is happening, as things were > working fine with the older version of kexec-tools? When the kdump kernel runs, I see the "Attempting to enter user-space to capture vmcore" message quickly going by, but the kernel just continues on, and goes through the normal boot process, bringing up the graphical window, etc. and just ends up staying in the kdump kernel. Perhaps because /proc/vmcore is zero bytes long, it isn't doing the copy to /var/crash, and rebooting back into the standard kernel?
Dave: "Is there a dependency between the kernel and kexec-tools versions with respect to VMCOREINFO handling?" Yes, IIRC there is, if you use a newer kexec-tools version (102.pre specifically), I think kexec expects to find /sys/kernel/vmcoreinfo. If it doesn't it won't load the new kernel. I might be wrong, but thats what I remember. I'm trying to parse what you're saying in comment 16, and I'm having a bit of trouble. You say that in the kdump kernel /proc/vmcore is zero bytes long. If thats the case, I would expect the kdump kernel to _not_ reboot, as you are observing. This is because the kdump init script detects the need to reboot based on the length of /proc/vmcore. If the file is of non-zero length, we record the core file and reboot, otherwise we just try in insert the kdump kernel into kernel memory as normal, which I would expect to fail under a kdump kernel due to the lack of a crashkernel command line option on its kernel command line. Regarding your "Attempting to enter user-space to capture vmcore" I'm not sure I'm familiar with that particular log message, I'm not sure where its comming from. Can you provide a serial console log of kdumps attempt to capture a vmcore? Thanks!
I can't. But the error message string is found in /sbin/mkdumprd, although I haven't checked where it puts it. Anyway, I restored kexec-tools-1.101-194.4.el5, and kdump of the bare-metal 2.6.18-58.el5 kernel works OK again. Although, as expected, there's no VMCOREINFO notes section: # readelf -a vmcore ... Notes at offset 0x00000158 with length 0x000002c8: Owner Data size Description CORE 0x00000150 NT_PRSTATUS (prstatus structure) CORE 0x00000150 NT_PRSTATUS (prstatus structure) # Sorry for the confusion. So to clarify, these are my results: i386: - Kernel 2.6.18-58.el5 with kexec-tools-1.102pre-8.el5, kdump works OK and the VMCOREINFO section shows up OK. - Kernel 2.6.18-58.el5xen with kexec-tools-1.102pre-8.el5, kdump creates the bogus vmcore with the bizarrely-large notes section. x86_64: - Kernel 2.6.18-58.el5 and kexec-tools-1.102pre-8.el5, the /proc/vmcore is zero bytes long, and so the kdump kernel continues to run. - Kernel 2.6.18-58.el5 with kexec-tools-1.101-194.4.el5, kdump works OK although there's no VMCOREINFO section in the vmcore. - Kernel 2.6.18-58.el5xen with kexec-tools-1.101-194.4.el5, kdump works OK although there's no VMCOREINFO section in the vmcore.
Ok, thank you Dave, regarding the log message, I must have just forgotten that I added that. Regarding your results, I think we should open separate bugs for the i386 xen case and the x86_64 case. The i386 case seems like something that just wasn't tested with the origional patch, and the latter actually sounds familiar, like vmcore initalization just isn't setting the size of the vmcore file properly on kdump kernel boot up (which sounds familiar, like we fixed it in fedora a bit ago, I'll need to check). Thanks!
OK -- I'll file an i386 "bizarre-notes-length" BZ, and let you further investigate the x86_64 issue... Thanks, Dave
Bugzilla Bug 423731: i386 dom0 kdump vmcore file created with bogus notes section https://bugzilla.redhat.com/show_bug.cgi?id=423731
Thanks dave. I've recreated the zero length vmcore issue. The size of that file is determined in the vmcore initcall, and it will be zero if there is an error parsing the elf headers for the vmcore. Not sure whats going on exactly yet.
Interesting data point, I just retrieved and rebuilt my private-nhorman-bz253850-branch from cvs where I did the initial backport for this patch, and it provides valid /proc/vmcore file in kdump with a valid VMCOREINFO section on my intel x86_64 system. That was branched from 2.6.18-45.el5 (at least thats the last entry in the changelog from the spec file). So something has changed between -45 and -58 that caused this. I note a ppc change in -55 for a kexec hang that may be related, but I'm realy not sure. I figure I'll just bisect the kernels and see where things go south. also, since I've discovered that things seem to work in -45, and this bug is in modified state, I'm going to open a new bug for this, to help me track it.
I opened bz 424511 to track this
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot1--available now on partners.redhat.com. Please test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot3--available now on partners.redhat.com. Please test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
We tested RHEL5.2GA-Snapshot3, and confirmed this feature works fine on both i386 and ia64. But this feature doesn't work on x86_64 due to BZ#439304.
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot4--available now on partners.redhat.com. Please test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot6--available now on partners.redhat.com. We are nearing GA for 5.2 so please test and confirm that your issue is fixed ASAP. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot7--available now on partners.redhat.com. We are nearing GA for 5.2--this is the last opportunity to test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html