Description of problem: kdump will not start on ppc64 server with 256G of memory enabled yaboot.conf configured - crashkernel=256M@32M I have gone as high as 1024M@32M for the crashkernel setting with no luck Version-Release number of selected component (if applicable): kexec-tools-1.102pre-96.el5_5.2 kernel-kdump-2.6.18-194.11.1.el5 kernel-2.6.18-194.11.1.el5 How reproducible: enable 256G of memory in ppc64 (P575) server and kdump will not start reduce configured memory to 112G and kdump works Steps to Reproduce: 1.configure crashkernel=256M@32M in yaboot.conf 2.enable 256G of memory for the ppc64 server 3.boot server 4.kdump will fail to start Actual results: excerpt from messages: kdump: get memory ranges:1 Modified cmdline:root=/dev/vg_main/lv_root ro console=hvc0 rhgb quiet irqpoll maxcpus=1 noirqdistrib reset_devices elfcorehdr=39936K savemaxmem=254208M kdump: kexec: failed to load kdump kernel kdump: failed to start up Expected results: kdump should start on boot Additional info: If I deconfigure memory down to 112G the kdump service starts and works fine
What is your /proc/iomem?
cat /proc/iomem 280000000000-28007fffffff: /pci@800000020000250 28007efaOOOO-28007efbffff: 0002:00:01.0 28007efcOOOO-28007effffff: 0002:00:01.0 28007efcOOOO-28007effffff: ipr 28007fOOOOOO-28007fffffff: 0002:00:01.0 28007fOOOOOO-28007fffffff: ipr 280400000000-28047fefffff: /pci@800000020000254 280480000000-2804ffefffff: /pci@800000020000255 300400000000-30047fefffff: /pci@800000020000234 30047fe76000-30047fe76fff: 0000:01:00.0 30047fe76000-30047fe76fff: lpfc 30047fe77000-30047fe77fff: 0000:01:00.1 30047fe77000-30047fe77fff: lpfc 30047fe78000-30047fe7bfff: 0000:01:00.0 30047fe78000-30047fe7bfff: Ipfc 30047fe7cOOO-30047fe7ffff: 0000:01:00.1 30047fe7cOOO-30047fe7ffff: lpfc 30047fe80000-30047febffff: 0000:01:00.0 30047fecOOOO-30047fefffff: 0000:01:00.1 300480000000-3004ffefffff: /pci@800000020000235 3004ffe76000-3004ffe76fff: 0001:01:00.0 3004ffe76000-3004ffe76fff: lpfc 3004ffe77000-3004ffe77fff: 0001:01:00.1 3004ffe77000-3004ffe77fff: lpfc 3004ffe78000-3004ffe7bfff: 0001:01:00.0 3004ffe78000-3004ffe7bfff: lpfc 3004ffe7cOOO-3004ffe7ffff: 0001:01:00.1 3004ffe7cOOO-3004ffe7ffff: lpfc 3004ffe80000-3004ffebffff: 0001:01:00.0 3004ffecOOOO-3004ffefffff: 0001:01:00.1
try leaving the @32M off the crashkernel line. I expect that as the amount of used memory goes up the kernel uses more to keep track of it. As a result we can't allocate the memory we need at the offset requested. If you eliminate the offset requested, the kernel will try to find the allocation request at a more appropriate location.
Ah, ppc64 doesn't list crash memory in /proc/iomem, so please follow Neil's comments and also try: dmesg | grep -i crashkernel. Thanks!
I modified the crashkernel parameter in various ways and pasted the results: 1. dmesg | grep -i crash Kernel command line: root=/dev/vg_main/lv_root ro console=hvcO rhgb quiet crashkernel=256M@32M kdump: get memory ranges:1 Modified cmdline:root=/dev/vg_main/lv_root ro console=hvcO rhgb quiet irqpoll maxcpus=1 noirqdistrib reset_devices elfcorehdr=39936K savemaxmem=254208M kdump: kexec: failed to load kdump kernel kdump: failed to start up 2. dmesg | grep -i crash Kernel command line: root=/dev/vg_main/lv_root ro console=hvcO rhgb quiet crashkernel=256M kdump: No crashkernel parameter specified for running kdump: failed to start up 3. dmesg | grep -i crash Crash kernel location must be 0x2000000 Kernel command line: root=/dev/vg_main/lv_root ro console=hvcO rhgb quiet crashkernel=256M@256M kdump: get memory ranges:1 Modified cmdline:root=/dev/vg_main/lv_root ro console=hvcO rhgb quiet irqpoll maxcpus=1 noirqdistrib reset_devices elfcorehdr=39936K savemaxmem=254208M kdump: kexec: failed to load kdump kernel kdump: failed to start up 4. dmesg | grep -i crash Kernel command line: root=/dev/vg_main/lv_root ro console=hvcO rhgb quiet crashkernel=4096M@32M kdump: get memory ranges:1 Modified cmdline:root=/dev/vg_main/lv_root ro console=hvcO rhgb quiet irqpoll maxcpus=1 noirqdistrib reset_devices elfcorehdr=39936K savemaxmem=254208M kdump: kexec: failed to load kdump kernel kdump: failed to start up
Created attachment 440755 [details] example kdump script mods Even though crashkernel=256M or crashkernel=256M@0 are valid options, the /etc/init.d/kdump script is looking for a @[0-9]\+[MmGgKk] format. The grep and sed lines need to be adjusted a bit to allow for alternate offset specifications. The attached modifications to the script allow kdump to start on my test system when using a crashkernel=256M format.
The patch did allow the removal of the @32M. Still no luck with kdump starting. I tried using a crashkernel of 256M, 512M, 4096M and 8192M without success.
Hmm, RHEL5 kernel doesn't support new syntax, nor gives any informantion in the log on ppc. So I don't know if kernel fails to reserve memory for kdump or kexec fails to load the kernel into the reserved memory... Is it possible for you to enable DEBUG and rebuild the srpm of kexec-tools? (appending -DDEBUG to CFLAGS in makefile). Thanks.
I downloaded / installed the src rpm for kexec-tools and modified the CFLAGS line in the spec file as requested Was that the correct place to make the edit? I ran "rpmbuild -bb" I uninstalled the kexec-tools rpm and reinstalled the one created from the src files I also applied the patch to /etc/init.d/kdump Rebooting the server resulted in the same messages as before Is there something I need to do to increase debugging messages?
Update to comment 11 The error messages I received are below: kdump: get memory ranges:1 Modified cmdline:root=/dev/vg_main/lv_root ro console=hvcO rhgb quiet irqpoll maxcpus=1 noirqdistrib reset_devices elfcorehdr=39936K savemaxmem=254208M kdump: kexec: failed to load kdump kernel kdump: failed to start up
Is there anymore info that I can provide?
Sorry for the delay, Joe. Here is what I did for debugging RHEL6 kexec-tools: 1) download srpm of kexec-tools 2) install this srpm 3) build the srpm with 'rpmbuild -bp' to get the patched source 4) enter the source code directory and modify makefile.in, append '-DDEBUG' 5) compile the source code by hand, './configure && make && make install' 6) copy the new kexec into /sbin directory, it is installed into /usr/local/sbin. 7) run /sbin/kexec, then you should see the debugging messages. It should not be so different for RHEL5, I am trying to find a RHEL5 machine to check this. BTW, what does your /proc/device-tree directory contain?
I will post the results of the debugging shortly. Thanks for the instructions. Contents of /proc/device-tree are below. I truncated the amount of memory entries. ls /proc/device-tree #address-cells aliases chosen clock-frequency compatible cpus device_type event-sources ibm,aix-diagnostics ibm,bsr@3fbfff000000 ibm,converged-loc-codes ibm,drc-indexes ibm,drc-names ibm,drc-power-domains ibm,drc-types ibm,eeh-default ibm,enable-ci64-capable ibm,extended-address ibm,extended-clock-frequency ibm,fault-behavior ibmjfru-9006-deactivate ibm,fw-bytes-per-boot-device ibm,fw-net-compatibility ibmjfw-net-version ibm,ignore-hp-po-fails-for-dlpar ibrrijlpar-capable ibm,max-boot-devices ibm,max-vios-function-level ibm,migratable-partition ibm,model-class ibm,partition-name ibm,partition-no ibm,partition-performance-parameters-level ibtn,pci-full-cfg ibm,phandle ibm,platform-hardware-notification ibm,plat-res-int-priorities ibm,serial interrupt-controller@0 interrupt-controller@800000025000234 interrupt-controller@800000025000235 interrupt-controller@800000025000250 interrupt-controller@800000025000254 interrupt-controller@800000025000255 lhca@23001500 lhea@23c0010c lhea@23c00114 linuXjphandle memory@0 memory@10000000 memory@100000000 memory@1000000000 memory@1010000000 memory@1020000000 memory@ef0000000 memory@f0000000 memory@f00000000 memory@f10000000 memory@f20000000 memory@f30000000 memory@f40000000 memory@f50000000 memory@f60000000 memory@f70000000 memory@f80000000 memory@f90000000 memory@f30000000 memory@fb0000000 memory@fc0000000 memory@fd0000000 memory@f60000000 memory@ff0000000 model name openprom options packages pci@800000020000234 pci@800000020000235 pci@800000020000250 pci@800000020000254 pci@800000020000255 rtas #size-cells system-id vdevice
FYI. We had also asked Joe to run the kdump startup script with strace for some additional data. In two different runs (one with strace and one with strace64), kexec was dying with a segmentation fault while trying to parse the system's memory (always at memory@2130000000). Here are some scans of the last dying gasps of kexec: 26098 lstat("/proc/device-tree//memory@2130000000/reg", {...}) = 0 26G98 open("/proc/device-tree//memory@2i30000000/reg", 0_RDONLY) = 74 2609g read(74, "\0\0\0!0\0\0\0\0\0\0\0\20\0\0\0", 16) = 16 26098 lseek(74, 0, SEEK_SET) = 0 26098 read(74, 0xffb2a4683 16) = 16 26098 - - SIGSEGV (Segmentation fault) @ 0 (0) --- 26097 ?... waitpid resumed> [{wifsignaled(s) && wtermsig(s) == sigsegv}], 0) = 26098 26151 lstat("/proc/device-tree//memory@2130000000/reg"J {st_mode=S_IFREG|0444, st_size=16, ...? = 0 26151 open("/proc/device-tree//memory@2130000000/reg", 0_RCX)NLY) = 74 26151 read(74, "\0\0\0I0\0\0\0\0\0\0\0\20\0\0\0", 16) = 16 26151 lseek(74, 0, SEEK_SET) = 0 26151 read(74, "\0\0\0!0\0\0\0\0\0\0\0\20\0\0\0", 16) = 16 26151 --- SIGSEGV (Segmentation fault) @ 0 (0) --- 26150 <... waitpid resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 0) = 26151
Created attachment 442067 [details] kexec debug 1
Created attachment 442068 [details] kexec debug 2
The attached kexec debug files are just an excerpt from the pages of debug that was produced. Let me know if you need to see more.
Hi, Joe, I can't find any useful info from your attachments, could you please provide the full debug info you got? Also, according to Kevin, it seems kexec segfaulted, it would be very helpful if you can catch the core file of kexec, and show us the backtrace with gdb.
I am not getting a core file /etc/security/limits.conf has: * soft core 0 * hard core 10000 How do I get kexec to dump core? And once I get it, what is the gdb command syntax you want me to run?
adding sbest to see if we can get hardware to reproduce this. Steve, can you grant us access to any ppc64 hardware with 256Gb of ram so we can reproduce this problem. Joe, to dump core, you need to do the following: 1) edit /etc/init.d/kdump.conf, place this line set -x at the top of the load_kdump function 2) execute this service kdump start it will fail, thats ok. it will also dump lots of information to the console. In that dump you will see a call to /sbin/kexec with lots of arugments following it, thats what you need 3) open a second console, su to root, and enter this command: ulimit -c unlimited 4) copy and paste the entire /sbin/kexec command from (2) into the console you opened in (3). You can also remove the set -x that you added in (2) now if you like 5) execute the kexec command in the console from step (3). It should segfault and dump core to the working directory. 6) Please attach that core file here, along with the rpm version and release of the kexec utility that you have installed. We'll be able to analyze the core from that point and determine why the segfault occured. Thanks!
The rpm version of kexec is 1.102pre-96.el5_5.2 I will not be able to send the core file because this is a closed site and there is not a way to make sure the core file is sanitized Kevin Rudd sent me instructions for getting a backtrace using gdb and if the backtrace is clean I will be able to send it I will download the debug rpms this evening and send that info on Tuesday as I will be out until then I can however send in the 74 pages of kdump generated in /var/log/messages and I will send this tonight
ok, It might take more than the backtrace to determine the problem, but its a good start.
(In reply to comment #22) > adding sbest to see if we can get hardware to reproduce this. Steve, can you > grant us access to any ppc64 hardware with 256Gb of ram so we can reproduce > this problem. > > Joe, to dump core, you need to do the following: > > 1) edit /etc/init.d/kdump.conf, place this line > set -x > at the top of the load_kdump function > > 2) execute this > service kdump start > it will fail, thats ok. > it will also dump lots of information to the console. In that dump you will > see a call to /sbin/kexec with lots of arugments following it, thats what you > need > > 3) open a second console, su to root, and enter this command: > ulimit -c unlimited > > 4) copy and paste the entire /sbin/kexec command from (2) into the console you > opened in (3). You can also remove the set -x that you added in (2) now if you > like > > 5) execute the kexec command in the console from step (3). It should segfault > and dump core to the working directory. > > 6) Please attach that core file here, along with the rpm version and release of > the kexec utility that you have installed. We'll be able to analyze the core > from that point and determine why the segfault occured. > > Thanks! Neil, I don't have a system at Red Hat with this amount of memory will see if I can get access to one at IBM. -Steve
Created attachment 443218 [details] kdump debug pg 1-9
Created attachment 443220 [details] kdump debug pg 10-19
Created attachment 443221 [details] kdump debug pg 20-29
Created attachment 443283 [details] kdump debug pg 30-39
Created attachment 443284 [details] kdump debug pg 40-49
Created attachment 443288 [details] kdump debug pg 50-59
Created attachment 443289 [details] kdump debug pg 60-69
Created attachment 443291 [details] kdump debug pg 70-73
Can you tell me which one of these pdfs contains a backtrace?
backtrace is below: [root@host1 ~]# gdb -c core.427 /sbin/kexec GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.2) Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "ppc64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /sbin/kexec...done. Reading symbols from /lib64/power6x/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/power6x/libc.so.6 Reading symbols from /lib64/ld64.so.1...Reading symbols from /usr/lib/debug/lib64/ld-2.5.so.debug...done. done. Loaded symbols for /lib64/ld64.so.1 Core was generated by '/sbin/kexec -p -command-line=root=/dev/vg_main/lv_root ro console=hvc0 rhgb qu'. Program terminated with signal 11, Segmentation fault. #0 0x0000000010009064 in putprops () at kexec/arch/ppc64/fs2dt.c:174 174 *dt++ = rlen; (gdb) bt #0 0x0000000010009064 in putprops () at kexec/arch/ppc64/fs2dt.c:174 #1 putnode () at kexec/arch/ppc64/fs2dt.c:344 #2 0x0000000010008b5c in putnode () at kexec/arch/ppc64/fs2dt.c:390 #3 0x00000000100091 dc in create_flatten_tree (info=<value optimized out>, bufp=0xfffffb4f078, sizep= 0xfffffb4f080, cmdline=0x100a7690 "root=/dev/vg_main/lv_root ro console=hvc0 rhgb quiet irqpoll maxcpus=1 noirqdistrib reset_devices elfcorehdr=39936K savemaxmem=254208M") at kexec/arch/ppc64/fs2dt.c:416 #4 0x0000000010009884 in elf_ppc64_load (argc=5, argv=<value optimized out>, buf=<value optimized out>, len=6399688, info=0xfffffb4f2f0) " at kexec/arch/ppc64/kexec-elf-ppc64.c:250 #5 0x0000000010003238 in myjoad (argc=5, argv=0xfffffb4f9b8) at kexec/kexec.c:640 #6 main (argc=5, argv=0xfffffb4f9b8) at kexec/kexec.c:909
Think I see the problem. Looks like we statically allocate a buffer to hold the flattened device tree and we're running out of space. I'll attach a patch for you to try
Created attachment 446071 [details] patch to double size of device tree space here, its not a perfect fix, but it will tell us if we are on the right track or not. Please build that patch into a kexec-tools tree and see if the problem is fixed. Thanks!
How do I apply this patch? Do I need to download the kexec-tools source? What is the patch command syntax for patching the tree?
Created attachment 446267 [details] kexec-tools rpms for testing here you go, I built it for you.
I applied the patch to the source tree and the kdump service was able to start I forced a panic and a vmcore was dumped successfully I just downloaded the new rpm and I will try that now and let you know
The new rpm works great with 256G of memory and I was able to dump a vmcore file successfully Will there be an official errata rpm released for kexec-tools?
there will be once this bug is approved for inclusion, yes
I tested the rpm on my other servers that are running the 2.6.18-164.11.1 kernel and it works there as well Can I use the rpm you supplied in the interim or should I wait for the official patch release? I would be installing that rpm on production systems so if there are any issues please let me know Thanks again for all of the help.
thats really up to you. What I gave you is not officially supported, so rolling it out to production servers is done at your own risk.
Understood. Any idea what kind of timeframe we are looking at for the decision to release a patched version of kexec?
no,. it really depends on several factors out of my control. I'll be able to give you a better idea when this gets approved. If you need more accurate timeframes, you should open a ticket with the support organization, and it can be more accurately tracked for you.
Fixed. Thanks!
Hello, I'm facing the same problem with RHEL 5.4 on ppc64 using 136GB of memory... Where is the fix please....
Official fix is currently ON_QA. https://bugzilla.redhat.com/show_bug.cgi?id=639303 ~rp
Hello, Does partner have any update about the verification of the official fix? Since this is a OtherQA bug and I'm waiting the result for the 5.5.z EUS bz639303. Thanks.
Due to the systems now being in production, I have to schedule down time for any testing. The earliest I can test the new rpm is this Friday, 10/15. I will post the results as soon as I test the new rpm. Thanks again.
I set the sanityonly flag since the ON_QA package passed sanity testing.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0061.html