Hide Forgot
Description of problem: A 1 blade prototype system with 128 GB of memory was installed with RHEL 6.5 snapshot 1, and a dump was induced via an NMI from the OA (like ILO NMI) Even though the /etc/kdump.conf file specifies dump_level as 31, makedumpfile did a full dump, and my /var/crash partition ran out of disk space, made incomplete VMCORE, and the dump process halted. When I rebooted, and looked at the vmcore-incomplete with crash -d8, it showed dump_level: 0 (0x0) /etc/kdump.conf says: #raw /dev/sda5 #ext4 /dev/sda3 #ext4 LABEL=/boot #ext4 UUID=03138356-5e61-4ab3-b58e-27507ac41937 #net my.server.com:/export/tmp #net user.com path /var/crash core_collector makedumpfile -c --message-level 23 -d 31 #core_collector scp #core_collector cp --sparse=always #extra_bins /bin/cp #link_delay 60 #kdump_post /var/crash/scripts/kdump-post.sh #extra_bins /usr/bin/lftp #disk_timeout 30 #extra_modules gfs2 #options modulename options default shell #debug_mem_level 0 force_rebuild 1 #sshkey /root/.ssh/kdump_id_rsa The console log from the dump showed it started to exclude pages, but then started copying: Creating block d sdi:evice sdi sdi1 Creating block device sdj sdj: sdj1 sdj2 sdj3 sdj4 Found device with scsi_ids: 3600c0ff000149dfee853785003000000 Creating Remain Block Devices Creating multipath devices Saving to the local filesystem UUID=b0ed190f-8fc2-4f35-8215-7470ecf879be e2fsck 1.41.12 (17-May-2010) /dev/mapper/mpathcp1: recovering journal /dev/mapper/mpathcp1: clean, 12/4399104 files, 2EXT4-fs (dm-8): 034716/17578103 mounted filesystem with ordered data mode. Opts: blocks Free memory/Total memory (free %): 419404 / 496476 ( 84.4762 ) Saving vmcore-dmesg.txt Saved vmcore-dmesg.txt Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [100 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [100 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [100 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [100 %] qla2xxx [0000:14:00.1]-8038:3: Cable is unplugged... Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [100 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [100 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [100 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [100 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [100 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [ 99 %] Excluding unnecessary pages : [ 99 %] Excluding unnecessary pages : [ 99 %] Excluding unnecessary pages : [ 99 %] Excluding unnecessary pages : [ 99 %] Excluding unnecessary pages : [ 99 %] Excluding unnecessary pages : [ 99 %] Excluding unnecessary pages : [100 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [ 0 %] Excluding unnecessary pages : [100 %] Copying data : [ 0 %] Copying data : [ 1 %] [lager: Log time: Fri Oct 11 12:55:00 MDT 2013] Copying data : [ 2 %] Copying data : [ 3 %] Copying data : [ 4 %] Copying data : [ 5 %] Copying data : [ 6 %] Copying data : [ 7 %] Copying data : [ 8 %] Copying data : [ 9 %] Copying data : [ 10 %] Copying data : [ 11 %] Copying data : [ 12 %] Copying data : [ 13 %] Copying data : [ 14 %] Copying data : [ 15 %] Copying data : [ 16 %] Copying data : [ 17 %] Copying data : [ 18 %] Copying data : [ 19 %] Copying data : [ 20 %] Copying data : [ 21 %] Copying data : [ 22 %] Copying data : [ 23 %] Copying data : [ 24 %] Copying data : [ 25 %] [lager: Log time: Fri Oct 11 13:00:00 MDT 2013] Copying data : [ 26 %] Copying data : [ 27 %] Copying data : [ 28 %] Copying data : [ 29 %] Copying data : [ 30 %] Copying data : [ 31 %] Copying data : [ 32 %] Copying data : [ 33 %] Copying data : [ 34 %] Copying data : [ 35 %] Copying data : [ 36 %] Copying data : [ 37 %] Copying data : [ 38 %] Copying data : [ 39 %] Copying data : [ 40 %] Copying data : [ 41 %] Copying data : [ 42 %] Copying data : [ 43 %] Copying data : [ 44 %] Copying data : [ 45 %] dropping to initramfs shell exiting this shell will reboot your system /sys/block # [lager: Log time: Fri Oct 11 13:05:00 MDT 2013] mount rootfs on / type rootfs (rw) /proc on /proc type proc (rw,relatime) /sys on /sys type sysfs (rw,relatime) /dev on /dev type tmpfs (rw,relatime,mode=755) /dev/pts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000) /sys/block # exit 0 kvm: exiting hardware virtualization sd 2:0:1:3: [sdj] Synchronizing SCSI cache sd 2:0:1:2: [sdi] Synchronizing SCSI cache sd 2:0:1:1: [sdh] Synchronizing SCSI cache sd 2:0:1:0: [sdg] Synchronizing SCSI cache sd 2:0:0:3: [sdf] Synchronizing SCSI cache sd 2:0:0:2: [sde] Synchronizing SCSI cache sd 2:0:0:1: [sdd] Synchronizing SCSI cache sd 2:0:0:0: [sdc] Synchronizing SCSI cache ixgbe 0000:03:00.1: PME# enabled ixgbe 0000:03:00.1: PCI INT B disabled ixgbe 0000:03:00.0: PME# enabled ixgbe 0000:03:00.0: PCI INT A disabled ixgbe 0000:01:00.1: PME# enabled ixgbe 0000:01:00.1: PCI INT B disabled ixgbe 0000:01:00.0: PME# enabled ixgbe 0000:01:00.0: PCI INT A disabled Restarting system. machine restart Version-Release number of selected component (if applicable): RHEL 6.5 snapshot 1 How reproducible: Not sure. Steps to Reproduce: 1. Install system with RHEL 6.5 s1, in this case our 1 blade prototype, with 128 GB memory 2. In /etc/kdump.conf specify core_collector makedumpfile -c --message-level 23 -d 31 3. Induce a crash Actual results: A full dump is taken, seems like no compression either Expected results: The dump level 31 filtering should occur, much smaller dump Additional info: We had another system with RHEL 6.5s1, on an 8 blade system with 2 TB with many mod's to the RHEL 6.5a1 makedumpfile source, which did succeed. I collected a dump_level =31 dump, from the crash help -D output, and the dump size was only 7.1 GB. So this full dump is not happening all the time, but I don't know what triggers this behavior on this smaller prototype.
More details, for the full dump issue on the 128 GB system, my crashkernel size was 512M, specified on the command line : crashkernel=512M, NOT using crashkernel=auto. And I made sure the dump initrd was rebuilt, using the kdump.conf file specifiying dump level 31, using service kdump restart
Another correction, My good dump on the very large 8 blade system was on RHEL 6.5 beta 1, the -419 kernel. I've yet to get a dump level=31 dump on RHEL 6.5 snapshot 1.
I just reproduced this on a DL980 on RHEL 6.5 snapshot 1, with 64 GB of memory. The dump proceeded very slowly, in the copy phase, and I reset the system before the dump was 30% done. When I rebooted the incomplete vmcore was 25 GB. The dump_level of the dump I took was dump_level: 0 (0x0), and this was after a fresh RHEL 6.5 snapshot 1 install, and setting crashkernel =128M and rebooting. core_collector makedumpfile -c --message-level 1 -d 31 was in /etc/kdump.conf. So this looks like a generic problem for RHEL 6.5 snapshot one, on shipping Proliant platforms as well.
Hi Lisa, It's a known problem, and has been fixed. We have delivered a new release kexec-tools-2.0.0-270.el6, please help test it. If you can't get the latest release, I have put one in below link, please access and get it. http://people.redhat.com/~bhe/.2112e0f1a4a02ad7917802fcf2d43426/ Baoquan Thanks *** This bug has been marked as a duplicate of bug 1015764 ***
Thanks, can HP have access to 1015764?
I just tested kexec-tools-2.0.0-270.el6 on the DL980 that failed previously, just doing an rpm -Uvh on top of the snapshot 1 version, and it worked great! Thanks! I got a dump_level=31 dump, all compressed only 274 MB for a 64 GB system, where before I had stopped the snapshot 1 dump with incomplete vmcore of 25 GB. So looks like you got it fixed. Still would like access to 1015764, though, as we are patching other fixes into makedumpfile, trying to fix another problem and we would like to see what was fixed, or what caused this regression.
Hi Lisa, I am asking reporter of that bug whether HP can access it. Meanwhile I put patch in below link, please click it. you can see the cause and fix. http://post-office.corp.redhat.com/archives/kexec-kdump-list/2013-October/msg00019.html Baoquan Thanks
I can't access the above link. Get DNS server error, like we can't access post-office.corp.redhat.com. I'll see if Nigel can access it for me.
Nigel got me the info, so I don't need access any more.