Bug 1018397 - [HP BCS 6.5 Bug] Full dump is done even if dump level 31 is specified
Summary: [HP BCS 6.5 Bug] Full dump is done even if dump level 31 is specified
Keywords:
Status: CLOSED DUPLICATE of bug 1015764
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kexec-tools
Version: 6.5
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: ---
Assignee: Baoquan He
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-10-11 21:19 UTC by Lisa Mitchell
Modified: 2013-10-14 15:01 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-10-12 03:02:29 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Lisa Mitchell 2013-10-11 21:19:43 UTC
Description of problem:  A 1 blade prototype system with 128 GB of memory was installed with RHEL 6.5 snapshot 1, and a dump was induced via an NMI from the OA (like ILO NMI)  Even though the /etc/kdump.conf file specifies dump_level as 31,  makedumpfile did a full dump, and my /var/crash partition ran out of disk space, made incomplete VMCORE, and the dump process halted.

When I rebooted, and looked at the vmcore-incomplete with crash -d8, it showed dump_level: 0 (0x0)

/etc/kdump.conf says:

#raw /dev/sda5
#ext4 /dev/sda3
#ext4 LABEL=/boot
#ext4 UUID=03138356-5e61-4ab3-b58e-27507ac41937
#net my.server.com:/export/tmp
#net user.com
path /var/crash
core_collector makedumpfile -c --message-level 23 -d 31
#core_collector scp
#core_collector cp --sparse=always
#extra_bins /bin/cp
#link_delay 60
#kdump_post /var/crash/scripts/kdump-post.sh
#extra_bins /usr/bin/lftp
#disk_timeout 30
#extra_modules gfs2
#options modulename options
default shell
#debug_mem_level 0
force_rebuild 1
#sshkey /root/.ssh/kdump_id_rsa


The console log from the dump showed it started to exclude pages, but then started copying:
Creating block d sdi:evice sdi
 sdi1
Creating block device sdj
 sdj: sdj1 sdj2 sdj3 sdj4
Found device with scsi_ids: 3600c0ff000149dfee853785003000000
Creating Remain Block Devices
Creating multipath devices
Saving to the local filesystem UUID=b0ed190f-8fc2-4f35-8215-7470ecf879be
e2fsck 1.41.12 (17-May-2010)
/dev/mapper/mpathcp1: recovering journal
/dev/mapper/mpathcp1: clean, 12/4399104 files, 2EXT4-fs (dm-8): 034716/17578103 mounted filesystem with ordered data mode. Opts: blocks

Free memory/Total memory (free %): 419404 / 496476 ( 84.4762 )
Saving vmcore-dmesg.txt
Saved vmcore-dmesg.txt

Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [100 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [100 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [100 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [100 %] qla2xxx [0000:14:00.1]-8038:3: Cable is unplugged...

Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [100 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [100 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [100 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [100 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [100 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [ 99 %] 
Excluding unnecessary pages        : [ 99 %] 
Excluding unnecessary pages        : [ 99 %] 
Excluding unnecessary pages        : [ 99 %] 
Excluding unnecessary pages        : [ 99 %] 
Excluding unnecessary pages        : [ 99 %] 
Excluding unnecessary pages        : [ 99 %] 
Excluding unnecessary pages        : [100 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [  0 %] 
Excluding unnecessary pages        : [100 %] 
Copying data                       : [  0 %] 
Copying data                       : [  1 %] [lager: Log time: Fri Oct 11 12:55:00 MDT 2013]

Copying data                       : [  2 %] 
Copying data                       : [  3 %] 
Copying data                       : [  4 %] 
Copying data                       : [  5 %] 
Copying data                       : [  6 %] 
Copying data                       : [  7 %] 
Copying data                       : [  8 %] 
Copying data                       : [  9 %] 
Copying data                       : [ 10 %] 
Copying data                       : [ 11 %] 
Copying data                       : [ 12 %] 
Copying data                       : [ 13 %] 
Copying data                       : [ 14 %] 
Copying data                       : [ 15 %] 
Copying data                       : [ 16 %] 
Copying data                       : [ 17 %] 
Copying data                       : [ 18 %] 
Copying data                       : [ 19 %] 
Copying data                       : [ 20 %] 
Copying data                       : [ 21 %] 
Copying data                       : [ 22 %] 
Copying data                       : [ 23 %] 
Copying data                       : [ 24 %] 
Copying data                       : [ 25 %] [lager: Log time: Fri Oct 11 13:00:00 MDT 2013]

Copying data                       : [ 26 %] 
Copying data                       : [ 27 %] 
Copying data                       : [ 28 %] 
Copying data                       : [ 29 %] 
Copying data                       : [ 30 %] 
Copying data                       : [ 31 %] 
Copying data                       : [ 32 %] 
Copying data                       : [ 33 %] 
Copying data                       : [ 34 %] 
Copying data                       : [ 35 %] 
Copying data                       : [ 36 %] 
Copying data                       : [ 37 %] 
Copying data                       : [ 38 %] 
Copying data                       : [ 39 %] 
Copying data                       : [ 40 %] 
Copying data                       : [ 41 %] 
Copying data                       : [ 42 %] 
Copying data                       : [ 43 %] 
Copying data                       : [ 44 %] 
Copying data                       : [ 45 %] dropping to initramfs shell
exiting this shell will reboot your system
/sys/block # [lager: Log time: Fri Oct 11 13:05:00 MDT 2013]
mount
rootfs on / type rootfs (rw)
/proc on /proc type proc (rw,relatime)
/sys on /sys type sysfs (rw,relatime)
/dev on /dev type tmpfs (rw,relatime,mode=755)
/dev/pts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)
/sys/block # exit 0
kvm: exiting hardware virtualization
sd 2:0:1:3: [sdj] Synchronizing SCSI cache
sd 2:0:1:2: [sdi] Synchronizing SCSI cache
sd 2:0:1:1: [sdh] Synchronizing SCSI cache
sd 2:0:1:0: [sdg] Synchronizing SCSI cache
sd 2:0:0:3: [sdf] Synchronizing SCSI cache
sd 2:0:0:2: [sde] Synchronizing SCSI cache
sd 2:0:0:1: [sdd] Synchronizing SCSI cache
sd 2:0:0:0: [sdc] Synchronizing SCSI cache
ixgbe 0000:03:00.1: PME# enabled
ixgbe 0000:03:00.1: PCI INT B disabled
ixgbe 0000:03:00.0: PME# enabled
ixgbe 0000:03:00.0: PCI INT A disabled
ixgbe 0000:01:00.1: PME# enabled
ixgbe 0000:01:00.1: PCI INT B disabled
ixgbe 0000:01:00.0: PME# enabled
ixgbe 0000:01:00.0: PCI INT A disabled
Restarting system.
machine restart




Version-Release number of selected component (if applicable): RHEL 6.5 snapshot 1


How reproducible:  Not sure.
 

Steps to Reproduce:
1.  Install system with RHEL 6.5 s1, in this case our 1 blade prototype, with 128 GB memory

2. In /etc/kdump.conf specify core_collector makedumpfile -c --message-level 23 -d 31

3.  Induce a crash

Actual results:  A full dump is taken, seems like no compression either  


Expected results:  The dump level 31 filtering should occur, much smaller dump


Additional info:


We had another system with RHEL 6.5s1, on an 8 blade system with 2 TB with many mod's to the RHEL 6.5a1 makedumpfile source, which  did succeed.  I collected a dump_level =31 dump, from the crash help -D output, and the dump size was only 7.1 GB.  So this full dump is not happening all the time, but I don't know what triggers this behavior on this smaller prototype.

Comment 2 Lisa Mitchell 2013-10-11 21:35:31 UTC
More details, for the full dump issue on the 128 GB system, my crashkernel size was 512M, specified on the command line :  crashkernel=512M, NOT using crashkernel=auto.

And I made sure the dump initrd was rebuilt, using the kdump.conf file specifiying dump level 31, using service kdump restart

Comment 3 Lisa Mitchell 2013-10-11 21:45:09 UTC
Another correction, My good dump on the very large 8 blade system was on RHEL 6.5 beta 1,  the -419 kernel.  I've yet to get a dump level=31 dump on RHEL 6.5 snapshot 1.

Comment 4 Lisa Mitchell 2013-10-11 23:00:57 UTC
I just reproduced this on a DL980 on RHEL 6.5 snapshot 1,  with 64 GB of memory.

The dump proceeded very slowly, in the copy phase, and I reset the system before the dump was 30% done. When I rebooted the incomplete vmcore was 25 GB.

The dump_level of the dump I took was dump_level: 0 (0x0), and this was after a fresh RHEL  6.5 snapshot 1 install, and setting crashkernel =128M and rebooting.

  core_collector makedumpfile -c --message-level 1 -d 31

was in /etc/kdump.conf.

So this looks like a generic problem for RHEL 6.5 snapshot one, on shipping Proliant platforms as well.

Comment 5 Baoquan He 2013-10-12 03:02:29 UTC
Hi Lisa,

It's a known problem, and has been fixed.

We have delivered a new release kexec-tools-2.0.0-270.el6, please help test it.
If you can't get the latest release, I have put one in below link, please access and get it.

http://people.redhat.com/~bhe/.2112e0f1a4a02ad7917802fcf2d43426/

Baoquan
Thanks

*** This bug has been marked as a duplicate of bug 1015764 ***

Comment 6 Lisa Mitchell 2013-10-12 17:00:51 UTC
Thanks, can HP have access to 1015764?

Comment 7 Lisa Mitchell 2013-10-13 18:13:57 UTC
I just tested kexec-tools-2.0.0-270.el6  on the DL980 that failed previously, just doing an rpm -Uvh on top of the snapshot 1 version,  and it worked great!

Thanks!  I got a dump_level=31 dump, all compressed only 274 MB for a 64 GB system,  where before I had stopped the snapshot 1 dump with incomplete vmcore of 25 GB.  

So looks like you got it fixed.

Still would like access to 1015764, though, as we are patching other fixes into makedumpfile, trying to fix another problem and we would like to see what was fixed, or what caused this regression.

Comment 8 Baoquan He 2013-10-14 07:19:43 UTC
Hi Lisa,

I am asking reporter of that bug whether HP can access it. Meanwhile I put patch in below link, please click it.  you can see the cause and fix. 

http://post-office.corp.redhat.com/archives/kexec-kdump-list/2013-October/msg00019.html

Baoquan
Thanks

Comment 9 Lisa Mitchell 2013-10-14 14:23:11 UTC
I can't access the above link. Get DNS server error, like we can't access post-office.corp.redhat.com.  I'll see if Nigel can access it for me.

Comment 10 Lisa Mitchell 2013-10-14 15:01:15 UTC
Nigel got me the info, so I don't need access any more.


Note You need to log in before you can comment on or make changes to this bug.