Bug 1346327 - EET: RHEL7.2Z 24TB RAM 768CPU HP Integrity Superdome X
Summary: EET: RHEL7.2Z 24TB RAM 768CPU HP Integrity Superdome X
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Extended Engineering Testing
Classification: Red Hat
Component: Limits-Testing
Version: unspecified
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: PaulB
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1356807
TreeView+ depends on / blocked
 
Reported: 2016-06-14 14:39 UTC by PaulB
Modified: 2016-10-20 13:01 UTC (History)
25 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1356807 (view as bug list)
Environment:
Last Closed: 2016-07-20 20:29:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description PaulB 2016-06-14 14:39:40 UTC
System Under Test "SUT" Hardware Description:
1. Brief description of hardware
A) SUT Info:
   system name=
   vendor=
   model=
B) CPU Info:
   family=
   model=
   model name=
   total cpu count=
C) Memory Info:
   type=
   dimm part#'s=
   memory amount=

2. Link to the Hardware Certifications for existing system
A) Base Certification
B) Supplemental Certification

3. List known issues
A) Existing BZ's
B) Existing Hardware Errata
C) Existing KBase articles

4. Memory specifications
Please provide a brief description for the following:
A) What is the expected bandwidth of the memory subsystem system wide?
   (If we run many instances of memory intensive applications where
   each application does not cross NUMA boundaries, how much
   aggregate bandwidth might we expect on the server?)
B) Does the memory subsystem support NORMAL -vs- PERFORMANCE
   mode at the management/BIOS layer? If so what is it set to?
C) How many memory channels per socket for specific CPU?
D) How many channels per socket are actually populated on the SUT?

Comment 1 PaulB 2016-06-14 14:47:53 UTC
All,
This BZ was opened following the failure of:
 Bug 1311226 - EET: RHEL7.2 24TB RAM 768CPU HP Integrity Superdome X - BL920s Gen9 System 
 https://bugzilla.redhat.com/show_bug.cgi?id=1311226


There was an issue found during the performance stage of testing in BZ1311226:
 https://bugzilla.redhat.com/show_bug.cgi?id=1311226#c37

We have opened this BZ to rerun the EET testing with kernel-3.10.0-327.18.2.el7.

Best,
-pbunyan

Comment 2 PaulB 2016-06-14 14:48:46 UTC
Nigel Croxon 2016-02-23 11:08:55 EST
https://bugzilla.redhat.com/show_bug.cgi?id=1311226#c0

System Under Test "SUT" Hardware Description:
1. Brief description of hardware
A) HP Integrity Superdome X - BL920s Gen9 System

B) Broadwell EX -E7-8890v4 2.60GHz, CPU count 24

C) DDR4-2133 LRDimm, 24TB or 12TB 


2. Link to the Hardware Certification for existing system:
A) Base Certification
See comment below for Base Certification

B) Supplemental Certification


3. List of known issues:
A) Existing BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1293436

B) Existing Hardware Errata

C) Existing KBase article - https://access.redhat.com/articles/1979103


4. Memory specifications:
A) What is the expected bandwidth of the memory subsystem system wide?
(If we run many instances of memory intensive applications where
each application does not cross NUMA boundaries, how much
aggregate bandwidth might we expect on the server?)

~762 GB/s at 16 sockets or ~48 GB/s per socket of memory bandwidth (read only) with RAS features enabled.
~1200TB/s at 16 sockets or ~75GB/s per socket of memory bandwidth (read only) with RAS features disabled.

B) Does the memory subsystem support NORMAL -vs- PERFORMANCE
mode at the management/BIOS layer? Yes 
If so what is it set to?
Default is DDDC mode = performance mode

C) How many memory channels per socket for specific CPU?
The Integrity Superdome X contains 8 BL920s Gen9 blades
  Each of the 8 blades has 2 CPU sockets.
  Each CPU socket has 2 memory channels each connecting to 2 memory controllers that contain 6 Dimms each.
  Each CPU socket has 24 Dimms
  Each blade has 48 Dimms
  Total system Dimm capacity is 384 Dimms
  384 x 32GB DDR4-2133 LRDimm = 12288TB of system memory installed

D) How many channels per socket are actually populated on the test
system?
Each of the 16 CPU sockets has all memory slots populated - 24 x 32GB DDR4-2133 LRDimms = 768GB per CPU socket

-End

Comment 3 PaulB 2016-06-14 14:51:26 UTC
All,
The following Extended Engineering Testing (EET) is in progress:
 EET: RHEL7.2Z HP Integrity Superdome X 

This EET testing equires a Zstream kernel:
 3.10.0-327.18.2.el7.x86_64

======================================
TARGET HOST DETAILS:
======================================
Hostname = hawk604a.local
Arch = x86_64
Distro = RHEL-7.2Z
Kernel = 3.10.0-327.18.2.el7.x86_64
CPU count =  768
CPU model name = Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20GHz
BIOS Information = 
 Vendor: HP
 Version: Bundle: 008.002.042 SFW: 041.119.000
 Release Date: 04/30/2016

MemTotal = 25364774656 kB

There are three stages of EET testing:
[] Fundamentals (PBunyan pbunyan)
[] Performance  (BMarson bmarson)
[] Lload        (LWoodman lwoodman)


Best,
-pbunyan

Comment 5 PaulB 2016-06-14 15:00:35 UTC
All,
Current testing status...

RHEL7.2Z 24TB RAM 768CPU 
kernel-3.10.0-327.18.2.el7.x86_64

======================================
FUNDAMENTALS: PBunyan
======================================
EET x86_64 Baremetal - scheduled
EET x86_64 Xen - N/A
EET x86_64 KVM -       scheduled
EET x86_64 Kdump -     scheduled


======================================
PERFORMANCE: BMarson
======================================
x86_64 Linpack - pending review...
x86_64 Stream  - pending review...

Barry - please provide a comment with your testing results


======================================
LLOAD: LWoodman
======================================
x86_64 Lload - scheduled


Best,
-pbunyan

Comment 6 Barry Marson 2016-06-14 20:31:26 UTC
I just finished looking over the linpack and streams runs.  With our C based versions of these tests, compiled with gcc, we demonstrated:

Linpack single precision (making use of on chip cache more)
---------------------------------------------------------
Performance peaked at 800 Gflops when 288 instances (18 per NUMA node) were run in parallel.

Linpack double precision (accesses more of main memory)
-------------------------------------------------------
Performance peaked at 335 Gflops when 144 instances (9 per NUMA node) were run in parallel.


Streams (main memory exerciser)
-------------------------------
Typically performance peaks with our affinity testing (using taskset) but in this testbed, NUMA pinning performed even better.  We migrated to the errata kernel with some scheduler fixes/enhancements because the GA kernel was showing too huge a standard deviation between the individual iterations as we increased the worklaod.  This kernel is behaving far better.

Performance peaked to 489 GB/sec when 224 instances (14 per NUMA node) were run in parallel.  Based off the memory bandwidth information documented above, this seems a little low.

I tested without hyperthreads and there was little difference.  I did this to make sure we weren't accidentally thinking an LCPU was actually a cores other hyperthread.

The tests were run with the tuned profile latency-performance which essentially forces all cores to stay at cstate C1.  This limits CPU frequency (and in the past has prevented turbomode to run) so my scaling data (1 core per socket vs many) is more predictable.

I've asked Nigel for more clarification of the memory bandwidth numbers and configuration.


Barry

Comment 7 Nigel Croxon 2016-06-15 17:23:41 UTC
To connect to the Partner lab:

Connect x2goclient to host address 10.16.46.165 with a "Session type = gnome"
Available user names are pbunyan, bmarson, lwoodman, passwd: 100yard-

Start the Firefox browser and connect to: 
https://bpe1-ssl.houston.hpe.com/dana-na/auth/url_default/welcome.cgi
Do Not fill in a "User ID" or a "Passcode".
On the "Token" pull-down, Select "BPIA Certificate".
Click on "Sign In".
A "User Identification Request" window will appear, click on "OK".
The browser window will show "Network Connect" Line with a "Start" button.
Click on "Start"
A window appears, asking are you sure you want to run this application (Network Connect Launcher).  Click on "Yes".
A new window should appear in the top left.  Showing the connection (Assigned IP address).
At this point, once this window appears, you have VPN into the partner lab.

Open a "Terminal" window and ssh into the jump station.
Available user names are pbunyan, bmarson, lwoodman with personal passwords.
for example,  "ssh bmarson@jump1"   The jump station IP address is 15.252.158.21.

Once on the Jump station.  You can ssh to the Onboard Administrator (OA).
ssh Administrator.14.1    Password:  Acme
At the "Hawk604-oa1>" prompt, one can type "co 1" to connect to the console serial line.
Ctrl-B  is to exit.

or on the jump station, ssh to the RHEL OS running.
ssh root.14.30   Password: 100yard-

Comment 8 Barry Marson 2016-06-16 14:39:03 UTC
Just an update .. Not sure where my mind was when I wrote this.  In comment #6 I wrote ..

> I tested without hyperthreads and there was little difference.  I did
> this to make sure we weren't accidentally thinking an LCPU was actually
> a cores other hyper-thread.

This is totally incorrect.  Removing HT at the BIOS level made the tests run properly in the NUMA pinning mode with the GA kernel.  So the errata kernel did improve the behavior with the hyper-threads present.

Barry

Comment 9 PaulB 2016-06-16 18:53:34 UTC
Barry,
Have you determined if the performance testing is a pass -or- a fail with 
kernel-3.10.0-327.18.2.el7.x86_64?

Best,
-pbunyan

Comment 10 Barry Marson 2016-06-17 17:59:29 UTC
Paul, 

Im still waiting for information about memory bandwidth and configuration for this specific system.

Barry

Comment 11 PaulB 2016-06-17 18:22:19 UTC
(In reply to Barry Marson from comment #10)
> Paul, 
> 
> Im still waiting for information about memory bandwidth and configuration
> for this specific system.
> 
> Barry

Nigel, 
Please provide BarryM with the required information.

Best,
-pbunyan

Comment 12 Tom Vaden 2016-06-17 21:04:38 UTC
(In reply to PaulB from comment #11)
> (In reply to Barry Marson from comment #10)
> > Paul, 
> > 
> > Im still waiting for information about memory bandwidth and configuration
> > for this specific system.
> > 
> > Barry
> 
> Nigel, 
> Please provide BarryM with the required information.
> 
> Best,
> -pbunyan

We're working on speeds. Hope to have something soon.

However to be totally in sync with terminology for the description...
For SDx there is RAS mode and Perf mode.

The default is RAS (not Perf) mode. RAS mode, for the Broadwell-EX-based Superdome X, is:
"enhanced DDDC+1 with DRAM bank sparing and DDR4 command/address parity error retry"
(bank sparing and parity error retry are adders to the previous SDx feature set).

Comment 13 Barry Marson 2016-06-20 13:55:12 UTC
Tom,

Is this test system in default mode or was it switched to perf mode ?  Who can answer that ?

Thanks
Barry

Comment 14 Nigel Croxon 2016-06-20 14:22:42 UTC
Following on to Comment #7
https://bugzilla.redhat.com/show_bug.cgi?id=1346327#c7

At the "Hawk604-oa1>" prompt, one can type these commands

"co 1" to connect to the console serial line
"show livelogs" to show the current running BIOS/Firmware information messages.
"^b" a Ctrl-B  is to exit.

"poweron partition 1" to power on the partition
"poweron partition 1 force" to force a stuck system to power on the partition

"poweroff partition 1" to power off the partition
"poweroff partition 1 force" to force a stuck system to power off the partition

Comment 15 PaulB 2016-06-21 15:52:47 UTC
(In reply to PaulB from comment #3)
> All,
> The following Extended Engineering Testing (EET) is in progress:
>  EET: RHEL7.2Z HP Integrity Superdome X 
> 
> This EET testing equires a Zstream kernel:
>  3.10.0-327.18.2.el7.x86_64
> 
> ======================================
> TARGET HOST DETAILS:
> ======================================
> Hostname = hawk604a.local
> Arch = x86_64
> Distro = RHEL-7.2Z

*****
Apologies - small correction needed here.
The distro did _not_ change, only the kernel was 
changed to a Zstream kernel for this EET test run.
This should read: Distro = RHEL-7.2
*****

> Kernel = 3.10.0-327.18.2.el7.x86_64
> CPU count =  768
> CPU model name = Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20GHz
> BIOS Information = 
>  Vendor: HP
>  Version: Bundle: 008.002.042 SFW: 041.119.000
>  Release Date: 04/30/2016
> 
> MemTotal = 25364774656 kB
> 
> There are three stages of EET testing:
> [] Fundamentals (PBunyan pbunyan)
> [] Performance  (BMarson bmarson)
> [] Lload        (LWoodman lwoodman)
> 
> 
> Best,
> -pbunyan


All,
Current testing status...

kernel-3.10.0-327.18.2.el7.x86_64
======================================
FUNDAMENTALS: PBunyan
======================================
EET x86_64 Baremetal - ** PASSED **
EET x86_64 Xen - N/A
EET x86_64 KVM -   in progress...
EET x86_64 Kdump - scheduled.


======================================
PERFORMANCE: BMarson
======================================
x86_64 Linpack - results under review...
x86_64 Stream  - results under review...

Barry - Please provide a short summary/update of the performance 
testing results with kernel-3.10.0-327.18.2.el7.x86_64.


======================================
LLOAD: LWoodman
======================================
x86_64 Lload - scheduled


Best,
-pbunyan

Comment 16 Tom Vaden 2016-06-21 16:00:37 UTC
(In reply to Barry Marson from comment #13)
> Tom,
> 
> Is this test system in default mode or was it switched to perf mode ?  Who
> can answer that ?
> 
> Thanks
> Barry

Barry:

The machine is in RAS mode.

fyi,
tom

Comment 17 Barry Marson 2016-06-21 17:06:15 UTC
OK then this explains why the performance of streams was lower than I expected.

The performance tests PASS.

Barry

Comment 18 PaulB 2016-06-21 18:34:05 UTC
(In reply to Tom Vaden from comment #16)
> (In reply to Barry Marson from comment #13)
> > Tom,
> > 
> > Is this test system in default mode or was it switched to perf mode ?  Who
> > can answer that ?
> > 
> > Thanks
> > Barry
> 
> Barry:
> 
> The machine is in RAS mode.
> 
> fyi,
> 
Nigel,

https://bugzilla.redhat.com/show_bug.cgi?id=1346327#c2tom
---<-snip->---
B) Does the memory subsystem support NORMAL -vs- PERFORMANCE
mode at the management/BIOS layer? Yes 
If so what is it set to?
Default is DDDC mode = performance mode
---<-snip->---


Is the system bios in NORMAL -or- PERFORMANCE mode?


Best,
-pbunyan

Comment 19 Nigel Croxon 2016-06-21 19:19:16 UTC
The system is in NORMAL mode.

Comment 20 PaulB 2016-06-23 02:32:56 UTC
ruyang, 
Your kdump expertise would be greatly appreciated....


============================================
Issue: system failing to kdump testing
============================================
-------------------------------------------------
This is the issue seen on console following
triggering a crash (echo c > /proc/sysrq-trigger):
--------------------------------------------------
---<snip->---
[ 113.883697] sd 0:0:0:1: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 113.895119] sdb: sdb1 sdb2 sdb3
[ 113.899318] sd 0:0:0:1: [sdb] Attached SCSI disk
[ TIME ] Timed out wa[ 114.023535] device-mapper: multipath service-time: version 0.2.0 loaded
iting for device dev-mapper-mpathc1.device.
[DEPEND] Dependency failed for /kdumproot/mnt/hpstorage.
[DEPEND] Dependency failed for Initrd Root File System.
[DEPEND] Dependency failed for Reload Configuration from the Real Root.
[DEPEND] Dependency failed for File System Check on /dev/mapper/mpathc1.
[ OK ] Stopped Kdump Vmcore Save Service.
[ OK ] Stopped dracut pre-pivot and cleanup hook.
[ OK ] Stopped target Initrd Default Target.
[ OK ] Reached target Initrd File Systems.
[ OK ] Stopped dracut mount hook.
[ OK ] Stopped target Basic System.
[ OK ] Stopped target System Initialization.
Starting Setup Virtual Console...
StartinFailed to start kdump-error-handler.service: Transaction is destructive.
[FAILED] Failed to start Kdump Emergency.
See 'systemctl status emergency.service' for details.
[DEPEND] Dependency failed for Emergency Mode.
[ OK ] Started Setup Virtual Console.
[ OK ] Found device /dev/disk/by-uuid/e1da41b9-bd30-4079-a3c7-0bf2ded9b31c.
[ OK ] Found device /dev/disk/by-uuid/3EBF-1CBC.
[ OK ] Found device /dev/disk/by-uuid/6e0da585-542f-4556-9477-ab84407a32e9.
[ OK ] Found device /dev/mapper/rhel_hawk604a-root.
Starting File System Check on /dev/mapper/rhel_hawk604a-root...
[ OK ] Started File System Check on /dev/mapper/rhel_hawk604a-root.
Starting File System Check on /dev/mapper/mpathc1...
[ 102.675298] systemd-fsck[852]: /sbin/fsck.xfs: XFS file system.
[ OK ] Started dracut initqueue hook.
[ OK ] Started File System Check on /dev/mapper/mpathc1.
Mounting /kdumproot/mnt/hpstorage...
Mounting /sysroot...
[ OK ] Reached target Remote File Systems (Pre).
[ OK [ 115.004581] SGI XFS with ACLs, security attributes, no debug enabled
] Reached target Remote File Sys[ 115.013609] XFS (dm-2): Mounting V4 Filesystem
[ 115.013702] XFS (dm-7): Mounting V4 Filesystem
tems.
[ 115.099991] XFS (dm-2): Starting recovery (logdev: internal)
[ 115.128176] XFS (dm-7): Starting recovery (logdev: internal)
[ 115.215637] XFS (dm-7): Ending recovery (logdev: internal)
[ OK ] Mounted /sysroot.
[ 115.275986] XFS (dm-2): Ending recovery (logdev: internal)
[ OK ] Mounted /kdumproot/mnt/hpstorage.
---<snip->---


============================================
These are the relevant system kdump configs:
============================================
------------------------------------------
cat /proc/cmdline
------------------------------------------
[root@hawk604a ~]# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-327.18.2.el7.x86_64 root=/dev/mapper/rhel_hawk604a-root ro crashkernel=512M,high rd.lvm.lv=rhel_hawk604a/root rd.lvm.lv=rhel_hawk604a/swap console=ttyS0,115200n81
[root@hawk604a ~]#


------------------------------------------
/etc/sysconfig/kdump
------------------------------------------
[root@hawk604a ~]# cat /etc/sysconfig/kdump
---<snip->---
#raw /dev/vg/lv_kdump
#ext4 /dev/vg/lv_kdump
#ext4 LABEL=/boot
#ext4 UUID=03138356-5e61-4ab3-b58e-27507ac41937
#nfs my.server.com:/export/tmp
#ssh user.com
#sshkey /root/.ssh/kdump_id_rsa
#path /var/crash
xfs UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c
path /dumpit/here
core_collector makedumpfile -l --message-level 1 -c -d 31
#core_collector scp
#kdump_post /var/crash/scripts/kdump-post.sh
#kdump_pre /var/crash/scripts/kdump-pre.sh
#extra_bins /usr/bin/lftp
#extra_modules gfs2
#default shell
#force_rebuild 1
#dracut_args --omit-drivers "cfg80211 snd" --add-drivers "ext2 ext3"
#fence_kdump_args -p 7410 -f auto -c 0 -i 10
#fence_kdump_nodes node1 node2
---<snip->---

------------------------------------------
cat /etc/kdump.conf
------------------------------------------
[root@hawk604a ~]# cat /etc/kdump.conf
---<snip->---
#

#raw /dev/vg/lv_kdump
#ext4 /dev/vg/lv_kdump
#ext4 LABEL=/boot
#ext4 UUID=03138356-5e61-4ab3-b58e-27507ac41937
#nfs my.server.com:/export/tmp
#ssh user.com
#sshkey /root/.ssh/kdump_id_rsa
#path /var/crash
xfs UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c
path /dumpit/here
core_collector makedumpfile -l --message-level 1 -c -d 31
#core_collector scp
#kdump_post /var/crash/scripts/kdump-post.sh
#kdump_pre /var/crash/scripts/kdump-pre.sh
#extra_bins /usr/bin/lftp
#extra_modules gfs2
#default shell
#force_rebuild 1
#dracut_args --omit-drivers "cfg80211 snd" --add-drivers "ext2 ext3"
#fence_kdump_args -p 7410 -f auto -c 0 -i 10
#fence_kdump_nodes node1 node2
[root@hawk604a ~]#
---<snip->---

=========
NOTE:
=========
We had similar issue in previous testing:
 https://bugzilla.redhat.com/show_bug.cgi?id=1311226#c18

Seems rd.retry=300 in /etc/sysconfig/kdump KDUMP_COMMANDLINE_APPEND
is not working with our test kernel-3.10.0-327.18.2.el7.x86_64

Other than we are testing  kernel-3.10.0-327.18.2.el7.x86_64, the system is 
configured exactly the same as described here:
 https://bugzilla.redhat.com/show_bug.cgi?id=1311226#c18


Thank you for your time, ruyang.

Comment 21 Dave Young 2016-06-23 02:41:28 UTC
Hi, Paul

Seems cat /etc/sysconfig/kdump content is wrong, it is /etc/kdump.conf instead.

Could you try longer rd.retry? the unit is second. ie. rd.retry=600

Thanks
Dave

Comment 22 PaulB 2016-06-23 02:48:28 UTC
(In reply to Dave Young from comment #21)
> Hi, Paul
> 
> Seems cat /etc/sysconfig/kdump content is wrong, it is /etc/kdump.conf
> instead.
> 
> Could you try longer rd.retry? the unit is second. ie. rd.retry=600
> 
> Thanks
> Dave

ruyang, 
Apologies for the cut/paste mistake :/
Below is the current /etc/sysconfig/kdump,
I will try rd.retry=600 - as suggested...

------------------------------------------
/etc/sysconfig/kdump
------------------------------------------
[root@hawk604a ~]# cat /etc/sysconfig/kdump 
# Kernel Version string for the -kdump kernel, such as 2.6.13-1544.FC5kdump
# If no version is specified, then the init script will try to find a
# kdump kernel with the same version number as the running kernel.
KDUMP_KERNELVER=""

# The kdump commandline is the command line that needs to be passed off to
# the kdump kernel.  This will likely match the contents of the grub kernel
# line.  For example:
#   KDUMP_COMMANDLINE="ro root=LABEL=/"
# Dracut depends on proper root= options, so please make sure that appropriate
# root= options are copied from /proc/cmdline. In general it is best to append
# command line options using "KDUMP_COMMANDLINE_APPEND=".
# If a command line is not specified, the default will be taken from
# /proc/cmdline
KDUMP_COMMANDLINE=""

# This variable lets us append arguments to the current kdump commandline
# As taken from either KDUMP_COMMANDLINE above, or from /proc/cmdline
#KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never"
KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=4 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never rd.retry=300"

# Any additional kexec arguments required.  In most situations, this should
# be left empty
#
# Example:
#   KEXEC_ARGS="--elf32-core-headers"
KEXEC_ARGS=""

#Where to find the boot image
#KDUMP_BOOTDIR="/boot"

#What is the image type used for kdump
KDUMP_IMG="vmlinuz"

#What is the images extension.  Relocatable kernels don't have one
KDUMP_IMG_EXT=""
[root@hawk604a ~]# 

-End

best,
-pbunyan

Comment 23 PaulB 2016-06-23 04:22:40 UTC
ruyang, 
Testing with  rd.retry=600  did not resolve the issue.

----------------------
Here is a console log:
----------------------
---<-snip->---
[  112.592260] sd 0:0:0:0: [sda] 195305472 512-byte logical blocks: (99.9 GB/93.1 GiB)
[  112.600780] sd 0:0:0:1: [sdb] 205070336 512-byte logical blocks: (104 GB/97.7 GiB)
[  112.600835] sd 0:0:1:1: [sde] 205070336 512-byte logical blocks: (104 GB/97.7 GiB)
[  112.600862] sd 0:0:1:2: [sdf] 49218740224 512-byte logical blocks: (25.1 TB/22.9 TiB)
[  112.600879] sd 0:0:1:0: [sdd] 195305472 512-byte logical blocks: (99.9 GB/93.1 GiB)
[  112.600979] sd 0:0:0:2: [sdc] 49218740224 512-byte logical blocks: (25.1 TB/22.9 TiB)
[  112.643151] sd 14:0:0:0: [sdg] 195305472 512-byte logical blocks: (99.9 GB/93.1 GiB)
[  112.643166] sd 14:0:1:1: [sdk] 205070336 512-byte logical blocks: (104 GB/97.7 GiB)
[  112.643175] sd 14:0:0:2: [sdi] 49218740224 512-byte logical blocks: (25.1 TB/22.9 TiB)
[  112.643321] sd 14:0:1:2: [sdl] 49218740224 512-byte logical blocks: (25.1 TB/22.9 TiB)
[  112.643358] sd 14:0:1:0: [sdj] 195305472 512-byte logical blocks: (99.9 GB/93.1 GiB)
[  112.643515] sd 0:0:1:0: [sdd] Write Protect is off
[  112.643603] sd 0:0:0:2: [sdc] Write Protect is off
[  112.643712] sd 14:0:0:2: [sdi] Write Protect is off
[  112.643730] sd 0:0:1:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  112.643736] sd 0:0:0:1: [sdb] Write Protect is off
[  112.643827] sd 0:0:1:1: [sde] Write Protect is off
[  112.643840] sd 0:0:0:2: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  112.643912] sd 14:0:0:2: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  112.643929] sd 0:0:0:1: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  112.643982] sd 14:0:1:2: [sdl] Write Protect is off
[  112.643991] sd 14:0:1:0: [sdj] Write Protect is off
[  112.644053] sd 0:0:0:0: [sda] Write Protect is off
[  112.644117] sd 0:0:1:1: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  112.644132] sd 14:0:1:1: [sdk] Write Protect is off
[  112.644141] sd 0:0:1:2: [sdf] Write Protect is off
[  112.644235] sd 14:0:1:0: [sdj] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  112.644246] sd 14:0:1:2: [sdl] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  112.644249] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  112.644364] sd 14:0:1:1: [sdk] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  112.644367] sd 0:0:1:2: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  112.645397] sd 14:0:0:1: [sdh] 205070336 512-byte logical blocks: (104 GB/97.7 GiB)
[  112.646193] sd 14:0:0:1: [sdh] Write Protect is off
[  112.646370] sd 14:0:0:1: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  112.683209]  sdk: sdk1 sdk2 sdk3
[  112.683281]  sdb: sdb1 sdb2 sdb3
[  112.683317]  sde: sde1 sde2 sde3
[  112.683404]  sdh: sdh1 sdh2 sdh3
[  112.683951] sd 14:0:0:1: [sdh] Attached SCSI disk
[  112.688576] sd 0:0:1:1: [sde] Attached SCSI disk
[  112.689087]  sdj: sdj1 sdj2 sdj3
[  112.689128]  sda: sda1 sda2 sda3
[  112.689189]  sdd: sdd1 sdd2 sdd3
[  112.690010] sd 0:0:0:0: [sda] Attached SCSI disk
[  112.690022] sd 0:0:1:0: [sdd] Attached SCSI disk
[  112.690101] sd 14:0:1:0: [sdj] Attached SCSI disk
[  112.715975]  sdc: sdc1
[  112.716033]  sdi: sdi1
[  112.716135]  sdl: sdl1
[  112.716178]  sdf: sdf1
[  112.716831] sd 14:0:0:2: [sdi] Attached SCSI disk
[  112.716911] sd 14:0:1:2: [sdl] Attached SCSI disk
[  112.716955] sd 0:0:1:2: [sdf] Attached SCSI disk
[  112.716983] sd 0:0:0:2: [sdc] Attached SCSI disk
[  112.944656] sd 14:0:1:1: [sdk] Attached SCSI disk
[  112.945268] sd 0:0:0:1: [sdb] Attached SCSI disk
[  112.946796] sd 14:0:0:0: [sdg] Write Protect is off
[  112.946908] sd 14:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  112.971532]  sdg: sdg1 sdg2 sdg3
[  112.975682] sd 14:0:0:0: [sdg] Attached SCSI disk
[  113.088242] device-mapper: multipath service-time: version 0.2.0 loaded
[  OK  ] Found device /dev/disk/by-uuid/3EBF-1CBC.
[  OK  ] Found device /dev/disk/by-uuid/6e0da585-542f-4556-9477-ab84407a32e9.
[  OK  ] Found device /dev/mapper/rhel_hawk604a-root.
         Starting File System Check on /dev/mapper/rhel_hawk604a-root...
[  OK  ] Started File System Check on /dev/mapper/rhel_hawk604a-root.
[ TIME ] Timed out waiting for device dev-mapper-mpathc1.device.
[DEPEND] Dependency failed for /kdumproot/mnt/hpstorage.
[DEPEND] Dependency failed for Initrd Root File System.
[DEPEND] Dependency failed for Reload Configuration from the Real Root.
[DEPEND] Dependency failed for File System Check on /dev/mapper/mpathc1.
[  OK  ] Found device /dev/disk/by-uuid/e1da41b9-bd30-4079-a3c7-0bf2ded9b31c.
         Starting File System Check on /dev/mapper/mpathc1...
[  101.892176]          Starting Setup Virtual Console...
[  OK  ] Started dracut initqueue hook.
[  OK  ] Started File System Check on /dev/mapper/mpathc1.
Failed to start kdump-error-handler.service: Transaction is destructive.
[  114.436096] SGI XFS with ACLs, security attributes, no debug enabled
[  114.438705] XFS (dm-7): Mounting V4 Filesystem
         Mountin         Mounting /sysroot...
[  OK  ] Reached target Remote File Sys[  114.468647] XFS (dm-5): Mounting V4 Filesystem
tems (Pre).
[  OK  ] Reached target Remote File Systems.
[FAILED] Failed to start Kdump Emergency.
See 'systemctl status emergency.service' for details.
[DEPEND] Dependency failed for Emergency Mode.
[  OK  ] Started Setup Virtual Console.
[  114.554311] XFS (dm-5): Starting recovery (logdev: internal)
[  114.613881] XFS (dm-7): Starting recovery (logdev: internal)
[  114.634916] XFS (dm-5): Ending recovery (logdev: internal)
[  OK  ] Mounted /sysroot.
[  114.792887] XFS (dm-7): Ending recovery (logdev: internal)
[  OK  ] Mounted /kdumproot/mnt/hpstorage.
---<-snip->---



As I stated previously, all configs were the same.
We updated the kernel from 3.10.0-327.el7.x86_64 
to 3.10.0-327.18.2.el7.x86_64 for this test run.

I will have to dig into the kernel changelog to see if something
jumps out at me...

ruyang - any insight you may have regarding this time sensitive 
EET (Extended Engineering Testing) test run would be appreciated.

Best,
-pbunyan

Comment 24 Dave Young 2016-06-23 07:05:24 UTC
Seems systemd through out an error maybe for:
[DEPEND] Dependency failed for File System Check on /dev/mapper/mpathc1

Then kdump error handler server tried to start, but it failed to start

But the target device mpathc1 was up after a while, I'm not sure why the emergency service was started without waiting for mpathc1..

Harald, do you have any thought's about this issue?

Thanks
Dave

Comment 26 Harald Hoyer 2016-06-24 08:26:11 UTC
quick guess: "udev.children-max=2"

because of that, mpathc1 doesn't get the udev SYSTEMD_READY flag set in time.

Why "udev.children-max=2" ? Any reason for this strange restriction?

If there are issues with udev and cpus, maybe comment out:
# CPU hotadd request
# SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}="1"

in /lib/udev/rules.d/40-redhat.rules

Comment 28 Dave Young 2016-06-26 10:10:36 UTC
udev.children-max=2 is for avoiding oom, to confirm Harald's concern, Paul, can you test again by removing udev.children-max=2 in kdump sysconfig file?

Comment 29 PaulB 2016-06-27 19:41:14 UTC
(In reply to Dave Young from comment #28)
> udev.children-max=2 is for avoiding oom, to confirm Harald's concern, Paul,
> can you test again by removing udev.children-max=2 in kdump sysconfig file?


All,
[] I retested with kernel-3.10.0-327.el7.x86_64 and was 
   able to successfully crash and capture a vmcorefile.
   hmmmm...

[] I then retested with kernel-3.10.0-327.18.2.el7.x86 and 
   kexec-tools-2.0.7-38.el7_2.1.x86_64 - the system failed 
   in the same manner :/

[] Then, as suggested...
   I removed the  udev.children-max=2  from the
  "KDUMP_COMMANDLINE_APPEND=" setting in /etc/sysconfig/kdump.

  I retested with kernel-3.10.0-327.18.2.el7.x86_64 and 
  kexec-tools-2.0.7-38.el7.x86_64 - success!!
  I was able to successfully crash the system and capture/analyse
  the vmcore file.


Best,
-pbunyan

Comment 30 PaulB 2016-06-27 19:56:29 UTC
All,
Current testing status...

RHEL7.2Z 24TB RAM 768CPU 
kernel-3.10.0-327.18.2.el7.x86_64

======================================
FUNDAMENTALS: PBunyan
======================================
EET x86_64 Baremetal - ** PASS **
EET x86_64 Xen - N/A
EET x86_64 KVM -       ** PASS **
EET x86_64 Kdump -     ** PASS - KBASE REQUIRED **


======================================
PERFORMANCE: BMarson
======================================
x86_64 Linpack - ** PASS **
x86_64 Stream  - ** PASS **

https://bugzilla.redhat.com/show_bug.cgi?id=1346327#c17


======================================
LLOAD: LWoodman
======================================
x86_64 Lload - in progress...



Best,
-pbunyan

Comment 31 Dave Young 2016-06-27 20:37:46 UTC
(In reply to PaulB from comment #29)
> (In reply to Dave Young from comment #28)
> > udev.children-max=2 is for avoiding oom, to confirm Harald's concern, Paul,
> > can you test again by removing udev.children-max=2 in kdump sysconfig file?
> 
> 
> All,
> [] I retested with kernel-3.10.0-327.el7.x86_64 and was 
>    able to successfully crash and capture a vmcorefile.
>    hmmmm...
> 
> [] I then retested with kernel-3.10.0-327.18.2.el7.x86 and 
>    kexec-tools-2.0.7-38.el7_2.1.x86_64 - the system failed 
>    in the same manner :/
> 
> [] Then, as suggested...
>    I removed the  udev.children-max=2  from the
>   "KDUMP_COMMANDLINE_APPEND=" setting in /etc/sysconfig/kdump.
> 
>   I retested with kernel-3.10.0-327.18.2.el7.x86_64 and 
>   kexec-tools-2.0.7-38.el7.x86_64 - success!!
>   I was able to successfully crash the system and capture/analyse
>   the vmcore file.
> 

Paul, glad to know it works though I still need to figure out why udev.children-max=2 can cause the failure. It really should wait no matter how many udev threads being used.

Thanks
Dave

Comment 32 PaulB 2016-06-27 20:53:41 UTC
(In reply to Dave Young from comment #31)

> 
> Paul, glad to know it works though I still need to figure out why
> udev.children-max=2 can cause the failure. It really should wait no matter
> how many udev threads being used.
> 
> Thanks
> Dave


All,
That being said...

As long as a KBASE article can be approved and written for removing udev.children-max=2  from the "KDUMP_COMMANDLINE_APPEND=" setting in /etc/sysconfig/kdump, I can PASS the Fundamentals stage of EET testing.

If a KBASE article cannot be written - the system fails EET.

Adding Gary Case for KBASE blessing. 


Best,
-pbunyan

Comment 33 Dave Young 2016-06-27 21:07:21 UTC
(In reply to Harald Hoyer from comment #26)
> quick guess: "udev.children-max=2"
> 
> because of that, mpathc1 doesn't get the udev SYSTEMD_READY flag set in time.

Harald, Paul has confirmed dropping udev.children-max=2 works, but why do not wait for SYSTEMD_READY? Can you give some hints where is the timeout value and how can we connect it to rd.retry?

Thanks
Dave

Comment 34 Gary Case 2016-06-27 21:51:02 UTC
We would need to get Support's take on this as well. Without the change you won't have a functional kdump, and that directly impacts their ability to support customers. It would also be nice to know why this option causes kdump to fail.

Comment 36 Dave Young 2016-06-28 13:48:44 UTC
systemd.mount manpage says about below option:
x-systemd.device-timeout=

Paul, could you give another try below?
* keep the udev.children_max=2 in sysconfig
* add rd.retry=600
* add x-systemd.device-timeout=600s

If this works we may add x-systemd.device-timeout equal to rd.retry by default in case user does not specify it in fstab.

Thanks
Dave

Comment 37 Dave Young 2016-06-28 13:50:50 UTC
rd.retry should be added to sysconfig

x-systemd.device-timeout=600s should be added to /etc/fstab in the mount options of the multipath device for kdump

Comment 38 Nigel Croxon 2016-06-29 13:37:45 UTC
Larry, how did your testing go?   Pass/Fail?

Comment 39 PaulB 2016-06-29 17:08:02 UTC
Nigel,
Once LarryW has finished his testing, it seems we need debug the
kdump issue further. 

Will there be time on HP schedule to allow for looking into the kdump issue?

Best,
-pbunyan

Comment 40 Nigel Croxon 2016-06-29 18:42:03 UTC
How much time do you need?  Is it something that can be completed today?

If no, we will have to reschedule time with the 24TB system in the future. 

-Nigel

Comment 41 PaulB 2016-06-30 15:19:21 UTC
(In reply to Dave Young from comment #36)
> systemd.mount manpage says about below option:
> x-systemd.device-timeout=
> 
> Paul, could you give another try below?
> * keep the udev.children_max=2 in sysconfig
> * add rd.retry=600
> * add x-systemd.device-timeout=600s
> 
> If this works we may add x-systemd.device-timeout equal to rd.retry by
> default in case user does not specify it in fstab.
> 
> Thanks
> Dave

Dave,
Testing with the requested configuration fails in the same manner as noted here:
 https://bugzilla.redhat.com/show_bug.cgi?id=1346327#c23


=======================
Note: This config fails
=======================
---------------------
/etc/sysconfig/kdump:
---------------------
KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never rd.retry=600 x-systemd.device-timeout=600s"

------------------
cat /proc/cmdline:
------------------
BOOT_IMAGE=/vmlinuz-3.10.0-327.18.2.el7.x86_64 root=/dev/mapper/rhel_hawk604a-root ro crashkernel=512M,high rd.lvm.lv=rhel_hawk604a/root rd.lvm.lv=rhel_hawk604a/swap console=ttyS0,115200n81

----------------
/etc/kdump.conf:
----------------
xfs UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c
path /dumpit/here
core_collector makedumpfile -l --message-level 1 -c -d 31

-----------
/etc/fstab:
-----------
UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c  /mnt/hpstorage  xfs  defaults 0 0

Best,
-pbunyan

Comment 42 PaulB 2016-06-30 15:23:56 UTC
Nigel,
As a KBASE article cannot be used to justify the kdump configuration that includes removing udev.children-max=2 from KDUMP_COMMANDLINE_APPEND line in /etc/sysconfig/kdump, kdump testing is considered a FAIL. 

Therefore, the Fundamentals stage of EET testing has FAILED.


Best,
-pbunyan

Comment 43 Dave Young 2016-06-30 15:56:02 UTC
Paul,

The x-systemd.device-timeout= param is for /etc/fstab mount options, adding it to sysconfig does not help.

Do you still have the machine in hand? Can we use it to test it again?

Thanks
Dave

Comment 44 Tom Vaden 2016-06-30 21:09:18 UTC
(In reply to PaulB from comment #42)
> Nigel,
> As a KBASE article cannot be used to justify the kdump configuration that
> includes removing udev.children-max=2 from KDUMP_COMMANDLINE_APPEND line in
> /etc/sysconfig/kdump, kdump testing is considered a FAIL. 
> 
> Therefore, the Fundamentals stage of EET testing has FAILED.
> 
> 
> Best,
> -pbunyan

Paul:

It looks like a systemd deficiency for which there is a workaround. 
So what's our recourse if not a kbase?

thanks,
tom

Comment 45 Xunlei Pang 2016-06-30 21:57:13 UTC
We can utilize systemd "x-systemd.device-timeout=" parameter to address the issue. We just tested on Paul's machine, and it works.

As an example, add 700s timeout in /etc/fstab:
/dev/mapper/rhel-root  /  xfs  defaults,x-systemd.device-timeout=700s  0 0

Then "touch /etc/kdump.conf" and "kdumpctl restart", after this kdump will use the new fstab options, so in kdump kernel, systemd will find "x-systemd.device-timeout" and use the timeout specified to wait the target ready.


We suggest it as a solution.

Comment 46 PaulB 2016-07-01 13:56:12 UTC
(In reply to Xunlei Pang from comment #45)
> We can utilize systemd "x-systemd.device-timeout=" parameter to address the
> issue. We just tested on Paul's machine, and it works.
> 
> As an example, add 700s timeout in /etc/fstab:
> /dev/mapper/rhel-root  /  xfs  defaults,x-systemd.device-timeout=700s  0 0
> 
> Then "touch /etc/kdump.conf" and "kdumpctl restart", after this kdump will
> use the new fstab options, so in kdump kernel, systemd will find
> "x-systemd.device-timeout" and use the timeout specified to wait the target
> ready.
> 
> 
> We suggest it as a solution.

Xunlei Pang,
Actually, the fstab entry was a bit more detailed.

We set the fstab entry, as follows:
[root@hawk604a here]# cat /etc/fstab 
UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c  /mnt/hpstorage  xfs  rw,relatime,seclabel,attr2,inode64,noquota,x-systemd.device-timeout=700s 0 0


Why? - As you explained to me, kdump uses data from here:
[root@hawk604a here]# findmnt --fstab
TARGET         SOURCE                                    FSTYPE OPTIONS
/              /dev/mapper/rhel_hawk604a-root            xfs    defaults
/boot          UUID=6e0da585-542f-4556-9477-ab84407a32e9 xfs    defaults
/boot/efi      UUID=3EBF-1CBC                            vfat   umask=0077,shortname=winnt
/home          /dev/mapper/rhel_hawk604a-home            xfs    defaults
swap           /dev/mapper/rhel_hawk604a-swap            swap   defaults
/mnt/hpstorage UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c xfs    rw,relatime,seclabel,attr2,inode64,noquota,x-systemd.device-timeout=700s


===========================================
Note: Kdump WORKS with the following config
===========================================
---------------------------------
First - remember this known issue
---------------------------------
Speaking with Nigel on this an previous testing, I was made aware of the following known issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1311226#c27

Bug 1123039 - [HP HPS 7.1 Bug] Crashkernel boot failure, out of memory, when 
crashkernel=512M,high
---<-snip->---
-The following setting was required:
 kernel command line: crashkernel=512M,high
-/etc/sysconfig/kdump  KDUMP_COMMANDLINE_APPEND: s/nr_cpus=1/nr_cpus=4
---<-snip->---

---------------------
/etc/sysconfig/kdump:
---------------------
KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=4 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never rd.retry=600"

------------------
cat /proc/cmdline:
------------------
BOOT_IMAGE=/vmlinuz-3.10.0-327.18.2.el7.x86_64 root=/dev/mapper/rhel_hawk604a-root ro crashkernel=512M,high rd.lvm.lv=rhel_hawk604a/root rd.lvm.lv=rhel_hawk604a/swap console=ttyS0,115200n81

----------------
/etc/kdump.conf:
----------------
xfs UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c
path /dumpit/here
core_collector makedumpfile -l --message-level 1 -c -d 31

-----------
/etc/fstab:
-----------
UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c  /mnt/hpstorage  xfs  rw,relatime,seclabel,attr2,inode64,noquota,x-systemd.device-timeout=700s 0 0


At this point the status of the Fundamentals stage of testing is dependent on the following: 
Is this configuration/workaround acceptable by PM and can a KBASE article be written and approved?

Adding needinfo from Gary Case for KBASE requirement.

Best,
-pbunyan

Comment 47 PaulB 2016-07-01 13:57:42 UTC
(In reply to Tom Vaden from comment #44)
> (In reply to PaulB from comment #42)
> > Nigel,
> > As a KBASE article cannot be used to justify the kdump configuration that
> > includes removing udev.children-max=2 from KDUMP_COMMANDLINE_APPEND line in
> > /etc/sysconfig/kdump, kdump testing is considered a FAIL. 
> > 
> > Therefore, the Fundamentals stage of EET testing has FAILED.
> > 
> > 
> > Best,
> > -pbunyan
> 
> Paul:
> 
> It looks like a systemd deficiency for which there is a workaround. 
> So what's our recourse if not a kbase?
> 
> thanks,
> tom

Tom,
A KBASE will be required.
We will need await Gary Case reply.

Best,
-pbunyan

Comment 48 Xunlei Pang 2016-07-01 17:06:27 UTC
(In reply to PaulB from comment #46)
> (In reply to Xunlei Pang from comment #45)
> > We can utilize systemd "x-systemd.device-timeout=" parameter to address the
> > issue. We just tested on Paul's machine, and it works.
> > 
> > As an example, add 700s timeout in /etc/fstab:
> > /dev/mapper/rhel-root  /  xfs  defaults,x-systemd.device-timeout=700s  0 0
> > 
> > Then "touch /etc/kdump.conf" and "kdumpctl restart", after this kdump will
> > use the new fstab options, so in kdump kernel, systemd will find
> > "x-systemd.device-timeout" and use the timeout specified to wait the target
> > ready.
> > 
> > 
> > We suggest it as a solution.
> 
> Xunlei Pang,
> Actually, the fstab entry was a bit more detailed.
> 
> We set the fstab entry, as follows:
> [root@hawk604a here]# cat /etc/fstab 
> UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c  /mnt/hpstorage  xfs 
> rw,relatime,seclabel,attr2,inode64,noquota,x-systemd.device-timeout=700s 0 0
> 

Hi Paul,

The detailed "rw,relatime,seclabel,attr2,inode64,noquota" was actually copied from the output of previous findmnt, I think it is ok to use "defaults" instead in /etc/fstab for most cases, that is, "defaults,x-systemd.device-timeout=700s".

Regards,
Xunlei

> 
> Why? - As you explained to me, kdump uses data from here:
> [root@hawk604a here]# findmnt --fstab
> TARGET         SOURCE                                    FSTYPE OPTIONS
> /              /dev/mapper/rhel_hawk604a-root            xfs    defaults
> /boot          UUID=6e0da585-542f-4556-9477-ab84407a32e9 xfs    defaults
> /boot/efi      UUID=3EBF-1CBC                            vfat  
> umask=0077,shortname=winnt
> /home          /dev/mapper/rhel_hawk604a-home            xfs    defaults
> swap           /dev/mapper/rhel_hawk604a-swap            swap   defaults
> /mnt/hpstorage UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c xfs   
> rw,relatime,seclabel,attr2,inode64,noquota,x-systemd.device-timeout=700s
> 
> 
> ===========================================
> Note: Kdump WORKS with the following config
> ===========================================
> ---------------------------------
> First - remember this known issue
> ---------------------------------
> Speaking with Nigel on this an previous testing, I was made aware of the
> following known issue:
> https://bugzilla.redhat.com/show_bug.cgi?id=1311226#c27
> 
> Bug 1123039 - [HP HPS 7.1 Bug] Crashkernel boot failure, out of memory, when 
> crashkernel=512M,high
> ---<-snip->---
> -The following setting was required:
>  kernel command line: crashkernel=512M,high
> -/etc/sysconfig/kdump  KDUMP_COMMANDLINE_APPEND: s/nr_cpus=1/nr_cpus=4
> ---<-snip->---
> 
> ---------------------
> /etc/sysconfig/kdump:
> ---------------------
> KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=4 reset_devices
> cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10
> rootflags=nofail acpi_no_memhotplug transparent_hugepage=never rd.retry=600"
> 
> ------------------
> cat /proc/cmdline:
> ------------------
> BOOT_IMAGE=/vmlinuz-3.10.0-327.18.2.el7.x86_64
> root=/dev/mapper/rhel_hawk604a-root ro crashkernel=512M,high
> rd.lvm.lv=rhel_hawk604a/root rd.lvm.lv=rhel_hawk604a/swap
> console=ttyS0,115200n81
> 
> ----------------
> /etc/kdump.conf:
> ----------------
> xfs UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c
> path /dumpit/here
> core_collector makedumpfile -l --message-level 1 -c -d 31
> 
> -----------
> /etc/fstab:
> -----------
> UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c  /mnt/hpstorage  xfs 
> rw,relatime,seclabel,attr2,inode64,noquota,x-systemd.device-timeout=700s 0 0
> 
> 
> At this point the status of the Fundamentals stage of testing is dependent
> on the following: 
> Is this configuration/workaround acceptable by PM and can a KBASE article be
> written and approved?
> 
> Adding needinfo from Gary Case for KBASE requirement.
> 
> Best,
> -pbunyan

Comment 49 Nigel Croxon 2016-07-01 17:45:36 UTC
I am not giving the official answer.
But I just got a text message from Larry (who is at Red Hat Summit).
We passed LLoad testing.

-Nigel

Comment 51 Joseph Kachuck 2016-07-12 19:12:21 UTC
Hello,
This is current kbase. Please let me know if you would like anything added or changed for this?

########
Subject:
HP Integrity Superdome X kdump option required in large configurations

Environment

    HP Integrity Superdome X Large configuration
    RHEL 7.2.z

Issue

    In a HP Integrity Superdome X with RHEL7.2.Z, 24TB RAM, and 768CPU kdump requires udev.children-max=2 in /etc/sysconfig/kdump.
    This udev.children-max=2 was added to the default kdump February 2013. This option limits the udev threads to 2.

Resolution

    Edit /etc/sysconfig/kdump and confirm udev.children-max=2 is listed in KDUMP_COMMANDLINE_APPEND

KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=4 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never rd.retry=300"
#########

Thank You
Joe Kachuck

Comment 52 Xunlei Pang 2016-07-13 08:07:43 UTC
(In reply to Joseph Kachuck from comment #51)
> Hello,
> This is current kbase. Please let me know if you would like anything added
> or changed for this?

Hi Joseph,

I think you must be missing something, we now resort to systemd's "x-systemd.device-timeout" parameter in /etc/fstab in 1st kernel, that is:
Add an entry into "/etc/fstab" for the dump target, and specify an extra "x-systemd.device-timeout=700" mount option for this entry, then rebuild the kdump initramfs.

Regards,
Xunlei

> 
> ########
> Subject:
> HP Integrity Superdome X kdump option required in large configurations
> 
> Environment
> 
>     HP Integrity Superdome X Large configuration
>     RHEL 7.2.z
> 
> Issue
> 
>     In a HP Integrity Superdome X with RHEL7.2.Z, 24TB RAM, and 768CPU kdump
> requires udev.children-max=2 in /etc/sysconfig/kdump.
>     This udev.children-max=2 was added to the default kdump February 2013.
> This option limits the udev threads to 2.
> 
> Resolution
> 
>     Edit /etc/sysconfig/kdump and confirm udev.children-max=2 is listed in
> KDUMP_COMMANDLINE_APPEND
> 
> KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=4 reset_devices
> cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10
> rootflags=nofail acpi_no_memhotplug transparent_hugepage=never rd.retry=300"
> #########
> 
> Thank You
> Joe Kachuck

Comment 53 Joseph Kachuck 2016-07-13 16:14:20 UTC
Hello,
From looking at comment 45 would this be a preferred kbase?  

Please confirm if this is correct.

Thank You
Joe Kachuck

Comment 54 PaulB 2016-07-13 19:06:06 UTC
(In reply to Joseph Kachuck from comment #53)
> Hello,
> From looking at comment 45 would this be a preferred kbase?  
> 
> Please confirm if this is correct.
> 
> Thank You
> Joe Kachuck

Joe,
No.
Please use the following comment, as a reference for writing the kbase:
 https://bugzilla.redhat.com/show_bug.cgi?id=1346327#c46

Best,
-pbunyan

Comment 55 Joseph Kachuck 2016-07-13 19:44:00 UTC
Hello Paul,
This is the new kbase:
Since in comment 46. It appeared the udev.children-max=2 option was included. I have left it in the kbase. Please let me know if this looks better?

########
Subject:
HP Integrity Superdome X kdump options required in large configurations

Environment

    HP Integrity Superdome X Large configuration
    RHEL 7.2.z

Issue

    In a HP Integrity Superdome X with RHEL7.2.Z, 24TB RAM, and 768CPU kdump requires additional options for kdump to work correctly.
  
    udev.children-max=2 should be added to /etc/sysconfig/kdump.
    This udev.children-max=2 was added to the default kdump February 2013. This option limits the udev threads to 2.

    x-systemd.device-timeout=700s should be added to the kdump mount point.


Resolution

    Edit /etc/sysconfig/kdump and confirm udev.children-max=2 is listed in KDUMP_COMMANDLINE_APPEND

KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=4 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never rd.retry=300"

    Edit /etc/kdump.conf add the correct path
xfs UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c
path /dumpit/here
core_collector makedumpfile -l --message-level 1 -c -d 31

    Edit /etc/fstab
UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c  /mnt/storage  xfs  rw,relatime,seclabel,attr2,inode64,noquota,x-systemd.device-timeout=700s 0 0

    Then run the command "kdumpctl restart"

#########


Thank You
Joe Kachuck

Comment 56 PaulB 2016-07-14 13:24:39 UTC
HP / Nigel,
We are discussing the KBASE article internally.
Once approved JoeK will add comment.

Best,
-pbunyan

Comment 58 Nigel Croxon 2016-07-14 13:35:04 UTC
Thank you Paul and Red Hat.   We await your posting.

-Nigel

Comment 59 Joseph Kachuck 2016-07-14 16:35:58 UTC
Hello,
The kbase for this issue has now been published:
https://access.redhat.com/solutions/2438911

Please let me know in email if this needs any changes.

Thank You
Joe Kachuck

Comment 60 Larry Woodman 2016-07-14 17:16:12 UTC
I have finished my EET testing of the 24TB RAM 768CPU HP Integrity Superdome X running RHEL7.2.z.  I was able to successfully consume and even over-commit all the RAM on every CPU, involking a storm of OOMkills on many CPUs at the same time.  The system successfully killed the necessary processes to continue wunning without hanging or pausing for an excessive amount of time.  This even worked OK when all of most of the memory was allocated on a different NUMA node that the node it was executing on, a stress test that has been problematic in the past on large systems.  In addition, I was able to consume all the memory in the pagecache then apply a heavy anonymous workload.  The system successfully reclaimed all or most of the pagecache memory even when the underlying files were mmap()'d into the address space of active processes, once again a stress test that proved problematic in the past on large HP and other systems.  

At this point I would say that Red Hat can officially support this system running RHEL7.2.z.  The only thing I would probably say in a release note is that reclaiming lots of pagecache memory for several very large anonymous memory backed by Transparent Huge Pages(THP) can cause the system to pause for several seconds and even encounter soft lockups on a system this large.  If this happens and the resulting pauses and/or soft lockups are problematic, disabling THP will eliminate it.  The reason for the pauses/lockups is 1.) THP 2MB pages allow the memory demmand to be up to 512 times greater than 4KB small pages.  2.) The page reclaim code must defragment memory zones and break the 2MB pages into 512 individual 4KB pages in order to reclaim them.


Larry Woodman

Comment 61 PaulB 2016-07-15 01:14:24 UTC
Joe,
The Lload testing also requires a KBASE:
 https://bugzilla.redhat.com/show_bug.cgi?id=1346327#c60

Best,
-pbunyan

Comment 62 Tom Vaden 2016-07-15 01:33:07 UTC
(In reply to PaulB from comment #61)
> Joe,
> The Lload testing also requires a KBASE:
>  https://bugzilla.redhat.com/show_bug.cgi?id=1346327#c60
> 
> Best,
> -pbunyan

Paul:

Can we enlarge or use the previous kbase that evoked similar behavior in the RHEL7.1 EET?

It is at:
https://access.redhat.com/articles/1979103

just a thought,
tom

Comment 63 Joseph Kachuck 2016-07-19 18:40:47 UTC
Hello Paul,
Would it be acceptable to update kbase 1979103 as noted in comment 62?

Thank You
Joe Kachuck

Comment 64 PaulB 2016-07-19 20:40:47 UTC
(In reply to Joseph Kachuck from comment #63)
> Hello Paul,
> Would it be acceptable to update kbase 1979103 as noted in comment 62?
> 
> Thank You
> Joe Kachuck

Joe,
There are three stages of EET testing:
[] Fundamentals (PBunyan pbunyan)
[] Performance  (BMarson bmarson)
[] Lload        (LWoodman lwoodman)

That is an KBASE issue for Larry Woodmans Lload testing stage.
I would prefer Larry Woodman answer your question.

Adding needinfo from lwoodman.

Best,
-pbunyan

Comment 65 Larry Woodman 2016-07-19 21:12:36 UTC
In regards to commnet #62:
------------------------------------------------------------------------------
Paul:

Can we enlarge or use the previous kbase that evoked similar behavior in the RHEL7.1 EET?

It is at:
https://access.redhat.com/articles/1979103

just a thought,
tom
-------------------------------------------------------------------------------

Yes, please include this system in the scope of "https://access.redhat.com/articles/1979103".  No sence in writing the exact release note for this system.

Larry

Comment 66 Joseph Kachuck 2016-07-20 17:01:39 UTC
Hello,
The kbase has been updated:
https://access.redhat.com/articles/1979103

Thank You
Joe Kachuck

Comment 67 Trinh Dao 2016-07-20 17:58:50 UTC
Hi Joe,
since RHEL7.2 EET kbase is posted, can you please add RHEL7.2 kbase 'Hardware update requires updated version of RHEL to RHEL 7.2' to BL920s Gen9 on RH HCL, https://access.redhat.com/ecosystem/hardware/2165921 

thank you,
trinh

Comment 68 PaulB 2016-07-20 18:05:05 UTC
All,
EET testing has completed successfully:
 
EET: RHEL7.2Z 24TB RAM 768CPU HP Integrity Superdome X
kernel-3.10.0-327.18.2.el7.x86_64

======================================
FUNDAMENTALS: PBunyan
======================================
EET x86_64 Baremetal - ** PASS **
EET x86_64 Xen - N/A
EET x86_64 KVM -       ** PASS **
EET x86_64 Kdump -     ** PASS - KBASE REQUIRED **

KBASE: https://access.redhat.com/solutions/2438911


======================================
PERFORMANCE: BMarson
======================================
x86_64 Linpack - ** PASS **
x86_64 Stream  - ** PASS **

https://bugzilla.redhat.com/show_bug.cgi?id=1346327#c17


======================================
LLOAD: LWoodman
======================================
x86_64 Lload - ** PASS - KBASE REQUIRED **

https://bugzilla.redhat.com/show_bug.cgi?id=1346327#c60
KBASE: https://access.redhat.com/articles/1979103


Best,
-pbunyan

Comment 69 Joseph Kachuck 2016-07-20 18:30:50 UTC
Hello,
Kbases have been added.

Thank You
Joe Kachuck

Comment 70 Nigel Croxon 2016-07-20 19:32:08 UTC
Thank you Paul for all of your efforts here.

Comment 71 Tom Vaden 2016-07-20 19:55:32 UTC
(In reply to Nigel Croxon from comment #70)
> Thank you Paul for all of your efforts here.

ditto

Comment 72 Harald Hoyer 2016-08-11 10:20:18 UTC
(In reply to Dave Young from comment #28)
> udev.children-max=2 is for avoiding oom, to confirm Harald's concern, Paul,
> can you test again by removing udev.children-max=2 in kdump sysconfig file?

Right, I removed the RAM check back in 2013. So we now have the default value of:

children-max = 8 + CPU_COUNT * 2

Do you have a suggestion how to shrink this according to the RAM available?

I am thinking of:

cpu_max = 8 + CPU_COUNT * 2
ram_max = ...
children-max = MIN(cpu_max, ram_max)

We used to calculate ram_max with:
ram_max = memsize_mb / 8

Any suggestion on how to calculate ram_max ?

How much memory is available in the kdump environment?

Comment 75 PaulB 2016-08-11 13:11:39 UTC
HaroldH/DaveY,
This EET BZ is closed. The resolution for kdump was suggested/approved by the 
kdump team and KBASE was completed.

I would suggest opening a new BZ to troubleshoot/investigate.
Add comments #71-73 to the new BZ and reference this BZ.

Best,
-pbunyan

Comment 76 Dave Young 2016-08-15 01:52:11 UTC
In reply to Harald Hoyer from comment #72)
> (In reply to Dave Young from comment #28)
> > udev.children-max=2 is for avoiding oom, to confirm Harald's concern, Paul,
> > can you test again by removing udev.children-max=2 in kdump sysconfig file?
> 
> Right, I removed the RAM check back in 2013. So we now have the default
> value of:
> 
> children-max = 8 + CPU_COUNT * 2
> 
> Do you have a suggestion how to shrink this according to the RAM available?
> 
> I am thinking of:
> 
> cpu_max = 8 + CPU_COUNT * 2
> ram_max = ...
> children-max = MIN(cpu_max, ram_max)
> 
> We used to calculate ram_max with:
> ram_max = memsize_mb / 8
>
> Any suggestion on how to calculate ram_max ?

The original value looks like udev will have one thread per 8M memory as the maximum thread number. I suspect it is too aggressive, during out test some processes like dhclient use a lot of memory, maybe memsize_mb/128 is a reasonable value, OTOH, even with /128, for 24T ram_max will be 196608, it will be too many, maybe there should be a maximum value for ram_max. 

> 
> How much memory is available in the kdump environment?

Usually it is 160M+64M/Tb for crashkernel=auto (x86), but one can use specific value with like crashkernel=512M in kernel cmdline like in this bug.
For ppc64 since they have 64K page, they need more memory in kdump kernel.

Thanks
Dave

Comment 77 Pingfan Liu 2016-10-20 02:48:07 UTC
Hi PaulB,

Can you set the value of "DefaultTimeoutStartSec=700s" in /etc/systemd/system.conf, then have a test? It affects all the service's timeout. If fortunately, I hope it can survive from "Time out waiting for device dev-mapper-mpathc1.device", which cause the failure of kdump service.

Thx,
Pingfan


(In reply to PaulB from comment #20)

> [ TIME ] Timed out wa[ 114.023535] device-mapper: multipath service-time:
> version 0.2.0 loaded
> iting for device dev-mapper-mpathc1.device
> [DEPEND] Dependency failed for /kdumproot/mnt/hpstorage.
> [DEPEND] Dependency failed for Initrd Root File System.
> [DEPEND] Dependency failed for Reload Configuration from the Real Root.
> [DEPEND] Dependency failed for File System Check on /dev/mapper/mpathc1.
> [ OK ] Stopped Kdump Vmcore Save Service.
> [ OK ] Stopped dracut pre-pivot and cleanup hook.
> [ OK ] Stopped target Initrd Default Target.
> [ OK ] Reached target Initrd File Systems.
> [ OK ] Stopped dracut mount hook.
> [ OK ] Stopped target Basic System.
> [ OK ] Stopped target System Initialization.
> Starting Setup Virtual Console...
> StartinFailed to start kdump-error-handler.service: Transaction is
> destructive.
> [FAILED] Failed to start Kdump Emergency.
> See 'systemctl status emergency.service' for details.
> [DEPEND] Dependency failed for Emergency Mode.
> [ OK ] Started Setup Virtual Console.
> [ OK ] Found device /dev/disk/by-uuid/e1da41b9-bd30-4079-a3c7-0bf2ded9b31c.
> [ OK ] Found device /dev/disk/by-uuid/3EBF-1CBC.
> [ OK ] Found device /dev/disk/by-uuid/6e0da585-542f-4556-9477-ab84407a32e9.
> [ OK ] Found device /dev/mapper/rhel_hawk604a-root.
> Starting File System Check on /dev/mapper/rhel_hawk604a-root...
> [ OK ] Started File System Check on /dev/mapper/rhel_hawk604a-root.
> Starting File System Check on /dev/mapper/mpathc1...
> [ 102.675298] systemd-fsck[852]: /sbin/fsck.xfs: XFS file system.
> [ OK ] Started dracut initqueue hook.
> [ OK ] Started File System Check on /dev/mapper/mpathc1.
> Mounting /kdumproot/mnt/hpstorage...
> Mounting /sysroot...
> [ OK ] Reached target Remote File Systems (Pre).
> [ OK [ 115.004581] SGI XFS with ACLs, security attributes, no debug enabled
> ] Reached target Remote File Sys[ 115.013609] XFS (dm-2): Mounting V4
> Filesystem
> [ 115.013702] XFS (dm-7): Mounting V4 Filesystem
> tems.
> [ 115.099991] XFS (dm-2): Starting recovery (logdev: internal)
> [ 115.128176] XFS (dm-7): Starting recovery (logdev: internal)
> [ 115.215637] XFS (dm-7): Ending recovery (logdev: internal)
> [ OK ] Mounted /sysroot.
> [ 115.275986] XFS (dm-2): Ending recovery (logdev: internal)
> [ OK ] Mounted /kdumproot/mnt/hpstorage.
> ---<snip->---
> 
> 
> ============================================
> These are the relevant system kdump configs:
> ============================================
> ------------------------------------------
> cat /proc/cmdline
> ------------------------------------------
> [root@hawk604a ~]# cat /proc/cmdline
> BOOT_IMAGE=/vmlinuz-3.10.0-327.18.2.el7.x86_64
> root=/dev/mapper/rhel_hawk604a-root ro crashkernel=512M,high
> rd.lvm.lv=rhel_hawk604a/root rd.lvm.lv=rhel_hawk604a/swap
> console=ttyS0,115200n81
> [root@hawk604a ~]#
> 
> 
> ------------------------------------------
> /etc/sysconfig/kdump
> ------------------------------------------
> [root@hawk604a ~]# cat /etc/sysconfig/kdump
> ---<snip->---
> #raw /dev/vg/lv_kdump
> #ext4 /dev/vg/lv_kdump
> #ext4 LABEL=/boot
> #ext4 UUID=03138356-5e61-4ab3-b58e-27507ac41937
> #nfs my.server.com:/export/tmp
> #ssh user.com
> #sshkey /root/.ssh/kdump_id_rsa
> #path /var/crash
> xfs UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c
> path /dumpit/here
> core_collector makedumpfile -l --message-level 1 -c -d 31
> #core_collector scp
> #kdump_post /var/crash/scripts/kdump-post.sh
> #kdump_pre /var/crash/scripts/kdump-pre.sh
> #extra_bins /usr/bin/lftp
> #extra_modules gfs2
> #default shell
> #force_rebuild 1
> #dracut_args --omit-drivers "cfg80211 snd" --add-drivers "ext2 ext3"
> #fence_kdump_args -p 7410 -f auto -c 0 -i 10
> #fence_kdump_nodes node1 node2
> ---<snip->---
> 
> ------------------------------------------
> cat /etc/kdump.conf
> ------------------------------------------
> [root@hawk604a ~]# cat /etc/kdump.conf
> ---<snip->---
> #
> 
> #raw /dev/vg/lv_kdump
> #ext4 /dev/vg/lv_kdump
> #ext4 LABEL=/boot
> #ext4 UUID=03138356-5e61-4ab3-b58e-27507ac41937
> #nfs my.server.com:/export/tmp
> #ssh user.com
> #sshkey /root/.ssh/kdump_id_rsa
> #path /var/crash
> xfs UUID=e1da41b9-bd30-4079-a3c7-0bf2ded9b31c
> path /dumpit/here
> core_collector makedumpfile -l --message-level 1 -c -d 31
> #core_collector scp
> #kdump_post /var/crash/scripts/kdump-post.sh
> #kdump_pre /var/crash/scripts/kdump-pre.sh
> #extra_bins /usr/bin/lftp
> #extra_modules gfs2
> #default shell
> #force_rebuild 1
> #dracut_args --omit-drivers "cfg80211 snd" --add-drivers "ext2 ext3"
> #fence_kdump_args -p 7410 -f auto -c 0 -i 10
> #fence_kdump_nodes node1 node2
> [root@hawk604a ~]#
> ---<snip->---
> 
> =========
> NOTE:
> =========
> We had similar issue in previous testing:
>  https://bugzilla.redhat.com/show_bug.cgi?id=1311226#c18
> 
> Seems rd.retry=300 in /etc/sysconfig/kdump KDUMP_COMMANDLINE_APPEND
> is not working with our test kernel-3.10.0-327.18.2.el7.x86_64
> 
> Other than we are testing  kernel-3.10.0-327.18.2.el7.x86_64, the system is 
> configured exactly the same as described here:
>  https://bugzilla.redhat.com/show_bug.cgi?id=1311226#c18
> 
> 
> Thank you for your time, ruyang.

Comment 78 PaulB 2016-10-20 13:01:56 UTC
(In reply to Pingfan Liu from comment #77)
> Hi PaulB,
> 
> Can you set the value of "DefaultTimeoutStartSec=700s" in
> /etc/systemd/system.conf, then have a test? It affects all the service's
> timeout. If fortunately, I hope it can survive from "Time out waiting for
> device dev-mapper-mpathc1.device", which cause the failure of kdump service.
> 
> Thx,
> Pingfan
> 
> 
> (In reply to PaulB from comment #20)

Pingfan,
This system is no longer available to us. 
This was a remote EET (Extended Engineering Testing).

RE: https://bugzilla.redhat.com/show_bug.cgi?id=1346327#c75
---<-snip>---
This EET BZ is closed. The resolution for kdump was suggested/approved by the 
kdump team and KBASE was completed.

I would suggest opening a new BZ to troubleshoot/investigate.
Add comments #71-73 to the new BZ and reference this BZ.
---<-snip>---

Best,
-pbunyan


Note You need to log in before you can comment on or make changes to this bug.