Bug 1277234 - [HPS CERT] EET Testing: RHEL7.1 HP Integrity Superdome X - BL920s Gen9 System
[HPS CERT] EET Testing: RHEL7.1 HP Integrity Superdome X - BL920s Gen9 System
Status: CLOSED WORKSFORME
Product: Extended Engineering Testing
Classification: Red Hat
Component: Limits-Testing (Show other bugs)
unspecified
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: PaulB
:
Depends On:
Blocks: 1105196
  Show dependency treegraph
 
Reported: 2015-11-02 13:26 EST by PaulB
Modified: 2016-09-06 23:25 EDT (History)
19 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-09-06 23:25:53 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description PaulB 2015-11-02 13:26:02 EST
System Under Test "SUT" Hardware Description:
1. Brief description of hardware
A) HP Integrity Superdome X - BL920s Gen9 System
B) E7-8890v3 2.50GHz, CPU count 16
C) Memory type, memory amount
   DDR4-2133 LRDimm, 12288TB 

2. Link to the Hardware Certification for existing system:
A) Base Certification
   RHEL7.1 submission - https://hardware.redhat.com/show.cgi?id=1257348
   RHEL6.6 submission - https://hardware.redhat.com/show.cgi?id=1255438

B) Supplemental Certification

3. List of known issues:
A) Existing BZ's - n/a
B) Existing Hardware Errata - n/a
C) Existing KBase articles - n/a

4. Memory specifications:
A) What is the expected bandwidth of the memory subsystem system wide?
(If we run many instances of memory intensive applications where
each application does not cross NUMA boundaries, how much
aggregate bandwidth might we expect on the server?)

~762 GB/s at 16 sockets or ~48 GB/s per socket of memory bandwidth (read only) with RAS features enabled.
~1200TB/s at 16 sockets or ~75GB/s per socket of memory bandwidth (read only) with RAS features disabled.

B) Does the memory subsystem support NORMAL -vs- PERFORMANCE
mode at the management/BIOS layer? Yes 
If so what is it set to?
Default is DDDC mode = performance mode

C) How many memory channels per socket for specific CPU?
The Integrity Superdome X contains 8 BL920s Gen9 blades
  Each of the 8 blades has 2 CPU sockets.
  Each CPU socket has 2 memory channels each connecting to 2 memory controllers that contain 6 Dimms each.
  Each CPU socket has 24 Dimms
  Each blade has 48 Dimms
  Total system Dimm capacity is 384 Dimms
  384 x 32GB DDR4-2133 LRDimm = 12288TB of system memory installed

D) How many channels per socket are actually populated on the test
system?
Each of the 16 CPU sockets has all memory slots populated - 24 x 32GB DDR4-2133 LRDimms = 768GB per CPU socket
Comment 1 PaulB 2015-11-02 13:32:25 EST
All,

Additional requirements for “Remote” EET

• Testing remotely is approximately 3 weeks per Arch/RHEL.

• RHEL/Fedora VPN client instructions and individual credentials.

• Second system to be used as a NFS server and HTTP server yum repository server.

• Access to the "System Under Test" 
   - serial console (ipmi)
   - tty console (ilo)
   - reboot capabilities (impi)

• Access to the RHEL ISO for the "System Under Test"

• "System Under Test" must meet required minimums
   - ie. x86_64 1GB minimum/1 GB/logical CPU

• "System Under Test" to be at production BIOS
   - BIOS in default shipping mode.

• Network diagram of SUT configuration.
   - DHCP or Static addresses for virtualization testing.

• Lload testing requires storage = 2X installed ram, for thorough investigation. (If the system has 12TB we should have at least 24TB of storage available)

Please let me know if you have any questions or comments.
Best,
-pbunyan
Comment 2 PaulB 2015-12-01 13:53:04 EST
All, 
Extended Engineering Testing (EET) testing:
 RHEL-7.1 Extended Engineering Testing (EET) 
 24TB 576CPU HP Integrity Superdome X - BL920s Gen9 System
 (aka Griffen Hawk) 

======================================
TARGET HOST DETAILS:
======================================
Hostname = hawk604a.local
           HP Integrity Superdome X - BL920s Gen9
Arch = x86_64
Distro = RHEL-7.1
Kernel = 3.10.0-229.el7.x86_64
CPU count =  576
CPU model name = Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz
BIOS Information = 
 Vendor: HP
 Version: Bundle: 007.006.000 SFW: 033.162.000
 Release Date: 10/30/2015
 ROM Size: 12288 kB
MemTotal = 25364979524 kB

There are three stages of EET testing:
[] Fundamentals (PBunyan pbunyan@redhat.com)
[] Performance  (BMarson bmarson@redhat.com)
[] Lload        (LWoodman lwoodman@redhat.com)


Currently we are running the Fundamentals stage of testing:
======================================
FUNDAMENTALS: PBunyan
======================================
EET x86_64 Baremetal - ~80% complete
EET x86_64 Xen -       N/A
EET x86_64 KVM -       scheduled
EET x86_64 Kdump -     scheduled

Best,
-pbunyan
Comment 3 PaulB 2015-12-01 13:58:17 EST
BarryM,
I ran the linpack and stream performance testing.
I have posted the results for your review:
 http://file.bos.redhat.com/pbunyan/EET/HP/HP_GriffinHawk_12TB_576CPU/RHEL-7.1/Performance/

Please provide a short summary of the performance testing results.

======================================
PERFORMANCE: BMarson
======================================
x86_64 Linpack - results posted for review...
x86_64 Stream -  results posted for review...


Best,
-pbunyan
Comment 4 PaulB 2015-12-01 14:10:08 EST
(In reply to PaulB from comment #3)
> BarryM,
> I ran the linpack and stream performance testing.
> I have posted the results for your review:
>  http://file.bos.redhat.com/pbunyan/EET/HP/HP_GriffinHawk_12TB_576CPU/RHEL-7.
> 1/Performance/

Barry,
This is a 24TB system. I have moved the location of the performance testing results to the following path:
 http://file.bos.redhat.com/pbunyan/EET/HP/HP_GriffinHawk_24TB_576CPU/RHEL-7.1/Performance/

Best,
-pbunyan

> 
> Please provide a short summary of the performance testing results.
> 
> ======================================
> PERFORMANCE: BMarson
> ======================================
> x86_64 Linpack - results posted for review...
> x86_64 Stream -  results posted for review...
> 
> 
> Best,
> -pbunyan
Comment 5 Nigel Croxon 2015-12-03 11:12:48 EST
Just to note, in the description of this bz, the memory size wrong.
Every other references in the bz is correct.  24TB of memory, not 12TB. 


System Under Test "SUT" Hardware Description:
1. Brief description of hardware
A) HP Integrity Superdome X - BL920s Gen9 System
B) E7-8890v3 2.50GHz, CPU count 16
C) Memory type, memory amount
   DDR4-2133 LRDimm, 12288TB
Comment 6 PaulB 2015-12-03 13:48:37 EST
(In reply to Nigel Croxon from comment #5)
> Just to note, in the description of this bz, the memory size wrong.
> Every other references in the bz is correct.  24TB of memory, not 12TB. 
> 
> 
> System Under Test "SUT" Hardware Description:
> 1. Brief description of hardware
> A) HP Integrity Superdome X - BL920s Gen9 System
> B) E7-8890v3 2.50GHz, CPU count 16
> C) Memory type, memory amount
>    DDR4-2133 LRDimm, 12288TB

Nigel,
Agreed. Thanks for the comment.
This detail is also mentioned here:
 https://bugzilla.redhat.com/show_bug.cgi?id=1277234#c2

Best,
-pbunyan
Comment 7 PaulB 2015-12-04 15:26:50 EST
All,
EET testing status update...

Currently I am running the Fundamentals stage of testing:
======================================
FUNDAMENTALS: PBunyan
======================================
EET x86_64 Baremetal - ** PASSED **
EET x86_64 Xen -       N/A
EET x86_64 KVM -       ~70% complete
EET x86_64 Kdump -     ** PASSED **

I just wanted to make note of a few known issues that I experienced during
the baremetal portion of the fundamentals stage of testing:
[] Bug 1276398 - parallel memory allocation with numa balancing enabled on 
   large systems stalls 
   https://bugzilla.redhat.com/show_bug.cgi?id=1276398

[] Bug 1261799 - ltp/oom1 cause the system hang
   https://bugzilla.redhat.com/show_bug.cgi?id=1261799 

[] Bug 1024017 - ERROR: Skipped too many probes, check MAXSKIPPED or try again 
   with stap -t for more details.
   https://bugzilla.redhat.com/show_bug.cgi?id=1024017


======================================
PERFORMANCE: BMarson
======================================
The performance testing results were posted for BarryM to review:
 https://bugzilla.redhat.com/show_bug.cgi?id=1277234#c4


======================================
Lload: LWoodman
======================================
I have spoken with LarryW. 
Lload testing is scheduled to begin on Monday Dec07.


Best,
-pbunyan
Comment 8 Larry Woodman 2015-12-11 12:13:46 EST
I've finished testing the 24TB/16Node/576CPU HP Integrity Superdome X - BL920s Gen9 System system running RHEL-7.1.  At this point I would say that Red Hat's  support of this system is marginal.  We are probably at the system size limit where we will need to make some significant kernel changes in order to properly support systems this large as well as the workloads they can support.  At a minimum we do need to provide a release note and/or a kbase article describing the problems we encountered, how to tune around it and the limitations/expectations.

My problems consisted of consuming all 24TB of memory in the the pagecache by simply creating millions of files and consuming the RAM and then re-reading them back in.  After all memory was consumed in the pagecache I applied an anonymous memory load on every CPU that collectively consumed the entire 24TB but should not overcommit and cause swapping.  I expected the system to quickly reclaim the pagecache memory for the anonymous memory load.  Instead the system experienced a lot of difficulty reclaiming the pagecache memory and eventually hung with spinlock timeout messages on the majority of the 576 CPUs on this system.

After debugging the kernel code to determine the cause of this problem I discovered that the pagecache memory was not evenly distributed between the 16 numa nodes on this system because I read the files into the pagescache on a small subset of the CPUs.  This resulted in a total memory exaustion of some numa nodes and no memory usage of other nodes.  I also realized that /proc/sys/vm/zone_reclaim_mode is set to zero even though this system has 16 nodes and numa factors equal to 30 in the table displayed with "numactl --hardware".  This resulted in the CPUs running on the exhausted nodes to skip over to the next node with available memory and attempt the memory allocation there.  Eventually when all memory is exhausted most of the 576 CPUs entered direct reclaim on just one or two numa nodes resulting in excessive spinlock contention of the zone->lru_lock for those nodes.  Eventually the system
hung as described above.

There are a few ways of reducing and evern eliminating this problem.  The first solution to prevent this hang is to set /proc/sys/vm/zone_reclaim_mode one.  This causes the system to reclaim the pagecache memory on the exhausted node rather than skipping over to the node with available memory.  So, I recommend we set /proc/sys/vm/zone_reclaim_mode to one on all system with this amount of memory, CPUs and numa nodes.  This will help prevent the excessive spinlock contention that occurs when the system is exposed to total memoyy exhaustion and enters direct reclaim on all CPUs.  The second solution is to disable Transparent Huge Pages.  This does a couple things: 1.) throttles the page allocation so the memory reclaim code can keep up and 2.) eliminates the khugepaged splitting up of 2MB pages that happens as part of page reclaim while holding spinlocks.

Finally, the boot time initialization code that sets /proc/sys/vm/zone_reclaim_mode was changed upstream and backported to RHEL7.1.  Before this change /proc/sys/vm/zone_reclaim_mode was set to one whenever the largest numa factor was 20 or higher.  This change set increased setting
/proc/sys/vm/zone_reclaim_mode to one whenever the largest numa factor was 30 or higher.  This is the one reason the 24TB/16Node/576CPU HP Integrity Superdome X - BL920s Gen9 System system is experiencing this problem and therefore needs the tuning of /proc/sys/vm/zone_reclaim_mode.  The other reason is that the THP allocation is so much faster that the 4KB page allocation and therefore aoolies much more memory pressure.
Comment 9 PaulB 2015-12-14 10:01:42 EST
BarryM,
The system is all yours for Performance testing investigation.
Please provide a short summary of the testing results, when you have completed your review.

Best,
-pbunyan
Comment 10 Barry Marson 2015-12-17 11:26:52 EST
I finally got on the server and was able to understand a concern I had with PaulB's linpack/streams results from several weeks back.  Essentially, my test didn't think NUMA was enabled and thus didnt run the NUMA specific tests.  In reality NUMA was enabled, but the numactl rpm which is a dependency for my tests was not installed.  This oversight was due to a few things but mostly related to manual testing which doesnt install beaker dependencies automatically.

I had no issues running these tests with HT enabled.

The newer version of linpack/streams, which has better granularity for scale testing, was rerun, and here are the results.

***

The C based linpack achieved 608Gflops for single precision (more of an L2 cache test) and 247Gflops with double precision (interacts more with main memory).

The C based streams achieved 482GB/sec running instances on all cores of each socket.  These results were with memory/cpu pinning by NUMA node.  Default scheduling may show a higher bandwidth but that is because of the skewed scheduling and running of the tests which reduced memory contention and increased time.

Both of these results are a nice bump over the dragonhawk results from last spring/summer

The tests pass

Barry
Comment 11 PaulB 2015-12-17 16:23:25 EST
(In reply to Larry Woodman from comment #8)
> I've finished testing the 24TB/16Node/576CPU HP Integrity Superdome X -
> BL920s Gen9 System system running RHEL-7.1.  At this point I would say that
> Red Hat's  support of this system is marginal.  We are probably at the
> system size limit where we will need to make some significant kernel changes
> in order to properly support systems this large as well as the workloads
> they can support.  At a minimum we do need to provide a release note and/or
> a kbase article describing the problems we encountered, how to tune around
> it and the limitations/expectations.
> 
> My problems consisted of consuming all 24TB of memory in the the pagecache
> by simply creating millions of files and consuming the RAM and then
> re-reading them back in.  After all memory was consumed in the pagecache I
> applied an anonymous memory load on every CPU that collectively consumed the
> entire 24TB but should not overcommit and cause swapping.  I expected the
> system to quickly reclaim the pagecache memory for the anonymous memory
> load.  Instead the system experienced a lot of difficulty reclaiming the
> pagecache memory and eventually hung with spinlock timeout messages on the
> majority of the 576 CPUs on this system.
> 
> After debugging the kernel code to determine the cause of this problem I
> discovered that the pagecache memory was not evenly distributed between the
> 16 numa nodes on this system because I read the files into the pagescache on
> a small subset of the CPUs.  This resulted in a total memory exaustion of
> some numa nodes and no memory usage of other nodes.  I also realized that
> /proc/sys/vm/zone_reclaim_mode is set to zero even though this system has 16
> nodes and numa factors equal to 30 in the table displayed with "numactl
> --hardware".  This resulted in the CPUs running on the exhausted nodes to
> skip over to the next node with available memory and attempt the memory
> allocation there.  Eventually when all memory is exhausted most of the 576
> CPUs entered direct reclaim on just one or two numa nodes resulting in
> excessive spinlock contention of the zone->lru_lock for those nodes. 
> Eventually the system
> hung as described above.
> 
> There are a few ways of reducing and evern eliminating this problem.  The
> first solution to prevent this hang is to set /proc/sys/vm/zone_reclaim_mode
> one.  This causes the system to reclaim the pagecache memory on the
> exhausted node rather than skipping over to the node with available memory. 
> So, I recommend we set /proc/sys/vm/zone_reclaim_mode to one on all system
> with this amount of memory, CPUs and numa nodes.  This will help prevent the
> excessive spinlock contention that occurs when the system is exposed to
> total memoyy exhaustion and enters direct reclaim on all CPUs.  The second
> solution is to disable Transparent Huge Pages.  This does a couple things:
> 1.) throttles the page allocation so the memory reclaim code can keep up and
> 2.) eliminates the khugepaged splitting up of 2MB pages that happens as part
> of page reclaim while holding spinlocks.
> 
> Finally, the boot time initialization code that sets
> /proc/sys/vm/zone_reclaim_mode was changed upstream and backported to
> RHEL7.1.  Before this change /proc/sys/vm/zone_reclaim_mode was set to one
> whenever the largest numa factor was 20 or higher.  This change set
> increased setting
> /proc/sys/vm/zone_reclaim_mode to one whenever the largest numa factor was
> 30 or higher.  This is the one reason the 24TB/16Node/576CPU HP Integrity
> Superdome X - BL920s Gen9 System system is experiencing this problem and
> therefore needs the tuning of /proc/sys/vm/zone_reclaim_mode.  The other
> reason is that the THP allocation is so much faster that the 4KB page
> allocation and therefore aoolies much more memory pressure.

All,
There was a meeting today with NigelC, TomV, LarryW, PaulB, and BarryM.
Larry explained the results of his Lload testing with the attendees. There 
was an issue found by LarryW during Lload testing. 
[] HP (NigelC and TomV) will be getting Red Hat access to an equivalent
   system to allow Red Hat (LarryW) to further investigate the issue.
[] An internal Red Hat meeting has been scheduled to discuss the EET
   results and the possibility of a "release note" detailing the Lload
   testing issue.

Best,
-pbunyan
Comment 12 Tom Vaden 2015-12-17 16:55:15 EST
Paul:

HPE had anticipated that success for 24TB for this platform would cover the 12TB case as well.

If we don't have a totally clean bill of health for 24TB, can this level of testing still cover the 12TB case and allow for a full statement of support from Red Hat for 12TB with 7.1 for this platform?

thanks,
tom
Comment 13 PaulB 2015-12-21 11:31:54 EST
(In reply to Tom Vaden from comment #12)
> Paul:
> 
> HPE had anticipated that success for 24TB for this platform would cover the
> 12TB case as well.
> 
> If we don't have a totally clean bill of health for 24TB, can this level of
> testing still cover the 12TB case and allow for a full statement of support
> from Red Hat for 12TB with 7.1 for this platform?

Tom - No. EET testing would need be rerun for the 12TB configuration.
Best,
-pbunyan

> 
> thanks,
> tom
Comment 14 PaulB 2015-12-21 11:44:06 EST
All,
Larry Woodman will be writing a "release note" regarding the issues had encountered during the Lload Testing stage of the Extended Engineering Testing
(EET). Based on his release note LarryW stated he will "PASS" the Lload Testing stage of EET.

Note:
Result of internal Red Hat conference call with Larry Woodman held today. 


Best,
-pbunyan
Comment 15 Tom Vaden 2015-12-21 11:53:03 EST
(In reply to PaulB from comment #13)
> (In reply to Tom Vaden from comment #12)
> > Paul:
> > 
> > HPE had anticipated that success for 24TB for this platform would cover the
> > 12TB case as well.
> > 
> > If we don't have a totally clean bill of health for 24TB, can this level of
> > testing still cover the 12TB case and allow for a full statement of support
> > from Red Hat for 12TB with 7.1 for this platform?
> 
> Tom - No. EET testing would need be rerun for the 12TB configuration.
> Best,
> -pbunyan
> 
> > 
> > thanks,
> > tom

Paul:

Since (I believe) in the 24TB case everything was deemed to pass except for Larry's performance part, could we at least short circuit the 12TB EET to only require Larry's blessing for the performance testing?

thanks,
tom
Comment 16 Tom Vaden 2015-12-21 11:54:59 EST
(In reply to PaulB from comment #14)
> All,
> Larry Woodman will be writing a "release note" regarding the issues had
> encountered during the Lload Testing stage of the Extended Engineering
> Testing
> (EET). Based on his release note LarryW stated he will "PASS" the Lload
> Testing stage of EET.
> 
> Note:
> Result of internal Red Hat conference call with Larry Woodman held today. 
> 
> 
> Best,
> -pbunyan

Paul:

Appreciate the update.

So, what does that translate into for the RHEL7.1 support statement for the 24TB configuration?

thanks,
tom
Comment 17 PaulB 2015-12-21 13:34:26 EST
(In reply to Tom Vaden from comment #16)
> (In reply to PaulB from comment #14)
> > All,
> > Larry Woodman will be writing a "release note" regarding the issues had
> > encountered during the Lload Testing stage of the Extended Engineering
> > Testing
> > (EET). Based on his release note LarryW stated he will "PASS" the Lload
> > Testing stage of EET.
> > 
> > Note:
> > Result of internal Red Hat conference call with Larry Woodman held today. 
> > 
> > 
> > Best,
> > -pbunyan
> 
> Paul:
> 
> Appreciate the update.
> 
> So, what does that translate into for the RHEL7.1 support statement for the
> 24TB configuration?
> 
> thanks,
> tom

Tom,
In short:
System has passed EET testing.


All,
RHEL-7.1 Extended Engineering Testing on the HP Integrity 
Superdome X - BL920s Gen9 (aka GriffinHawk) has passed.
There are three stages of EET testing: Fundamentals, Performance,
and Lload. 

Testing results listed below for reference.
======================================
TARGET HOST DETAILS:
======================================
Hostname = hawk604a.local
           HP Integrity Superdome X - BL920s Gen9
Arch = x86_64
Distro = RHEL-7.1
Kernel = 3.10.0-229.el7.x86_64
CPU count =  576
CPU model name = Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz
BIOS Information = 
 Vendor: HP
 Version: Bundle: 007.006.000 SFW: 033.162.000
 Release Date: 10/30/2015
 ROM Size: 12288 kB
MemTotal = 25364979524 kB

======================================
FUNDAMENTALS: Paul Bunyan
======================================
EET x86_64 Baremetal - ** PASSED **
EET x86_64 Xen -       N/A
EET x86_64 KVM -       ** PASSED **
EET x86_64 kdump -     ** PASSED **

======================================
PERFORMANCE: Barry Marson
======================================
x86_64 Linpack - ** PASSED **
x86_64 Stream  - ** PASSED ** 

======================================
LLOAD: Larry Woodman
======================================
EET x86_64 Lload - ** PASSED - RELEASE NOTE REQUIRED** 

Thank you Barry Marson and Larry Woodman for your time and expertise.

Best,
-pbunyan
Comment 18 Nigel Croxon 2015-12-22 07:44:35 EST
Thank you Red Hatters.

It has been a pleasure working with you on this testing.

-Nigel
Comment 19 PaulB 2015-12-22 08:29:31 EST
(In reply to Nigel Croxon from comment #18)
> Thank you Red Hatters.
> 
> It has been a pleasure working with you on this testing.
> 
> -Nigel

Nigel,
Just wanted to follow up from our joint meeting last week.
HP has agreed to provide a comparable system to Red Hat in 
order for Larry Woodman to further investigate/troubleshoot 
the Lload testing issue. Please contact(email) Larry Woodman and myself
once the test system is available.

Thank you, Nigel.
Best,
-pbunyan
Comment 20 Larry Woodman 2015-12-22 08:42:11 EST
When running large memory and CPU ixtensive workloads on a system as large as the HP Integrity Superdome X - BL920s Gen9 System(24TB of memory and storage, 16 numa nodes each with 36 CPUs and a totall CPU count of 576) its is possible to encounter problems reclaiming pages of memory from the pagecache fast enough.  When this happens free memory will become exhausted, most memory will be in the pagecache and most or even all CPUs will write "CPU x stuck for 22s" messages to the console.  If memory pressure is high enough the system will not recover and must be rebooted.  The likelyhood of this happening can be reduced and even eliminated but with a potential performance reduction.

To reduce the likelyhood of encountering this hang but not totally eliminating it the system administrator should set /proc/sys/vm/page_reclaim_mode to 1 instead of the default 0 value.  This will force the system to reclaim pages from memory exhausted numa nodes rather than allocating pages on other nodes.  This can result in slower startup/initialization times for processes that allocate lots of memory but, faster run times once that are totally up and running.  This is due to more local node memory references and fewer remote node memory references when the process is running.  Another side effect of this setting is a reduction in the probability of encountering the CPU timeout messages and potential hang described above.

To totally eliminate the hang the system administrator can disable Transparent Huge Pages(THP) by: 
"echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled".  
This totally eliminates the hang by 1.) throttling the page allocation so the memory reclaim code can keep up with the demand and 2.) eliminating the khugepaged splitting up of 2MB pages that happens as part of page reclaim while holding spinlocks.  Disabling THP does however introduce a possiblity of a significant performance reduction for some systems and applications because memory is allocated and mapped in only 4KB sections rather than 2MB section.

If the hang described above is encountered you should first try setting zone_reclaim_mode to 1 and rerun the application(s) to determine the hang no longer occurs.  If the hang continues to occur you should disable THP to eliminate the hang then attempt to measure the possible performance consequences of doing that.
Comment 21 Nigel Croxon 2015-12-22 09:01:25 EST
Hello Paul and Larry,

The system is back, available in the partner lab, but with less memory, 1TB now.
It is yours to use.

Our group did not own that memory (24TB), so we had to return it.
But we are working on getting a comparable system.
We will let you know more, when we know more.

-Nigel
Comment 22 PaulB 2015-12-22 09:54:00 EST
(In reply to Nigel Croxon from comment #21)
> Hello Paul and Larry,
> 
> The system is back, available in the partner lab, but with less memory, 1TB
> now.
> It is yours to use.
> 
> Our group did not own that memory (24TB), so we had to return it.
> But we are working on getting a comparable system.
> We will let you know more, when we know more.
> 
> -Nigel

Nigel,
As we discussed in our joint meeting - a comparable system configuration
of equal RAM / CPU / and remote storage will be required in order to troubleshoot/reproduce the Lload testing issue.

LarryW is out of town at the moment -and- Red Hat is entering a holiday break.
Please contact LarryW via email regarding scheduling his testing on a "comparable system". 

Best,
-pbunyan
Comment 23 Nigel Croxon 2015-12-22 09:59:09 EST
Paul,

Yes, we will be contacting Larry after the Holidays.

Enjoy your break,

-Nigel
Comment 24 PaulB 2016-01-13 10:28:36 EST
(In reply to PaulB from comment #22)
> (In reply to Nigel Croxon from comment #21)
> > Hello Paul and Larry,
> > 
> > The system is back, available in the partner lab, but with less memory, 1TB
> > now.
> > It is yours to use.
> > 
> > Our group did not own that memory (24TB), so we had to return it.
> > But we are working on getting a comparable system.
> > We will let you know more, when we know more.
> > 
> > -Nigel
> 
> Nigel,
> As we discussed in our joint meeting - a comparable system configuration
> of equal RAM / CPU / and remote storage will be required in order to
> troubleshoot/reproduce the Lload testing issue.
> 
> LarryW is out of town at the moment -and- Red Hat is entering a holiday
> break.
> Please contact LarryW via email regarding scheduling his testing on a
> "comparable system". 
> 
> Best,
> -pbunyan

Nigel,
Is there an ETA on the availability of comparable system for Larry Woodman
to investigate Lload testing issue, as we had previously discussed?

Thank you.
Best,
-pbunyan
Comment 25 Nigel Croxon 2016-01-13 11:20:48 EST
Hello Paul,

I don't have an ETA yet.  We are working on scheduling the system.
When I know more, I will update this BZ.

-Nigel
Comment 26 PaulB 2016-04-29 16:20:10 EDT
All,
Adding Gary Case, as there is a kbase article / release note required:
 https://bugzilla.redhat.com/show_bug.cgi?id=1277234#c20

Best,
-pbunyan
Comment 27 Gary Case 2016-05-02 17:08:15 EDT
Hi Paul,

I'll talk to the HP EAM (formerly called "pTAM") about writing a kbase based on Larry's excellent comment 20 and get that attached to the certification.
Comment 28 Nigel Croxon 2016-09-06 11:27:16 EDT

Can this bugzilla be closed?
Comment 29 PaulB 2016-09-06 11:30:15 EDT
(In reply to Gary Case from comment #27)
> Hi Paul,
> 
> I'll talk to the HP EAM (formerly called "pTAM") about writing a kbase based
> on Larry's excellent comment 20 and get that attached to the certification.


Gary,
Is this KBASE complete?

best,
-pbunyan
Comment 30 Gary Case 2016-09-06 18:05:20 EDT
It looks like this one's been done for some time. The certification is published and there's a kbase solution attached to it that explains Larry's suggestions:

https://access.redhat.com/articles/1979103
Comment 31 PaulB 2016-09-06 23:25:53 EDT
Thank you, GaryC...

All,
EET Testing: RHEL7.1 HP Integrity Superdome X - BL920s Gen9 System 
was completed successfully:
 https://bugzilla.redhat.com/show_bug.cgi?id=1277234#c17

Required KBASE is complete:
 https://access.redhat.com/articles/1979103

Best,
-pbunyan

Note You need to log in before you can comment on or make changes to this bug.