This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1479269 - EET Request: RHEL7.3 - Lenovo System x3950 x6 with 24TB RAM.
EET Request: RHEL7.3 - Lenovo System x3950 x6 with 24TB RAM.
Status: NEW
Product: Extended Engineering Testing
Classification: Red Hat
Component: Limits-Testing (Show other bugs)
unspecified
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: William Gomeringer
William Gomeringer
:
Depends On:
Blocks: 1438583 1450449
  Show dependency treegraph
 
Reported: 2017-08-08 04:39 EDT by Chris McDermott
Modified: 2017-10-17 05:35 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
'numactl --hardware' output from 24TB configuration (3.11 KB, application/octet-stream)
2017-09-10 19:53 EDT, Chris McDermott
no flags Details

  None (edit)
Description Chris McDermott 2017-08-08 04:39:03 EDT
System Under Test "SUT" Hardware Description:
1. Brief description of hardware
A) SUT Info:
   Lenovo System x3950 x6
   
B) CPU Info:
   Intel Xeon E7-8894v4 (Broadwell) 24C 
   8-socket (192C / 384T)
   
C) Memory Info:
   DDR4, 192x128GB DIMM (24TB)
   dimm part#'s=
   memory amount=

2. Link to the Hardware Certifications for existing system

https://access.redhat.com/ecosystem/hardware/2381681

3. List known issues

A) Existing BZ's

https://bugzilla.redhat.com/show_bug.cgi?id=1479234

B) Existing Hardware Errata - N/A
C) Existing KBase articles - N/A

4. Memory specifications
Please provide a brief description for the following:
A) What is the expected bandwidth of the memory subsystem system wide?
   (If we run many instances of memory intensive applications where
   each application does not cross NUMA boundaries, how much
   aggregate bandwidth might we expect on the server?)
2133 MHz.

B) Does the memory subsystem support NORMAL -vs- PERFORMANCE
   mode at the management/BIOS layer? If so what is it set to?
Performance by default. 

C) How many memory channels per socket for specific CPU?
24 DIMM Slots per Socket (192 Total), 4 memory channels w/SMI2 (Jordan Creek) memory buffers per Socket, 6 DIMMs per memory buffer.

D) How many channels per socket are actually populated on the SUT?
All of them.
Comment 2 Monte Knutson 2017-08-18 11:08:15 EDT
Chris, 

Do you know if the test team has posted the supplemental 24TB testing to the existing certification log or did they create a new certification request and add it there? 

The EET team needs to know so where it is/if completed already/ so they can schedule this testing.  

Thanks, 

Monte
Comment 3 Chris McDermott 2017-08-21 18:12:16 EDT
(In reply to Monte Knutson from comment #2)
> Chris, 
> 
> Do you know if the test team has posted the supplemental 24TB testing to the
> existing certification log or did they create a new certification request
> and add it there? 
> 
> The EET team needs to know so where it is/if completed already/ so they can
> schedule this testing.  
> 
> Thanks, 
> 
> Monte

Monte, I think the test team posted a new certification with 24TB. But Amy, on CC, should be able to answer that question.
Comment 6 Barry Marson 2017-08-28 10:38:11 EDT
Chris,

Regarding comment #1 Question 4A, Im looking for the per NUMA node and total memory bandwidth (GB/sec) not the memory speed.  This way when the stream based runs are executed, we know if our results are in the ball park with the machines capabilities.

Thanks
Barry
Comment 7 William Gomeringer 2017-08-28 12:22:55 EDT
Hi Chris,
Do you know if we have this system at Red Hat or do we need to access remotely on the Lenovo side? If the former then I am still looking for the system. If the latter can you please provide information on how to access the system and please see additional EET requirement for systems not within Red Hat:

-----------------------------------------
Additional requirements for "Remote" EET:
-----------------------------------------
• Testing remotely is approximately 3 weeks per Arch/RHEL.

• RHEL/Fedora VPN client instructions and individual credentials.

• Second system to be used as a NFS server and HTTP server yum repository server.

• Access to the "System Under Test" 
   - serial console (ipmi)
     (how to connect to SUT serial console)
   - tty console (ilo)
   - reboot capabilities (impi)
     (how to reboot SUT in the event of system hang)

• Access to the RHEL ISO for the "System Under Test"

• "System Under Test" must meet required minimums
   - ie. x86_64 1GB minimum/1 GB/logical CPU

• "System Under Test" to be at production BIOS
   - BIOS in default shipping mode.

• Network diagram of SUT configuration.
   - DHCP or Static addresses for virtualization testing.
     (need static IP for KVM Guest testing)

• Lload testing requires storage = 2X installed ram, for thorough investigation. 
  (If the system has 12TB we should have at least 24TB of storage available)
  (need storage attached, formatted, and added to fstab)

* Run memtest on SUT for 24-48hrs and confirm there are no memory errors.
  (we have had several instances in the past of failing DIMM/DIMMs on
   your target SUT - please confirm there are no issues)

* Please confirm approximate reboot time of SUT and that there are 
   no hardware errors reported in dmesg.

Testing should start today, as soon as we locate the system and additional EET requirements are met if necessary. Also, just want to double check, are we testing with the 7.3 GA or the latest 7.3.z?  Thanks. 

Will
Comment 8 Amy Gou 2017-08-29 02:09:33 EDT
Hi all,

there is the other OS Certification ongoing under RHEL 6.8 with the same configuration. 
https://hardware.redhat.com/show.cgi?id=1484208

Plus, the Certification team ask for the the other EET for RHEL6.8. i would like to see if t can be also used on RHEL 6.8? thanks a lot.

Best Regards,
Amy
Comment 9 William Gomeringer 2017-08-29 12:40:37 EDT
Hello Chris,
Still looking for some more information before we can begin, see comments #6 and #7. The NEEDINFO flag was cleared in the previous comment. 

Thank you,
Will
Comment 10 Dilip Soman 2017-09-05 11:00:25 EDT
Hello Chris,

We are a week into the schedule and have not been able to start on this EET because we are still waiting for information for comment #6.  See below:

Chris,

Regarding comment #1 Question 4A, Im looking for the per NUMA node and total memory bandwidth (GB/sec) not the memory speed.  This way when the stream based runs are executed, we know if our results are in the ball park with the machines capabilities.

Thanks
Barry


Please help to provide the information - as it is, we have significant risk for completing this testing on this schedule, due to other conflicting schedules with testing and tester availability. 

Thanks,
Dilip
Comment 11 Chris McDermott 2017-09-07 14:52:33 EDT
(In reply to William Gomeringer from comment #7)
> Hi Chris,
> Do you know if we have this system at Red Hat or do we need to access
> remotely on the Lenovo side? If the former then I am still looking for the
> system. If the latter can you please provide information on how to access
> the system and please see additional EET requirement for systems not within

You have an 8S x3950 X6 at Red Hat. However, not one with 24TB of RAM. We're talking about $1M worth of memory here (192x128GB DIMMs), which is obviously something we can't provide to Red Hat. So, we'll have to make available to Red Hat remote access to a system at Lenovo.  I'll work with Amy (on CC) the details regarding access to this configuration and we'll provide answers to the questions you asked in Comment #7.
Comment 12 Chris McDermott 2017-09-10 19:50:57 EDT
(In reply to Dilip Soman from comment #10)
> Hello Chris,
> 
> We are a week into the schedule and have not been able to start on this EET
> because we are still waiting for information for comment #6.  See below:
> 
> Chris,
> 
> Regarding comment #1 Question 4A, Im looking for the per NUMA node and total
> memory bandwidth (GB/sec) not the memory speed.  This way when the stream
> based runs are executed, we know if our results are in the ball park with
> the machines capabilities.
> 

The memory bandwidth will vary based on memory speed and memory mode.
Memory speed will depend on the CPU (type, sku) and DIMM (type, dpc, speed).

So, the best way to determine this is to find the configured clock speed from SMBIOS (dmidecode) type 17 and memory mode to determine the bandwidth.
There are other ways to determine max memory speed, but some CPU SKUs will limit this, and SMBIOS type 17 will provide the values configured by UEFI
(for both fields: max speed and configured speed).

There are 2 memory modes that can be configured in UEFI setup, as defined by the Intel EDS - Independent mode, Locksteup mode. Independent mode is recommended for performance.

Bandwidth example calculations based on configured memory mode below.

Independent mode:
           
Memory Speed: (Max 1600MHz x 2) = 3.2GT/s * 8Bytes = 25.6GB/s * 2SMI = 51.2GB/s * 2iMC = 102.4GB/s * 8Socket = 819.2GB/s

Lockstep mode:
               
Memory Speed (Max 1866MHz x 1) = 1.87GT/s * 8Bytes = 14.96GB/s * 2SMI = 29.92GB/s * 2iMC = 59.84GB/s * 8Socket = 478.72GB/s

So, max bandwidth would be in Indepenent mode despite the slower DDR speed,
as it aggregates the DDR bandwidth.


From E7 v2 EDS: (Bandwidth per SMI Channel), Table 6-1. Intel Xeon Processor E7 v2 Product Family Memory Data Paths and Clock Domains.

Independent Mode    Max         Transfers/  Transfer   Data      Channel  
                    Frequency	Clock	    Rate       Width     Bandwidth
Intel SMI Data      1.33 GHz    2           2.67 GT/s  8 bytes   21.3 GT/s
Bus
DDR Bus             0.67 GHz    2           1.33 GT/s  8 bytes   10.7 GT/s

Lockstep Mode       Max         Transfers/  Transfer   Data      Channel
                    Frequency   Clock       Rate       Width     Bandwidth
Intel SMI Data      0.93 GHz    2           1.87 GT/s  8 Bytes   15 GT/s
Bus
DDR Bus             0.93 GHz    2           1.87 GT/s  8 Bytes   15 GT/s


From x3950 x6 Spec: 

Independent     Max        Transfers/   Transfer     Data      Channel
Mode            Frequency  Clock        Rate         Width     Bandwidth

SMI2 Bus (JC1)  1.33 GHz   2            2.67 GT/s    8 Bytes   21.3 GB/s
DDR3 Bus (JC1)  0.67 GHz   2            1.33 GT/s    8 Bytes   10.6 GB/s
SMI2 Bus (JC2)  1.6 GHz    2            3.2 GT/s     8 Bytes   25.6 GB/s
DDR4 Bus (JC2)  0.8 GHz    2            1.6 GT/s     8 Bytes   12.8 GB/s

Lockstep Mode	Max        Transfers/   Transfer     Data      Channel
                Frequency  Clock        Rate         Width     Bandwidth

SMI2 Bus (JC1)  0.8 GHz	   2            1.6 GT/s     8 Bytes   12.8 GB/s
DDR3 Bus (JC1)  0.8 GHz	   2            1.6 GT/s     8 Bytes   12.8 GB/s
SMI2 Bus (JC2)  0.93 GHz   2            1.87 GT/s    8 Bytes   14.9 GB/s
DDR4 Bus (JC2)  0.93 GHz   2            1.87 GT/s    8 Bytes   14.9 GB/s
Comment 13 Chris McDermott 2017-09-10 19:53 EDT
Created attachment 1324307 [details]
'numactl --hardware' output from 24TB configuration
Comment 14 Chris McDermott 2017-09-10 19:58:21 EDT
(In reply to Chris McDermott from comment #13)
> Created attachment 1324307 [details]
> 'numactl --hardware' output from 24TB configuration

Note that the NUMA domains are only at socket level.

Memory is interleaved between iMCs, Channels and Ranks, which is not visible to OS, but may impact memory bandwidth, e.g., 

A[6]:    Used to interleave between 2 iMCs
A[8:7]:  Used to interleave between 4 DDR channels which are further 
         hashed by A[27:12].

In a 24TB configuration, 192x128GB DIMMs (3 DIMMS per Channel, 3DS RDIMMs) are limited to 1333MHz max.
Comment 15 Chris McDermott 2017-09-11 17:39:42 EDT
Hopefully, the information in Comments #11-14 provide you the information you need. That's all I got.
Comment 16 Dilip Soman 2017-09-12 12:15:34 EDT
Chris,

We still need information for remote testing requested in comment #7. 

For comment #6, I will defer to Barry if he needs any more specific information. 

- Dilip
Comment 17 Chris McDermott 2017-09-12 20:25:33 EDT
(In reply to Dilip Soman from comment #16)
> Chris,
> 
> We still need information for remote testing requested in comment #7. 
> 
> For comment #6, I will defer to Barry if he needs any more specific
> information. 
> 
> - Dilip

Need Amy to assist with that information, as the system is in her lab.
Comment 18 Chris McDermott 2017-09-12 20:42:13 EDT
(In reply to William Gomeringer from comment #7)
> Hi Chris,
> Do you know if we have this system at Red Hat or do we need to access
> remotely on the Lenovo side? If the former then I am still looking for the
> system. If the latter can you please provide information on how to access
> the system and please see additional EET requirement for systems not within
> Red Hat:
> 
> -----------------------------------------
> Additional requirements for "Remote" EET:
> -----------------------------------------
> • Testing remotely is approximately 3 weeks per Arch/RHEL.
> 
> • RHEL/Fedora VPN client instructions and individual credentials.
> 
> • Second system to be used as a NFS server and HTTP server yum repository
> server.

Do you just need access to a second server?  You'll do all of the setup configuration on this system?  Or, are you expecting this to already be configured.  If so, you'll need to probably provide instructions, etc. regarding the configuration you need.

> 
> • Access to the "System Under Test" 
>    - serial console (ipmi)
>      (how to connect to SUT serial console)
>    - tty console (ilo)
>    - reboot capabilities (impi)
>      (how to reboot SUT in the event of system hang)

You will have access to the host IP and the IMM IP. You can connect to the serial console via ipmi or via the IMM CLI. You can also reboot the system from both ipmi and from the IMM CLI/GUI. 

Amy can provide IP address details.

> 
> • Access to the RHEL ISO for the "System Under Test"

Not sure I understand this request. You don't have access to your own RHEL ISO?
Or, are you asking how to access the RHEL ISO remotely?

> 
> • "System Under Test" must meet required minimums
>    - ie. x86_64 1GB minimum/1 GB/logical CPU

Not a problem. System exceeds supported limits (24TB RAM, about 86GB/logical CPU.

> 
> • "System Under Test" to be at production BIOS
>    - BIOS in default shipping mode.

This requirement should already be satisfied.

> 
> • Network diagram of SUT configuration.
>    - DHCP or Static addresses for virtualization testing.
>      (need static IP for KVM Guest testing)

Amy?

> 
> • Lload testing requires storage = 2X installed ram, for thorough
> investigation. 
>   (If the system has 12TB we should have at least 24TB of storage available)
>   (need storage attached, formatted, and added to fstab)

I think this is an unrealistic request. This is a 24TB system and I'm not certain that Amy has access to 48TB of storage that she can easily add to this server. But, I'll let her respond to that.

> 
> * Run memtest on SUT for 24-48hrs and confirm there are no memory errors.
>   (we have had several instances in the past of failing DIMM/DIMMs on
>    your target SUT - please confirm there are no issues)

Amy can you do this, or confirm that it has been done?

> 
> * Please confirm approximate reboot time of SUT and that there are 
>   no hardware errors reported in dmesg.

dmesg output is included in the HW Cert sosreport output. I don't have one readily available. Alternatively, Amy could boot the system and provide that
output.

System reboot time is anywhere from 40m-75m, depending on memory retraining.

> 
> Testing should start today, as soon as we locate the system and additional
> EET requirements are met if necessary. Also, just want to double check, are
> we testing with the 7.3 GA or the latest 7.3.z?  Thanks. 

I would suspect the answer to this is the latest 7.3.z. Amy can you confirm?
Comment 19 William Gomeringer 2017-09-13 10:24:39 EDT
(In reply to Chris McDermott from comment #18)
> (In reply to William Gomeringer from comment #7)
> > Hi Chris,
> > Do you know if we have this system at Red Hat or do we need to access
> > remotely on the Lenovo side? If the former then I am still looking for the
> > system. If the latter can you please provide information on how to access
> > the system and please see additional EET requirement for systems not within
> > Red Hat:
> > 
> > -----------------------------------------
> > Additional requirements for "Remote" EET:
> > -----------------------------------------
> > • Testing remotely is approximately 3 weeks per Arch/RHEL.
> > 
> > • RHEL/Fedora VPN client instructions and individual credentials.
> > 
> > • Second system to be used as a NFS server and HTTP server yum repository
> > server.
> 
> Do you just need access to a second server?  You'll do all of the setup
> configuration on this system?  Or, are you expecting this to already be
> configured.  If so, you'll need to probably provide instructions, etc.
> regarding the configuration you need.

Just need access to a second server with RHEL 7.x installed. Something on the same network as the SUT

> 
> > 
> > • Access to the "System Under Test" 
> >    - serial console (ipmi)
> >      (how to connect to SUT serial console)
> >    - tty console (ilo)
> >    - reboot capabilities (impi)
> >      (how to reboot SUT in the event of system hang)
> 
> You will have access to the host IP and the IMM IP. You can connect to the
> serial console via ipmi or via the IMM CLI. You can also reboot the system
> from both ipmi and from the IMM CLI/GUI. 
> 
> Amy can provide IP address details.

This works, I will wait for the details

> > 
> > • Access to the RHEL ISO for the "System Under Test"
> 
> Not sure I understand this request. You don't have access to your own RHEL
> ISO?
> Or, are you asking how to access the RHEL ISO remotely?
> 

We just need the RHEL 7.3 ISO on the SUT so we can use it to create a repo and install any necessary packages on both the SUT and the secondary server. We will take care of mounting it and setting up the repos.

> > 
> > • "System Under Test" must meet required minimums
> >    - ie. x86_64 1GB minimum/1 GB/logical CPU
> 
> Not a problem. System exceeds supported limits (24TB RAM, about 86GB/logical
> CPU.
> 

Good, thanks.

> > • "System Under Test" to be at production BIOS
> >    - BIOS in default shipping mode.
> 
> This requirement should already be satisfied.
> 

Good, thanks.

> > • Network diagram of SUT configuration.
> >    - DHCP or Static addresses for virtualization testing.
> >      (need static IP for KVM Guest testing)
> 
> Amy?
> 



> > • Lload testing requires storage = 2X installed ram, for thorough
> > investigation. 
> >   (If the system has 12TB we should have at least 24TB of storage available)
> >   (need storage attached, formatted, and added to fstab)
> 
> I think this is an unrealistic request. This is a 24TB system and I'm not
> certain that Amy has access to 48TB of storage that she can easily add to
> this server. But, I'll let her respond to that.

I'll let Larry speak to this one. Larry, is it necesarry in this case to have 48TB of storage for the Lload test?


> > 
> > * Run memtest on SUT for 24-48hrs and confirm there are no memory errors.
> >   (we have had several instances in the past of failing DIMM/DIMMs on
> >    your target SUT - please confirm there are no issues)
> 
> Amy can you do this, or confirm that it has been done?
> 

Great, thank you

> > * Please confirm approximate reboot time of SUT and that there are 
> >   no hardware errors reported in dmesg.
> 
> dmesg output is included in the HW Cert sosreport output. I don't have one
> readily available. Alternatively, Amy could boot the system and provide that
> output.
> 
> System reboot time is anywhere from 40m-75m, depending on memory retraining.

Perfect

> > 
> > Testing should start today, as soon as we locate the system and additional
> > EET requirements are met if necessary. Also, just want to double check, are
> > we testing with the 7.3 GA or the latest 7.3.z?  Thanks. 
> 
> I would suspect the answer to this is the latest 7.3.z. Amy can you confirm?

Great, thank you.


Thanks for the responses Chris. 

Larry, can you take a look at the Lload discussion above? With 24TB of RAM installed do we need 48TB of storage for this test? Thanks Larry.
Comment 20 Dilip Soman 2017-09-19 11:54:28 EDT
We are unable to schedule this EET request until we get all the information requested in Comment # 19. When we have the information we will schedule the testing based on available EET testing slots.
Comment 21 Monte Knutson 2017-10-10 15:17:06 EDT
Ocean, Can you please help get this information back to Dilip and the team?  Thanks.
Comment 22 Amy Gou 2017-10-17 05:35:57 EDT
Hi all,

For Draco with 24TB, we have create a Remote access for it.
At 1st, the VPN access application is still pending, we use Teamviewer to avoid it.
Here is the Remote Desktop with TeamViewer on Windows 2012R2:
ID: 206909066
Password:1887

Then on windows 2012R2, you can use SSH to access the Draco 24TB:
SSH: 10.245.43.96
User: root
Password: 111111

IMM IP: 10.245.43.108
User: USERID
Password: PASSW0RD

Please feel free to let me know if you have any concern
Best Regards,
Amy

Note You need to log in before you can comment on or make changes to this bug.