Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 223165

Summary: RHEL5 certification MEMORY test took too long on 256GB memory
Product: [Retired] Red Hat Hardware Certification Program Reporter: erikj
Component: Test Suite (tests)Assignee: Greg Nichols <gnichols>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 5CC: cww, edwardsg, gbeshers, iboverma, jh, martinez, niwa.hideyuki, wei, wwlinuxengineering
Target Milestone: ---   
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-04-09 17:12:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 230220    
Bug Blocks: 222068    

Description erikj 2007-01-18 04:14:51 UTC
Irina Boverman requested a bug filed on this against HTS.  Jerry - please
make any additions or corrections you feel are needed.

From our QE folks (Jerry):

 The MEMORY2 test in RHEL5 certification test took too long on an altix 4700 
 with 256GB memory.  lmbench was used in MEMORY2 test and it took over 40 hours 
 up to now and still not finish yet.  Previous RHEL4 cert testing on the same 
 4700 256GB took about 17.3 hours to finish (using similar lmbench commands).   
 Now the a4700 running montecito with 256GB took over 30 hours in lat_mem_rd 
 command  in lmbench and looked like still long way to go.

Jerey then added:

 Did compare the output of MEMORY2 test between current rhel5-rc_s5 and previous
 rhel4u4.  The basic lmbench commands used were the same.  However, the current
 test used 202871MB as available memory for testing while previous test used
 188949MB.

 As for memory bandwidth test (using bw_mem), the results were similar except  
 for some bw_mem rdwr test about six times slower (but it could be something 
 running at that time).

 The major difference was in memory read latency testing (using lat_mem_rd), 
 though previous test used 188949MB as size of memory, but it only tested up to 
 530MB. And current test it tested passed 530MB and looked like to continue to 
 202871MB, which took much longer time to run.  So this could be different 
 version of lmbench was used.  Would like someone to check with RedHat about 
 this for huge memory configuration.

Comment 3 Greg Nichols 2007-01-23 19:30:58 UTC
Did the test complete?   The use of lat_mem_rd should end with a stride of 1024.
I'm interested if these test completed, as took excessively long, or some
other problem caused the test to hang.

- Thanks!

Comment 4 erikj 2007-01-23 19:39:55 UTC
Jerry - could you answer comment #3?

Comment 5 Jerry Wei 2007-01-23 19:45:32 UTC
The test didn't complete.  It took about 2 days to finish one stride in lat_mem_rd
and we killed the the test after 4 days (we need the big machine for other usage).
So it just tooke excessively long.
Thanks.


Comment 6 erikj 2007-01-25 20:37:51 UTC
John Hesterberg requested we bump the severity to high.

Comment 7 Greg Nichols 2007-02-12 14:05:45 UTC
*** Bug 227975 has been marked as a duplicate of this bug. ***

Comment 8 Greg Nichols 2007-02-12 17:15:19 UTC
I made the following changes in the interest of reducing test time
for large memory machines:

1) bw_mem cp and bcopy now limit to 1/4 of available memory.

3) lat_mem_rd uses 1G array size, or available memory if less than 1G

Fix is in R25

Comment 9 Greg Nichols 2007-02-12 17:23:13 UTC
Is this a NUMA system?
Was it the xen kernel being tested?

Comment 10 John Hesterberg 2007-02-13 03:25:55 UTC
Yes, a NUMA system.
No, xen is not being used.
This is Itanium, and xen no workie yet.

Comment 11 Jerry Wei 2007-02-15 16:08:37 UTC
Did the memory test with hts-5.0-25 and found out there were redundant output
lines in memory bandwidth test.  The most time consuming test, lat_mem_rd on
big memory, wasn't changed.  Here were the related scripts in MEMORY2

==============
# limit latency test arraysize to 1 GB
if (( "$MB" > "1024")); then
    arraysize=1024
else
        arraysize=$MB
fi
echo "Testing memory read latency (cache-line size detection etc.)"
echo "Running: lat_mem_rd $arraysize 16 32 64 128 256 512 1024"
lat_mem_rd $MB 16 32 64 128 256 512 1024
echo "done."
=============

Looked like the lat_mem_rd still used $MB in stead of $arraysize.


Comment 12 Greg Nichols 2007-02-15 16:11:33 UTC
Changing to Assigned per above.

Comment 13 Greg Nichols 2007-02-15 16:15:47 UTC
Fixed R26

Comment 14 John Hesterberg 2007-02-16 03:24:32 UTC
If you wanted to provide it (attach it here?), George or Jerry could
probably test out a fix on a 256gb machine.
We're having a hiccup giving you direct access to the 256gb machine
(but working on it).

Comment 15 Greg Nichols 2007-02-16 04:03:41 UTC
The fix is just to change the variable in 
line 113 of MEMORY2 (/usr/share/hts/tests/memory/MEMORY2), as in:

lat_mem_rd $arraysize 16 32 64 128 256 512 1024

So please make that change and try it out.

Comment 16 George Beshers 2007-02-27 16:50:33 UTC
I have run this on altix3.lab.boston.redhat.com (pw: altix3) and
left the results available.  Unless I am confused about when I started
the test it ran for the better part of 4 days.

I will save off the information before doing anything more with the system.


Comment 17 George Beshers 2007-02-27 17:33:40 UTC
The attached file is a log from a modified MEMORY2 which ran just the bw_mem
tests.  Things to note:

Size     rd        wr      rdwr     bzero   cp       bcopy
1024m    1:39.20   0:32.29 0:54.74  0:12.34 1:49.47  1:45.97
16384m   27:29.96  9:00.16 15:46.34 3:13.94 27:10.77 30:39.53

397740m  2:20:43   **NOTE: -N1 (default is 11 times)
795481m  5:17:24           also libc error see BZ230220

Comment 18 Greg Nichols 2007-03-07 14:23:35 UTC
Fixed R29