Bug 223165 - RHEL5 certification MEMORY test took too long on 256GB memory
RHEL5 certification MEMORY test took too long on 256GB memory
Product: Red Hat Hardware Certification Program
Classification: Red Hat
Component: Test Suite (tests) (Show other bugs)
ia64 Linux
medium Severity high
: ---
: ---
Assigned To: Greg Nichols
: 227975 (view as bug list)
Depends On: 230220
Blocks: SGI_Blocker_5.0.0
  Show dependency treegraph
Reported: 2007-01-17 23:14 EST by Erik Jacobson
Modified: 2008-07-16 17:57 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-04-09 13:12:41 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Erik Jacobson 2007-01-17 23:14:51 EST
Irina Boverman requested a bug filed on this against HTS.  Jerry - please
make any additions or corrections you feel are needed.

From our QE folks (Jerry):

 The MEMORY2 test in RHEL5 certification test took too long on an altix 4700 
 with 256GB memory.  lmbench was used in MEMORY2 test and it took over 40 hours 
 up to now and still not finish yet.  Previous RHEL4 cert testing on the same 
 4700 256GB took about 17.3 hours to finish (using similar lmbench commands).   
 Now the a4700 running montecito with 256GB took over 30 hours in lat_mem_rd 
 command  in lmbench and looked like still long way to go.

Jerey then added:

 Did compare the output of MEMORY2 test between current rhel5-rc_s5 and previous
 rhel4u4.  The basic lmbench commands used were the same.  However, the current
 test used 202871MB as available memory for testing while previous test used

 As for memory bandwidth test (using bw_mem), the results were similar except  
 for some bw_mem rdwr test about six times slower (but it could be something 
 running at that time).

 The major difference was in memory read latency testing (using lat_mem_rd), 
 though previous test used 188949MB as size of memory, but it only tested up to 
 530MB. And current test it tested passed 530MB and looked like to continue to 
 202871MB, which took much longer time to run.  So this could be different 
 version of lmbench was used.  Would like someone to check with RedHat about 
 this for huge memory configuration.
Comment 3 Greg Nichols 2007-01-23 14:30:58 EST
Did the test complete?   The use of lat_mem_rd should end with a stride of 1024.
I'm interested if these test completed, as took excessively long, or some
other problem caused the test to hang.

- Thanks!
Comment 4 Erik Jacobson 2007-01-23 14:39:55 EST
Jerry - could you answer comment #3?
Comment 5 Jerry Wei 2007-01-23 14:45:32 EST
The test didn't complete.  It took about 2 days to finish one stride in lat_mem_rd
and we killed the the test after 4 days (we need the big machine for other usage).
So it just tooke excessively long.
Comment 6 Erik Jacobson 2007-01-25 15:37:51 EST
John Hesterberg requested we bump the severity to high.
Comment 7 Greg Nichols 2007-02-12 09:05:45 EST
*** Bug 227975 has been marked as a duplicate of this bug. ***
Comment 8 Greg Nichols 2007-02-12 12:15:19 EST
I made the following changes in the interest of reducing test time
for large memory machines:

1) bw_mem cp and bcopy now limit to 1/4 of available memory.

3) lat_mem_rd uses 1G array size, or available memory if less than 1G

Fix is in R25
Comment 9 Greg Nichols 2007-02-12 12:23:13 EST
Is this a NUMA system?
Was it the xen kernel being tested?
Comment 10 John Hesterberg 2007-02-12 22:25:55 EST
Yes, a NUMA system.
No, xen is not being used.
This is Itanium, and xen no workie yet.
Comment 11 Jerry Wei 2007-02-15 11:08:37 EST
Did the memory test with hts-5.0-25 and found out there were redundant output
lines in memory bandwidth test.  The most time consuming test, lat_mem_rd on
big memory, wasn't changed.  Here were the related scripts in MEMORY2

# limit latency test arraysize to 1 GB
if (( "$MB" > "1024")); then
echo "Testing memory read latency (cache-line size detection etc.)"
echo "Running: lat_mem_rd $arraysize 16 32 64 128 256 512 1024"
lat_mem_rd $MB 16 32 64 128 256 512 1024
echo "done."

Looked like the lat_mem_rd still used $MB in stead of $arraysize.
Comment 12 Greg Nichols 2007-02-15 11:11:33 EST
Changing to Assigned per above.
Comment 13 Greg Nichols 2007-02-15 11:15:47 EST
Fixed R26
Comment 14 John Hesterberg 2007-02-15 22:24:32 EST
If you wanted to provide it (attach it here?), George or Jerry could
probably test out a fix on a 256gb machine.
We're having a hiccup giving you direct access to the 256gb machine
(but working on it).
Comment 15 Greg Nichols 2007-02-15 23:03:41 EST
The fix is just to change the variable in 
line 113 of MEMORY2 (/usr/share/hts/tests/memory/MEMORY2), as in:

lat_mem_rd $arraysize 16 32 64 128 256 512 1024

So please make that change and try it out.
Comment 16 George Beshers 2007-02-27 11:50:33 EST
I have run this on altix3.lab.boston.redhat.com (pw: altix3) and
left the results available.  Unless I am confused about when I started
the test it ran for the better part of 4 days.

I will save off the information before doing anything more with the system.
Comment 17 George Beshers 2007-02-27 12:33:40 EST
The attached file is a log from a modified MEMORY2 which ran just the bw_mem
tests.  Things to note:

Size     rd        wr      rdwr     bzero   cp       bcopy
1024m    1:39.20   0:32.29 0:54.74  0:12.34 1:49.47  1:45.97
16384m   27:29.96  9:00.16 15:46.34 3:13.94 27:10.77 30:39.53

397740m  2:20:43   **NOTE: -N1 (default is 11 times)
795481m  5:17:24           also libc error see BZ230220
Comment 18 Greg Nichols 2007-03-07 09:23:35 EST
Fixed R29

Note You need to log in before you can comment on or make changes to this bug.