Red Hat Bugzilla – Bug 223165
RHEL5 certification MEMORY test took too long on 256GB memory
Last modified: 2008-07-16 17:57:36 EDT
Irina Boverman requested a bug filed on this against HTS. Jerry - please
make any additions or corrections you feel are needed.
From our QE folks (Jerry):
The MEMORY2 test in RHEL5 certification test took too long on an altix 4700
with 256GB memory. lmbench was used in MEMORY2 test and it took over 40 hours
up to now and still not finish yet. Previous RHEL4 cert testing on the same
4700 256GB took about 17.3 hours to finish (using similar lmbench commands).
Now the a4700 running montecito with 256GB took over 30 hours in lat_mem_rd
command in lmbench and looked like still long way to go.
Jerey then added:
Did compare the output of MEMORY2 test between current rhel5-rc_s5 and previous
rhel4u4. The basic lmbench commands used were the same. However, the current
test used 202871MB as available memory for testing while previous test used
As for memory bandwidth test (using bw_mem), the results were similar except
for some bw_mem rdwr test about six times slower (but it could be something
running at that time).
The major difference was in memory read latency testing (using lat_mem_rd),
though previous test used 188949MB as size of memory, but it only tested up to
530MB. And current test it tested passed 530MB and looked like to continue to
202871MB, which took much longer time to run. So this could be different
version of lmbench was used. Would like someone to check with RedHat about
this for huge memory configuration.
Did the test complete? The use of lat_mem_rd should end with a stride of 1024.
I'm interested if these test completed, as took excessively long, or some
other problem caused the test to hang.
Jerry - could you answer comment #3?
The test didn't complete. It took about 2 days to finish one stride in lat_mem_rd
and we killed the the test after 4 days (we need the big machine for other usage).
So it just tooke excessively long.
John Hesterberg requested we bump the severity to high.
*** Bug 227975 has been marked as a duplicate of this bug. ***
I made the following changes in the interest of reducing test time
for large memory machines:
1) bw_mem cp and bcopy now limit to 1/4 of available memory.
3) lat_mem_rd uses 1G array size, or available memory if less than 1G
Fix is in R25
Is this a NUMA system?
Was it the xen kernel being tested?
Yes, a NUMA system.
No, xen is not being used.
This is Itanium, and xen no workie yet.
Did the memory test with hts-5.0-25 and found out there were redundant output
lines in memory bandwidth test. The most time consuming test, lat_mem_rd on
big memory, wasn't changed. Here were the related scripts in MEMORY2
# limit latency test arraysize to 1 GB
if (( "$MB" > "1024")); then
echo "Testing memory read latency (cache-line size detection etc.)"
echo "Running: lat_mem_rd $arraysize 16 32 64 128 256 512 1024"
lat_mem_rd $MB 16 32 64 128 256 512 1024
Looked like the lat_mem_rd still used $MB in stead of $arraysize.
Changing to Assigned per above.
If you wanted to provide it (attach it here?), George or Jerry could
probably test out a fix on a 256gb machine.
We're having a hiccup giving you direct access to the 256gb machine
(but working on it).
The fix is just to change the variable in
line 113 of MEMORY2 (/usr/share/hts/tests/memory/MEMORY2), as in:
lat_mem_rd $arraysize 16 32 64 128 256 512 1024
So please make that change and try it out.
I have run this on altix3.lab.boston.redhat.com (pw: altix3) and
left the results available. Unless I am confused about when I started
the test it ran for the better part of 4 days.
I will save off the information before doing anything more with the system.
The attached file is a log from a modified MEMORY2 which ran just the bw_mem
tests. Things to note:
Size rd wr rdwr bzero cp bcopy
1024m 1:39.20 0:32.29 0:54.74 0:12.34 1:49.47 1:45.97
16384m 27:29.96 9:00.16 15:46.34 3:13.94 27:10.77 30:39.53
397740m 2:20:43 **NOTE: -N1 (default is 11 times)
795481m 5:17:24 also libc error see BZ230220