Red Hat Bugzilla – Bug 233166
Memory test on 128GB is taking too long
Last modified: 2008-07-16 17:58:59 EDT
Description of problem:
The memory test on a 4 socket Montecito 128GB system is taking
too much time to complete. It is currently at 144hrs with no
indication of when it might complete. The test was initiated
from the system console, but I have been unable to ssh in on
another port to view logs or process times. Console output
seems to indicate that the oom-killer is being invoked
frequently. A sample of console output is included in
the attached file.
A similar 2 socket system with 32GB completed the test
in 2-3 hrs.
Version-Release number of selected component (if applicable):
Have only invoked the test on a system of this size memory once.
Steps to Reproduce:
1. run the hts memory cert test on a system with 128GB
Test does not appear to complete
Test completion within a day or less.
Created attachment 150518 [details]
Sampling of memory test console output
Please supply the test output. The current attachment does bears no indication
of how far the test has progressed.
I have run the memory test on a 64 processor ia64 with 128GB in roughly
I am not as familiar with the hts test structure. Where would I find these
test results? I may have to restart the test because I have since uninstalled
hts-5.0-31 and installed hts-5.0-26 so I could get the test run.
If I do need to restart the test, how long do I need to run it to be able
to get you useful data?
Well, that system has been reconfigured, so no past test results are available.
Is there something you need me to try?
Please re-run the test:
hts certify --test memory
Then attach the log.
Sure, I can do that - but how long should I let it run to get useful information.
I killed it at 7 days and it was showing no signs of finishing at that point.
Would it make sense to stop it when I start seeing the OOM-killer messages?
Can you let it run for 24 hours or so?
Sure - np. Will that be enough for you?
Ok, somehow I just knew this was going to happen. The test on the 128GB
system with 5.0-31 ran to completion within 12 hrs whereas previously it had
not completed in 7 days.
In considering what was different, I can think of at least three things -
+ the system was installed on the same disks but now using a U320 controller
instead of a U160 controller
+ the test was invoked with hts certify --test memory whereas the first time
the test was invoked with hts certify --test core --test memory
+ after the test running for 7 days had been aborted, I discovered that one
quad of memory (16GB) had been deconfigured. physically reseated that memory
before running this test and noted that it was not deconfigured this time
I doubt that using a U320 vs. U160 controller had an impact on the memory test
I'll rerun this test and invoke the core test as well.
I suspect that the memory deconfig had the most impact on the test. I do not
know whether that mem deconfig occurred during the test or prior to the boot.
If it did occur prior to the boot, the OS would never have known that memory
was there. I think the most likely explanation is that the memory failed
sometime during the test and because I had to reboot the system to terminate
the memory test - the failed dimms were logged in the fw event log at the
reboot. I'll try manually deconfig of those dimms and rerun the test
after the core/memory run. But I suspect that will not generate the problem
I have a completed test result that I can attach if you think it will be of
interest. Would you like for me to attach it?
I ran the hts certify --test core --test memory
test with 5.0-31 and it also completed within a reasonable amount of time.
I wanted to manually deconfigure the memory rank that had been deconfigured
before, but unfortunately the capability to do that is only available on
our higher end systems.
So, for the time being, I'm going to assume that this problem was a result
of a hardware memory failure and close this defect. Will reopen it if I
encounter the problem again.