Bug 233166 - Memory test on 128GB is taking too long
Memory test on 128GB is taking too long
Status: CLOSED NOTABUG
Product: Red Hat Hardware Certification Program
Classification: Red Hat
Component: Test Suite (tests) (Show other bugs)
5
ia64 Linux
medium Severity high
: ---
: ---
Assigned To: Greg Nichols
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-03-20 14:59 EDT by Rick Hester
Modified: 2008-07-16 17:58 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-03-29 11:28:33 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Sampling of memory test console output (63.14 KB, text/plain)
2007-03-20 14:59 EDT, Rick Hester
no flags Details

  None (edit)
Description Rick Hester 2007-03-20 14:59:23 EDT
Description of problem:
The memory test on a 4 socket Montecito 128GB system is taking
too much time to complete.   It is currently at 144hrs with no
indication of when it might complete. The test was initiated
from the system console, but I have been unable to ssh in on
another port to view logs or process times.  Console output 
seems to indicate that the oom-killer is being invoked 
frequently.  A sample of console output is included in 
the attached file.

A similar 2 socket system with 32GB completed the test
in 2-3 hrs.

Version-Release number of selected component (if applicable):
hts-5.0-31

How reproducible:
Have only invoked the test on a system of this size memory once.

Steps to Reproduce:
1. run the hts memory cert test on a system with 128GB
2.
3.
  
Actual results:
Test does not appear to complete

Expected results:
Test completion within a day or less.


Additional info:
Comment 1 Rick Hester 2007-03-20 14:59:24 EDT
Created attachment 150518 [details]
Sampling of  memory test console output
Comment 2 Greg Nichols 2007-03-26 10:52:26 EDT
Please supply the test output.  The current attachment does bears no indication
of how far the test has progressed.

I have run the memory test on a 64 processor ia64 with 128GB in roughly
three hours.



Comment 3 Rick Hester 2007-03-27 00:32:17 EDT
I am not as familiar with the hts test structure.   Where would I find these
test results?   I may have to restart the test because I have since uninstalled
hts-5.0-31 and installed hts-5.0-26 so I could get the test run.

If I do need to restart the test, how long do I need to run it to be able
to get you useful data?
Comment 4 Rick Hester 2007-03-27 14:51:30 EDT
Well, that system has been reconfigured, so no past test results are available.

Is there something you need me to try?
Comment 5 Greg Nichols 2007-03-27 16:16:24 EDT
Please re-run the test:

hts certify --test memory

Then attach the log.

- Thanks!
Comment 6 Rick Hester 2007-03-27 17:29:09 EDT
Sure, I can do that - but how long should I let it run to get useful information.

I killed it at 7 days and it was showing no signs of finishing at that point.

Would it make sense to stop it when I start seeing the OOM-killer messages?
Comment 7 Greg Nichols 2007-03-27 17:56:38 EDT
Can you let it run for 24 hours or so?

- Thanks
Comment 8 Rick Hester 2007-03-27 18:06:17 EDT
Sure - np.   Will that be enough for you?
Comment 9 Rick Hester 2007-03-28 13:36:30 EDT
Ok, somehow I just knew this was going to happen.   The test on the 128GB 
system with 5.0-31 ran to completion within 12 hrs whereas previously it had
not completed in 7 days.

In considering what was different, I can think of at least three things -
+ the system was installed on the same disks but now using a U320 controller
  instead of a U160 controller
+ the test was invoked with hts certify --test memory whereas the first time
  the test was invoked with hts certify --test core --test memory
+ after the test running for 7 days had been aborted, I discovered that one
  quad of memory (16GB) had been deconfigured.  physically reseated that memory
  before running this test and noted that it was not deconfigured this time

I doubt that using a U320 vs. U160 controller had an impact on the memory test
I'll rerun this test and invoke the core test as well.

I suspect that the memory deconfig had the most impact on the test.  I do not
know whether that mem deconfig occurred during the test or prior to the boot.
If it did occur prior to the boot, the OS would never have known that memory
was there.   I think the most likely explanation is that the memory failed
sometime during the test and because I had to reboot the system to terminate
the memory test - the failed dimms were logged in the fw event log at the
reboot.   I'll try manually deconfig of those dimms and rerun the test
after the core/memory run.  But I suspect that will not generate the problem
either.

I have a completed test result that I can attach if you think it will be of
interest.  Would you like for me to attach it?
Comment 10 Rick Hester 2007-03-29 11:28:33 EDT
I ran the hts certify --test core --test memory 
test with 5.0-31 and it also completed within a reasonable amount of time.

I wanted to manually deconfigure the memory rank that had been deconfigured
before, but unfortunately the capability to do that is only available on
our higher end systems.

So, for the time being, I'm going to assume that this problem was a result
of a hardware memory failure and close this defect.  Will reopen it if I
encounter the problem again.

Note You need to log in before you can comment on or make changes to this bug.