Description of problem: The memory test on a 4 socket Montecito 128GB system is taking too much time to complete. It is currently at 144hrs with no indication of when it might complete. The test was initiated from the system console, but I have been unable to ssh in on another port to view logs or process times. Console output seems to indicate that the oom-killer is being invoked frequently. A sample of console output is included in the attached file. A similar 2 socket system with 32GB completed the test in 2-3 hrs. Version-Release number of selected component (if applicable): hts-5.0-31 How reproducible: Have only invoked the test on a system of this size memory once. Steps to Reproduce: 1. run the hts memory cert test on a system with 128GB 2. 3. Actual results: Test does not appear to complete Expected results: Test completion within a day or less. Additional info:
Created attachment 150518 [details] Sampling of memory test console output
Please supply the test output. The current attachment does bears no indication of how far the test has progressed. I have run the memory test on a 64 processor ia64 with 128GB in roughly three hours.
I am not as familiar with the hts test structure. Where would I find these test results? I may have to restart the test because I have since uninstalled hts-5.0-31 and installed hts-5.0-26 so I could get the test run. If I do need to restart the test, how long do I need to run it to be able to get you useful data?
Well, that system has been reconfigured, so no past test results are available. Is there something you need me to try?
Please re-run the test: hts certify --test memory Then attach the log. - Thanks!
Sure, I can do that - but how long should I let it run to get useful information. I killed it at 7 days and it was showing no signs of finishing at that point. Would it make sense to stop it when I start seeing the OOM-killer messages?
Can you let it run for 24 hours or so? - Thanks
Sure - np. Will that be enough for you?
Ok, somehow I just knew this was going to happen. The test on the 128GB system with 5.0-31 ran to completion within 12 hrs whereas previously it had not completed in 7 days. In considering what was different, I can think of at least three things - + the system was installed on the same disks but now using a U320 controller instead of a U160 controller + the test was invoked with hts certify --test memory whereas the first time the test was invoked with hts certify --test core --test memory + after the test running for 7 days had been aborted, I discovered that one quad of memory (16GB) had been deconfigured. physically reseated that memory before running this test and noted that it was not deconfigured this time I doubt that using a U320 vs. U160 controller had an impact on the memory test I'll rerun this test and invoke the core test as well. I suspect that the memory deconfig had the most impact on the test. I do not know whether that mem deconfig occurred during the test or prior to the boot. If it did occur prior to the boot, the OS would never have known that memory was there. I think the most likely explanation is that the memory failed sometime during the test and because I had to reboot the system to terminate the memory test - the failed dimms were logged in the fw event log at the reboot. I'll try manually deconfig of those dimms and rerun the test after the core/memory run. But I suspect that will not generate the problem either. I have a completed test result that I can attach if you think it will be of interest. Would you like for me to attach it?
I ran the hts certify --test core --test memory test with 5.0-31 and it also completed within a reasonable amount of time. I wanted to manually deconfigure the memory rank that had been deconfigured before, but unfortunately the capability to do that is only available on our higher end systems. So, for the time being, I'm going to assume that this problem was a result of a hardware memory failure and close this defect. Will reopen it if I encounter the problem again.