Bug 430552
Summary: | LTC41392-3950 M2 System hard locks during RH Cert memory test | ||
---|---|---|---|
Product: | [Retired] Red Hat Hardware Certification Program | Reporter: | Issue Tracker <tao> |
Component: | Test Suite (tests) | Assignee: | Greg Nichols <gnichols> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 5 | CC: | bxu, tao, ykun |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | hts-5.2-16.el5 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-06-20 17:00:18 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Issue Tracker
2008-01-28 18:54:55 UTC
LTC Owner is: mcdermoc.com LTC Originator is: zorek.com ---Problem Description--- System hard locks during RH Cert memory test Contact Information = zorek.com ---uname output--- 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:19 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux RHEL5.1 Machine Type = 3950 M2 ---System Hang--- System hard locks during RH Cert memory test. The system becomes unresponsive to keyboard and requires a hard reset after approximately 30 minutes in test. ---Debugger--- A debugger is not configured ---Steps to Reproduce--- System hard locks during RH Cert memory test. The system becomes unresponsive to keyboard and requires a hard reset after approximately 30 minutes in test. ---Kernel - NUMA Component Data--- Stack trace output: no Oops output: no System Dump Info: The system is not configured to capture a system dump. System hard locks during RH Cert memory test. The system becomes unresponsive to keyboard and requires a hard reset after approximately 30 minutes in test. Tester: Adam Sheltz Phone: 6-0066 Date: Thu Jan 3 14:52:07 EST 2008 Machine Name: Product Family: x3950 Model Type: 7141 Processors Physical CPUS: 8 x 2933 MHz CPU Version: Version: Intel Xeon MP Number of Logical CPUS (as seen by the OS): 32 Memory Physical DIMMS: 52 DIMMs Consisting of: 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 4096MB 2048MB 2048MB 2048MB 2048MB Total Memory (in MB) as seen by the BIOS: 204800 Total Memory (in MB) as seen by the OS: 201024 BIOS Version: 1.03 BIOS Build: A3E123TUS BIOS Release Date: 12/13/2007 BMC Version: no BMC detected, or version couldn't be determined BMC Build: A3BT25A Diagnostics Version: Diagnostics Build: CPLD INFO: no CPLD detected IPMI Specification: 2.0 2.0 OS: Red Hat Enterprise Linux Server release 5.1 (Tikanga), 64 bit KERNEL INFO: Linux localhost.localdomain 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:19 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux 00:1f.1 hda 'IDE interface' 'Intel Corporation, 82801G (ICH7 Family) IDE Controller' 02:00.0 eth0 'Ethernet controller' 'Broadcom Corporation, NetXtreme II BCM5709 Gigabit Ethernet' 02:00.1 eth1 'Ethernet controller' 'Broadcom Corporation, NetXtreme II BCM5709 Gigabit Ethernet' 04:00.0 sda 'SCSI storage controller' 'LSI Logic / Symbios Logic, SAS1078 PCI-Express Fusion-MPT SAS' 31:00.0 eth3 'Ethernet controller' 'Broadcom Corporation, NetXtreme II BCM5709 Gigabit Ethernet' 31:00.1 eth4 'Ethernet controller' 'Broadcom Corporation, NetXtreme II BCM5709 Gigabit Ethernet' 33:00.0 sdb 'RAID bus controller' 'LSI Logic / Symbios Logic, MegaRAID SAS 1078, MegaRAID SAS PCI Express ROMB' 3c:00.0 ... 'Ethernet controller' 'Broadcom Corporation, NetXtreme II BCM5708 Gigabit Ethernet' 42:00.0 ... 'Ethernet controller' 'Broadcom Corporation, NetXtreme II BCM5708 Gigabit Ethernet' This is a 2-node system and reproduces on multiple setups. Adam also reports that the xen kernel also fails this test. Adam reports the following which may or may not be relevant to the rhel5.1 cert memory test hang failure with large memory array installed: "the RHEL4 Cert memory test works without any problems." Same system and same memory. Adam reports the following: "I tried the memory test with 64 GB in the secondary and 32 GB in the primary and the system hard locks after about 45 minutes. I also tried 64GB in the primary and 32GB in the secondary and the system still hard locks. All the tests hang the system when they get to the threaded_memtest portion of the script. " This issue is blocking certification testing for RHEL5.1 memory.py ran successfully to completion on a single node x3950 M2 with 64GB. Currently running memory.py on a 2-node configuration with 96GB - 64GB (2GB DIMMs) on node 0 and 32GB (1GB DIMMs) on node 1. 2-node memory.py test ran to completion on my 2-node, 96GB x3850 M2. Although, I don't know how long it actually took, since I let it run overnight. I have restarted the test to determine how long it takes to complete. There were a couple of issues I noticed while the test was running that may be problematic. 1) The following error occured during a number of the tests: *** glibc detected ** bw_mem: munmap_chunk(): invalid pointer: 0x000000000040276d *** This messages was followed by a backtrace. Ed, do you know whether Adam saw this error during his testing? 2) After the Threaded Memory Test started, the following kernel messages were displayed: warning: many lost ticks Your time source seems to be instable or some driver is hogging interrupts This implies that interrupts are being masked by a driver/kernel module (or some other kernel subsystem) for possibly abnormal extended periods of time, indicating the timer handler isn't running at it's normal interval. Typically, the kernel lost tick compensation code will correct this and no incorrect system behavior will be noticed. Do you know whether Adam's system experienced any errors similar to this one? One possible cause for the system hang could be swapspace depletion. Ed, do you know how much swap space is configured on Adam's system? 2nd 2-node test completed in 1hr 41min wall clock time. Started at 9:08, completed at 10:49. 'time' output: real 101m33.828s user 71m52.300s sys 734m40.173s I did not get the 'lost ticks' messages during this run, but did still see the glibs invalid poiner messages. BTW, this 2-node is installed with RHEL5.1, 2.6.18-53.el5. I am using 4GB swap (which is actually quite low for a 96GB system). I believe the default install configures 2GB swap, so I'm going to run the memory test again with only 2GB swap to see what happens. Adam reports back that he did not see any glib errors. Also, he has the default amount of swap which is 2GB. He has seen it hung as an unresponsive system and also has seen it eventually reset after a length of time. So far, memory test ran to completion with PASS result on my 2-node with 32GB swap, 4GB swap and 2GB swap. I have run the memory test 5 times, but have been unable to reproduce any hangs/resets at this point. You mentioned that Adam has seen resets. Has he confirmed that there are no MCKs/SPINTs occurring? Perhaps you can ask that he gather sysreport/sosreport output and attach it to this bug. At this point, I need to figure out what is different between his configuration and mine. Also, do you know whether Adam has a serial console connected to his system? What I would like to do is determine whether he can capture <Alt><SysRq><t> output from the system when it is in the 'hung' state. Running memory test overnight in a loop. swap space at 2GB. I have set up my own 96GB 2-node and will see if I can duplicate Chris's results. (In reply to comment #14) > Running memory test overnight in a loop. swap space at 2GB. Test ran 9 times successfully overnight. No hangs/resets. Ed, can you find out how Adam is invoking the test? I'm just executing the runtest.sh script from within the /usr/share/hts/tests/memory directory. Perhaps there's something different in the way the test is being executed? Adam installs the suite to the default location and then exectues: hts plan hts certify --test memory This runs only the memory test I am told. We just kicked it off on my 2-node with 96GB. These are all new dimms on my machine. I am running rhel5.1 64bit xen on that system. There were no fault indicators illuminated on Adam's machines when the fails happened. He is going to recheck the rsa logs but has not noted any warnings or errors in the past. What type of 4GB dimms are you using? Adam is using existing ridgeback 4GB dimms. My system will be running new elpida 4GB dimms. My 2-node xen system with 96GB became unresponsive somewhere between the 30minute and 1hr mark. I am going to rerun it launching the script directly as Chris was doing. (In reply to comment #17) > Adam installs the suite to the default location and then exectues: > hts plan > hts certify --test memory OK. I have restarted the test using this method. > > What type of 4GB dimms are you using? Adam is using existing ridgeback 4GB > dimms. My system will be running new elpida 4GB dimms. I don't have any 4GB DIMMs. I have 32x2GB DIMMs in node1 and 32x1GB DIMMs in node2. (In reply to comment #18) > My 2-node xen system with 96GB became unresponsive somewhere between the > 30minute and 1hr mark. > Was this a complete system hang? You had to reboot to recover? Also, did this occur when the Threaded Memory Test is running, similar to what Adam saw? My system was unresponsive so I had to hard power it off. Running the memory script by itself from the memory directory completed but it never ran the threaded test as that executable did not exist. So I did a make in that directory which built the threaded module and auto kicked off the memory test. I'll let it run and see where it goes. running 'hts certify --test memory' ran just fine on my system. This is strange. At this point, the only differences I see are that you and Adam are using 4GB DIMMs and I am not. But, if you can get sysreport or sosreport output from one of the 2 systems, I will compare with my configuration to see whether there are any other differences. I'll post my sysreport data as soon as this 2nd memory test run completes. Chris, did you have to manually do the make in the memory directory to get the threaded test to run? The executable was not in that directory on my system. I don't know if it was built and moved by the hts interface or not but there was no residual evidence in the directory that the threaded test had been built. > Chris, did you have to manually do the make in the memory directory... Yes. In fact, 'make' is how I have actually been running the memtest. If you look at the Makefile, it just simply calls 'runtest.sh' once everything is compiled, built, installed. I should have been more specific in my previous comment. Specifically, I had been running: while sleep 5 do date make date done 2>&1 | tee memtest.results Incidently, 'hts certify --test memory' (running in a similar loop) has successfully completed 3 iterations now. Created an attachment (id=34093) sosreport output from 2-node x3950 M2 in Beaverton Checked the RSA logs of both nodes for my failing setup. Nothing logged on either system and no error lights. Pulled the two cards of 4GB dimms from node1 and replaced them with 4 new cards full of 2GB dimms. Rerunning memory test via the hts interface method. Ran memory test overnight without any hangs/resets using the 2.6.18-53.elxen kernel. How long is the threaded memtest supposed to take with 96GB? I am running in a terminal and can see that when the threaded test started it consumed all of memory except for about 38k worth and also consumed all of the 2GB swap. At least that is what 'top' displays. However, the system will allow me to switch between terminals etc so it is still responsive to some capacity. I cannot really run anything in another terminal but isn't that expected with all of memory and swap used up? Is the allocation of 99.9% of memory and 100% of swap normal? I know this system running out of and xterm sat for over half a day and never recovered. But really, how long should I expect this test to take? Ok, actual free memory according to top is 39856k which is 39M. At ~3hrs of sitting with no update to 'top' frozen showing available memory as 39856k and available swap as 0k the system abruptly posted: Out of Memory: Killed process 8832 (threaded_memtes) Along with a bunch of preceding messages essentially reflecting what 'top' had been showing in the frozen screen regarding available pages (39M) etc. System was running xen 5.1 kernel with 96GB installed and seen in dom0. The memory testing was launched from terminal 1 from the test directory via 'make'. 'top' was running in terminal 2. All brand new 2GB dimms. (In reply to comment #29) > How long is the threaded memtest supposed to take with 96GB? Ed, I have no idea how long the threaded memtest should take on a 96GB system. That's probably a reasonable question to ask Red Hat. I'll try to ensure that this gets on our agenda for System x technical call this week. At any rate, the memortest consistently completes on my 2-node in anywhere from 1 hr 25 minutes to 1 hr 40 minutes. I'm not sure how much of that time is spent in the threaded memtest, but I can attempt to nail that down. > Is the allocation of 99.9% of memory and 100% of swap normal? Another good question. I've been monitoring swap space consumption while the memory test is running, and I haven't ever seen total swap spaced used go above about 700MB while running the test. I captured 4 bmps of the console while the test was running, I'll attach those to the bug. > But really, how long should I expect this test to take? As I said, my test completes in about 1 1/2 hours. Anything over a couple of hours, I would suspect there might be some problems. Created an attachment (id=34177) screenshot as threaded memtest is beginning Created an attachment (id=34178) sosreport from system which overcommits memory during threaded test This is a rhel5.1 64bit xen system. It reports 96GB of memory in dom0. Memory test was launched via 'make' in the memory directory. First portion of memory test completed then the threaded test started and 'top' showed all of memory and swap consumed. System was generally unresponsive then after ~3hrs put the out of memory error on the screen and killed threaded_memtes. System eventually came back to normal operation. sosreport attached. Created an attachment (id=34179) screenshot after threaded memtest has initialized Created an attachment (id=34182) screenshot while memtest is running I took this screenshot because the display hung at this point for several minutes, although the system was still active, top continued to run, etc. Created an attachment (id=34183) screenshot after threaded memtest has completed. Doh! Sorry about that. I already blew away the partions for a fresh install... I have installed rhel5.1 64bit non-xen on my system that was failing with xen kernel with no other changes including even an AC cycle and now have the following results over the past 24hrs: installed rhel5.1 non-xen 64bit and selected only the webserver package (did not select software development) Run1: installed hts-32, all supporting libs, gcc, rpm-build, mt, dt, lmbench, and stress launched memory test via 'make' in the /usr/share/hts/tests/memory directory test completed with exit code 0 and pass Run2: uninstalled hts-32 and installed hts-48 launched memory test via 'make' in the /usr/share/hts/tests/memory directory test completed with exit code 0 and pass Run3: uninstalled hts-48 then deleted all hts directorys under /usr/share reinstalled hts-48 launched the memory test via 'hts plan', 'hts certify --test memory' test completed with exit code 0 and pass All three runs were launched from and xterm locally in the gui. I am unconvinced at this point that non-xen has any issue and am reinstalling rhel5.1 xen. Created an attachment (id=34309) sosreport for 96GB rhel5.1 non-xen 64bit passing config This event sent from IssueTracker by streeter [Support Engineering Group] issue 160220 ----- Additional Comments From zorek.com 2008-01-25 13:41 EDT ------- sosreport from 96GB rhel5.1 xen 64bit that now passes memory test Reinstalled rhel5.1 64bit xen on 2-node that previously failed 100% of memory test runs in xen. System has 96GB of memory installed and I installed using a xen activation key, selected to delete all partitions and let the NOS use default partitioning, also chose webserver and software development packages. System installed and allocated 4GB of swap. Run1: installed hts-32 and all supporting packages memory test launched via 'make' and ran to completion exit 0 PASS. Run2: uninstalled hts-32 and installed hts-48 memory test launched via 'make' and ran to completion exit 0 PASS. Reinstalling xen and strictly limiting swap to 2GB and will rerun. This event sent from IssueTracker by streeter [Support Engineering Group] issue 160220 ----- Additional Comments From zorek.com 2008-01-25 16:02 EDT ------- Reinstalled rhel5.1 64bit xen and with all 2nd node drives pulled it installed and defaulted to ~2G swap. Installed hts and ran the memory test via 'make' and it consumed all of swap and hung trying to recover during the threaded_memtest. Internal Status set to 'Waiting on Support' Status set to: Waiting on Tech This event sent from IssueTracker by streeter [Support Engineering Group] issue 160220 ----- Additional Comments From zorek.com 2008-01-28 09:25 EDT ------- With rhel5.1 64bit xen kernel the amount of swap that he NOS defaults to at installation time appears to be insufficient to properly run the threaded portion of the memory test. When the NOS has 96GB-224GB we have seen the memory test fail due to the NOS only allocating ~2GB of swap space. When we manually force swap to 4GB at installation time then the memory test passes. We have not tested memory arrays between 64GB and 96GB. We have only tested 96GB to 224GB for this debug. Internal Status set to 'Waiting on Support' Status set to: Waiting on Tech This event sent from IssueTracker by streeter [Support Engineering Group] issue 160220 ----- Additional Comments From mcdermoc.com (prefers email at lcm.com) 2008-01-28 10:43 EDT ------- Reproduced the hang on the Beaverton system with the RHEL5.1 Xen kernel and 2GB swap (creaed 2GB swap file and disabled 4GB primary swap partition). This was on a 3-node system with 112GB memory. Test passes with the default 4GB swap partition setup on this system (I manually set up 4GB swap during install). Internal Status set to 'Waiting on Support' Status set to: Waiting on Tech This event sent from IssueTracker by streeter [Support Engineering Group] issue 160220 Hello SEG, I am sending this up. It appears to be an issue with hecert. If there is only 2gb of swap it fails memory test. With 4 gb of swap it passes. This is only an issue if the server has large amounts of memory. Thank You Joe Kachuck Issue escalated to Support Engineering Group by: jkachuck. Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by streeter [Support Engineering Group] issue 160220 ----- Additional Comments From hansendc.com (prefers email at haveblue.com) 2008-01-28 13:55 EDT ------- I'd also like to see a test to see if raising min_free_kbytes helps the system pass the test. The dead (or live) lock that occurs is probably because the system is too low on memory to get any work done at all. Raising min_free_kbytes will hopefully kick the system into reclaim earlier, before it can really get into trouble. Taking the existing value and doubling or tripling it would be a good start: $ cat /proc/sys/vm/min_free_kbytes 3816 $ echo 15000 > /proc/sys/vm/min_free_kbytes BTW, that's on my 2GB laptop. It should be MUCH larger on a 96GB system. This event sent from IssueTracker by Glen Johnson issue 160220 ----- Additional Comments From zorek.com 2008-01-28 14:12 EDT ------- We just need to know what is acceptable by Redhat for IBM to do in order to pass certification using hts-048 package on this product with a large memory array installed. We need to start our test certification runs tomorrow (1/29/2008). This event sent from IssueTracker by Glen Johnson issue 160220 ----- Additional Comments From hansendc.com (prefers email at haveblue.com) 2008-01-28 14:26 EDT ------- I'm suggesting you try an alternate method that wouldn't require repartitioning or changing the swap setup. It might be an easier way to get RH to certify the configuration if it's a more simple config tweak. This event sent from IssueTracker by Glen Johnson issue 160220 ----- Additional Comments From mcdermoc.com (prefers email at lcm.com) 2008-01-28 14:59 EDT ------- I'm currently testing a 64GB configuration to determine whether the failure occurs with less than 96GB. Once this is finished, I'll test min_free_kbytes so the test team can concentrate on their certification runs. This event sent from IssueTracker by Glen Johnson issue 160220 ----- Additional Comments From mcdermoc.com (prefers email at lcm.com) 2008-01-28 17:27 EDT ------- Memory test completed with PASS result on 64GB configuration. So the failure point is somewhere between 64GB - 96GB. I'm moving back to a 112GB and will start playing with min_free_kbytes. This event sent from IssueTracker by Glen Johnson issue 160220 ----- Additional Comments From mcdermoc.com (prefers email at lcm.com) 2008-01-29 19:28 EDT ------- On 112GB system, increased min_free_kbytes from 42895 to 85790 (doubled). System still hangs, but I was able to capture this repeatedly scrolling on the serial console: DMA per-cpu: cpu 0 hot: high 186, batch 31 used:27 cpu 0 cold: high 62, batch 15 used:17 cpu 1 hot: high 186, batch 31 used:48 cpu 1 cold: high 62, batch 15 used:53 cpu 2 hot: high 186, batch 31 used:40 cpu 2 cold: high 62, batch 15 used:50 cpu 3 hot: high 186, batch 31 used:131 cpu 3 cold: high 62, batch 15 used:49 cpu 4 hot: high 186, batch 31 used:175 cpu 4 cold: high 62, batch 15 used:15 cpu 5 hot: high 186, batch 31 used:34 cpu 5 cold: high 62, batch 15 used:33 cpu 6 hot: high 186, batch 31 used:64 cpu 6 cold: high 62, batch 15 used:55 cpu 7 hot: high 186, batch 31 used:35 cpu 7 cold: high 62, batch 15 used:50 cpu 8 hot: high 186, batch 31 used:139 cpu 8 cold: high 62, batch 15 used:44 cpu 9 hot: high 186, batch 31 used:180 cpu 9 cold: high 62, batch 15 used:54 cpu 10 hot: high 186, batch 31 used:47 cpu 10 cold: high 62, batch 15 used:40 cpu 11 hot: high 186, batch 31 used:77 cpu 11 cold: high 62, batch 15 used:55 cpu 12 hot: high 186, batch 31 used:9 cpu 12 cold: high 62, batch 15 used:14 cpu 13 hot: high 186, batch 31 used:179 cpu 13 cold: high 62, batch 15 used:53 cpu 14 hot: high 186, batch 31 used:58 cpu 14 cold: high 62, batch 15 used:60 cpu 15 hot: high 186, batch 31 used:165 cpu 15 cold: high 62, batch 15 used:49 cpu 16 hot: high 186, batch 31 used:162 cpu 16 cold: high 62, batch 15 used:57 cpu 17 hot: high 186, batch 31 used:157 cpu 17 cold: high 62, batch 15 used:57 cpu 18 hot: high 186, batch 31 used:21 cpu 18 cold: high 62, batch 15 used:53 cpu 19 hot: high 186, batch 31 used:36 cpu 19 cold: high 62, batch 15 used:59 cpu 20 hot: high 186, batch 31 used:8 cpu 20 cold: high 62, batch 15 used:43 cpu 21 hot: high 186, batch 31 used:10 cpu 21 cold: high 62, batch 15 used:46 cpu 22 hot: high 186, batch 31 used:70 cpu 22 cold: high 62, batch 15 used:48 cpu 23 hot: high 186, batch 31 used:14 cpu 23 cold: high 62, batch 15 used:58 cpu 24 hot: high 186, batch 31 used:111 cpu 24 cold: high 62, batch 15 used:48 cpu 25 hot: high 186, batch 31 used:6 cpu 25 cold: high 62, batch 15 used:57 cpu 26 hot: high 186, batch 31 used:175 cpu 26 cold: high 62, batch 15 used:49 cpu 27 hot: high 186, batch 31 used:154 cpu 27 cold: high 62, batch 15 used:49 cpu 28 hot: high 186, batch 31 used:83 cpu 28 cold: high 62, batch 15 used:27 cpu 29 hot: high 186, batch 31 used:24 cpu 29 cold: high 62, batch 15 used:50 cpu 30 hot: high 186, batch 31 used:31 cpu 30 cold: high 62, batch 15 used:60 cpu 31 hot: high 186, batch 31 used:145 cpu 31 cold: high 62, batch 15 used:53 DMA32 per-cpu: empty Normal per-cpu: empty HighMem per-cpu: empty Free pages: 84708kB (0kB HighMem) Active:16392192 inactive:11675846 dirty:0 writeback:0 unstable:0 free:21177 slab:9090 mapped-file:256 mapped-anon:28069897 pagetables:61353 DMA free:84708kB min:85788kB low:107232kB high:128680kB active:65568768kB inactive:46703384kB present:115003652kB pages_scanned:719607243 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 1*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 20*4096kB = 84708kB DMA32: empty Normal: empty HighMem: empty Swap cache: add 512785, delete 512828, find 49/159, race 0+0 Free swap = 0kB Total swap = 2047992kB threaded_memtes invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0 Call Trace: [<ffffffff802b45f6>] out_of_memory+0x8b/0x203 [<ffffffff8020f04d>] __alloc_pages+0x22b/0x2b4 [<ffffffff8023265d>] read_swap_cache_async+0x42/0xd1 [<ffffffff802b8d4f>] swapin_readahead+0x4e/0x77 [<ffffffff802092aa>] __handle_mm_fault+0xae2/0xf4d [<ffffffff802641bf>] do_page_fault+0xe4c/0x11e0 [<ffffffff8025d823>] error_exit+0x0/0x6e [<ffffffff8025ee7c>] __get_user_8+0x20/0x2c [<ffffffff80297a90>] exit_robust_list+0x20/0xd0 [<ffffffff8026187d>] _spin_lock_irq+0x9/0x14 [<ffffffff80215071>] do_exit+0x232/0x88a [<ffffffff80247a96>] cpuset_exit+0x0/0x6b [<ffffffff8022acce>] get_signal_to_deliver+0x43e/0x470 [<ffffffff8025a12c>] do_notify_resume+0x9c/0x7b4 [<ffffffff80221f5c>] __up_read+0x19/0x7f [<ffffffff8028186a>] default_wake_function+0x0/0xe [<ffffffff802641f2>] do_page_fault+0xe7f/0x11e0 [<ffffffff802a84c7>] audit_syscall_exit+0x2fb/0x319 [<ffffffff8025d424>] int_signal+0x12/0x17 So, it looks like the kernel is almost doing the right thing. I'll try increasing min_free_kbytes a bit more to see what happens. In the mean time, Dave, is there a recipe for what the min_free_kbytes threshold should be? This event sent from IssueTracker by Glen Johnson issue 160220 ----- Additional Comments From mcdermoc.com (prefers email at lcm.com) 2008-01-31 14:05 EDT ------- I tried several values for min_free_kbytes (up to 1GB). In every case, the OOM code kicks in and attempts to kill the memory test. Unfortunately, in every case the system still hung. The console always shows: crond invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Call Trace: [<ffffffff802b45f6>] out_of_memory+0x8b/0x203 [<ffffffff8020f04d>] __alloc_pages+0x22b/0x2b4 [<ffffffff8025b6cd>] cache_alloc_refill+0x269/0x4ba [<ffffffff8020ab37>] kmem_cache_alloc+0x50/0x6d [<ffffffff802121fd>] getname+0x25/0x1c1 [<ffffffff8022354b>] __user_walk_fd+0x19/0x4c [<ffffffff802283f9>] vfs_stat_fd+0x1b/0x4a [<ffffffff8024ce49>] lock_hrtimer_base+0x26/0x4c [<ffffffff8023ad92>] hrtimer_try_to_cancel+0x4a/0x53 [<ffffffff8025992a>] hrtimer_cancel+0xc/0x16 [<ffffffff80260a45>] do_nanosleep+0x47/0x70 [<ffffffff802232eb>] sys_newstat+0x19/0x31 [<ffffffff8025d291>] tracesys+0x47/0xb2 [<ffffffff8025d2f1>] tracesys+0xa7/0xb2 Mem-info: DMA per-cpu: cpu 0 hot: high 186, batch 31 used:104 cpu 0 cold: high 62, batch 15 used:17 cpu 1 hot: high 186, batch 31 used:110 cpu 1 cold: high 62, batch 15 used:52 cpu 2 hot: high 186, batch 31 used:133 cpu 2 cold: high 62, batch 15 used:14 cpu 3 hot: high 186, batch 31 used:35 cpu 3 cold: high 62, batch 15 used:53 cpu 4 hot: high 186, batch 31 used:162 cpu 4 cold: high 62, batch 15 used:58 cpu 5 hot: high 186, batch 31 used:158 cpu 5 cold: high 62, batch 15 used:49 cpu 6 hot: high 186, batch 31 used:104 cpu 6 cold: high 62, batch 15 used:38 cpu 7 hot: high 186, batch 31 used:173 cpu 7 cold: high 62, batch 15 used:55 cpu 8 hot: high 186, batch 31 used:78 cpu 8 cold: high 62, batch 15 used:54 cpu 9 hot: high 186, batch 31 used:171 cpu 9 cold: high 62, batch 15 used:52 cpu 10 hot: high 186, batch 31 used:133 cpu 10 cold: high 62, batch 15 used:58 cpu 11 hot: high 186, batch 31 used:35 cpu 11 cold: high 62, batch 15 used:50 cpu 12 hot: high 186, batch 31 used:107 cpu 12 cold: high 62, batch 15 used:24 cpu 13 hot: high 186, batch 31 used:160 cpu 13 cold: high 62, batch 15 used:50 cpu 14 hot: high 186, batch 31 used:162 cpu 14 cold: high 62, batch 15 used:61 cpu 15 hot: high 186, batch 31 used:79 cpu 15 cold: high 62, batch 15 used:47 cpu 16 hot: high 186, batch 31 used:168 cpu 16 cold: high 62, batch 15 used:47 cpu 17 hot: high 186, batch 31 used:161 cpu 17 cold: high 62, batch 15 used:54 cpu 18 hot: high 186, batch 31 used:30 cpu 18 cold: high 62, batch 15 used:45 cpu 19 hot: high 186, batch 31 used:170 cpu 19 cold: high 62, batch 15 used:55 cpu 20 hot: high 186, batch 31 used:113 cpu 20 cold: high 62, batch 15 used:53 cpu 21 hot: high 186, batch 31 used:135 cpu 21 cold: high 62, batch 15 used:48 cpu 22 hot: high 186, batch 31 used:30 cpu 22 cold: high 62, batch 15 used:59 cpu 23 hot: high 186, batch 31 used:34 cpu 23 cold: high 62, batch 15 used:56 cpu 24 hot: high 186, batch 31 used:32 cpu 24 cold: high 62, batch 15 used:51 cpu 25 hot: high 186, batch 31 used:166 cpu 25 cold: high 62, batch 15 used:51 cpu 26 hot: high 186, batch 31 used:86 cpu 26 cold: high 62, batch 15 used:54 cpu 27 hot: high 186, batch 31 used:159 cpu 27 cold: high 62, batch 15 used:61 cpu 28 hot: high 186, batch 31 used:31 cpu 28 cold: high 62, batch 15 used:47 cpu 29 hot: high 186, batch 31 used:182 cpu 29 cold: high 62, batch 15 used:54 cpu 30 hot: high 186, batch 31 used:168 cpu 30 cold: high 62, batch 15 used:49 cpu 31 hot: high 186, batch 31 used:30 cpu 31 cold: high 62, batch 15 used:51 DMA32 per-cpu: empty Normal per-cpu: empty HighMem per-cpu: empty Free pages: 1029384kB (0kB HighMem) Active:13968223 inactive:13864378 dirty:0 writeback:0 unstable:0 free:257346 slab:9218 mapped-file:256 mapped-anon:27833002 pagetables:60944 DMA free:1029384kB min:1029480kB low:1286848kB high:1544220kB active:55872892kB inactive:55457512kB present:115003652kB pages_scanned:386262567 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 DMA: 0*4kB 1*8kB 2*16kB 1*32kB 1*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 251*4096kB = 1029384kB DMA32: empty Normal: empty HighMem: empty Swap cache: add 1631998, delete 1632141, find 5731/11296, race 0+0 Free swap = 0kB Total swap = 2047992kB Free swap: 0kB 28750913 pages of RAM 575476 reserved pages 6869 pages shared 0 pages swap cached Out of memory: Killed process 13120 (threaded_memtes). A couple of observations. First and foremost, this problem can be rectified by ensuring swap space is > 2GB (e.g., the memtest runs to a successful completion if swap is increased to 4GB). Obviously, 2GB is not enough swap for a system configured with 96GB of memory, particularly when running a workload (in this case the RHEL5 cert test) that by design maps and writes to all available memory pages. The second comment is that this only occurs with the Xen kernel. The standard RHEL5.1 SMP kernel passes the memory test without issues, even at only 2GB of swap. So, the question for Red Hat is, given that we are submitting certifications using the Xen kernel and evidently the Red Hat installer does NOT by default enable enough swap space on large memory systems: - Does Red Hat offer any recommendations/guidelines for customers regarding swap space allocation on large memory systems. - Will Red Hat accept certification results from a system where the default swap space configuration has been altered in order to get around the system hangs resulting from the hts threaded memory test? 1) This only occurs with the Xen kernel. This event sent from IssueTracker by Glen Johnson issue 160220 Fixed in hts 5.2. Greg, Could you please provide a environment for testing? Thank you. Resolving due to lack of hardware for verification. |