A patch was provided by Larry Woodman in BZ 205772 that helped , but now, with the patch, customer is experiencing an issue where a certain percentage of the nodes (around 20%+) will hang for a few minutes before the system decides to do oomkill. The node eventually recovers, but the delay has been long enough that the customer's SLURM utility will mark the nodes as being offline. We do have what appears to be a reproducer for the current problem. This problem shows up customer's swapless diskless quad socket dual core x86_64 nodes (and it appears all of those items listed may play a part in creating this problem). The nodes already have the min_free_kbytes quadrupled and Larry's patch to deal with the handoff of the lock between CPUs. The amount of memory in the system appears to play a factor in how long the delay is. On customer's nodes (with 16GB of memory) the delay lasted several minutes before the OOMkiller launched and recovered the system. I tried this same reproducer out on a quad-socket dual core AMD64 system in the lab. This system, however had 32GB of memory, and it took several hours before the OOMkiller launched and the system recovered. I've attached a log with some oomkill information in it.
I've gotten a bit confused as to which patches we should be trying out here. We still have the unfortunately long pauses when memory gets tight. Can you help me sort out what is what? rhel4-swapout_limit.patch - the one in this email rhel4-blkio.patch - this might have improved things over stock RHEL4.5 but we're not sure. rhel4-swap.patch (there seem to be several of these and I've gotten myself really confused by that) I think what happened is you made the blkio patch then you made a rhel4-swap.patch which caused problems for us on diskless nodes. Then you sent me a new version of the rhel4-swap.patch which fixed the problem the first rhel4-swap.patch. Now when you replied to Rik's comments you have a new rhel4-swap.patch rhel4-inactive_list.patch Larry Woodman wrote: > Another problem with the 2.6 VM is that it will continue to swap and > reclaim pagecache memory heavily even if several processes exit and free > most of the memory once it starts reclaiming. The reason for this is > that shrink_zone() never looks at the zone's free count so if it > determines that it needs to reclaim thousands of pages it wont stop > until the memory is freed even if there is no longer a need. The > attached patch fixes this problem by limiting the pages reclaimed if the > free list ever exceeds twice the pages_high watermark in shrink_zone(). > > Fixes BZ 234572
I'm currently working on this issue. I know that Ben wants faster OOM killing to occur and I am working on that but did the system get worse with the RHEL4-U6 patches??? Larry
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 295331 [details] Fix for this issue. BTW, this is the patch I posted to rhkernel-list on 12/22/07 Larry
Do you have the RHEL5 version of this patch. The one that went into the test kernel you sent me a few days ago. That is the patch I was interested. Alternatively, is this patch directly adaptable to RHEL5? Is there another BZ?
Committed in 68.12. RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Hi, the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at which point no further additions or revisions will be entertained. a mockup of the RHEL5.2 release notes can be viewed at the following link: http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html please use the aforementioned link to verify if your bugzilla is already in the release notes (if it needs to be). each item in the release notes contains a link to its original bug; as such, you can search through the release notes by bug number. Cheers, Don
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html