Bug 252939 - Long Delay before OOMKill launches
Summary: Long Delay before OOMKill launches
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.5
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Larry Woodman
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks: 246627 392351 422551 430698
TreeView+ depends on / blocked
 
Reported: 2007-08-16 07:30 UTC by Norm Murray
Modified: 2018-10-19 21:16 UTC (History)
3 users (show)

Fixed In Version: RHSA-2008-0665
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-07-24 19:15:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Fix for this issue. (1.37 KB, patch)
2008-02-19 20:18 UTC, Larry Woodman
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2008:0665 0 normal SHIPPED_LIVE Moderate: Updated kernel packages for Red Hat Enterprise Linux 4.7 2008-07-24 16:41:06 UTC

Description Norm Murray 2007-08-16 07:30:28 UTC
A patch was provided by Larry Woodman in BZ 205772 that helped , but now, with
the patch, customer is experiencing an issue where a certain percentage of the
nodes (around 20%+) will hang for a few minutes before the system decides to do
oomkill.  The node eventually recovers, but the delay has been long enough that
the customer's SLURM utility will mark the nodes as being offline.  

We do  have what appears to be a reproducer for the current problem.

This problem shows up customer's swapless diskless quad socket dual core x86_64
nodes (and it appears all of those items listed may play a part in creating this
problem).  The nodes already have the min_free_kbytes quadrupled and Larry's
patch to deal with the handoff of the lock between CPUs.

The amount of memory in the system appears to play a factor in how long the
delay is.  On customer's nodes (with 16GB of memory) the delay lasted several
minutes before the OOMkiller launched and recovered the system.

I tried this same reproducer out on a quad-socket dual core AMD64 system in the
lab.  This system, however had 32GB of memory, and it took several hours before
the OOMkiller launched and the system recovered.  I've attached a log with some
oomkill information in it.

Comment 1 Norm Murray 2007-08-16 07:32:32 UTC
I've gotten a bit confused as to which patches we should be trying out
here. We still have the unfortunately long pauses when memory gets
tight. Can you help me sort out what is what?

rhel4-swapout_limit.patch - the one in this email

rhel4-blkio.patch - this might have improved things over stock RHEL4.5
but we're not sure.

rhel4-swap.patch (there seem to be several of these and I've gotten
myself really confused by that) I think what happened is you made the
blkio patch then you made a rhel4-swap.patch which caused problems for
us on diskless nodes. Then you sent me a new version of the
rhel4-swap.patch which fixed the problem the first rhel4-swap.patch. Now
when you replied to Rik's comments you have a new rhel4-swap.patch

rhel4-inactive_list.patch

Larry Woodman wrote:
> Another problem with the 2.6 VM is that it will continue to swap and
> reclaim pagecache memory heavily even if several processes exit and free
> most of the memory once it starts reclaiming.  The reason for this is
> that shrink_zone() never looks at the zone's free count so if it
> determines that it needs to reclaim thousands of pages it wont stop
> until the memory is freed even if there is no longer a need.  The
> attached patch fixes this problem by limiting the pages reclaimed if the
> free list ever exceeds twice the pages_high watermark in shrink_zone().
>
> Fixes BZ 234572

Comment 3 Larry Woodman 2007-08-27 15:41:25 UTC
I'm currently working on this issue.  I know that Ben wants faster OOM killing
to occur and I am working on that but did the system get worse with the RHEL4-U6
patches???

Larry


Comment 5 RHEL Program Management 2008-01-08 17:07:55 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Larry Woodman 2008-02-19 20:18:11 UTC
Created attachment 295331 [details]
Fix for this issue.


BTW, this is the patch I posted to rhkernel-list on 12/22/07

Larry

Comment 8 Ben Woodard 2008-02-21 03:48:30 UTC
Do you have the RHEL5 version of this patch. The one that went into the test
kernel you sent me a few days ago. That is the patch I was interested.
Alternatively, is this patch directly adaptable to RHEL5?
Is there another BZ?

Comment 9 Vivek Goyal 2008-02-21 14:50:10 UTC
Committed in 68.12. RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 10 Don Domingo 2008-04-02 02:13:49 UTC
Hi,
the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at
which point no further additions or revisions will be entertained.

a mockup of the RHEL5.2 release notes can be viewed at the following link:
http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html

please use the aforementioned link to verify if your bugzilla is already in the
release notes (if it needs to be). each item in the release notes contains a
link to its original bug; as such, you can search through the release notes by
bug number.

Cheers,
Don

Comment 14 errata-xmlrpc 2008-07-24 19:15:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html


Note You need to log in before you can comment on or make changes to this bug.