252939 – Long Delay before OOMKill launches

Bug 252939 - Long Delay before OOMKill launches

Summary: Long Delay before OOMKill launches

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.5
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Larry Woodman
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	246627 392351 422551 430698
TreeView+	depends on / blocked

Reported:	2007-08-16 07:30 UTC by Norm Murray
Modified:	2018-10-19 21:16 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHSA-2008-0665
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-07-24 19:15:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Fix for this issue. (1.37 KB, patch) 2008-02-19 20:18 UTC, Larry Woodman	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2008:0665	0	normal	SHIPPED_LIVE	Moderate: Updated kernel packages for Red Hat Enterprise Linux 4.7	2008-07-24 16:41:06 UTC

Description Norm Murray 2007-08-16 07:30:28 UTC

A patch was provided by Larry Woodman in BZ 205772 that helped , but now, with
the patch, customer is experiencing an issue where a certain percentage of the
nodes (around 20%+) will hang for a few minutes before the system decides to do
oomkill.  The node eventually recovers, but the delay has been long enough that
the customer's SLURM utility will mark the nodes as being offline.  

We do  have what appears to be a reproducer for the current problem.

This problem shows up customer's swapless diskless quad socket dual core x86_64
nodes (and it appears all of those items listed may play a part in creating this
problem).  The nodes already have the min_free_kbytes quadrupled and Larry's
patch to deal with the handoff of the lock between CPUs.

The amount of memory in the system appears to play a factor in how long the
delay is.  On customer's nodes (with 16GB of memory) the delay lasted several
minutes before the OOMkiller launched and recovered the system.

I tried this same reproducer out on a quad-socket dual core AMD64 system in the
lab.  This system, however had 32GB of memory, and it took several hours before
the OOMkiller launched and the system recovered.  I've attached a log with some
oomkill information in it.

Comment 1 Norm Murray 2007-08-16 07:32:32 UTC

I've gotten a bit confused as to which patches we should be trying out
here. We still have the unfortunately long pauses when memory gets
tight. Can you help me sort out what is what?

rhel4-swapout_limit.patch - the one in this email

rhel4-blkio.patch - this might have improved things over stock RHEL4.5
but we're not sure.

rhel4-swap.patch (there seem to be several of these and I've gotten
myself really confused by that) I think what happened is you made the
blkio patch then you made a rhel4-swap.patch which caused problems for
us on diskless nodes. Then you sent me a new version of the
rhel4-swap.patch which fixed the problem the first rhel4-swap.patch. Now
when you replied to Rik's comments you have a new rhel4-swap.patch

rhel4-inactive_list.patch

Larry Woodman wrote:
> Another problem with the 2.6 VM is that it will continue to swap and
> reclaim pagecache memory heavily even if several processes exit and free
> most of the memory once it starts reclaiming.  The reason for this is
> that shrink_zone() never looks at the zone's free count so if it
> determines that it needs to reclaim thousands of pages it wont stop
> until the memory is freed even if there is no longer a need.  The
> attached patch fixes this problem by limiting the pages reclaimed if the
> free list ever exceeds twice the pages_high watermark in shrink_zone().
>
> Fixes BZ 234572

Comment 3 Larry Woodman 2007-08-27 15:41:25 UTC

I'm currently working on this issue.  I know that Ben wants faster OOM killing
to occur and I am working on that but did the system get worse with the RHEL4-U6
patches???

Larry

Comment 5 RHEL Program Management 2008-01-08 17:07:55 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Larry Woodman 2008-02-19 20:18:11 UTC

Created attachment 295331 [details]
Fix for this issue.


BTW, this is the patch I posted to rhkernel-list on 12/22/07

Larry

Comment 8 Ben Woodard 2008-02-21 03:48:30 UTC

Do you have the RHEL5 version of this patch. The one that went into the test
kernel you sent me a few days ago. That is the patch I was interested.
Alternatively, is this patch directly adaptable to RHEL5?
Is there another BZ?

Comment 9 Vivek Goyal 2008-02-21 14:50:10 UTC

Committed in 68.12. RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 10 Don Domingo 2008-04-02 02:13:49 UTC

Hi,
the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at
which point no further additions or revisions will be entertained.

a mockup of the RHEL5.2 release notes can be viewed at the following link:
http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html

please use the aforementioned link to verify if your bugzilla is already in the
release notes (if it needs to be). each item in the release notes contains a
link to its original bug; as such, you can search through the release notes by
bug number.

Cheers,
Don

Comment 14 errata-xmlrpc 2008-07-24 19:15:27 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html

Note You need to log in before you can comment on or make changes to this bug.