Bug 1105225 - Watchman OOM plugin fails to restart gears
Summary: Watchman OOM plugin fails to restart gears
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 2.1.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: ---
Assignee: Brenton Leanhardt
QA Contact: libra bugs
URL:
Whiteboard:
Depends On: 1096863 1104902
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-06-05 15:43 UTC by Brenton Leanhardt
Modified: 2014-08-04 13:27 UTC (History)
8 users (show)

Fixed In Version: openshift-origin-node-util-1.22.11.1-1.el6op
Doc Type: Bug Fix
Doc Text:
In certain scenarios when using the Watchman OOM plug-in, gears would fail to be restarted after running out of memory. This bug fix addresses several Watchman issues, and Watchman now restarts gears that have run out of memory, as expected.
Clone Of: 1104902
Environment:
Last Closed: 2014-08-04 13:27:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:0999 0 normal SHIPPED_LIVE Red Hat OpenShift Enterprise 2.1.4 bug fix and enhancement update 2014-08-04 17:26:43 UTC

Description Brenton Leanhardt 2014-06-05 15:43:28 UTC
+++ This bug was initially created as a clone of Bug #1104902 +++

Description of problem:

Due to a somewhat last-minute change in the OOM plugin code, it always stops an OOMed gear before calling restart.  Since watchman's restart only restarts a running gear, the result is that OOMed apps are simply left stopped.

Version-Release number of selected component (if applicable):

openshift-origin-node-util-1.24.5-1.el6oso.noarch

How reproducible:

Always

Steps to Reproduce:
1. Create an app that consumes too much memory
2. Wait for it to OOM

Actual results:

Watchman logs two lines:

Jun  4 18:53:06 ex-std-nodeNNN watchman[92335]: OOM Plugin: Found gear XXXX under OOM.
Jun  4 18:53:06 ex-std-nodeNNN watchman[92335]: OOM Plugin: Increasing memory for gear XXXX to 705901363 and restarting

And in the platform.log, you'll see:

June 04 18:53:07 INFO [] AdminGearsControl: initialized for gear(s) XXXX
  AdminGearsControl: initialized with timeout 360s
  AdminGearsControl: initialized with 1 process per CPU
June 04 18:53:08 INFO [] XXXX stop against 'jbossews'
June 04 18:53:09 INFO [] Shell command '/sbin/runuser -s /bin/sh XXXX -c "exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c3,c118' /bin/sh -c \"set -e; /var/lib/openshift/XXXX/jbossews/bin/control stop \""' ran. rc=0 out=Stopping jbossews cartridge
kill -9 94742
kill -9 94521
June 04 18:53:21 INFO [] Shell command '/usr/bin/pkill -9 -u 3181' ran. rc=0 out=
June 04 18:53:22 INFO [] Shell command '/usr/bin/pgrep -u 3181' ran. rc=1 out=
June 04 18:53:22 INFO [] Shell command '/usr/bin/pkill -9 -u 3181' ran. rc=1 out=
June 04 18:53:22 INFO [] Shell command '/usr/bin/pgrep -u 3181' ran. rc=1 out=
June 04 18:53:22 INFO [] (98092) Stopping gear XXXX ... [ OK ]

Expected results:

In addition to the above logs, we should see a restart attempt with a message like:

watchman restarted user #{uuid}: application #{gear_name(uuid)} (retries: #{retries})"

Comment 1 Brenton Leanhardt 2014-07-14 19:11:08 UTC
Upstream commits:

commit 115b72f8260e4fea936161d22d6e3dab8407e46e
Author: Andy Grimm <agrimm>
Date:   Thu Jun 5 10:37:34 2014 -0400

    Bug 1104902 - Fix several bugs in OOM Plugin app restarts
    
    Several code paths had not been properly tested in this code, and
    various typos and logic errors have been corrected.

commit 4a5e999a9561ff0aef01b203f360e6a2b87be0cc
Author: Jhon Honce <jhonce>
Date:   Tue Jun 17 11:55:45 2014 -0700

    Bug 1104902 - Fix unit tests

Comment 4 Anping Li 2014-07-16 10:15:16 UTC
Verified and pass on puddle-2-1-2014-07-15

Highlight: in puddle-2-1-2014-07-15,the oom_plugin was imported to OSE2.1z. oo-cgroup-disable/enable must be executed for all containers to enable this plugin.

1. rhc app create jbosseap jbosseap
2. run application to swallow memory until out of memory.
3. Watch /var/log/message, the task was killed by kernel for resource limit.

Jul 16 04:07:12 node kernel: Task in /openshift/53c62dd24cfeff6c83000001 killed as a result of limit of /openshift/53c62dd24cfeff6c83000001
Jul 16 04:07:12 node kernel: memory: usage 524224kB, limit 524288kB, failcnt 21135
Jul 16 04:07:12 node kernel: memory+swap: usage 626688kB, limit 626688kB, failcnt 27


Verified on puddle-2-1-2014-07-15,Run same steps as per above.

1)OOM Plugin works. 
Jul 16 06:04:15 node dhclient[1016]: bound to 192.168.55.38 -- renewal in 51 seconds.
Jul 16 06:04:31 node watchman[24846]: OOM Plugin: Found gear 53c64d4e4cfeff1e1b00003f under OOM.
Jul 16 06:04:31 node watchman[24846]: OOM Plugin: Increasing memory for gear 53c64d4e4cfeff1e1b00003f to 705901363 and restarting


2)The gears was restarted by watchman.
July 16 06:04:41 INFO AdminGearsControl: initialized for gear(s) 53c64d4e4cfeff1e1b00003f
  AdminGearsControl: initialized with timeout 360s
  AdminGearsControl: initialized with 1 process per CPU
July 16 06:04:42 INFO 53c64d4e4cfeff1e1b00003f start against 'jbosseap'
July 16 06:05:07 INFO Shell command '/sbin/runuser -s /bin/sh 53c64d4e4cfeff1e1b00003f ******
Found 127.2.247.129:8080 listening port
Found 127.2.247.129:9999 listening port
~/jbosseap/standalone/deployments ~/jbosseap
~/jbosseap
Artifacts deployed: ./ROOT.war

July 16 06:05:07 INFO (20525) Starting gear 53c64d4e4cfeff1e1b00003f ... [ OK ]

Comment 6 errata-xmlrpc 2014-08-04 13:27:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0999.html


Note You need to log in before you can comment on or make changes to this bug.