Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1105225 - Watchman OOM plugin fails to restart gears
Watchman OOM plugin fails to restart gears
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers (Show other bugs)
2.1.0
Unspecified Unspecified
high Severity medium
: ---
: ---
Assigned To: Brenton Leanhardt
libra bugs
: Upstream
Depends On: 1096863 1104902
Blocks:
  Show dependency treegraph
 
Reported: 2014-06-05 11:43 EDT by Brenton Leanhardt
Modified: 2014-08-04 09:27 EDT (History)
8 users (show)

See Also:
Fixed In Version: openshift-origin-node-util-1.22.11.1-1.el6op
Doc Type: Bug Fix
Doc Text:
In certain scenarios when using the Watchman OOM plug-in, gears would fail to be restarted after running out of memory. This bug fix addresses several Watchman issues, and Watchman now restarts gears that have run out of memory, as expected.
Story Points: ---
Clone Of: 1104902
Environment:
Last Closed: 2014-08-04 09:27:19 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:0999 normal SHIPPED_LIVE Red Hat OpenShift Enterprise 2.1.4 bug fix and enhancement update 2014-08-04 13:26:43 EDT

  None (edit)
Description Brenton Leanhardt 2014-06-05 11:43:28 EDT
+++ This bug was initially created as a clone of Bug #1104902 +++

Description of problem:

Due to a somewhat last-minute change in the OOM plugin code, it always stops an OOMed gear before calling restart.  Since watchman's restart only restarts a running gear, the result is that OOMed apps are simply left stopped.

Version-Release number of selected component (if applicable):

openshift-origin-node-util-1.24.5-1.el6oso.noarch

How reproducible:

Always

Steps to Reproduce:
1. Create an app that consumes too much memory
2. Wait for it to OOM

Actual results:

Watchman logs two lines:

Jun  4 18:53:06 ex-std-nodeNNN watchman[92335]: OOM Plugin: Found gear XXXX under OOM.
Jun  4 18:53:06 ex-std-nodeNNN watchman[92335]: OOM Plugin: Increasing memory for gear XXXX to 705901363 and restarting

And in the platform.log, you'll see:

June 04 18:53:07 INFO [] AdminGearsControl: initialized for gear(s) XXXX
  AdminGearsControl: initialized with timeout 360s
  AdminGearsControl: initialized with 1 process per CPU
June 04 18:53:08 INFO [] XXXX stop against 'jbossews'
June 04 18:53:09 INFO [] Shell command '/sbin/runuser -s /bin/sh XXXX -c "exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c3,c118' /bin/sh -c \"set -e; /var/lib/openshift/XXXX/jbossews/bin/control stop \""' ran. rc=0 out=Stopping jbossews cartridge
kill -9 94742
kill -9 94521
June 04 18:53:21 INFO [] Shell command '/usr/bin/pkill -9 -u 3181' ran. rc=0 out=
June 04 18:53:22 INFO [] Shell command '/usr/bin/pgrep -u 3181' ran. rc=1 out=
June 04 18:53:22 INFO [] Shell command '/usr/bin/pkill -9 -u 3181' ran. rc=1 out=
June 04 18:53:22 INFO [] Shell command '/usr/bin/pgrep -u 3181' ran. rc=1 out=
June 04 18:53:22 INFO [] (98092) Stopping gear XXXX ... [ OK ]

Expected results:

In addition to the above logs, we should see a restart attempt with a message like:

watchman restarted user #{uuid}: application #{gear_name(uuid)} (retries: #{retries})"
Comment 1 Brenton Leanhardt 2014-07-14 15:11:08 EDT
Upstream commits:

commit 115b72f8260e4fea936161d22d6e3dab8407e46e
Author: Andy Grimm <agrimm@redhat.com>
Date:   Thu Jun 5 10:37:34 2014 -0400

    Bug 1104902 - Fix several bugs in OOM Plugin app restarts
    
    Several code paths had not been properly tested in this code, and
    various typos and logic errors have been corrected.

commit 4a5e999a9561ff0aef01b203f360e6a2b87be0cc
Author: Jhon Honce <jhonce@redhat.com>
Date:   Tue Jun 17 11:55:45 2014 -0700

    Bug 1104902 - Fix unit tests
Comment 4 Anping Li 2014-07-16 06:15:16 EDT
Verified and pass on puddle-2-1-2014-07-15

Highlight: in puddle-2-1-2014-07-15,the oom_plugin was imported to OSE2.1z. oo-cgroup-disable/enable must be executed for all containers to enable this plugin.

1. rhc app create jbosseap jbosseap
2. run application to swallow memory until out of memory.
3. Watch /var/log/message, the task was killed by kernel for resource limit.

Jul 16 04:07:12 node kernel: Task in /openshift/53c62dd24cfeff6c83000001 killed as a result of limit of /openshift/53c62dd24cfeff6c83000001
Jul 16 04:07:12 node kernel: memory: usage 524224kB, limit 524288kB, failcnt 21135
Jul 16 04:07:12 node kernel: memory+swap: usage 626688kB, limit 626688kB, failcnt 27


Verified on puddle-2-1-2014-07-15,Run same steps as per above.

1)OOM Plugin works. 
Jul 16 06:04:15 node dhclient[1016]: bound to 192.168.55.38 -- renewal in 51 seconds.
Jul 16 06:04:31 node watchman[24846]: OOM Plugin: Found gear 53c64d4e4cfeff1e1b00003f under OOM.
Jul 16 06:04:31 node watchman[24846]: OOM Plugin: Increasing memory for gear 53c64d4e4cfeff1e1b00003f to 705901363 and restarting


2)The gears was restarted by watchman.
July 16 06:04:41 INFO AdminGearsControl: initialized for gear(s) 53c64d4e4cfeff1e1b00003f
  AdminGearsControl: initialized with timeout 360s
  AdminGearsControl: initialized with 1 process per CPU
July 16 06:04:42 INFO 53c64d4e4cfeff1e1b00003f start against 'jbosseap'
July 16 06:05:07 INFO Shell command '/sbin/runuser -s /bin/sh 53c64d4e4cfeff1e1b00003f ******
Found 127.2.247.129:8080 listening port
Found 127.2.247.129:9999 listening port
~/jbosseap/standalone/deployments ~/jbosseap
~/jbosseap
Artifacts deployed: ./ROOT.war

July 16 06:05:07 INFO (20525) Starting gear 53c64d4e4cfeff1e1b00003f ... [ OK ]
Comment 6 errata-xmlrpc 2014-08-04 09:27:19 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0999.html

Note You need to log in before you can comment on or make changes to this bug.