Bug 1105225

Summary: Watchman OOM plugin fails to restart gears
Product: OpenShift Container Platform Reporter: Brenton Leanhardt <bleanhar>
Component: ContainersAssignee: Brenton Leanhardt <bleanhar>
Status: CLOSED ERRATA QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: high    
Version: 2.1.0CC: adellape, agrimm, anli, jkeck, jokerman, libra-onpremise-devel, mmccomas, xjia
Target Milestone: ---Keywords: Upstream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openshift-origin-node-util-1.22.11.1-1.el6op Doc Type: Bug Fix
Doc Text:
In certain scenarios when using the Watchman OOM plug-in, gears would fail to be restarted after running out of memory. This bug fix addresses several Watchman issues, and Watchman now restarts gears that have run out of memory, as expected.
Story Points: ---
Clone Of: 1104902 Environment:
Last Closed: 2014-08-04 13:27:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1096863, 1104902    
Bug Blocks:    

Description Brenton Leanhardt 2014-06-05 15:43:28 UTC
+++ This bug was initially created as a clone of Bug #1104902 +++

Description of problem:

Due to a somewhat last-minute change in the OOM plugin code, it always stops an OOMed gear before calling restart.  Since watchman's restart only restarts a running gear, the result is that OOMed apps are simply left stopped.

Version-Release number of selected component (if applicable):

openshift-origin-node-util-1.24.5-1.el6oso.noarch

How reproducible:

Always

Steps to Reproduce:
1. Create an app that consumes too much memory
2. Wait for it to OOM

Actual results:

Watchman logs two lines:

Jun  4 18:53:06 ex-std-nodeNNN watchman[92335]: OOM Plugin: Found gear XXXX under OOM.
Jun  4 18:53:06 ex-std-nodeNNN watchman[92335]: OOM Plugin: Increasing memory for gear XXXX to 705901363 and restarting

And in the platform.log, you'll see:

June 04 18:53:07 INFO [] AdminGearsControl: initialized for gear(s) XXXX
  AdminGearsControl: initialized with timeout 360s
  AdminGearsControl: initialized with 1 process per CPU
June 04 18:53:08 INFO [] XXXX stop against 'jbossews'
June 04 18:53:09 INFO [] Shell command '/sbin/runuser -s /bin/sh XXXX -c "exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c3,c118' /bin/sh -c \"set -e; /var/lib/openshift/XXXX/jbossews/bin/control stop \""' ran. rc=0 out=Stopping jbossews cartridge
kill -9 94742
kill -9 94521
June 04 18:53:21 INFO [] Shell command '/usr/bin/pkill -9 -u 3181' ran. rc=0 out=
June 04 18:53:22 INFO [] Shell command '/usr/bin/pgrep -u 3181' ran. rc=1 out=
June 04 18:53:22 INFO [] Shell command '/usr/bin/pkill -9 -u 3181' ran. rc=1 out=
June 04 18:53:22 INFO [] Shell command '/usr/bin/pgrep -u 3181' ran. rc=1 out=
June 04 18:53:22 INFO [] (98092) Stopping gear XXXX ... [ OK ]

Expected results:

In addition to the above logs, we should see a restart attempt with a message like:

watchman restarted user #{uuid}: application #{gear_name(uuid)} (retries: #{retries})"

Comment 1 Brenton Leanhardt 2014-07-14 19:11:08 UTC
Upstream commits:

commit 115b72f8260e4fea936161d22d6e3dab8407e46e
Author: Andy Grimm <agrimm>
Date:   Thu Jun 5 10:37:34 2014 -0400

    Bug 1104902 - Fix several bugs in OOM Plugin app restarts
    
    Several code paths had not been properly tested in this code, and
    various typos and logic errors have been corrected.

commit 4a5e999a9561ff0aef01b203f360e6a2b87be0cc
Author: Jhon Honce <jhonce>
Date:   Tue Jun 17 11:55:45 2014 -0700

    Bug 1104902 - Fix unit tests

Comment 4 Anping Li 2014-07-16 10:15:16 UTC
Verified and pass on puddle-2-1-2014-07-15

Highlight: in puddle-2-1-2014-07-15,the oom_plugin was imported to OSE2.1z. oo-cgroup-disable/enable must be executed for all containers to enable this plugin.

1. rhc app create jbosseap jbosseap
2. run application to swallow memory until out of memory.
3. Watch /var/log/message, the task was killed by kernel for resource limit.

Jul 16 04:07:12 node kernel: Task in /openshift/53c62dd24cfeff6c83000001 killed as a result of limit of /openshift/53c62dd24cfeff6c83000001
Jul 16 04:07:12 node kernel: memory: usage 524224kB, limit 524288kB, failcnt 21135
Jul 16 04:07:12 node kernel: memory+swap: usage 626688kB, limit 626688kB, failcnt 27


Verified on puddle-2-1-2014-07-15,Run same steps as per above.

1)OOM Plugin works. 
Jul 16 06:04:15 node dhclient[1016]: bound to 192.168.55.38 -- renewal in 51 seconds.
Jul 16 06:04:31 node watchman[24846]: OOM Plugin: Found gear 53c64d4e4cfeff1e1b00003f under OOM.
Jul 16 06:04:31 node watchman[24846]: OOM Plugin: Increasing memory for gear 53c64d4e4cfeff1e1b00003f to 705901363 and restarting


2)The gears was restarted by watchman.
July 16 06:04:41 INFO AdminGearsControl: initialized for gear(s) 53c64d4e4cfeff1e1b00003f
  AdminGearsControl: initialized with timeout 360s
  AdminGearsControl: initialized with 1 process per CPU
July 16 06:04:42 INFO 53c64d4e4cfeff1e1b00003f start against 'jbosseap'
July 16 06:05:07 INFO Shell command '/sbin/runuser -s /bin/sh 53c64d4e4cfeff1e1b00003f ******
Found 127.2.247.129:8080 listening port
Found 127.2.247.129:9999 listening port
~/jbosseap/standalone/deployments ~/jbosseap
~/jbosseap
Artifacts deployed: ./ROOT.war

July 16 06:05:07 INFO (20525) Starting gear 53c64d4e4cfeff1e1b00003f ... [ OK ]

Comment 6 errata-xmlrpc 2014-08-04 13:27:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0999.html