+++ This bug was initially created as a clone of Bug #1104902 +++ Description of problem: Due to a somewhat last-minute change in the OOM plugin code, it always stops an OOMed gear before calling restart. Since watchman's restart only restarts a running gear, the result is that OOMed apps are simply left stopped. Version-Release number of selected component (if applicable): openshift-origin-node-util-1.24.5-1.el6oso.noarch How reproducible: Always Steps to Reproduce: 1. Create an app that consumes too much memory 2. Wait for it to OOM Actual results: Watchman logs two lines: Jun 4 18:53:06 ex-std-nodeNNN watchman[92335]: OOM Plugin: Found gear XXXX under OOM. Jun 4 18:53:06 ex-std-nodeNNN watchman[92335]: OOM Plugin: Increasing memory for gear XXXX to 705901363 and restarting And in the platform.log, you'll see: June 04 18:53:07 INFO [] AdminGearsControl: initialized for gear(s) XXXX AdminGearsControl: initialized with timeout 360s AdminGearsControl: initialized with 1 process per CPU June 04 18:53:08 INFO [] XXXX stop against 'jbossews' June 04 18:53:09 INFO [] Shell command '/sbin/runuser -s /bin/sh XXXX -c "exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c3,c118' /bin/sh -c \"set -e; /var/lib/openshift/XXXX/jbossews/bin/control stop \""' ran. rc=0 out=Stopping jbossews cartridge kill -9 94742 kill -9 94521 June 04 18:53:21 INFO [] Shell command '/usr/bin/pkill -9 -u 3181' ran. rc=0 out= June 04 18:53:22 INFO [] Shell command '/usr/bin/pgrep -u 3181' ran. rc=1 out= June 04 18:53:22 INFO [] Shell command '/usr/bin/pkill -9 -u 3181' ran. rc=1 out= June 04 18:53:22 INFO [] Shell command '/usr/bin/pgrep -u 3181' ran. rc=1 out= June 04 18:53:22 INFO [] (98092) Stopping gear XXXX ... [ OK ] Expected results: In addition to the above logs, we should see a restart attempt with a message like: watchman restarted user #{uuid}: application #{gear_name(uuid)} (retries: #{retries})"
Upstream commits: commit 115b72f8260e4fea936161d22d6e3dab8407e46e Author: Andy Grimm <agrimm> Date: Thu Jun 5 10:37:34 2014 -0400 Bug 1104902 - Fix several bugs in OOM Plugin app restarts Several code paths had not been properly tested in this code, and various typos and logic errors have been corrected. commit 4a5e999a9561ff0aef01b203f360e6a2b87be0cc Author: Jhon Honce <jhonce> Date: Tue Jun 17 11:55:45 2014 -0700 Bug 1104902 - Fix unit tests
Verified and pass on puddle-2-1-2014-07-15 Highlight: in puddle-2-1-2014-07-15,the oom_plugin was imported to OSE2.1z. oo-cgroup-disable/enable must be executed for all containers to enable this plugin. 1. rhc app create jbosseap jbosseap 2. run application to swallow memory until out of memory. 3. Watch /var/log/message, the task was killed by kernel for resource limit. Jul 16 04:07:12 node kernel: Task in /openshift/53c62dd24cfeff6c83000001 killed as a result of limit of /openshift/53c62dd24cfeff6c83000001 Jul 16 04:07:12 node kernel: memory: usage 524224kB, limit 524288kB, failcnt 21135 Jul 16 04:07:12 node kernel: memory+swap: usage 626688kB, limit 626688kB, failcnt 27 Verified on puddle-2-1-2014-07-15,Run same steps as per above. 1)OOM Plugin works. Jul 16 06:04:15 node dhclient[1016]: bound to 192.168.55.38 -- renewal in 51 seconds. Jul 16 06:04:31 node watchman[24846]: OOM Plugin: Found gear 53c64d4e4cfeff1e1b00003f under OOM. Jul 16 06:04:31 node watchman[24846]: OOM Plugin: Increasing memory for gear 53c64d4e4cfeff1e1b00003f to 705901363 and restarting 2)The gears was restarted by watchman. July 16 06:04:41 INFO AdminGearsControl: initialized for gear(s) 53c64d4e4cfeff1e1b00003f AdminGearsControl: initialized with timeout 360s AdminGearsControl: initialized with 1 process per CPU July 16 06:04:42 INFO 53c64d4e4cfeff1e1b00003f start against 'jbosseap' July 16 06:05:07 INFO Shell command '/sbin/runuser -s /bin/sh 53c64d4e4cfeff1e1b00003f ****** Found 127.2.247.129:8080 listening port Found 127.2.247.129:9999 listening port ~/jbosseap/standalone/deployments ~/jbosseap ~/jbosseap Artifacts deployed: ./ROOT.war July 16 06:05:07 INFO (20525) Starting gear 53c64d4e4cfeff1e1b00003f ... [ OK ]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0999.html