Bug 1104902 - Watchman OOM plugin fails to restart gears
Summary: Watchman OOM plugin fails to restart gears
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 2.x
Assignee: Andy Grimm
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks: 1105225
TreeView+ depends on / blocked
 
Reported: 2014-06-05 00:19 UTC by Andy Grimm
Modified: 2016-11-08 03:47 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1105225 (view as bug list)
Environment:
Last Closed: 2014-07-15 10:30:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Andy Grimm 2014-06-05 00:19:31 UTC
Description of problem:

Due to a somewhat last-minute change in the OOM plugin code, it always stops an OOMed gear before calling restart.  Since watchman's restart only restarts a running gear, the result is that OOMed apps are simply left stopped.

Version-Release number of selected component (if applicable):

openshift-origin-node-util-1.24.5-1.el6oso.noarch

How reproducible:

Always

Steps to Reproduce:
1. Create an app that consumes too much memory
2. Wait for it to OOM

Actual results:

Watchman logs two lines:

Jun  4 18:53:06 ex-std-nodeNNN watchman[92335]: OOM Plugin: Found gear XXXX under OOM.
Jun  4 18:53:06 ex-std-nodeNNN watchman[92335]: OOM Plugin: Increasing memory for gear XXXX to 705901363 and restarting

And in the platform.log, you'll see:

June 04 18:53:07 INFO [] AdminGearsControl: initialized for gear(s) XXXX
  AdminGearsControl: initialized with timeout 360s
  AdminGearsControl: initialized with 1 process per CPU
June 04 18:53:08 INFO [] XXXX stop against 'jbossews'
June 04 18:53:09 INFO [] Shell command '/sbin/runuser -s /bin/sh XXXX -c "exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c3,c118' /bin/sh -c \"set -e; /var/lib/openshift/XXXX/jbossews/bin/control stop \""' ran. rc=0 out=Stopping jbossews cartridge
kill -9 94742
kill -9 94521
June 04 18:53:21 INFO [] Shell command '/usr/bin/pkill -9 -u 3181' ran. rc=0 out=
June 04 18:53:22 INFO [] Shell command '/usr/bin/pgrep -u 3181' ran. rc=1 out=
June 04 18:53:22 INFO [] Shell command '/usr/bin/pkill -9 -u 3181' ran. rc=1 out=
June 04 18:53:22 INFO [] Shell command '/usr/bin/pgrep -u 3181' ran. rc=1 out=
June 04 18:53:22 INFO [] (98092) Stopping gear XXXX ... [ OK ]

Expected results:

In addition to the above logs, we should see a restart attempt with a message like:

watchman restarted user #{uuid}: application #{gear_name(uuid)} (retries: #{retries})"

Comment 1 Jhon Honce 2014-06-18 16:23:00 UTC
Fixed in https://github.com/openshift/origin-server/pull/5478

Comment 2 openshift-github-bot 2014-06-18 17:14:11 UTC
Commits pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/115b72f8260e4fea936161d22d6e3dab8407e46e
Bug 1104902 - Fix several bugs in OOM Plugin app restarts

Several code paths had not been properly tested in this code, and
various typos and logic errors have been corrected.

https://github.com/openshift/origin-server/commit/4a5e999a9561ff0aef01b203f360e6a2b87be0cc
Bug 1104902 - Fix unit tests

Comment 3 Meng Bo 2014-06-19 05:41:06 UTC
Checked on devenv_4890, issue has been fixed.

# tailf /var/log/messages | grep watch
Jun 19 05:27:27 ip-10-69-20-117 watchman[2036]: OOM Plugin: Found gear 53a2a9575d98fb2313000057 under OOM.
Jun 19 05:27:27 ip-10-69-20-117 watchman[2036]: OOM Plugin: Increasing memory for gear 53a2a9575d98fb2313000057 to 705901363 and restarting
Jun 19 05:28:02 ip-10-69-20-117 watchman[2036]: starting 53a2a9575d98fb2313000057
Jun 19 05:28:07 ip-10-69-20-117 watchman[2036]: watchman started user 53a2a9575d98fb2313000057: application perl1 (retries: 0)

# rhc app show perl1 --state
Cartridge perl-5.10 is started


Note You need to log in before you can comment on or make changes to this bug.