Description of problem: Due to a somewhat last-minute change in the OOM plugin code, it always stops an OOMed gear before calling restart. Since watchman's restart only restarts a running gear, the result is that OOMed apps are simply left stopped. Version-Release number of selected component (if applicable): openshift-origin-node-util-1.24.5-1.el6oso.noarch How reproducible: Always Steps to Reproduce: 1. Create an app that consumes too much memory 2. Wait for it to OOM Actual results: Watchman logs two lines: Jun 4 18:53:06 ex-std-nodeNNN watchman[92335]: OOM Plugin: Found gear XXXX under OOM. Jun 4 18:53:06 ex-std-nodeNNN watchman[92335]: OOM Plugin: Increasing memory for gear XXXX to 705901363 and restarting And in the platform.log, you'll see: June 04 18:53:07 INFO [] AdminGearsControl: initialized for gear(s) XXXX AdminGearsControl: initialized with timeout 360s AdminGearsControl: initialized with 1 process per CPU June 04 18:53:08 INFO [] XXXX stop against 'jbossews' June 04 18:53:09 INFO [] Shell command '/sbin/runuser -s /bin/sh XXXX -c "exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c3,c118' /bin/sh -c \"set -e; /var/lib/openshift/XXXX/jbossews/bin/control stop \""' ran. rc=0 out=Stopping jbossews cartridge kill -9 94742 kill -9 94521 June 04 18:53:21 INFO [] Shell command '/usr/bin/pkill -9 -u 3181' ran. rc=0 out= June 04 18:53:22 INFO [] Shell command '/usr/bin/pgrep -u 3181' ran. rc=1 out= June 04 18:53:22 INFO [] Shell command '/usr/bin/pkill -9 -u 3181' ran. rc=1 out= June 04 18:53:22 INFO [] Shell command '/usr/bin/pgrep -u 3181' ran. rc=1 out= June 04 18:53:22 INFO [] (98092) Stopping gear XXXX ... [ OK ] Expected results: In addition to the above logs, we should see a restart attempt with a message like: watchman restarted user #{uuid}: application #{gear_name(uuid)} (retries: #{retries})"
Fixed in https://github.com/openshift/origin-server/pull/5478
Commits pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/115b72f8260e4fea936161d22d6e3dab8407e46e Bug 1104902 - Fix several bugs in OOM Plugin app restarts Several code paths had not been properly tested in this code, and various typos and logic errors have been corrected. https://github.com/openshift/origin-server/commit/4a5e999a9561ff0aef01b203f360e6a2b87be0cc Bug 1104902 - Fix unit tests
Checked on devenv_4890, issue has been fixed. # tailf /var/log/messages | grep watch Jun 19 05:27:27 ip-10-69-20-117 watchman[2036]: OOM Plugin: Found gear 53a2a9575d98fb2313000057 under OOM. Jun 19 05:27:27 ip-10-69-20-117 watchman[2036]: OOM Plugin: Increasing memory for gear 53a2a9575d98fb2313000057 to 705901363 and restarting Jun 19 05:28:02 ip-10-69-20-117 watchman[2036]: starting 53a2a9575d98fb2313000057 Jun 19 05:28:07 ip-10-69-20-117 watchman[2036]: watchman started user 53a2a9575d98fb2313000057: application perl1 (retries: 0) # rhc app show perl1 --state Cartridge perl-5.10 is started