Bug 1104902
| Summary: | Watchman OOM plugin fails to restart gears | |||
|---|---|---|---|---|
| Product: | OpenShift Online | Reporter: | Andy Grimm <agrimm> | |
| Component: | Containers | Assignee: | Andy Grimm <agrimm> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | libra bugs <libra-bugs> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 2.x | CC: | bmeng, jgoulding, jhonce, jokerman, mmccomas | |
| Target Milestone: | --- | |||
| Target Release: | 2.x | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1105225 (view as bug list) | Environment: | ||
| Last Closed: | 2014-07-15 10:30:11 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1105225 | |||
Commits pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/115b72f8260e4fea936161d22d6e3dab8407e46e Bug 1104902 - Fix several bugs in OOM Plugin app restarts Several code paths had not been properly tested in this code, and various typos and logic errors have been corrected. https://github.com/openshift/origin-server/commit/4a5e999a9561ff0aef01b203f360e6a2b87be0cc Bug 1104902 - Fix unit tests Checked on devenv_4890, issue has been fixed. # tailf /var/log/messages | grep watch Jun 19 05:27:27 ip-10-69-20-117 watchman[2036]: OOM Plugin: Found gear 53a2a9575d98fb2313000057 under OOM. Jun 19 05:27:27 ip-10-69-20-117 watchman[2036]: OOM Plugin: Increasing memory for gear 53a2a9575d98fb2313000057 to 705901363 and restarting Jun 19 05:28:02 ip-10-69-20-117 watchman[2036]: starting 53a2a9575d98fb2313000057 Jun 19 05:28:07 ip-10-69-20-117 watchman[2036]: watchman started user 53a2a9575d98fb2313000057: application perl1 (retries: 0) # rhc app show perl1 --state Cartridge perl-5.10 is started |
Description of problem: Due to a somewhat last-minute change in the OOM plugin code, it always stops an OOMed gear before calling restart. Since watchman's restart only restarts a running gear, the result is that OOMed apps are simply left stopped. Version-Release number of selected component (if applicable): openshift-origin-node-util-1.24.5-1.el6oso.noarch How reproducible: Always Steps to Reproduce: 1. Create an app that consumes too much memory 2. Wait for it to OOM Actual results: Watchman logs two lines: Jun 4 18:53:06 ex-std-nodeNNN watchman[92335]: OOM Plugin: Found gear XXXX under OOM. Jun 4 18:53:06 ex-std-nodeNNN watchman[92335]: OOM Plugin: Increasing memory for gear XXXX to 705901363 and restarting And in the platform.log, you'll see: June 04 18:53:07 INFO [] AdminGearsControl: initialized for gear(s) XXXX AdminGearsControl: initialized with timeout 360s AdminGearsControl: initialized with 1 process per CPU June 04 18:53:08 INFO [] XXXX stop against 'jbossews' June 04 18:53:09 INFO [] Shell command '/sbin/runuser -s /bin/sh XXXX -c "exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c3,c118' /bin/sh -c \"set -e; /var/lib/openshift/XXXX/jbossews/bin/control stop \""' ran. rc=0 out=Stopping jbossews cartridge kill -9 94742 kill -9 94521 June 04 18:53:21 INFO [] Shell command '/usr/bin/pkill -9 -u 3181' ran. rc=0 out= June 04 18:53:22 INFO [] Shell command '/usr/bin/pgrep -u 3181' ran. rc=1 out= June 04 18:53:22 INFO [] Shell command '/usr/bin/pkill -9 -u 3181' ran. rc=1 out= June 04 18:53:22 INFO [] Shell command '/usr/bin/pgrep -u 3181' ran. rc=1 out= June 04 18:53:22 INFO [] (98092) Stopping gear XXXX ... [ OK ] Expected results: In addition to the above logs, we should see a restart attempt with a message like: watchman restarted user #{uuid}: application #{gear_name(uuid)} (retries: #{retries})"