Bug 1173246

Summary: watchman OOMPlugin should background pkill commands
Product: OpenShift Container Platform Reporter: Brenton Leanhardt <bleanhar>
Component: ContainersAssignee: Brenton Leanhardt <bleanhar>
Status: CLOSED ERRATA QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.2.0CC: agrimm, anli, jokerman, libra-bugs, libra-onpremise-devel, mmccomas, nicholas_schuetz, pruan, tiwillia
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openshift-origin-node-util-1.32.4.1-1 Doc Type: Bug Fix
Doc Text:
Cause: Previously, watchman OOMPlugin waited for pkill to exit. Consequence: Watchman would unnecessarily wait for pkill to exit which may take a long time and block other tasks. Fix: Watchman now backgrounds pkill tasks. Result: Watchman will now continue processing other tasks while pkill operations are processed in the background.
Story Points: ---
Clone Of: 1171289 Environment:
Last Closed: 2015-01-08 15:34:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1171289    
Bug Blocks:    

Description Brenton Leanhardt 2014-12-11 18:18:06 UTC
+++ This bug was initially created as a clone of Bug #1171289 +++

Description of problem:
Watchman's OOMPlugin can hang indefinitely on pkill commands if an OOM gear is holding locks on kernel task objects.

Possible solutions:

* fork the ruby process and call app.kill_procs() in the child.
* Kernel.spawn the pkill command(s)

In either case, if we wait for them at all, it should be after the memory limit bump.  Maybe we don't even care and should run Process.detach on the PID.

Version-Release number of selected component (if applicable):
openshift-origin-node-util-1.31.3-1.el6oso.noarch

--- Additional comment from Andy Grimm on 2014-12-11 09:13:44 EST ---

PR for master is https://github.com/openshift/origin-server/pull/6010

It needs another [merge], as the first attempt failed tests.

The corresponding PR for stage has been merged, and shoudl be tagged into a hotfix today.

Comment 3 Anping Li 2014-12-12 10:47:13 UTC
Verified and pass on puddle-2-2-2014-12-11
1) checked the code, the safe_pkill was added in this puddle.
2) create app and Increase the memory usage in the gear ( perl -np -e \'$x="0123456789"x1000000\' < /dev/zero)
3) wait for minutes. check the /var/log/message, we can found the gear was killed.

Dec 12 03:43:33 ose2 watchman[29375]: OOM Plugin: Found gear 548ac1186bb25e95a50000de under OOM.
Dec 12 03:43:33 ose2 watchman[29375]: OOM Plugin: Increasing memory for gear 548ac1186bb25e95a50000de to 705901363 and killing processe

Comment 5 errata-xmlrpc 2015-01-08 15:34:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0019.html