Bug 1171289 - watchman OOMPlugin should background pkill commands
Summary: watchman OOMPlugin should background pkill commands
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 1.x
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 2.x
Assignee: Andy Grimm
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks: 1173246
TreeView+ depends on / blocked
 
Reported: 2014-12-05 20:21 UTC by Andy Grimm
Modified: 2019-03-22 07:27 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1173246 (view as bug list)
Environment:
Last Closed: 2015-07-07 23:49:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Andy Grimm 2014-12-05 20:21:51 UTC
Description of problem:
Watchman's OOMPlugin can hang indefinitely on pkill commands if an OOM gear is holding locks on kernel task objects.

Possible solutions:

* fork the ruby process and call app.kill_procs() in the child.
* Kernel.spawn the pkill command(s)

In either case, if we wait for them at all, it should be after the memory limit bump.  Maybe we don't even care and should run Process.detach on the PID.

Version-Release number of selected component (if applicable):
openshift-origin-node-util-1.31.3-1.el6oso.noarch

Comment 1 Andy Grimm 2014-12-11 14:13:44 UTC
PR for master is https://github.com/openshift/origin-server/pull/6010

It needs another [merge], as the first attempt failed tests.

The corresponding PR for stage has been merged, and shoudl be tagged into a hotfix today.

Comment 2 openshift-github-bot 2015-01-20 17:37:28 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/3f92ecc1b1bc6b65b3db738db1b317a6aca7bfea
Bug 1171289 - background pkill command in OOMPlugin

Comment 3 Meng Bo 2015-02-03 07:31:33 UTC
Tested the new oom_plugin (since the devenv-stage_1137 build failed, copied the latest oom_plugin.rb from the origin-server repo to the /etc/openshift/watchman/plugins/ and restart openshift-watchman), the new oom plugin has great improvement on the working efficiency.

It will handle the oom_kill and restart the gear within a short period,
Feb  3 06:31:44 ip-10-171-100-122 watchman[11693]: OOM Plugin: Found gear 54d0b0a87e2fb21876000001 under OOM.
Feb  3 06:31:44 ip-10-171-100-122 watchman[11693]: OOM Plugin: Increasing memory for gear 54d0b0a87e2fb21876000001 to 705901363 and killing processes
Feb  3 06:31:54 ip-10-171-100-122 watchman[11693]: watchman restarted user 54d0b0a87e2fb21876000001: application perl1 (retries: 0)


To compare, the following is the old one's result,
Feb  3 07:24:15 ip-10-147-36-222 watchman[2527]: OOM Plugin: Found gear 54d0bd5bb0929a4b9c000001 under OOM.
Feb  3 07:24:15 ip-10-147-36-222 watchman[2527]: OOM Plugin: Increasing memory for gear 54d0bd5bb0929a4b9c000001 to 705901363 and killing processes
Feb  3 07:24:20 ip-10-147-36-222 watchman[2527]: OOM Plugin: Increasing memory for gear 54d0bd5bb0929a4b9c000001 to 776491499 and killing processes
Feb  3 07:24:25 ip-10-147-36-222 watchman[2527]: OOM Plugin: Increasing memory for gear 54d0bd5bb0929a4b9c000001 to 854140649 and killing processes
Feb  3 07:24:35 ip-10-147-36-222 watchman[2527]: watchman restarted user 54d0bd5bb0929a4b9c000001: application perl1 (retries: 0)

Comment 4 Meng Bo 2015-02-11 03:18:40 UTC
@agrimm

The PR has been merged into master, could you help move the bug to ON_QA so that we can verify it?

Thanks.

Comment 5 Qixuan Wang 2015-06-24 07:44:54 UTC
Tested on devenv_5556, the OOM plugin could detect the OOM, kill the process and restart gear. Move the bug to VERIFIED, thanks.

Here is the result: 

1. Create a gear and make it consume memory
# rhc app create rb20 ruby-2.0
# rhc ssh rb20
# perl -np -e \'$x="0123456789"x1000000\' < /dev/zero &

2. Check watchman log
[root@ip-10-169-88-220 ~]# tailf /var/log/messages
Jun 24 03:36:58 ip-10-169-88-220 watchman[2642]: OOM Plugin: Found gear 558a5c0e82fa275604000009 under OOM.
Jun 24 03:36:58 ip-10-169-88-220 watchman[2642]: OOM Plugin: Increasing memory for gear 558a5c0e82fa275604000009 to 705901363 and killing processes
Jun 24 03:37:04 ip-10-169-88-220 root[28200]: user-cron-jobs :START: minutely run of all scheduled jobs
Jun 24 03:37:04 ip-10-169-88-220 root[28208]: user-cron-jobs :END: minutely run of all scheduled jobs
Jun 24 03:37:09 ip-10-169-88-220 watchman[2642]: watchman restarted user 558a5c0e82fa275604000009: application rb20 (retries: 0)


Note You need to log in before you can comment on or make changes to this bug.