Description of problem: Watchman's OOMPlugin can hang indefinitely on pkill commands if an OOM gear is holding locks on kernel task objects. Possible solutions: * fork the ruby process and call app.kill_procs() in the child. * Kernel.spawn the pkill command(s) In either case, if we wait for them at all, it should be after the memory limit bump. Maybe we don't even care and should run Process.detach on the PID. Version-Release number of selected component (if applicable): openshift-origin-node-util-1.31.3-1.el6oso.noarch
PR for master is https://github.com/openshift/origin-server/pull/6010 It needs another [merge], as the first attempt failed tests. The corresponding PR for stage has been merged, and shoudl be tagged into a hotfix today.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/3f92ecc1b1bc6b65b3db738db1b317a6aca7bfea Bug 1171289 - background pkill command in OOMPlugin
Tested the new oom_plugin (since the devenv-stage_1137 build failed, copied the latest oom_plugin.rb from the origin-server repo to the /etc/openshift/watchman/plugins/ and restart openshift-watchman), the new oom plugin has great improvement on the working efficiency. It will handle the oom_kill and restart the gear within a short period, Feb 3 06:31:44 ip-10-171-100-122 watchman[11693]: OOM Plugin: Found gear 54d0b0a87e2fb21876000001 under OOM. Feb 3 06:31:44 ip-10-171-100-122 watchman[11693]: OOM Plugin: Increasing memory for gear 54d0b0a87e2fb21876000001 to 705901363 and killing processes Feb 3 06:31:54 ip-10-171-100-122 watchman[11693]: watchman restarted user 54d0b0a87e2fb21876000001: application perl1 (retries: 0) To compare, the following is the old one's result, Feb 3 07:24:15 ip-10-147-36-222 watchman[2527]: OOM Plugin: Found gear 54d0bd5bb0929a4b9c000001 under OOM. Feb 3 07:24:15 ip-10-147-36-222 watchman[2527]: OOM Plugin: Increasing memory for gear 54d0bd5bb0929a4b9c000001 to 705901363 and killing processes Feb 3 07:24:20 ip-10-147-36-222 watchman[2527]: OOM Plugin: Increasing memory for gear 54d0bd5bb0929a4b9c000001 to 776491499 and killing processes Feb 3 07:24:25 ip-10-147-36-222 watchman[2527]: OOM Plugin: Increasing memory for gear 54d0bd5bb0929a4b9c000001 to 854140649 and killing processes Feb 3 07:24:35 ip-10-147-36-222 watchman[2527]: watchman restarted user 54d0bd5bb0929a4b9c000001: application perl1 (retries: 0)
@agrimm The PR has been merged into master, could you help move the bug to ON_QA so that we can verify it? Thanks.
Tested on devenv_5556, the OOM plugin could detect the OOM, kill the process and restart gear. Move the bug to VERIFIED, thanks. Here is the result: 1. Create a gear and make it consume memory # rhc app create rb20 ruby-2.0 # rhc ssh rb20 # perl -np -e \'$x="0123456789"x1000000\' < /dev/zero & 2. Check watchman log [root@ip-10-169-88-220 ~]# tailf /var/log/messages Jun 24 03:36:58 ip-10-169-88-220 watchman[2642]: OOM Plugin: Found gear 558a5c0e82fa275604000009 under OOM. Jun 24 03:36:58 ip-10-169-88-220 watchman[2642]: OOM Plugin: Increasing memory for gear 558a5c0e82fa275604000009 to 705901363 and killing processes Jun 24 03:37:04 ip-10-169-88-220 root[28200]: user-cron-jobs :START: minutely run of all scheduled jobs Jun 24 03:37:04 ip-10-169-88-220 root[28208]: user-cron-jobs :END: minutely run of all scheduled jobs Jun 24 03:37:09 ip-10-169-88-220 watchman[2642]: watchman restarted user 558a5c0e82fa275604000009: application rb20 (retries: 0)