+++ This bug was initially created as a clone of Bug #1091433 +++ Description of problem: Sometime in the past couple of releases, watchman went from consuming a little under 10% of a CPU to somewhere in the 20-30% range. As I understand it from looking at our configs, we are using the new gear state plugin, but the metrics plugin is not enabled. I have not looked for a root cause yet, nor have I tried disabling individual plugins. Version-Release number of selected component (if applicable): openshift-origin-node-util-1.22.6-1.el6oso.noarch How reproducible: Always (at least, it appears pretty consistent across our nodes) Steps to Reproduce: 1. Create a node with hundreds of gears (500 should be sufficient) 2. Run watchman for a while 3. check CPU usage using "ps auxww --cumulative | grep watchman". The third column shows the precentage of CPU used by watchman and its child processes. Actual results: CPU usage is over 20% Expected results: Less than that. :) --- Additional comment from Jhon Honce on 2014-05-06 15:58:32 EDT --- Added element STATE_CHECK_PERIOD to /etc/sysconfig/watchman to allow detuning of state checks. https://github.com/openshift/origin-server/pull/5383 --- Additional comment from openshift-github-bot on 2014-05-06 16:53:59 EDT --- Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/c84642a6f0c03af10fad08c6064f686f74e2dedf Bug 1091433 - Add setting to detune GearStatePlugin * Add sysconfig/watchman element STATE_CHECK_PERIOD to control frequency of running GearStatePlugin --- Additional comment from Yan Du on 2014-05-07 05:43:46 EDT --- Test on devenv_4769, STATE_CHECK_PERIOD could take effect for watchman. steps: 1. Config in /etc/sysconfig/watchman and restart watchman STATE_CHANGE_DELAY=60 STATE_CHECK_PERIOD=60 2. change gear state and check the syslog, could get gear state change info in syslog with below log after about 2 min 3. check the cpu usage, it is lower than 20% Move bug to verified.
We should pull in this upstream PR too: https://github.com/openshift/origin-server/pull/5418/files
These are two additional pull requests that ship important updates for watchman: https://github.com/openshift/origin-server/pull/5429 https://github.com/openshift/origin-server/pull/5437
When the OOM plugin is backported we should consider pulling in https://github.com/openshift/origin-server/pull/5494 as well.
Upstream commits: commit c84642a6f0c03af10fad08c6064f686f74e2dedf Author: Jhon Honce <jhonce> Date: Tue May 6 08:40:56 2014 -0700 Bug 1091433 - Add setting to detune GearStatePlugin * Add sysconfig/watchman element STATE_CHECK_PERIOD to control frequency of running GearStatePlugin commit dbc9cfadb7c82eba7b17638e7f79e2c0a01bdf8e Author: Jhon Honce <jhonce> Date: Thu May 15 11:41:36 2014 -0700 Bug 1097959 - Add THROTTLER_CHECK_PERIOD to detune Throttler * Add THROTTLER_CHECK_PERIOD element to /etc/sysconfig/watchman to allow Operator to set period for checking cgroup counters commit 6188dd63856e048aa51071e059618141ce13fd04 Author: Andy Grimm <agrimm> Date: Mon May 12 16:05:30 2014 -0400 Introduce oom plugin and disable syslog plugin The oom plugin is improves handling of out-of-memory conditions in gears by dynamically adjusting a cgroup's memory limit while cleaning up its tasks. commit efec8b5f07988f3e95de5b5c54aae380b0879b98 Author: Andy Grimm <agrimm> Date: Tue May 20 15:22:57 2014 -0400 Remove an incorrect comment line in oom_plugin commit a43a0d461974087568d3e7e60f61e890a1e9b0d1 Author: Andy Grimm <agrimm> Date: Tue May 20 15:25:30 2014 -0400 Disable OOM kills for gear cgroups commit ba9636528748d0cb24b455e102b9f3098072c7c6 Author: Andy Grimm <agrimm> Date: Tue May 20 15:31:20 2014 -0400 Add OOM_CHECK_PERIOD to oo-watchman man page commit 322cb2dacc7c8cc3c1cbbb35fc2e98248a8a5d61 Author: Jhon Honce <jhonce> Date: Wed May 21 16:00:11 2014 -0700 WIP Node Platform - Skip syslog_plugin test if it has been disabled
Verified and pass on puddle-2-1-2014-07-15 The CPU became less after update to puddle-2-1-2014-07-15. and the configure values also take effect. 1) On OSE GA build. Watchman consumes 42% CPU times. [root@node ~]# ps auxww --cumulative | grep watchman root 23276 42 0.1 13263832 184336 ? Sl 17:17 8:31 watchman root 110942 0.0 0.0 103256 856 pts/1 S+ 17:25 0:00 grep watchman 2) On puddle puddle-2-1-2014-07-15, Only 11.5% CPU times. root@node ~]# ps auxww --cumulative | grep watchman root 2683 11.5 0.3 13001500 163292 ? Sl 20:05 12:45 watchman root 18410 0.0 0.0 103256 888 pts/1 S+ 21:55 0:00 grep watchman 3) After add the following configuration. STATE_CHANGE_DELAY=60 STATE_CHECK_PERIOD=60 [root@node ~]# ps auxww --cumulative | grep watchman root 10021 10.8 0.1 12905248 82308 ? Sl 22:00 0:56 watchman root 24596 0.0 0.0 103256 852 pts/2 S+ 22:08 0:00 grep watchman
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0999.html