Bug 1096863 - watchman consumes too much CPU
Summary: watchman consumes too much CPU
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 2.1.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: ---
Assignee: Brenton Leanhardt
QA Contact: libra bugs
URL:
Whiteboard:
Depends On: 1091433 1097959
Blocks: 1105225
TreeView+ depends on / blocked
 
Reported: 2014-05-12 14:33 UTC by Brenton Leanhardt
Modified: 2014-08-04 13:27 UTC (History)
9 users (show)

Fixed In Version: openshift-origin-node-util-1.22.11.1-1.el6op
Doc Type: Bug Fix
Doc Text:
Previously, Watchman's frequency for checking gear state was hard-coded in the tool, and it could consume too much CPU as a result. This bug fix adds many additional configuration parameters along with documentation to the /etc/sysconfig/watchman file, and administrators now have access to more tuning options when using Watchman.
Clone Of: 1091433
Environment:
Last Closed: 2014-08-04 13:27:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1100766 0 high CLOSED watchman throttler's math is wrong 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2014:0999 0 normal SHIPPED_LIVE Red Hat OpenShift Enterprise 2.1.4 bug fix and enhancement update 2014-08-04 17:26:43 UTC

Internal Links: 1100766

Description Brenton Leanhardt 2014-05-12 14:33:43 UTC
+++ This bug was initially created as a clone of Bug #1091433 +++

Description of problem:

Sometime in the past couple of releases, watchman went from consuming a little under 10% of a CPU to somewhere in the 20-30% range.  As I understand it from looking at our configs, we are using the new gear state plugin, but the metrics plugin is not enabled.  I have not looked for a root cause yet, nor have I tried disabling individual plugins.

Version-Release number of selected component (if applicable):

openshift-origin-node-util-1.22.6-1.el6oso.noarch

How reproducible:

Always (at least, it appears pretty consistent across our nodes)

Steps to Reproduce:
1. Create a node with hundreds of gears (500 should be sufficient)
2. Run watchman for a while
3. check CPU usage using "ps auxww --cumulative | grep watchman".  The third column shows the precentage of CPU used by watchman and its child processes.

Actual results:

CPU usage is over 20%

Expected results:

Less than that.  :)

--- Additional comment from Jhon Honce on 2014-05-06 15:58:32 EDT ---

Added element STATE_CHECK_PERIOD to /etc/sysconfig/watchman to allow detuning of state checks.

https://github.com/openshift/origin-server/pull/5383

--- Additional comment from openshift-github-bot on 2014-05-06 16:53:59 EDT ---

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/c84642a6f0c03af10fad08c6064f686f74e2dedf
Bug 1091433 - Add setting to detune GearStatePlugin

* Add sysconfig/watchman element STATE_CHECK_PERIOD to control
  frequency of running GearStatePlugin

--- Additional comment from Yan Du on 2014-05-07 05:43:46 EDT ---

Test on devenv_4769, STATE_CHECK_PERIOD could take effect for watchman.

steps:
1. Config in /etc/sysconfig/watchman and restart watchman
STATE_CHANGE_DELAY=60
STATE_CHECK_PERIOD=60
2. change gear state and check the syslog, could get gear state change info in syslog with below log after about 2 min
3. check the cpu usage, it is lower than 20%

Move bug to verified.

Comment 1 Brenton Leanhardt 2014-05-16 12:32:48 UTC
We should pull in this upstream PR too: https://github.com/openshift/origin-server/pull/5418/files

Comment 2 Brenton Leanhardt 2014-05-22 12:54:30 UTC
These are two additional pull requests that ship important updates for watchman:

https://github.com/openshift/origin-server/pull/5429
https://github.com/openshift/origin-server/pull/5437

Comment 3 Brenton Leanhardt 2014-06-11 15:19:43 UTC
When the OOM plugin is backported we should consider pulling in https://github.com/openshift/origin-server/pull/5494 as well.

Comment 4 Brenton Leanhardt 2014-07-14 19:10:09 UTC
Upstream commits:

commit c84642a6f0c03af10fad08c6064f686f74e2dedf
Author: Jhon Honce <jhonce>
Date:   Tue May 6 08:40:56 2014 -0700

    Bug 1091433 - Add setting to detune GearStatePlugin
    
    * Add sysconfig/watchman element STATE_CHECK_PERIOD to control
      frequency of running GearStatePlugin

commit dbc9cfadb7c82eba7b17638e7f79e2c0a01bdf8e
Author: Jhon Honce <jhonce>
Date:   Thu May 15 11:41:36 2014 -0700

    Bug 1097959 - Add THROTTLER_CHECK_PERIOD to detune Throttler
    
    * Add THROTTLER_CHECK_PERIOD element to /etc/sysconfig/watchman to
      allow Operator to set period for checking cgroup counters

commit 6188dd63856e048aa51071e059618141ce13fd04
Author: Andy Grimm <agrimm>
Date:   Mon May 12 16:05:30 2014 -0400

    Introduce oom plugin and disable syslog plugin
    
    The oom plugin is improves handling of out-of-memory conditions
    in gears by dynamically adjusting a cgroup's memory limit while
    cleaning up its tasks.

commit efec8b5f07988f3e95de5b5c54aae380b0879b98
Author: Andy Grimm <agrimm>
Date:   Tue May 20 15:22:57 2014 -0400

    Remove an incorrect comment line in oom_plugin

commit a43a0d461974087568d3e7e60f61e890a1e9b0d1
Author: Andy Grimm <agrimm>
Date:   Tue May 20 15:25:30 2014 -0400

    Disable OOM kills for gear cgroups

commit ba9636528748d0cb24b455e102b9f3098072c7c6
Author: Andy Grimm <agrimm>
Date:   Tue May 20 15:31:20 2014 -0400

    Add OOM_CHECK_PERIOD to oo-watchman man page

commit 322cb2dacc7c8cc3c1cbbb35fc2e98248a8a5d61
Author: Jhon Honce <jhonce>
Date:   Wed May 21 16:00:11 2014 -0700

    WIP Node Platform - Skip syslog_plugin test if it has been disabled

Comment 7 Anping Li 2014-07-16 14:24:03 UTC
Verified and pass on puddle-2-1-2014-07-15

The CPU became less after update to puddle-2-1-2014-07-15. and the configure values also take effect.

1) On OSE GA build. Watchman consumes 42% CPU times.
[root@node ~]# ps auxww --cumulative | grep watchman
root      23276  42  0.1 13263832 184336 ?     Sl   17:17   8:31 watchman                                 
root     110942  0.0  0.0 103256   856 pts/1    S+   17:25   0:00 grep watchman

2) On puddle puddle-2-1-2014-07-15, Only 11.5% CPU times.
root@node ~]# ps auxww --cumulative | grep watchman
root      2683 11.5  0.3 13001500 163292 ?     Sl   20:05  12:45 watchman        
root     18410  0.0  0.0 103256   888 pts/1    S+   21:55   0:00 grep watchman

3) After add the following configuration. 
STATE_CHANGE_DELAY=60
STATE_CHECK_PERIOD=60
[root@node ~]# ps auxww --cumulative | grep watchman
root     10021 10.8  0.1 12905248 82308 ?      Sl   22:00   0:56 watchman                                   
root     24596  0.0  0.0 103256   852 pts/2    S+   22:08   0:00 grep watchman

Comment 9 errata-xmlrpc 2014-08-04 13:27:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0999.html


Note You need to log in before you can comment on or make changes to this bug.