Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1096863 - watchman consumes too much CPU
watchman consumes too much CPU
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers (Show other bugs)
2.1.0
Unspecified Unspecified
high Severity medium
: ---
: ---
Assigned To: Brenton Leanhardt
libra bugs
: Upstream
Depends On: 1091433 1097959
Blocks: 1105225
  Show dependency treegraph
 
Reported: 2014-05-12 10:33 EDT by Brenton Leanhardt
Modified: 2014-08-04 09:27 EDT (History)
9 users (show)

See Also:
Fixed In Version: openshift-origin-node-util-1.22.11.1-1.el6op
Doc Type: Bug Fix
Doc Text:
Previously, Watchman's frequency for checking gear state was hard-coded in the tool, and it could consume too much CPU as a result. This bug fix adds many additional configuration parameters along with documentation to the /etc/sysconfig/watchman file, and administrators now have access to more tuning options when using Watchman.
Story Points: ---
Clone Of: 1091433
Environment:
Last Closed: 2014-08-04 09:27:06 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:0999 normal SHIPPED_LIVE Red Hat OpenShift Enterprise 2.1.4 bug fix and enhancement update 2014-08-04 13:26:43 EDT

  None (edit)
Description Brenton Leanhardt 2014-05-12 10:33:43 EDT
+++ This bug was initially created as a clone of Bug #1091433 +++

Description of problem:

Sometime in the past couple of releases, watchman went from consuming a little under 10% of a CPU to somewhere in the 20-30% range.  As I understand it from looking at our configs, we are using the new gear state plugin, but the metrics plugin is not enabled.  I have not looked for a root cause yet, nor have I tried disabling individual plugins.

Version-Release number of selected component (if applicable):

openshift-origin-node-util-1.22.6-1.el6oso.noarch

How reproducible:

Always (at least, it appears pretty consistent across our nodes)

Steps to Reproduce:
1. Create a node with hundreds of gears (500 should be sufficient)
2. Run watchman for a while
3. check CPU usage using "ps auxww --cumulative | grep watchman".  The third column shows the precentage of CPU used by watchman and its child processes.

Actual results:

CPU usage is over 20%

Expected results:

Less than that.  :)

--- Additional comment from Jhon Honce on 2014-05-06 15:58:32 EDT ---

Added element STATE_CHECK_PERIOD to /etc/sysconfig/watchman to allow detuning of state checks.

https://github.com/openshift/origin-server/pull/5383

--- Additional comment from openshift-github-bot on 2014-05-06 16:53:59 EDT ---

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/c84642a6f0c03af10fad08c6064f686f74e2dedf
Bug 1091433 - Add setting to detune GearStatePlugin

* Add sysconfig/watchman element STATE_CHECK_PERIOD to control
  frequency of running GearStatePlugin

--- Additional comment from Yan Du on 2014-05-07 05:43:46 EDT ---

Test on devenv_4769, STATE_CHECK_PERIOD could take effect for watchman.

steps:
1. Config in /etc/sysconfig/watchman and restart watchman
STATE_CHANGE_DELAY=60
STATE_CHECK_PERIOD=60
2. change gear state and check the syslog, could get gear state change info in syslog with below log after about 2 min
3. check the cpu usage, it is lower than 20%

Move bug to verified.
Comment 1 Brenton Leanhardt 2014-05-16 08:32:48 EDT
We should pull in this upstream PR too: https://github.com/openshift/origin-server/pull/5418/files
Comment 2 Brenton Leanhardt 2014-05-22 08:54:30 EDT
These are two additional pull requests that ship important updates for watchman:

https://github.com/openshift/origin-server/pull/5429
https://github.com/openshift/origin-server/pull/5437
Comment 3 Brenton Leanhardt 2014-06-11 11:19:43 EDT
When the OOM plugin is backported we should consider pulling in https://github.com/openshift/origin-server/pull/5494 as well.
Comment 4 Brenton Leanhardt 2014-07-14 15:10:09 EDT
Upstream commits:

commit c84642a6f0c03af10fad08c6064f686f74e2dedf
Author: Jhon Honce <jhonce@redhat.com>
Date:   Tue May 6 08:40:56 2014 -0700

    Bug 1091433 - Add setting to detune GearStatePlugin
    
    * Add sysconfig/watchman element STATE_CHECK_PERIOD to control
      frequency of running GearStatePlugin

commit dbc9cfadb7c82eba7b17638e7f79e2c0a01bdf8e
Author: Jhon Honce <jhonce@redhat.com>
Date:   Thu May 15 11:41:36 2014 -0700

    Bug 1097959 - Add THROTTLER_CHECK_PERIOD to detune Throttler
    
    * Add THROTTLER_CHECK_PERIOD element to /etc/sysconfig/watchman to
      allow Operator to set period for checking cgroup counters

commit 6188dd63856e048aa51071e059618141ce13fd04
Author: Andy Grimm <agrimm@redhat.com>
Date:   Mon May 12 16:05:30 2014 -0400

    Introduce oom plugin and disable syslog plugin
    
    The oom plugin is improves handling of out-of-memory conditions
    in gears by dynamically adjusting a cgroup's memory limit while
    cleaning up its tasks.

commit efec8b5f07988f3e95de5b5c54aae380b0879b98
Author: Andy Grimm <agrimm@redhat.com>
Date:   Tue May 20 15:22:57 2014 -0400

    Remove an incorrect comment line in oom_plugin

commit a43a0d461974087568d3e7e60f61e890a1e9b0d1
Author: Andy Grimm <agrimm@redhat.com>
Date:   Tue May 20 15:25:30 2014 -0400

    Disable OOM kills for gear cgroups

commit ba9636528748d0cb24b455e102b9f3098072c7c6
Author: Andy Grimm <agrimm@redhat.com>
Date:   Tue May 20 15:31:20 2014 -0400

    Add OOM_CHECK_PERIOD to oo-watchman man page

commit 322cb2dacc7c8cc3c1cbbb35fc2e98248a8a5d61
Author: Jhon Honce <jhonce@redhat.com>
Date:   Wed May 21 16:00:11 2014 -0700

    WIP Node Platform - Skip syslog_plugin test if it has been disabled
Comment 7 Anping Li 2014-07-16 10:24:03 EDT
Verified and pass on puddle-2-1-2014-07-15

The CPU became less after update to puddle-2-1-2014-07-15. and the configure values also take effect.

1) On OSE GA build. Watchman consumes 42% CPU times.
[root@node ~]# ps auxww --cumulative | grep watchman
root      23276  42  0.1 13263832 184336 ?     Sl   17:17   8:31 watchman                                 
root     110942  0.0  0.0 103256   856 pts/1    S+   17:25   0:00 grep watchman

2) On puddle puddle-2-1-2014-07-15, Only 11.5% CPU times.
root@node ~]# ps auxww --cumulative | grep watchman
root      2683 11.5  0.3 13001500 163292 ?     Sl   20:05  12:45 watchman        
root     18410  0.0  0.0 103256   888 pts/1    S+   21:55   0:00 grep watchman

3) After add the following configuration. 
STATE_CHANGE_DELAY=60
STATE_CHECK_PERIOD=60
[root@node ~]# ps auxww --cumulative | grep watchman
root     10021 10.8  0.1 12905248 82308 ?      Sl   22:00   0:56 watchman                                   
root     24596  0.0  0.0 103256   852 pts/2    S+   22:08   0:00 grep watchman
Comment 9 errata-xmlrpc 2014-08-04 09:27:06 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0999.html

Note You need to log in before you can comment on or make changes to this bug.