Description of problem: Recently, we've seen watchman CPU utilization increase dramatically, and I'm still looking for causes. Most recently I tracked this down to the grep call in libcgroup, which now takes as much as 0.6 seconds on our productions nodes ... and runs every 6 seconds, translating to 10% of a CPU all on its own. This appears to be at least double the CPU time of the old code. Please review and revise this. For reference, the commit that introduced this is here: https://github.com/openshift/origin-server/commit/1d040104ddf739053278c20915162807bc8dfde6 Example time difference of just the grep: [root ~]# time grep -H "" /cgroup/all/openshift/*/{cpu.stat,cpuacct.usage,cpu.cfs_quota_us} > /dev/null real 0m0.580s user 0m0.021s sys 0m0.319s [root ~]# time grep -H ^ /cgroup/*/openshift/*/cpu* >/dev/null real 0m1.227s user 0m0.159s sys 0m0.694s I know there are other considerations, such as that more files are now being read, but we need to bring this back to a reasonable level (that could mean we just need to poll the data less often, but I think there are other potential optimizations as well).
https://github.com/openshift/origin-server/pull/5418 provides THROTTLER_CHECK_PERIOD element in /etc/sysconfig/watchman to control how frequently the cgroup counters are read
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/dbc9cfadb7c82eba7b17638e7f79e2c0a01bdf8e Bug 1097959 - Add THROTTLER_CHECK_PERIOD to detune Throttler * Add THROTTLER_CHECK_PERIOD element to /etc/sysconfig/watchman to allow Operator to set period for checking cgroup counters
Checked on devenv_4797, with the option THROTTLER_CHECK_PERIOD=20 added into /etc/sysconfig/watchman. Checking the cgroup.log and cgroup-trace.log on node, it will check the gears cgroup setting each 20s. [root@ip-10-47-144-172 ~]# tailf cgroup.log May 19 02:41:26 INFO Shell command 'ps ax --format 'uid,pid=,ppid='' ran. rc=0 out=[SILENCED] May 19 02:41:43 INFO Shell command 'set -e -o pipefail; grep -H ^ /cgroup/*/openshift/*/cpu* |sed 's|^/cgroup/[^/]*/openshift/||'' ran. rc=0 out=[SILENCED] May 19 02:41:46 INFO Shell command '/bin/grep ' killed as a result of limit of ' /var/log/messages' ran. rc=1 out=[SILENCED] May 19 02:41:47 INFO Shell command 'ps ax --format 'uid,pid=,ppid='' ran. rc=0 out=[SILENCED] May 19 02:42:03 INFO Shell command 'set -e -o pipefail; grep -H ^ /cgroup/*/openshift/*/cpu* |sed 's|^/cgroup/[^/]*/openshift/||'' ran. rc=0 out=[SILENCED] May 19 02:42:07 INFO Shell command '/bin/grep ' killed as a result of limit of ' /var/log/messages' ran. rc=1 out=[SILENCED] May 19 02:42:07 INFO Shell command 'ps ax --format 'uid,pid=,ppid='' ran. rc=0 out=[SILENCED] May 19 02:42:24 INFO Shell command 'set -e -o pipefail; grep -H ^ /cgroup/*/openshift/*/cpu* |sed 's|^/cgroup/[^/]*/openshift/||'' ran. rc=0 out=[SILENCED] May 19 02:42:27 INFO Shell command '/bin/grep ' killed as a result of limit of ' /var/log/messages' ran. rc=1 out=[SILENCED] May 19 02:42:27 INFO Shell command 'ps ax --format 'uid,pid=,ppid='' ran. rc=0 out=[SILENCED] [root@ip-10-47-144-172 ~]# tailf cgroup-trace.log| grep INFO May 19 02:43:24 INFO oo_spawn running set -e -o pipefail; grep -H ^ /cgroup/*/openshift/*/cpu* |sed 's|^/cgroup/[^/]*/openshift/||': {:unsetenv_others=>false, :close_others=>true, :in=>"/dev/null", :out=>#<IO:fd 24>, :err=>#<IO:fd 22>} May 19 02:43:24 INFO oo_spawn buffer(23/) 5379a209fe6a161373000001/cpu.cfs_period_us:100000 May 19 02:43:24 INFO oo_spawn buffer(23/) 1613730000e2/cpu.stat:nr_periods 2675 May 19 02:43:24 INFO oo_spawn buffer(23/) cpuacct.stat:user 1129 May 19 02:43:27 INFO oo_spawn running /bin/grep ' killed as a result of limit of ' /var/log/messages: {:unsetenv_others=>false, :close_others=>true, :in=>"/dev/null", :out=>#<IO:fd 28>, :err=>#<IO:fd 26>} May 19 02:43:27 INFO oo_spawn running ps ax --format 'uid,pid=,ppid=': {:unsetenv_others=>false, :close_others=>true, :in=>"/dev/null", :out=>#<IO:fd 28>, :err=>#<IO:fd 26>} May 19 02:43:27 INFO oo_spawn buffer(27/) UID May 19 02:43:27 INFO oo_spawn buffer(27/) 8 1 May 19 02:43:44 INFO [] oo_spawn running set -e -o pipefail; grep -H ^ /cgroup/*/openshift/*/cpu* |sed 's|^/cgroup/[^/]*/openshift/||': {:unsetenv_others=>false, :close_others=>true, :in=>"/dev/null", :out=>#<IO:fd 24>, :err=>#<IO:fd 22>} May 19 02:43:44 INFO [] oo_spawn buffer(23/) 5379a209fe6a161373000001/cpu.cfs_period_us:100000 May 19 02:43:44 INFO [] oo_spawn buffer(23/) 1613730000e2/cpu.stat:nr_periods 2715 May 19 02:43:44 INFO [] oo_spawn buffer(23/) cpuacct.stat:user 1139 May 19 02:43:47 INFO [] oo_spawn running /bin/grep ' killed as a result of limit of ' /var/log/messages: {:unsetenv_others=>false, :close_others=>true, :in=>"/dev/null", :out=>#<IO:fd 28>, :err=>#<IO:fd 26>} May 19 02:43:47 INFO [] oo_spawn running ps ax --format 'uid,pid=,ppid=': {:unsetenv_others=>false, :close_others=>true, :in=>"/dev/null", :out=>#<IO:fd 28>, :err=>#<IO:fd 26>} May 19 02:43:47 INFO [] oo_spawn buffer(27/) UID May 19 02:43:47 INFO [] oo_spawn buffer(27/) 5 1 Move bug to verified.