Bug 1097959 - [PERF] libcgroup usage function needs optimization
Summary: [PERF] libcgroup usage function needs optimization
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 2.x
Assignee: Jhon Honce
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks: 1096863
TreeView+ depends on / blocked
 
Reported: 2014-05-15 00:56 UTC by Andy Grimm
Modified: 2016-11-08 03:47 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-07-15 10:28:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Andy Grimm 2014-05-15 00:56:55 UTC
Description of problem:

Recently, we've seen watchman CPU utilization increase dramatically, and I'm still looking for causes.  Most recently I tracked this down to the grep call in libcgroup, which now takes as much as 0.6 seconds on our productions nodes ... and runs every 6 seconds, translating to 10% of a CPU all on its own.  This appears to be at least double the CPU time of the old code.  Please review and revise this.

For reference, the commit that introduced this is here:

https://github.com/openshift/origin-server/commit/1d040104ddf739053278c20915162807bc8dfde6

Example time difference of just the grep:

[root ~]# time grep -H "" /cgroup/all/openshift/*/{cpu.stat,cpuacct.usage,cpu.cfs_quota_us} > /dev/null

real	0m0.580s
user	0m0.021s
sys	0m0.319s
[root ~]# time grep -H ^ /cgroup/*/openshift/*/cpu* >/dev/null

real	0m1.227s
user	0m0.159s
sys	0m0.694s

I know there are other considerations, such as that more files are now being read, but we need to bring this back to a reasonable level (that could mean we just need to poll the data less often, but I think there are other potential optimizations as well).

Comment 1 Jhon Honce 2014-05-15 21:03:44 UTC
https://github.com/openshift/origin-server/pull/5418 provides THROTTLER_CHECK_PERIOD element in /etc/sysconfig/watchman to control how frequently the cgroup counters are read

Comment 2 openshift-github-bot 2014-05-15 21:59:15 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/dbc9cfadb7c82eba7b17638e7f79e2c0a01bdf8e
Bug 1097959 - Add THROTTLER_CHECK_PERIOD to detune Throttler

* Add THROTTLER_CHECK_PERIOD element to /etc/sysconfig/watchman to
  allow Operator to set period for checking cgroup counters

Comment 3 Meng Bo 2014-05-19 02:44:31 UTC
Checked on devenv_4797, with the option THROTTLER_CHECK_PERIOD=20 added into /etc/sysconfig/watchman.

Checking the cgroup.log and cgroup-trace.log on node, it will check the gears cgroup setting each 20s.


[root@ip-10-47-144-172 ~]# tailf cgroup.log
May 19 02:41:26 INFO Shell command 'ps ax --format 'uid,pid=,ppid='' ran. rc=0 out=[SILENCED]
May 19 02:41:43 INFO Shell command 'set -e -o pipefail; grep -H ^ /cgroup/*/openshift/*/cpu* |sed 's|^/cgroup/[^/]*/openshift/||'' ran. rc=0 out=[SILENCED]
May 19 02:41:46 INFO Shell command '/bin/grep ' killed as a result of limit of ' /var/log/messages' ran. rc=1 out=[SILENCED]
May 19 02:41:47 INFO Shell command 'ps ax --format 'uid,pid=,ppid='' ran. rc=0 out=[SILENCED]
May 19 02:42:03 INFO Shell command 'set -e -o pipefail; grep -H ^ /cgroup/*/openshift/*/cpu* |sed 's|^/cgroup/[^/]*/openshift/||'' ran. rc=0 out=[SILENCED]
May 19 02:42:07 INFO Shell command '/bin/grep ' killed as a result of limit of ' /var/log/messages' ran. rc=1 out=[SILENCED]
May 19 02:42:07 INFO Shell command 'ps ax --format 'uid,pid=,ppid='' ran. rc=0 out=[SILENCED]
May 19 02:42:24 INFO Shell command 'set -e -o pipefail; grep -H ^ /cgroup/*/openshift/*/cpu* |sed 's|^/cgroup/[^/]*/openshift/||'' ran. rc=0 out=[SILENCED]
May 19 02:42:27 INFO Shell command '/bin/grep ' killed as a result of limit of ' /var/log/messages' ran. rc=1 out=[SILENCED]
May 19 02:42:27 INFO Shell command 'ps ax --format 'uid,pid=,ppid='' ran. rc=0 out=[SILENCED]




[root@ip-10-47-144-172 ~]# tailf cgroup-trace.log| grep INFO
May 19 02:43:24 INFO oo_spawn running set -e -o pipefail; grep -H ^ /cgroup/*/openshift/*/cpu* |sed 's|^/cgroup/[^/]*/openshift/||': {:unsetenv_others=>false, :close_others=>true, :in=>"/dev/null", :out=>#<IO:fd 24>, :err=>#<IO:fd 22>}
May 19 02:43:24 INFO oo_spawn buffer(23/) 5379a209fe6a161373000001/cpu.cfs_period_us:100000
May 19 02:43:24 INFO oo_spawn buffer(23/) 1613730000e2/cpu.stat:nr_periods 2675
May 19 02:43:24 INFO oo_spawn buffer(23/) cpuacct.stat:user 1129
May 19 02:43:27 INFO oo_spawn running /bin/grep ' killed as a result of limit of ' /var/log/messages: {:unsetenv_others=>false, :close_others=>true, :in=>"/dev/null", :out=>#<IO:fd 28>, :err=>#<IO:fd 26>}
May 19 02:43:27 INFO oo_spawn running ps ax --format 'uid,pid=,ppid=': {:unsetenv_others=>false, :close_others=>true, :in=>"/dev/null", :out=>#<IO:fd 28>, :err=>#<IO:fd 26>}
May 19 02:43:27 INFO oo_spawn buffer(27/)   UID            
May 19 02:43:27 INFO oo_spawn buffer(27/) 8     1
May 19 02:43:44 INFO [] oo_spawn running set -e -o pipefail; grep -H ^ /cgroup/*/openshift/*/cpu* |sed 's|^/cgroup/[^/]*/openshift/||': {:unsetenv_others=>false, :close_others=>true, :in=>"/dev/null", :out=>#<IO:fd 24>, :err=>#<IO:fd 22>}
May 19 02:43:44 INFO [] oo_spawn buffer(23/) 5379a209fe6a161373000001/cpu.cfs_period_us:100000
May 19 02:43:44 INFO [] oo_spawn buffer(23/) 1613730000e2/cpu.stat:nr_periods 2715
May 19 02:43:44 INFO [] oo_spawn buffer(23/) cpuacct.stat:user 1139
May 19 02:43:47 INFO [] oo_spawn running /bin/grep ' killed as a result of limit of ' /var/log/messages: {:unsetenv_others=>false, :close_others=>true, :in=>"/dev/null", :out=>#<IO:fd 28>, :err=>#<IO:fd 26>}
May 19 02:43:47 INFO [] oo_spawn running ps ax --format 'uid,pid=,ppid=': {:unsetenv_others=>false, :close_others=>true, :in=>"/dev/null", :out=>#<IO:fd 28>, :err=>#<IO:fd 26>}
May 19 02:43:47 INFO [] oo_spawn buffer(27/)   UID            
May 19 02:43:47 INFO [] oo_spawn buffer(27/) 5     1


Move bug to verified.


Note You need to log in before you can comment on or make changes to this bug.