+++ This bug was initially created as a clone of Bug #1100518 +++ Description of problem: There are two significant errors in the math in the MonitoredGear class used by the throttler plugin. 1) The code calculates the total possible cpu time by multiplying nr_periods by cfs_quota_us. However, nr_periods is only a counter of the periods when the cgroup actually ran something (or wanted to?). It is _not_ a count of elapsed time. 2) The number calculated with (nr_periods*cfs_quota_us) is in microseconds, but the throttled time and cpuacct.usage values against which it is compared are in nanoseconds. The percentage calculation would thus be off by a factor of 1000, except the conversion from a raw number to a percent was commented out, so it off by a less-noticeable (???) factor of 10. Version-Release number of selected component (if applicable): openshift-origin-node-util-1.23.4-1.el6oso.noarch How reproducible: Always Steps to Reproduce: Problem 1: 1. On a mostly-idle node, run a process in a gear that is compute-intensive. Here's one: ruby -e 'z=0; 1.upto(1_000_000_000_000) { |x| 1.upto(x) { |y| z += y } }' 2. Watch the gear get throttled, and note that the logs say things like: May 22 17:24:53 ex-std-node1 watchman[345260]: Throttler: REFUSED restore => 537b749b2587c85c1c0001ec (still over threshold (999.98)) Problem 2: 1. On a mostly-idle node, run a process in a gear that is compute-intensive, but in bursts. E.g.: while true; do time ruby -e 'z=0; 1.upto(1000) { |x| 1.upto(x) { |y| z += y } }'; sleep 5; done I specifically timed this so the command is using about 1 second of CPU time in my environment, and thus the throttler should measure it at about 16% utilization. Because of its use of nr_periods, it typically says 80-90% instead. Actual results: Problem 1: The gear appears to be using more than 100% of its CPU allocation, which 1) suggests that cgroups don't actually work, and 2) prevents gears from being unthrottled when they should be. Problem 2: Gears with bursty traffic but low overall CPU utilization are often penalized and throttled. Expected results: The throttler's calculations should more accurately reflect the actual percentage of CPU consumed. Additional info: I have a small patch on the way for this, and a script for testing as well. --- Additional comment from Andy Grimm on 2014-05-22 22:55:55 EDT --- PR: https://github.com/openshift/origin-server/pull/5443
Upstream commits: commit e214410228adb20034abbed0c9bedf7a99baa917 Author: Andy Grimm <agrimm> Date: Thu May 22 22:44:07 2014 -0400 Bug 1100518 - Correct throttler's CPU usage math commit 337d8994ec90f1fd82507f014e79d29e0205ece3 Author: Andy Grimm <agrimm> Date: Sun May 25 14:53:08 2014 -0400 Move cgroup sample timestamp insertion and fix unit test commit 32b245340f2eb8d58b3cc9ea47a012e3c200986c Author: Andy Grimm <agrimm> Date: Wed May 28 21:45:17 2014 -0400 Fix throttler math in monitored_gear_test
Verified and pass on puddle-2-1-2014-07-15 The bug can be recreated on OSE2.1Z GA build. Recreated steps for Problem 1: 1. run the process in a gear that is compute-intensive. Here's one: ruby -e 'z=0; 1.upto(1_000_000_000_000) { |x| 1.upto(x) { |y| z += y } }' 2. Watch the gear get throttled, and note that the logs say things like: Jul 16 01:29:37 node watchman[1421]: Throttler: REFUSED restore => 53c4a1a64cfeffdd11000023 (still over threshold (980.512)) Jul 16 01:29:37 node watchman[1421]: Throttler: REFUSED restore => 53c5fe864cfeffdd11000038 (still over threshold (1016.848)) Jul 16 01:30:00 node watchman[1421]: Throttler: REFUSED restore => 53c4a1a64cfeffdd11000023 (still over threshold (979.731)) Jul 16 01:30:00 node watchman[1421]: Throttler: REFUSED restore => 53c5fe864cfeffdd11000038 (still over threshold (902.147)) Verified steps: update to puddle-2-1-2014-07-15 (hint:oo-cgroup-disable/enable all container). 1. run the process in a gear that is compute-intensive. Here's one: ruby -e 'z=0; 1.upto(1_000_000_000_000) { |x| 1.upto(x) { |y| z += y } }' 2. Watch the gear get throttled, and note that the threshold is in normal status. Jul 16 05:31:30 node watchman[24846]: Throttler: REFUSED restore => 53c5fe864cfeffdd11000038 (still over threshold (77.901)) Jul 16 05:31:51 node watchman[24846]: Throttler: REFUSED restore => 53c5fe864cfeffdd11000038 (still over threshold (95.201)) <snip---> Jul 16 05:32:11 node watchman[24846]: Throttler: REFUSED restore => 53c5fe864cfeffdd11000038 (still over threshold (100.571)) Jul 16 05:32:13 node dhclient[1016]: bound to 192.168.55.38 -- renewal in 51 seconds. Jul 16 05:32:31 node watchman[24846]: Throttler: REFUSED restore => 53c5fe864cfeffdd11000038 (still over threshold (100.556)) Recreated steps for Problem 2: 1. On a mostly-idle node, run a process in a gear that is compute-intensive, but in bursts. E.g.: while true; do time ruby -e 'z=0; 1.upto(1000) { |x| 1.upto(x) { |y| z += y } }'; sleep 5; done Jul 16 01:47:54 node watchman[28143]: Throttler: throttle => 53c5fe964cfeffdd1100004c (42.545) Jul 16 01:47:54 node watchman[28143]: Throttler: throttle => 53c5fe964cfeffdd1100004c (42.545) <snip---> Jul 16 01:48:14 node watchman[28143]: Throttler: REFUSED restore => 53c5fe964cfeffdd1100004c (still over threshold (125.138)) Jul 16 01:48:35 node watchman[28143]: Throttler: REFUSED restore => 53c5fe964cfeffdd1100004c (still over threshold (207.461)) Verified with same steps after update puddle-2-1-2014-07-15,There isn't throttle and Throttler: REFUSED restore are reported against gear 53c5fe964cfeffdd1100004c
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0999.html