Description of problem: There are two significant errors in the math in the MonitoredGear class used by the throttler plugin. 1) The code calculates the total possible cpu time by multiplying nr_periods by cfs_quota_us. However, nr_periods is only a counter of the periods when the cgroup actually ran something (or wanted to?). It is _not_ a count of elapsed time. 2) The number calculated with (nr_periods*cfs_quota_us) is in microseconds, but the throttled time and cpuacct.usage values against which it is compared are in nanoseconds. The percentage calculation would thus be off by a factor of 1000, except the conversion from a raw number to a percent was commented out, so it off by a less-noticeable (???) factor of 10. Version-Release number of selected component (if applicable): openshift-origin-node-util-1.23.4-1.el6oso.noarch How reproducible: Always Steps to Reproduce: Problem 1: 1. On a mostly-idle node, run a process in a gear that is compute-intensive. Here's one: ruby -e 'z=0; 1.upto(1_000_000_000_000) { |x| 1.upto(x) { |y| z += y } }' 2. Watch the gear get throttled, and note that the logs say things like: May 22 17:24:53 ex-std-node1 watchman[345260]: Throttler: REFUSED restore => 537b749b2587c85c1c0001ec (still over threshold (999.98)) Problem 2: 1. On a mostly-idle node, run a process in a gear that is compute-intensive, but in bursts. E.g.: while true; do time ruby -e 'z=0; 1.upto(1000) { |x| 1.upto(x) { |y| z += y } }'; sleep 5; done I specifically timed this so the command is using about 1 second of CPU time in my environment, and thus the throttler should measure it at about 16% utilization. Because of its use of nr_periods, it typically says 80-90% instead. Actual results: Problem 1: The gear appears to be using more than 100% of its CPU allocation, which 1) suggests that cgroups don't actually work, and 2) prevents gears from being unthrottled when they should be. Problem 2: Gears with bursty traffic but low overall CPU utilization are often penalized and throttled. Expected results: The throttler's calculations should more accurately reflect the actual percentage of CPU consumed. Additional info: I have a small patch on the way for this, and a script for testing as well.
PR: https://github.com/openshift/origin-server/pull/5443
Fixed in https://github.com/openshift/origin-server/pull/5443
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/e214410228adb20034abbed0c9bedf7a99baa917 Bug 1100518 - Correct throttler's CPU usage math
Checked on devenv-stage_861. The throttle log will show the cpu usage as a normal value. And also do regression testing for watchman, no regression issue found. Move bug to verified. May 30 08:19:58 ip-10-169-41-37 watchman[25807]: Throttler: throttle => 538874914b53dc2679000001 (42.458) May 30 08:20:18 ip-10-169-41-37 watchman[25807]: Throttler: restore => 538874914b53dc2679000001 (61.1) May 30 08:20:38 ip-10-169-41-37 watchman[25807]: Throttler: throttle => 538874914b53dc2679000001 (77.148) May 30 08:20:58 ip-10-169-41-37 watchman[25807]: Throttler: REFUSED restore => 538874914b53dc2679000001 (still over threshold (95.213)) May 30 08:21:18 ip-10-169-41-37 watchman[25807]: Throttler: REFUSED restore => 538874914b53dc2679000001 (still over threshold (101.146)) ... ... May 30 08:24:18 ip-10-169-41-37 watchman[25807]: Throttler: REFUSED restore => 538874914b53dc2679000001 (still over threshold (100.024)) May 30 08:24:38 ip-10-169-41-37 watchman[25807]: Throttler: REFUSED restore => 538874914b53dc2679000001 (still over threshold (100.046)) May 30 08:24:58 ip-10-169-41-37 watchman[25807]: Throttler: REFUSED restore => 538874914b53dc2679000001 (still over threshold (84.915)) May 30 08:25:18 ip-10-169-41-37 watchman[25807]: Throttler: restore => 538874914b53dc2679000001 (67.979) May 30 08:25:39 ip-10-169-41-37 watchman[25807]: Throttler: throttle => 538874914b53dc2679000001 (50.971) May 30 08:25:59 ip-10-169-41-37 watchman[25807]: Throttler: restore => 538874914b53dc2679000001 (34.025)
I just ran in to this bug. How can I fix it on Enterprise 2.1?