1100518 – watchman throttler's math is wrong

Bug 1100518 - watchman throttler's math is wrong

Summary: watchman throttler's math is wrong

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	2.x
Assignee:	Andy Grimm
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1100766
TreeView+	depends on / blocked

Reported:	2014-05-23 01:46 UTC by Andy Grimm
Modified:	2016-11-08 03:47 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1100766 (view as bug list)
Environment:
Last Closed:	2014-07-15 10:29:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Andy Grimm 2014-05-23 01:46:03 UTC

Description of problem:

There are two significant errors in the math in the MonitoredGear class used by the throttler plugin.

1) The code calculates the total possible cpu time by multiplying nr_periods by cfs_quota_us.  However, nr_periods is only a counter of the periods when the cgroup actually ran something (or wanted to?).  It is _not_ a count of elapsed time.

2) The number calculated with (nr_periods*cfs_quota_us) is in microseconds, but the throttled time and cpuacct.usage values against which it is compared are in nanoseconds.  The percentage calculation would thus be off by a factor of 1000, except the conversion from a raw number to a percent was commented out, so it off by a less-noticeable (???) factor of 10.

Version-Release number of selected component (if applicable):

openshift-origin-node-util-1.23.4-1.el6oso.noarch

How reproducible:

Always

Steps to Reproduce:

Problem 1:

1. On a mostly-idle node, run a process in a gear that is compute-intensive.  Here's one:

ruby -e 'z=0; 1.upto(1_000_000_000_000) { |x| 1.upto(x) { |y| z += y } }'

2. Watch the gear get throttled, and note that the logs say things like:

May 22 17:24:53 ex-std-node1 watchman[345260]: Throttler: REFUSED restore => 537b749b2587c85c1c0001ec (still over threshold (999.98))

Problem 2:

1. On a mostly-idle node, run a process in a gear that is compute-intensive, but in bursts.  E.g.:

while true; do time ruby -e 'z=0; 1.upto(1000) { |x| 1.upto(x) { |y| z += y } }'; sleep 5; done

I specifically timed this so the command is using about 1 second of CPU time in my environment, and thus the throttler should measure it at about 16% utilization.  Because of its use of nr_periods, it typically says 80-90% instead.

Actual results:

Problem 1: The gear appears to be using more than 100% of its CPU allocation, which 1) suggests that cgroups don't actually work, and 2) prevents gears from being unthrottled when they should be.

Problem 2: Gears with bursty traffic but low overall CPU utilization are often penalized and throttled.

Expected results:

The throttler's calculations should more accurately reflect the actual percentage of CPU consumed.

Additional info:

I have a small patch on the way for this, and a script for testing as well.

Comment 1 Andy Grimm 2014-05-23 02:55:55 UTC

PR:

https://github.com/openshift/origin-server/pull/5443

Comment 2 Jhon Honce 2014-05-29 15:48:18 UTC

Fixed in https://github.com/openshift/origin-server/pull/5443

Comment 3 openshift-github-bot 2014-05-29 17:25:31 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/e214410228adb20034abbed0c9bedf7a99baa917
Bug 1100518 - Correct throttler's CPU usage math

Comment 4 Meng Bo 2014-05-30 08:37:44 UTC

Checked on devenv-stage_861. The throttle log will show the cpu usage as a normal value.

And also do regression testing for watchman, no regression issue found.
Move bug to verified.

May 30 08:19:58 ip-10-169-41-37 watchman[25807]: Throttler: throttle => 538874914b53dc2679000001 (42.458)
May 30 08:20:18 ip-10-169-41-37 watchman[25807]: Throttler: restore => 538874914b53dc2679000001 (61.1)
May 30 08:20:38 ip-10-169-41-37 watchman[25807]: Throttler: throttle => 538874914b53dc2679000001 (77.148)
May 30 08:20:58 ip-10-169-41-37 watchman[25807]: Throttler: REFUSED restore => 538874914b53dc2679000001 (still over threshold (95.213))
May 30 08:21:18 ip-10-169-41-37 watchman[25807]: Throttler: REFUSED restore => 538874914b53dc2679000001 (still over threshold (101.146))
...
...
May 30 08:24:18 ip-10-169-41-37 watchman[25807]: Throttler: REFUSED restore => 538874914b53dc2679000001 (still over threshold (100.024))
May 30 08:24:38 ip-10-169-41-37 watchman[25807]: Throttler: REFUSED restore => 538874914b53dc2679000001 (still over threshold (100.046))
May 30 08:24:58 ip-10-169-41-37 watchman[25807]: Throttler: REFUSED restore => 538874914b53dc2679000001 (still over threshold (84.915))
May 30 08:25:18 ip-10-169-41-37 watchman[25807]: Throttler: restore => 538874914b53dc2679000001 (67.979)
May 30 08:25:39 ip-10-169-41-37 watchman[25807]: Throttler: throttle => 538874914b53dc2679000001 (50.971)
May 30 08:25:59 ip-10-169-41-37 watchman[25807]: Throttler: restore => 538874914b53dc2679000001 (34.025)

Comment 5 Randall Theobald 2014-07-17 14:54:13 UTC

I just ran in to this bug. How can I fix it on Enterprise 2.1?

Note You need to log in before you can comment on or make changes to this bug.