Bug 1057734 - watchman won't unthrottle gears with no CPU usage
Summary: watchman won't unthrottle gears with no CPU usage
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: ---
Assignee: Jhon Honce
QA Contact: libra bugs
URL:
Whiteboard:
Depends On: 1062573
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-01-24 17:10 UTC by Andy Grimm
Modified: 2016-11-08 03:47 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-02-26 19:10:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Andy Grimm 2014-01-24 17:10:23 UTC
Description of problem:

We often see watchman logs like this:

Jan 24 08:04:13 ex-std-node43 rhc-watchman[186754]: Throttler: REFUSED restore => <UUID> (still over threshold (NaN))

The problem is that when a gear has no CPU utilization at all, nr_periods in cpu.stat does not change.  This leads to a division by zero in MonitoredGear.elapsed_usage

Version-Release number of selected component (if applicable):

rhc-node-1.18.4-1.el6oso.x86_64 (but the oo-watchman rewrite also appears to have this problem)

How reproducible:

always

Steps to Reproduce:
1. do a load test against a gear to cause it to be throttled
2. kill the processes in the gear
3. observe messages like the one above in the log

Actual results:

"still over threshold (NaN)" indicates a division by zero

Expected results:

The gear should be unthrottled

Additional info:

This is a minor issue, because usually when this appears in the logs, the gear actually has no processes running.  However, this could be very confusing to a sytems administrator, so it should be fixed.

Comment 1 openshift-github-bot 2014-01-31 17:57:40 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/8a3056e19daa1678617648ff4f217e34a1598023
Bug 1057734 - Protect against divide by zero

Comment 2 Meng Bo 2014-02-11 06:08:57 UTC
Checked on devenv_4357, after kill the process which eating the cpu usage. The watchman will unthrottle the gear in a while.

Feb 11 00:43:45 ip-10-16-155-161 watchman[1969]: Throttler: throttle => 52f99837c7ca5d728c00003e (158.478)
Feb 11 00:44:05 ip-10-16-155-161 watchman[1969]: Throttler: REFUSED restore => 52f99837c7ca5d728c00003e (still over threshold (392.087))
Feb 11 00:44:25 ip-10-16-155-161 watchman[1969]: Throttler: REFUSED restore => 52f99837c7ca5d728c00003e (still over threshold (392.888))
Feb 11 00:44:45 ip-10-16-155-161 watchman[1969]: Throttler: REFUSED restore => 52f99837c7ca5d728c00003e (still over threshold (393.526))
Feb 11 00:45:05 ip-10-16-155-161 watchman[1969]: Throttler: REFUSED restore => 52f99837c7ca5d728c00003e (still over threshold (388.051))
Feb 11 00:45:25 ip-10-16-155-161 watchman[1969]: Throttler: REFUSED restore => 52f99837c7ca5d728c00003e (still over threshold (368.415))
Feb 11 00:45:45 ip-10-16-155-161 watchman[1969]: Throttler: REFUSED restore => 52f99837c7ca5d728c00003e (still over threshold (118.894))
Feb 11 00:46:05 ip-10-16-155-161 watchman[1969]: Throttler: restore => 52f99837c7ca5d728c00003e (9.476)


Move bug to verified.


Note You need to log in before you can comment on or make changes to this bug.