Bug 1057734

Summary: watchman won't unthrottle gears with no CPU usage
Product: OpenShift Online Reporter: Andy Grimm <agrimm>
Component: ContainersAssignee: Jhon Honce <jhonce>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: low Docs Contact:
Priority: medium    
Version: 2.xCC: bmeng, jgoulding
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-26 19:10:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1062573    
Bug Blocks:    

Description Andy Grimm 2014-01-24 17:10:23 UTC
Description of problem:

We often see watchman logs like this:

Jan 24 08:04:13 ex-std-node43 rhc-watchman[186754]: Throttler: REFUSED restore => <UUID> (still over threshold (NaN))

The problem is that when a gear has no CPU utilization at all, nr_periods in cpu.stat does not change.  This leads to a division by zero in MonitoredGear.elapsed_usage

Version-Release number of selected component (if applicable):

rhc-node-1.18.4-1.el6oso.x86_64 (but the oo-watchman rewrite also appears to have this problem)

How reproducible:

always

Steps to Reproduce:
1. do a load test against a gear to cause it to be throttled
2. kill the processes in the gear
3. observe messages like the one above in the log

Actual results:

"still over threshold (NaN)" indicates a division by zero

Expected results:

The gear should be unthrottled

Additional info:

This is a minor issue, because usually when this appears in the logs, the gear actually has no processes running.  However, this could be very confusing to a sytems administrator, so it should be fixed.

Comment 1 openshift-github-bot 2014-01-31 17:57:40 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/8a3056e19daa1678617648ff4f217e34a1598023
Bug 1057734 - Protect against divide by zero

Comment 2 Meng Bo 2014-02-11 06:08:57 UTC
Checked on devenv_4357, after kill the process which eating the cpu usage. The watchman will unthrottle the gear in a while.

Feb 11 00:43:45 ip-10-16-155-161 watchman[1969]: Throttler: throttle => 52f99837c7ca5d728c00003e (158.478)
Feb 11 00:44:05 ip-10-16-155-161 watchman[1969]: Throttler: REFUSED restore => 52f99837c7ca5d728c00003e (still over threshold (392.087))
Feb 11 00:44:25 ip-10-16-155-161 watchman[1969]: Throttler: REFUSED restore => 52f99837c7ca5d728c00003e (still over threshold (392.888))
Feb 11 00:44:45 ip-10-16-155-161 watchman[1969]: Throttler: REFUSED restore => 52f99837c7ca5d728c00003e (still over threshold (393.526))
Feb 11 00:45:05 ip-10-16-155-161 watchman[1969]: Throttler: REFUSED restore => 52f99837c7ca5d728c00003e (still over threshold (388.051))
Feb 11 00:45:25 ip-10-16-155-161 watchman[1969]: Throttler: REFUSED restore => 52f99837c7ca5d728c00003e (still over threshold (368.415))
Feb 11 00:45:45 ip-10-16-155-161 watchman[1969]: Throttler: REFUSED restore => 52f99837c7ca5d728c00003e (still over threshold (118.894))
Feb 11 00:46:05 ip-10-16-155-161 watchman[1969]: Throttler: restore => 52f99837c7ca5d728c00003e (9.476)


Move bug to verified.