Bug 998704

Summary: [Watchman] Exception: NoMethodError
Product: OpenShift Online Reporter: Kenny Woodson <kwoodson>
Component: ContainersAssignee: Fotios Lindiakos <fotios>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 1.xCC: bmeng, jkeck
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-08-29 12:53:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kenny Woodson 2013-08-19 20:04:00 UTC
Description of problem:

When debugging a few issues I noticed these messages in rsyslog:

Aug 19 15:51:21 ex-std-node43 rhc-watchman[19981]: watchman caught #<NoMethodError: undefined method `>=' for nil:NilClass>: undefined method `>=' for nil:NilClass. Retries left: 1
Aug 19 15:53:13 ex-std-node43 rhc-watchman[19981]: watchman caught #<NoMethodError: undefined method `>=' for nil:NilClass>: undefined method `>=' for nil:NilClass. Retries left: 0


Version-Release number of selected component (if applicable):
Current

How reproducible:
Quite a few of our nodes are seeing this issue.  I'd say it would be reproducible but I'm not sure what the NilClass is.

Steps to Reproduce:
1.
2.
3.

Actual results:
Watchman throws an exception.

Expected results:

Watchman should be hardened and should not be throwing and exception for this specific issue.

Additional info:

We depend on watchman to manage resources and currently see that it fails often.  We have even written a restart handler for the service.  Let's harden it.

Comment 1 Fotios Lindiakos 2013-08-20 21:59:33 UTC
Fix in this PR, undergoing review and testing.

https://github.com/openshift/origin-server/pull/3443

Comment 2 Meng Bo 2013-08-22 10:22:30 UTC
Aug 22 01:54:40 ip-10-196-51-239 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug 22 01:54:40 ip-10-196-51-239 rhc-watchman[1928]: Throttler: REFUSED restore => 342483957415281973788672 (unknown utilization)
Aug 22 01:55:00 ip-10-196-51-239 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug 22 01:55:00 ip-10-196-51-239 rhc-watchman[1928]: Throttler: REFUSED restore => 342483957415281973788672 (unknown utilization)
Aug 22 01:55:20 ip-10-196-51-239 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug 22 01:55:20 ip-10-196-51-239 rhc-watchman[1928]: Throttler: REFUSED restore => 342483957415281973788672 (unknown utilization)
Aug 22 01:55:40 ip-10-196-51-239 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug 22 01:55:40 ip-10-196-51-239 rhc-watchman[1928]: Throttler: REFUSED restore => 342483957415281973788672 (unknown utilization)
Aug 22 01:56:00 ip-10-196-51-239 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug 22 01:56:00 ip-10-196-51-239 rhc-watchman[1928]: Throttler: REFUSED restore => 342483957415281973788672 (unknown utilization)
Aug 22 01:56:20 ip-10-196-51-239 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug 22 01:56:20 ip-10-196-51-239 rhc-watchman[1928]: watchman caught #<ArgumentError: comparison of String with Float failed>: comparison of String with Float failed. Retries left: 9
Aug 22 01:56:40 ip-10-196-51-239 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 9
Aug 22 01:56:40 ip-10-196-51-239 rhc-watchman[1928]: watchman caught #<ArgumentError: comparison of String with Float failed>: comparison of String with Float failed. Retries left: 8
Aug 22 01:57:00 ip-10-196-51-239 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 8
Aug 22 01:57:00 ip-10-196-51-239 rhc-watchman[1928]: watchman caught #<ArgumentError: comparison of String with Float failed>: comparison of String with Float failed. Retries left: 7
Aug 22 01:57:20 ip-10-196-51-239 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 7
Aug 22 01:57:20 ip-10-196-51-239 rhc-watchman[1928]: watchman caught #<ArgumentError: comparison of String with Float failed>: comparison of String with Float failed. Retries left: 6




Meet the above ArgumentError during my testing. And not sure how to reproduce it.

Comment 3 Fotios Lindiakos 2013-08-22 17:36:23 UTC
Tried another fix: https://github.com/openshift/origin-server/pull/3470

I have not been able to reproduce this, but I added some additional logging information. This should no longer fail, but please check /var/log/messages for "Throttler: problem in find for ..." and attach the log if it's found. I will watch the logs from some Jenkins runs as well. Hopefully with this information I can find the root cause of the problem.

Comment 4 openshift-github-bot 2013-08-22 23:17:50 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/8c575e912dd486a5b04bfcd99b126ba6f9db1547
Merge pull request #3470 from fotioslindiakos/Bug998704

Merged by openshift-bot

Comment 5 Meng Bo 2013-08-23 09:46:59 UTC
Checked on devenv-stage_452, did not meet such error in /var/log/messages

Move bug to verified.