Bug 1002293

Summary: [rhc-watchman] Watchman fails silently
Product: OpenShift Online Reporter: Kenny Woodson <kwoodson>
Component: ContainersAssignee: Jhon Honce <jhonce>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 2.xCC: jhonce, pmorie, qiuzhang
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-19 16:48:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Kenny Woodson 2013-08-28 20:14:15 UTC
Description of problem:

When starting libra-watchman, watchman dies silently.  
------------------------
[ex-std-node222.prod ~]$ sudo service libra-watchman status
rhc-watchman dead but pid file exists

The pid file exists.
-rw-r--r--. 1 root    root      5 Aug 28 15:21 rhc-watchman.pid

When I manually remove the pid file or call restart, the watchman process says its started:
[ex-std-node222.prod ~]$ sudo service libra-watchman restart
Stopping Watchman Services:                                [FAILED]
Starting Watchman Services:                                [  OK  ]

The problem is that it is not started.  It died silently and created the pid file again.

Upon further investigation, I discovered that in the call to daemon() on or near this line (below) calls daemon to fork.
    daemon() if daemon

The problem here is that there is no exception handling for this block.  Surrounding this in a simple begin,rescue,end and a simple puts for the exception.backtrace I was able to get the real error message:

Syslog.warning('Fork from parent process failed') if (pid = fork) == -1
exit unless pid.nil?


This exit was being called but since i print the values and continue it went on to the real problem (Line numbers for rhc-watchman won't match perfectly as I have inserted debug): 

/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:58:in `rescue in config_val': /etc/openshift/resource_limits.conf requires 'apply_period' in '[cg_template_throttled]' group (ArgumentError)
	from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:53:in `config_val'
	from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:27:in `initialize'
	from ./rhc-watchman:68:in `new'
	from ./rhc-watchman:68:in `initialize'
	from ./rhc-watchman:239:in `new'
	from ./rhc-watchman:239:in `<main>'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:58:in `rescue in config_val': /etc/openshift/resource_limits.conf requires 'apply_period' in '[cg_template_throttled]' group (ArgumentError)
	from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:53:in `config_val'
	from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:27:in `initialize'
	from ./rhc-watchman:68:in `new'
	from ./rhc-watchman:68:in `initialize'
	from ./rhc-watchman:239:in `new'
	from ./rhc-watchman:239:in `<main>'

Version-Release number of selected component (if applicable):
Current

How reproducible:
Very reproducible.  We have a resource_limits.conf.small that is not the correct version.  With this version watchman cannot find certain values it needs to run.

Steps to Reproduce:
1.  Remove apply_period from the resource_limits.conf.small file.
2.  Restart watchman.
3.

Actual results:
Watchman dies silently and will never start.

Expected results:
Watchman should die gracefully when run with a -v flag.  When attempting to debug or figure out the problem there are no error message in /var/log/messages.  There is no output when running rhc-watchman.  There is literally no way of knowing what the problem is.

Additional info:

The md5sum of the bad resource_limits.conf.small is e868dfe0e0df12c99fb9621013d53ddc.

The version it should be is 8a1b5299ff3ad09fc43110087d506925.

We will make sure the proper version is in place.

Please add a debug flag to watchman so we can run with a -v or -d to verify that it starts properly.  We rely on watchman to handle idling and watching over the applications.

Comment 1 openshift-github-bot 2013-09-12 00:15:34 UTC
Commits pushed to master at https://github.com/openshift/li

https://github.com/openshift/li/commit/bafa3f582dc4223f2bd31097d64bf075c13fe14d
Bug 1002293 - Protect Watchman from Throttler

https://github.com/openshift/li/commit/7e6cb41955280a8223ac4d8e975101c719d58888
Bug 1002293 - Protect Watchman from Throttler

* add throttler status to Watchman status message

Comment 2 Qiushui Zhang 2013-09-12 09:29:54 UTC
Tested on devenv_3776.

The watchman can restart successfully after modifying "apply_period"

Failures will be reported: (tailf /var/log/messages)

Sep 12 05:21:27 ip-10-147-175-80 rhc-watchman[20369]: Failed to create Throttler: /etc/openshift/resource_limits.conf requires 'apply_period' in '[cg_template_throttled]' group


Mark the bug as verified.