Bug 1002293 - [rhc-watchman] Watchman fails silently
Summary: [rhc-watchman] Watchman fails silently
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Jhon Honce
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-28 20:14 UTC by Kenny Woodson
Modified: 2015-05-14 23:27 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-19 16:48:28 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Kenny Woodson 2013-08-28 20:14:15 UTC
Description of problem:

When starting libra-watchman, watchman dies silently.  
------------------------
[ex-std-node222.prod ~]$ sudo service libra-watchman status
rhc-watchman dead but pid file exists

The pid file exists.
-rw-r--r--. 1 root    root      5 Aug 28 15:21 rhc-watchman.pid

When I manually remove the pid file or call restart, the watchman process says its started:
[ex-std-node222.prod ~]$ sudo service libra-watchman restart
Stopping Watchman Services:                                [FAILED]
Starting Watchman Services:                                [  OK  ]

The problem is that it is not started.  It died silently and created the pid file again.

Upon further investigation, I discovered that in the call to daemon() on or near this line (below) calls daemon to fork.
    daemon() if daemon

The problem here is that there is no exception handling for this block.  Surrounding this in a simple begin,rescue,end and a simple puts for the exception.backtrace I was able to get the real error message:

Syslog.warning('Fork from parent process failed') if (pid = fork) == -1
exit unless pid.nil?


This exit was being called but since i print the values and continue it went on to the real problem (Line numbers for rhc-watchman won't match perfectly as I have inserted debug): 

/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:58:in `rescue in config_val': /etc/openshift/resource_limits.conf requires 'apply_period' in '[cg_template_throttled]' group (ArgumentError)
	from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:53:in `config_val'
	from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:27:in `initialize'
	from ./rhc-watchman:68:in `new'
	from ./rhc-watchman:68:in `initialize'
	from ./rhc-watchman:239:in `new'
	from ./rhc-watchman:239:in `<main>'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:58:in `rescue in config_val': /etc/openshift/resource_limits.conf requires 'apply_period' in '[cg_template_throttled]' group (ArgumentError)
	from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:53:in `config_val'
	from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:27:in `initialize'
	from ./rhc-watchman:68:in `new'
	from ./rhc-watchman:68:in `initialize'
	from ./rhc-watchman:239:in `new'
	from ./rhc-watchman:239:in `<main>'

Version-Release number of selected component (if applicable):
Current

How reproducible:
Very reproducible.  We have a resource_limits.conf.small that is not the correct version.  With this version watchman cannot find certain values it needs to run.

Steps to Reproduce:
1.  Remove apply_period from the resource_limits.conf.small file.
2.  Restart watchman.
3.

Actual results:
Watchman dies silently and will never start.

Expected results:
Watchman should die gracefully when run with a -v flag.  When attempting to debug or figure out the problem there are no error message in /var/log/messages.  There is no output when running rhc-watchman.  There is literally no way of knowing what the problem is.

Additional info:

The md5sum of the bad resource_limits.conf.small is e868dfe0e0df12c99fb9621013d53ddc.

The version it should be is 8a1b5299ff3ad09fc43110087d506925.

We will make sure the proper version is in place.

Please add a debug flag to watchman so we can run with a -v or -d to verify that it starts properly.  We rely on watchman to handle idling and watching over the applications.

Comment 1 openshift-github-bot 2013-09-12 00:15:34 UTC
Commits pushed to master at https://github.com/openshift/li

https://github.com/openshift/li/commit/bafa3f582dc4223f2bd31097d64bf075c13fe14d
Bug 1002293 - Protect Watchman from Throttler

https://github.com/openshift/li/commit/7e6cb41955280a8223ac4d8e975101c719d58888
Bug 1002293 - Protect Watchman from Throttler

* add throttler status to Watchman status message

Comment 2 Qiushui Zhang 2013-09-12 09:29:54 UTC
Tested on devenv_3776.

The watchman can restart successfully after modifying "apply_period"

Failures will be reported: (tailf /var/log/messages)

Sep 12 05:21:27 ip-10-147-175-80 rhc-watchman[20369]: Failed to create Throttler: /etc/openshift/resource_limits.conf requires 'apply_period' in '[cg_template_throttled]' group


Mark the bug as verified.


Note You need to log in before you can comment on or make changes to this bug.