Hide Forgot
Description of problem: When starting libra-watchman, watchman dies silently. ------------------------ [ex-std-node222.prod ~]$ sudo service libra-watchman status rhc-watchman dead but pid file exists The pid file exists. -rw-r--r--. 1 root root 5 Aug 28 15:21 rhc-watchman.pid When I manually remove the pid file or call restart, the watchman process says its started: [ex-std-node222.prod ~]$ sudo service libra-watchman restart Stopping Watchman Services: [FAILED] Starting Watchman Services: [ OK ] The problem is that it is not started. It died silently and created the pid file again. Upon further investigation, I discovered that in the call to daemon() on or near this line (below) calls daemon to fork. daemon() if daemon The problem here is that there is no exception handling for this block. Surrounding this in a simple begin,rescue,end and a simple puts for the exception.backtrace I was able to get the real error message: Syslog.warning('Fork from parent process failed') if (pid = fork) == -1 exit unless pid.nil? This exit was being called but since i print the values and continue it went on to the real problem (Line numbers for rhc-watchman won't match perfectly as I have inserted debug): /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:58:in `rescue in config_val': /etc/openshift/resource_limits.conf requires 'apply_period' in '[cg_template_throttled]' group (ArgumentError) from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:53:in `config_val' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:27:in `initialize' from ./rhc-watchman:68:in `new' from ./rhc-watchman:68:in `initialize' from ./rhc-watchman:239:in `new' from ./rhc-watchman:239:in `<main>' /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:58:in `rescue in config_val': /etc/openshift/resource_limits.conf requires 'apply_period' in '[cg_template_throttled]' group (ArgumentError) from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:53:in `config_val' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.13.12/lib/openshift-origin-node/utils/cgroups/throttler.rb:27:in `initialize' from ./rhc-watchman:68:in `new' from ./rhc-watchman:68:in `initialize' from ./rhc-watchman:239:in `new' from ./rhc-watchman:239:in `<main>' Version-Release number of selected component (if applicable): Current How reproducible: Very reproducible. We have a resource_limits.conf.small that is not the correct version. With this version watchman cannot find certain values it needs to run. Steps to Reproduce: 1. Remove apply_period from the resource_limits.conf.small file. 2. Restart watchman. 3. Actual results: Watchman dies silently and will never start. Expected results: Watchman should die gracefully when run with a -v flag. When attempting to debug or figure out the problem there are no error message in /var/log/messages. There is no output when running rhc-watchman. There is literally no way of knowing what the problem is. Additional info: The md5sum of the bad resource_limits.conf.small is e868dfe0e0df12c99fb9621013d53ddc. The version it should be is 8a1b5299ff3ad09fc43110087d506925. We will make sure the proper version is in place. Please add a debug flag to watchman so we can run with a -v or -d to verify that it starts properly. We rely on watchman to handle idling and watching over the applications.
Commits pushed to master at https://github.com/openshift/li https://github.com/openshift/li/commit/bafa3f582dc4223f2bd31097d64bf075c13fe14d Bug 1002293 - Protect Watchman from Throttler https://github.com/openshift/li/commit/7e6cb41955280a8223ac4d8e975101c719d58888 Bug 1002293 - Protect Watchman from Throttler * add throttler status to Watchman status message
Tested on devenv_3776. The watchman can restart successfully after modifying "apply_period" Failures will be reported: (tailf /var/log/messages) Sep 12 05:21:27 ip-10-147-175-80 rhc-watchman[20369]: Failed to create Throttler: /etc/openshift/resource_limits.conf requires 'apply_period' in '[cg_template_throttled]' group Mark the bug as verified.