Description of problem: rhc-watchman is dying with this exception: Aug 1 23:10:59 ex-std-node2 rhc-watchman[20785]: watchman caught #<RuntimeError: User does not exist in cgroups: 51fb21c32587c8d8b3000127>: User does not exist in cgroups: 51fb21c32587c8d8b3000127. Retries left: 0 Since watchman is now used to throttle gears, it's extremely important for it to not die. This pull request attempts to fix it, but doesn't seem to: https://github.com/openshift/origin-server/commit/271ba09e3292c7cecc67e9e01fc3b0ec66079c80 Version-Release number of selected component (if applicable): rhc-node-1.12.5-1.el6oso.x86_64 How reproducible: unsure, just saw it happening in STG. Steps to Reproduce: 1. unknown Actual results: watchman dies Expected results: watchman should not die
This was caused by a race condition when trying to throttle or retrieve the profile for a gear that had been deleted. https://github.com/openshift/origin-server/pull/3280 Fixed by handling errors properly in throttler. Also submitted a PR to the stage branch for immediate inclusion.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/2a01c0ce2019cef486f26fb39c8a43f4bf232093 Merge pull request #3280 from fotioslindiakos/Bug991480 Merged by openshift-bot
Checked on devenv-stage_437, Delete the gear when it being throttled or restored. The following messages were found in /var/log/messages Aug 5 03:05:29 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10 Aug 5 03:05:29 ip-10-164-76-49 rhc-watchman[1928]: Throttler: throttle => 279855988575760340221952 (973.23) Aug 5 03:05:29 ip-10-164-76-49 rhc-watchman[1928]: Throttler: over_threshold => 279855988575760340221952 (973.23) Aug 5 03:05:34 ip-10-164-76-49 CGRE[1011]: Reloading rules configuration Aug 5 03:05:54 ip-10-164-76-49 CGRE[1011]: Reloading rules configuration Aug 5 03:06:09 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10 Aug 5 03:06:09 ip-10-164-76-49 rhc-watchman[1928]: Throttler: FAILED restore => 279855988575760340221952 (User does not exist in cgroups: 279855988575760340221952) Aug 5 03:06:29 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10 Aug 5 03:06:49 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10 Aug 5 03:07:09 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10 Aug 5 03:07:29 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10 Aug 5 03:07:49 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10 Aug 5 03:08:09 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10 We can see if the throttler failed to find the gear, rhc-watchman will still running. Move Bug to verified.