Bug 991480 - rhc-watchman dying with "User does not exist in cgroups"
rhc-watchman dying with "User does not exist in cgroups"
Status: CLOSED CURRENTRELEASE
Product: OpenShift Online
Classification: Red Hat
Component: Containers (Show other bugs)
2.x
Unspecified Unspecified
urgent Severity urgent
: ---
: ---
Assigned To: Fotios Lindiakos
libra bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-02 09:48 EDT by Thomas Wiest
Modified: 2015-05-14 19:25 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-08-07 18:58:58 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Thomas Wiest 2013-08-02 09:48:20 EDT
Description of problem:
rhc-watchman is dying with this exception:

Aug  1 23:10:59 ex-std-node2 rhc-watchman[20785]: watchman caught #<RuntimeError: User does not exist in cgroups: 51fb21c32587c8d8b3000127>: User does not exist in cgroups: 51fb21c32587c8d8b3000127. Retries left: 0

Since watchman is now used to throttle gears, it's extremely important for it to not die.

This pull request attempts to fix it, but doesn't seem to:
https://github.com/openshift/origin-server/commit/271ba09e3292c7cecc67e9e01fc3b0ec66079c80


Version-Release number of selected component (if applicable):
rhc-node-1.12.5-1.el6oso.x86_64

How reproducible:
unsure, just saw it happening in STG.


Steps to Reproduce:
1. unknown


Actual results:
watchman dies


Expected results:
watchman should not die
Comment 1 Fotios Lindiakos 2013-08-02 16:55:09 EDT
This was caused by a race condition when trying to throttle or retrieve the profile for a gear that had been deleted.

https://github.com/openshift/origin-server/pull/3280

Fixed by handling errors properly in throttler. Also submitted a PR to the stage branch for immediate inclusion.
Comment 2 openshift-github-bot 2013-08-02 17:23:53 EDT
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/2a01c0ce2019cef486f26fb39c8a43f4bf232093
Merge pull request #3280 from fotioslindiakos/Bug991480

Merged by openshift-bot
Comment 3 Meng Bo 2013-08-05 03:12:42 EDT
Checked on devenv-stage_437,

Delete the gear when it being throttled or restored. The following messages were found in /var/log/messages

Aug  5 03:05:29 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug  5 03:05:29 ip-10-164-76-49 rhc-watchman[1928]: Throttler: throttle => 279855988575760340221952 (973.23)
Aug  5 03:05:29 ip-10-164-76-49 rhc-watchman[1928]: Throttler: over_threshold => 279855988575760340221952 (973.23)
Aug  5 03:05:34 ip-10-164-76-49 CGRE[1011]: Reloading rules configuration
Aug  5 03:05:54 ip-10-164-76-49 CGRE[1011]: Reloading rules configuration
Aug  5 03:06:09 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug  5 03:06:09 ip-10-164-76-49 rhc-watchman[1928]: Throttler: FAILED restore => 279855988575760340221952 (User does not exist in cgroups: 279855988575760340221952)
Aug  5 03:06:29 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug  5 03:06:49 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug  5 03:07:09 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug  5 03:07:29 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug  5 03:07:49 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug  5 03:08:09 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10

We can see if the throttler failed to find the gear, rhc-watchman will still running.

Move Bug to verified.

Note You need to log in before you can comment on or make changes to this bug.