Bug 991480 - rhc-watchman dying with "User does not exist in cgroups"
Summary: rhc-watchman dying with "User does not exist in cgroups"
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Fotios Lindiakos
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-02 13:48 UTC by Thomas Wiest
Modified: 2015-05-14 23:25 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-08-07 22:58:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Thomas Wiest 2013-08-02 13:48:20 UTC
Description of problem:
rhc-watchman is dying with this exception:

Aug  1 23:10:59 ex-std-node2 rhc-watchman[20785]: watchman caught #<RuntimeError: User does not exist in cgroups: 51fb21c32587c8d8b3000127>: User does not exist in cgroups: 51fb21c32587c8d8b3000127. Retries left: 0

Since watchman is now used to throttle gears, it's extremely important for it to not die.

This pull request attempts to fix it, but doesn't seem to:
https://github.com/openshift/origin-server/commit/271ba09e3292c7cecc67e9e01fc3b0ec66079c80


Version-Release number of selected component (if applicable):
rhc-node-1.12.5-1.el6oso.x86_64

How reproducible:
unsure, just saw it happening in STG.


Steps to Reproduce:
1. unknown


Actual results:
watchman dies


Expected results:
watchman should not die

Comment 1 Fotios Lindiakos 2013-08-02 20:55:09 UTC
This was caused by a race condition when trying to throttle or retrieve the profile for a gear that had been deleted.

https://github.com/openshift/origin-server/pull/3280

Fixed by handling errors properly in throttler. Also submitted a PR to the stage branch for immediate inclusion.

Comment 2 openshift-github-bot 2013-08-02 21:23:53 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/2a01c0ce2019cef486f26fb39c8a43f4bf232093
Merge pull request #3280 from fotioslindiakos/Bug991480

Merged by openshift-bot

Comment 3 Meng Bo 2013-08-05 07:12:42 UTC
Checked on devenv-stage_437,

Delete the gear when it being throttled or restored. The following messages were found in /var/log/messages

Aug  5 03:05:29 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug  5 03:05:29 ip-10-164-76-49 rhc-watchman[1928]: Throttler: throttle => 279855988575760340221952 (973.23)
Aug  5 03:05:29 ip-10-164-76-49 rhc-watchman[1928]: Throttler: over_threshold => 279855988575760340221952 (973.23)
Aug  5 03:05:34 ip-10-164-76-49 CGRE[1011]: Reloading rules configuration
Aug  5 03:05:54 ip-10-164-76-49 CGRE[1011]: Reloading rules configuration
Aug  5 03:06:09 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug  5 03:06:09 ip-10-164-76-49 rhc-watchman[1928]: Throttler: FAILED restore => 279855988575760340221952 (User does not exist in cgroups: 279855988575760340221952)
Aug  5 03:06:29 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug  5 03:06:49 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug  5 03:07:09 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug  5 03:07:29 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug  5 03:07:49 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10
Aug  5 03:08:09 ip-10-164-76-49 rhc-watchman[1928]: Running rhc-watchman => delay: 20s, exception threshold: 10

We can see if the throttler failed to find the gear, rhc-watchman will still running.

Move Bug to verified.


Note You need to log in before you can comment on or make changes to this bug.