Red Hat Bugzilla – Bug 969528
[oo-accept-node] Processes not in cgroups
Last modified: 2013-11-17 19:48:05 EST
Description of problem:
We recently added some functionality that detected gears that were not inside of cgroups. This has been an issue that we've seen quite regularly when we run oo-accept-node. Here is an example:
FAIL: 9550c6fe411f45078a57d7ff6def4556 has a process missing from cgroups: 2131
FAIL: 9550c6fe411f45078a57d7ff6def4556 has a process missing from cgroups: 2309
FAIL: af964010ea314db19ff93c989153f648 has a process missing from cgroups: 2193
FAIL: af964010ea314db19ff93c989153f648 has a process missing from cgroups: 2196
ps -u af964010ea314db19ff93c989153f648
PID TTY TIME CMD
2193 ? 00:00:13 httpd
2196 ? 00:00:00 rotatelogs
2197 ? 00:00:00 rotatelogs
2232 ? 00:00:00 httpd
2585 ? 00:00:13 httpd
2588 ? 00:00:00 rotatelogs
2589 ? 00:00:00 rotatelogs
2601 ? 00:00:00 httpd
2679 ? 00:00:00 standalone.sh
2899 ? 00:07:01 java
30374 ? 00:11:26 mongod
~]$ cat /cgroup/all/openshift/af964010ea314db19ff93c989153f648/cgroup.procs
I know there was some effort to create a repair option for oo-admin-ctl-cgroups but we would like to get to the root cause of this issue so we do not have these accept node failures.
This is happening across our production environment currently.
Version-Release number of selected component (if applicable):
Very. I don't have the exact steps but our production environment has these occurring.
Steps to Reproduce:
oo-accept-node detects processes that are out of their cgroups.
We would like these processes to stay in their cgroups.
Let's look into the root of the cause of this problem.
From previous efforts to debug this issue, cgrulesengd gets stuck in prod in ways that are difficult to diagnose and have not been reproducible on a devenv where we can more safely pick the system apart.
I'm going to see if we can add pam_cgroup to the various methods of starting user tasks (ex: runuser) so that there's less reliance on cgrulesengd.
Ok, if I add pam_cgroup to the proper pam.d files and completely turn off cgrulesengd, then gear start properly classifies the processes.
We will still want cgrulesengd fixed and there's still a need for a cleanup task to re-classify processes.
Passing to Q/E:
1. Turn off cgrulesengd:
service cgred stop
2. Create an app
3. Examine the app processes to ensure they are in the correct cgroup.
*** Bug 963895 has been marked as a duplicate of this bug. ***
Commit pushed to master at https://github.com/openshift/li
Bug 969528 - cgrulesengd has become very flaky, adding this to pam causes processes to be properly classified when oo-spawn launches them.
Checked issue on devenv_3336,
1. Stop the cgred service from instance
2. Create app
3. Run oo-accept-node
[root@ip-10-154-161-46 openshift]# oo-accept-node
FAIL: service cgred not running
4. Check the process in cgroup for the gear
[root@ip-10-154-161-46 openshift]# cat /cgroup/all/openshift/1e07981ed0c911e2970622000a9aa12e/cgroup.procs
5. Check process for the gear
[root@ip-10-154-161-46 openshift]# ps -u 1e07981ed0c911e2970622000a9aa12e
PID TTY TIME CMD
13669 ? 00:00:00 httpd
13671 ? 00:00:00 rotatelogs
13672 ? 00:00:00 rotatelogs
13692 ? 00:00:00 httpd
Move bug to verified.