Description of problem: We recently added some functionality that detected gears that were not inside of cgroups. This has been an issue that we've seen quite regularly when we run oo-accept-node. Here is an example: FAIL: 9550c6fe411f45078a57d7ff6def4556 has a process missing from cgroups: 2131 FAIL: 9550c6fe411f45078a57d7ff6def4556 has a process missing from cgroups: 2309 FAIL: af964010ea314db19ff93c989153f648 has a process missing from cgroups: 2193 FAIL: af964010ea314db19ff93c989153f648 has a process missing from cgroups: 2196 ps -u af964010ea314db19ff93c989153f648 PID TTY TIME CMD 2193 ? 00:00:13 httpd 2196 ? 00:00:00 rotatelogs 2197 ? 00:00:00 rotatelogs 2232 ? 00:00:00 httpd 2585 ? 00:00:13 httpd 2588 ? 00:00:00 rotatelogs 2589 ? 00:00:00 rotatelogs 2601 ? 00:00:00 httpd 2679 ? 00:00:00 standalone.sh 2899 ? 00:07:01 java 30374 ? 00:11:26 mongod ~]$ cat /cgroup/all/openshift/af964010ea314db19ff93c989153f648/cgroup.procs ~]$ I know there was some effort to create a repair option for oo-admin-ctl-cgroups but we would like to get to the root cause of this issue so we do not have these accept node failures. This is happening across our production environment currently. Version-Release number of selected component (if applicable): 2.0.27.1 How reproducible: Very. I don't have the exact steps but our production environment has these occurring. Steps to Reproduce: 1. 2. 3. Actual results: oo-accept-node detects processes that are out of their cgroups. Expected results: We would like these processes to stay in their cgroups. Additional info: Let's look into the root of the cause of this problem.
From previous efforts to debug this issue, cgrulesengd gets stuck in prod in ways that are difficult to diagnose and have not been reproducible on a devenv where we can more safely pick the system apart. I'm going to see if we can add pam_cgroup to the various methods of starting user tasks (ex: runuser) so that there's less reliance on cgrulesengd.
Ok, if I add pam_cgroup to the proper pam.d files and completely turn off cgrulesengd, then gear start properly classifies the processes. We will still want cgrulesengd fixed and there's still a need for a cleanup task to re-classify processes. https://github.com/openshift/li/pull/1590
Passing to Q/E: To verify: 1. Turn off cgrulesengd: service cgred stop 2. Create an app 3. Examine the app processes to ensure they are in the correct cgroup.
*** Bug 963895 has been marked as a duplicate of this bug. ***
Commit pushed to master at https://github.com/openshift/li https://github.com/openshift/li/commit/c8b9d58795c66a0e6077d74ae3194a925cf9b70f Bug 969528 - cgrulesengd has become very flaky, adding this to pam causes processes to be properly classified when oo-spawn launches them.
Checked issue on devenv_3336, 1. Stop the cgred service from instance 2. Create app 3. Run oo-accept-node [root@ip-10-154-161-46 openshift]# oo-accept-node FAIL: service cgred not running 1 ERRORS 4. Check the process in cgroup for the gear [root@ip-10-154-161-46 openshift]# cat /cgroup/all/openshift/1e07981ed0c911e2970622000a9aa12e/cgroup.procs 13669 13671 13672 13692 5. Check process for the gear [root@ip-10-154-161-46 openshift]# ps -u 1e07981ed0c911e2970622000a9aa12e PID TTY TIME CMD 13669 ? 00:00:00 httpd 13671 ? 00:00:00 rotatelogs 13672 ? 00:00:00 rotatelogs 13692 ? 00:00:00 httpd Move bug to verified.