Bug 969528 - [oo-accept-node] Processes not in cgroups
[oo-accept-node] Processes not in cgroups
Status: CLOSED CURRENTRELEASE
Product: OpenShift Online
Classification: Red Hat
Component: Containers (Show other bugs)
1.x
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Rob Millner
libra bugs
:
: 963895 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-31 13:33 EDT by Kenny Woodson
Modified: 2013-11-17 19:48 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-06-11 00:15:04 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Kenny Woodson 2013-05-31 13:33:55 EDT
Description of problem:

We recently added some functionality that detected gears that were not inside of cgroups.  This has been an issue that we've seen quite regularly when we run oo-accept-node.  Here is an example:

FAIL: 9550c6fe411f45078a57d7ff6def4556 has a process missing from cgroups:  2131
FAIL: 9550c6fe411f45078a57d7ff6def4556 has a process missing from cgroups:  2309
FAIL: af964010ea314db19ff93c989153f648 has a process missing from cgroups:  2193
FAIL: af964010ea314db19ff93c989153f648 has a process missing from cgroups:  2196

 ps -u af964010ea314db19ff93c989153f648
  PID TTY          TIME CMD
 2193 ?        00:00:13 httpd
 2196 ?        00:00:00 rotatelogs
 2197 ?        00:00:00 rotatelogs
 2232 ?        00:00:00 httpd
 2585 ?        00:00:13 httpd
 2588 ?        00:00:00 rotatelogs
 2589 ?        00:00:00 rotatelogs
 2601 ?        00:00:00 httpd
 2679 ?        00:00:00 standalone.sh
 2899 ?        00:07:01 java
30374 ?        00:11:26 mongod

~]$ cat /cgroup/all/openshift/af964010ea314db19ff93c989153f648/cgroup.procs 
~]$ 


I know there was some effort to create a repair option for oo-admin-ctl-cgroups but we would like to get to the root cause of this issue so we do not have these accept node failures.

This is happening across our production environment currently.

Version-Release number of selected component (if applicable):

2.0.27.1

How reproducible:

Very.  I don't have the exact steps but our production environment has these occurring.

Steps to Reproduce:
1.
2.
3.

Actual results:

oo-accept-node detects processes that are out of their cgroups.

Expected results:

We would like these processes to stay in their cgroups.  

Additional info:

Let's look into the root of the cause of this problem.
Comment 1 Rob Millner 2013-06-07 15:21:22 EDT
From previous efforts to debug this issue, cgrulesengd gets stuck in prod in ways that are difficult to diagnose and have not been reproducible on a devenv where we can more safely pick the system apart.

I'm going to see if we can add pam_cgroup to the various methods of starting user tasks (ex: runuser) so that there's less reliance on cgrulesengd.
Comment 2 Rob Millner 2013-06-07 17:02:31 EDT
Ok, if I add pam_cgroup to the proper pam.d files and completely turn off cgrulesengd, then gear start properly classifies the processes.

We will still want cgrulesengd fixed and there's still a need for a cleanup task to re-classify processes.
https://github.com/openshift/li/pull/1590
Comment 3 Rob Millner 2013-06-07 18:52:47 EDT
Passing to Q/E:

To verify:

1. Turn off cgrulesengd:
  service cgred stop

2. Create an app
3. Examine the app processes to ensure they are in the correct cgroup.
Comment 4 Rob Millner 2013-06-07 18:54:28 EDT
*** Bug 963895 has been marked as a duplicate of this bug. ***
Comment 5 openshift-github-bot 2013-06-07 20:53:03 EDT
Commit pushed to master at https://github.com/openshift/li

https://github.com/openshift/li/commit/c8b9d58795c66a0e6077d74ae3194a925cf9b70f
Bug 969528 - cgrulesengd has become very flaky, adding this to pam causes processes to be properly classified when oo-spawn launches them.
Comment 6 Meng Bo 2013-06-08 04:49:19 EDT
Checked issue on devenv_3336,

1. Stop the cgred service from instance
2. Create app
3. Run oo-accept-node
[root@ip-10-154-161-46 openshift]# oo-accept-node 
FAIL: service cgred not running
1 ERRORS
4. Check the process in cgroup for the gear
[root@ip-10-154-161-46 openshift]# cat /cgroup/all/openshift/1e07981ed0c911e2970622000a9aa12e/cgroup.procs 
13669
13671
13672
13692

5. Check process for the gear
[root@ip-10-154-161-46 openshift]# ps -u 1e07981ed0c911e2970622000a9aa12e
  PID TTY          TIME CMD
13669 ?        00:00:00 httpd
13671 ?        00:00:00 rotatelogs
13672 ?        00:00:00 rotatelogs
13692 ?        00:00:00 httpd

Move bug to verified.

Note You need to log in before you can comment on or make changes to this bug.