969528 – [oo-accept-node] Processes not in cgroups

Bug 969528 - [oo-accept-node] Processes not in cgroups

Summary: [oo-accept-node] Processes not in cgroups

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	1.x
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Rob Millner
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	963895 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-31 17:33 UTC by Kenny Woodson
Modified:	2013-11-18 00:48 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-06-11 04:15:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Kenny Woodson 2013-05-31 17:33:55 UTC

Description of problem:

We recently added some functionality that detected gears that were not inside of cgroups. This has been an issue that we've seen quite regularly when we run oo-accept-node. Here is an example:

FAIL: 9550c6fe411f45078a57d7ff6def4556 has a process missing from cgroups: 2131
FAIL: 9550c6fe411f45078a57d7ff6def4556 has a process missing from cgroups: 2309
FAIL: af964010ea314db19ff93c989153f648 has a process missing from cgroups: 2193
FAIL: af964010ea314db19ff93c989153f648 has a process missing from cgroups: 2196

ps -u af964010ea314db19ff93c989153f648
PID TTY TIME CMD
2193 ? 00:00:13 httpd
2196 ? 00:00:00 rotatelogs
2197 ? 00:00:00 rotatelogs
2232 ? 00:00:00 httpd
2585 ? 00:00:13 httpd
2588 ? 00:00:00 rotatelogs
2589 ? 00:00:00 rotatelogs
2601 ? 00:00:00 httpd
2679 ? 00:00:00 standalone.sh
2899 ? 00:07:01 java
30374 ? 00:11:26 mongod

~]$ cat /cgroup/all/openshift/af964010ea314db19ff93c989153f648/cgroup.procs
~]$

I know there was some effort to create a repair option for oo-admin-ctl-cgroups but we would like to get to the root cause of this issue so we do not have these accept node failures.

This is happening across our production environment currently.

Version-Release number of selected component (if applicable):

2.0.27.1

How reproducible:

Very. I don't have the exact steps but our production environment has these occurring.

Steps to Reproduce:
1.
2.
3.

Actual results:

oo-accept-node detects processes that are out of their cgroups.

Expected results:

We would like these processes to stay in their cgroups.

Additional info:

Let's look into the root of the cause of this problem.

Comment 1 Rob Millner 2013-06-07 19:21:22 UTC

From previous efforts to debug this issue, cgrulesengd gets stuck in prod in ways that are difficult to diagnose and have not been reproducible on a devenv where we can more safely pick the system apart.

I'm going to see if we can add pam_cgroup to the various methods of starting user tasks (ex: runuser) so that there's less reliance on cgrulesengd.

Comment 2 Rob Millner 2013-06-07 21:02:31 UTC

Ok, if I add pam_cgroup to the proper pam.d files and completely turn off cgrulesengd, then gear start properly classifies the processes.

We will still want cgrulesengd fixed and there's still a need for a cleanup task to re-classify processes.
https://github.com/openshift/li/pull/1590

Comment 3 Rob Millner 2013-06-07 22:52:47 UTC

Passing to Q/E:

To verify:

1. Turn off cgrulesengd:
  service cgred stop

2. Create an app
3. Examine the app processes to ensure they are in the correct cgroup.

Comment 4 Rob Millner 2013-06-07 22:54:28 UTC

*** Bug 963895 has been marked as a duplicate of this bug. ***

Comment 5 openshift-github-bot 2013-06-08 00:53:03 UTC

Commit pushed to master at https://github.com/openshift/li

https://github.com/openshift/li/commit/c8b9d58795c66a0e6077d74ae3194a925cf9b70f
Bug 969528 - cgrulesengd has become very flaky, adding this to pam causes processes to be properly classified when oo-spawn launches them.

Comment 6 Meng Bo 2013-06-08 08:49:19 UTC

Checked issue on devenv_3336,

1. Stop the cgred service from instance
2. Create app
3. Run oo-accept-node
[root@ip-10-154-161-46 openshift]# oo-accept-node 
FAIL: service cgred not running
1 ERRORS
4. Check the process in cgroup for the gear
[root@ip-10-154-161-46 openshift]# cat /cgroup/all/openshift/1e07981ed0c911e2970622000a9aa12e/cgroup.procs 
13669
13671
13672
13692

5. Check process for the gear
[root@ip-10-154-161-46 openshift]# ps -u 1e07981ed0c911e2970622000a9aa12e
  PID TTY          TIME CMD
13669 ?        00:00:00 httpd
13671 ?        00:00:00 rotatelogs
13672 ?        00:00:00 rotatelogs
13692 ?        00:00:00 httpd

Move bug to verified.

Note You need to log in before you can comment on or make changes to this bug.