Bug 1001151 - [oo-accept-node] Fails to check if users are properly in cgroups
Summary: [oo-accept-node] Fails to check if users are properly in cgroups
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 1.2.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Jason DeTiberus
QA Contact: libra bugs
URL:
Whiteboard:
Depends On: 1000174
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-26 14:49 UTC by Jason DeTiberus
Modified: 2017-03-08 17:35 UTC (History)
8 users (show)

Fixed In Version: openshift-origin-node-util-1.9.9.4-1.el6op
Doc Type: Bug Fix
Doc Text:
Due to bugs in the oo-accept-node script, processes that were running outside of the Cgroup environment were not detected correctly. This issue has been fixed in the current release of OpenShift Enterprise.
Clone Of: 1000174
Environment:
Last Closed: 2013-09-25 15:30:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:1275 0 normal SHIPPED_LIVE OpenShift Enterprise 1.2.3 bug fix and enhancement update 2013-09-25 19:26:23 UTC

Description Jason DeTiberus 2013-08-26 14:49:21 UTC
+++ This bug was initially created as a clone of Bug #1000174 +++

Description of problem:

While doing a little research into checking process' memory I stumbled upon a process that was _not_ being confined in cgroups but oo-accept-node did not report on this.

After some investigation I found a definite bug in oo-accept-node that doesn't detect cgroup processes correctly.

Version-Release number of selected component (if applicable):
Current

How reproducible:
Easy

Steps to Reproduce:
1. Start a process.
2. Remove the procs out of /cgroup/all/openshift/UUID/cgroup.procs
3. Run oo-accept-node -v

Actual results:
The check completely fails to detect when cgroup procs are not in the cgroup.procs file but are running in the ps table.

Expected results:
Properly detect processes that are running but not in cgroups.

Additional info:

There are a few problems here.

The ENV['GEAR_MIN_UID'] is '500'.  
-This is a string.  
-This is _also_ incorrect as the minimum gear UID should be 1000.  There should never be a gear less than 1000.

The ENV['GEAR_MAX_UID'] is '6500'.
-This is a string.  
-This is _also_ incorrect as the minimum gear UID should be 1000.  There should never be a gear less than 1000.

FIX:
min_uid = ENV['GEAR_MIN_UID'].to_i
max_uid = ENV['GEAR_MAX_UID'].to_i



uid and pid were strings.
Fix:
    all_user_procs.each do |line|
        uid,pid = line.split
        uid = uid.to_i
        pid = pid.to_i

Let's also keep in mind that some of our nodes have 3000+ users on them and we need this script to achieve decent performance.
Would be nice if $USERS was a hash:
$USERS['uuid'] = #old user data

passwd_lines = $USERS.select { |u| u.uid == uid }

--- Additional comment from Jason DeTiberus on 2013-08-23 11:24:41 EDT ---

https://github.com/openshift/origin-server/pull/3483

--- Additional comment from openshift-github-bot on 2013-08-23 18:59:48 EDT ---

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/2003bc01f12cccd54b9e61390e8ea3931f889a2c
<oo-accept-node> Bug 1000174 - oo-accept-node fixes

https://bugzilla.redhat.com/show_bug.cgi?id=1000174

In check_cgroups_procs: Convert uid string values to integer before
comparisons, test for all defined cgroups controllers (not just all or
memory)

Remove unnecessary call to $USERS.dup

Fix an issue where 3 digit uids would not be verified in
check_cgroups_procs (this is the case for a non-district node wit the
default node.conf)

Update default node.conf values to match the default district values for
min/max uids

--- Additional comment from Hou Jianwei on 2013-08-26 05:58:39 EDT ---

Tested on devenv-stage_457
The node.conf file is not merged in the env, for I still get:

GEAR_MIN_UID=500                                             # Lower bound of UID used to create gears
GEAR_MAX_UID=6500


In my test, I found that the /cgroup/all/openshift/UUID/cgroup.procs is not writable, whenever I tried to update the fail, I got rejected, is there any way to achieve step 2 in the bug description?

Please also help to move the bug to on_qa, thanks!

--- Additional comment from xiaoli on 2013-08-26 06:55:21 EDT ---

Tested it on devenv-stage_457, after the stop and start cgconfig service, the existing process will be removed from cgroup.procs files:

[root@ip-10-40-54-111 ~]# service cgconfig stop
Stopping cgconfig service:                                 [  OK  ]
[root@ip-10-40-54-111 ~]# service cgconfig start
Starting cgconfig service:                                 [  OK  ]
[root@ip-10-40-54-111 ~]# cat  /cgroup/all/openshift/521b31aaddde1c0acd000003/cgroup.procs 
[root@ip-10-40-54-111 ~]# 

If process is not existing in /cgroup/all/openshift/UUID/cgroup.procs, but existing in ps table, oo-accept-node will report the error

[root@ip-10-40-54-111 ~]# oo-accept-node 
FAIL: 521b31aaddde1c0acd000003 has a process missing from cgroups: 16951 cgroups controller: all
FAIL: 521b31aaddde1c0acd000003 has a process missing from cgroups: 16952 cgroups controller: all
FAIL: 521b31aaddde1c0acd000003 has a process missing from cgroups: 16953 cgroups controller: all
FAIL: 521b31aaddde1c0acd000003 has a process missing from cgroups: 16967 cgroups controller: all

After running the following scripts, all the cgroup configure will come back to normal:
[root@ip-10-40-54-111 ~]# oo-cgroup-enable --with-all-containers
[root@ip-10-40-54-111 ~]# cat  /cgroup/all/openshift/521b31aaddde1c0acd000003/cgroup.procs
16951
16952
16953
16967
[root@ip-10-40-54-111 ~]# oo-accept-node 
PASS

The only remaining issue in this bug is max and min gear id is not built in the devenv-stage image, not sure why.

[root@ip-10-40-54-111 ~]# cat /etc/openshift/node.conf|grep GEAR_M
GEAR_MIN_UID=500                      # Lower bound of UID used to create gears
GEAR_MAX_UID=6500                     # Upper bound of UID used to create gears

The package version is
rubygem-openshift-origin-node-1.13.12-1.el6oso.noarch

--- Additional comment from Jason DeTiberus on 2013-08-26 09:23:16 EDT ---

Step 2 can also be replicated by using cgclassify: 'cglassify -g cpu,cpuacct,memory,net_cls,freezer:/ <pidlist>'

The node.conf file is listed as noreplace in the spec file, so it will not be updated by just updating the RPMs.  

Also, the devenv RPM copies node.conf.libra (in the li repo) to node.conf, submitted PR: https://github.com/openshift/li/pull/1857 to address this.

For the other environments, Ops will need to make any changes needed to the config files that already exist in production.

Comment 2 Jason DeTiberus 2013-09-06 17:43:02 UTC
https://github.com/openshift/enterprise-server/pull/131

Comment 3 Jason DeTiberus 2013-09-06 20:04:31 UTC
Missed some origin/enterprise differences in the first go round: https://github.com/openshift/enterprise-server/pull/132

Comment 4 Gaoyun Pei 2013-09-11 11:30:24 UTC
Verify this bug on puddle: 1.2/2013-09-10.2

Steps:
1. Create an app
2. Remove the pid list in /cgroup/all/openshift/UUID/cgroup.procs
cgclassify -g cpu,cpuacct,memory,net_cls,freezer:/ $(</cgroup/memory/openshift/52302ac5aeb9055fdd000006/cgroup.procs)
3. Run "oo-accept-node"
[root@node2 ~]# oo-accept-node -v
INFO: using default accept-node extensions
INFO: loading node configuration file /etc/openshift/node.conf
INFO: loading resource limit file /etc/openshift/resource_limits.conf
INFO: checking node public hostname resolution
INFO: checking selinux status
INFO: checking selinux openshift-origin policy
INFO: checking selinux booleans
INFO: checking package list
INFO: checking services
INFO: checking kernel semaphores >= 512
INFO: checking cgroups configuration
INFO: checking cgroups processes
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18456 cgroups controller: memory
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18457 cgroups controller: memory
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18458 cgroups controller: memory
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18459 cgroups controller: memory
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18460 cgroups controller: memory
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18461 cgroups controller: memory
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18456 cgroups controller: cpu
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18457 cgroups controller: cpu
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18458 cgroups controller: cpu
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18459 cgroups controller: cpu
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18460 cgroups controller: cpu
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18461 cgroups controller: cpu
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18456 cgroups controller: net_cls
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18457 cgroups controller: net_cls
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18458 cgroups controller: net_cls
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18459 cgroups controller: net_cls
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18460 cgroups controller: net_cls
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18461 cgroups controller: net_cls
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18456 cgroups controller: freezer
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18457 cgroups controller: freezer
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18458 cgroups controller: freezer
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18459 cgroups controller: freezer
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18460 cgroups controller: freezer
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18461 cgroups controller: freezer
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18456 cgroups controller: cpuacct
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18457 cgroups controller: cpuacct
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18458 cgroups controller: cpuacct
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18459 cgroups controller: cpuacct
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18460 cgroups controller: cpuacct
FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups:  18461 cgroups controller: cpuacct
INFO: checking filesystem quotas
INFO: checking quota db file selinux label
INFO: checking 3 user accounts
INFO: checking application dirs
INFO: checking system httpd configs
30 ERRORS

Comment 7 errata-xmlrpc 2013-09-25 15:30:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1275.html


Note You need to log in before you can comment on or make changes to this bug.