+++ This bug was initially created as a clone of Bug #1000174 +++ Description of problem: While doing a little research into checking process' memory I stumbled upon a process that was _not_ being confined in cgroups but oo-accept-node did not report on this. After some investigation I found a definite bug in oo-accept-node that doesn't detect cgroup processes correctly. Version-Release number of selected component (if applicable): Current How reproducible: Easy Steps to Reproduce: 1. Start a process. 2. Remove the procs out of /cgroup/all/openshift/UUID/cgroup.procs 3. Run oo-accept-node -v Actual results: The check completely fails to detect when cgroup procs are not in the cgroup.procs file but are running in the ps table. Expected results: Properly detect processes that are running but not in cgroups. Additional info: There are a few problems here. The ENV['GEAR_MIN_UID'] is '500'. -This is a string. -This is _also_ incorrect as the minimum gear UID should be 1000. There should never be a gear less than 1000. The ENV['GEAR_MAX_UID'] is '6500'. -This is a string. -This is _also_ incorrect as the minimum gear UID should be 1000. There should never be a gear less than 1000. FIX: min_uid = ENV['GEAR_MIN_UID'].to_i max_uid = ENV['GEAR_MAX_UID'].to_i uid and pid were strings. Fix: all_user_procs.each do |line| uid,pid = line.split uid = uid.to_i pid = pid.to_i Let's also keep in mind that some of our nodes have 3000+ users on them and we need this script to achieve decent performance. Would be nice if $USERS was a hash: $USERS['uuid'] = #old user data passwd_lines = $USERS.select { |u| u.uid == uid } --- Additional comment from Jason DeTiberus on 2013-08-23 11:24:41 EDT --- https://github.com/openshift/origin-server/pull/3483 --- Additional comment from openshift-github-bot on 2013-08-23 18:59:48 EDT --- Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/2003bc01f12cccd54b9e61390e8ea3931f889a2c <oo-accept-node> Bug 1000174 - oo-accept-node fixes https://bugzilla.redhat.com/show_bug.cgi?id=1000174 In check_cgroups_procs: Convert uid string values to integer before comparisons, test for all defined cgroups controllers (not just all or memory) Remove unnecessary call to $USERS.dup Fix an issue where 3 digit uids would not be verified in check_cgroups_procs (this is the case for a non-district node wit the default node.conf) Update default node.conf values to match the default district values for min/max uids --- Additional comment from Hou Jianwei on 2013-08-26 05:58:39 EDT --- Tested on devenv-stage_457 The node.conf file is not merged in the env, for I still get: GEAR_MIN_UID=500 # Lower bound of UID used to create gears GEAR_MAX_UID=6500 In my test, I found that the /cgroup/all/openshift/UUID/cgroup.procs is not writable, whenever I tried to update the fail, I got rejected, is there any way to achieve step 2 in the bug description? Please also help to move the bug to on_qa, thanks! --- Additional comment from xiaoli on 2013-08-26 06:55:21 EDT --- Tested it on devenv-stage_457, after the stop and start cgconfig service, the existing process will be removed from cgroup.procs files: [root@ip-10-40-54-111 ~]# service cgconfig stop Stopping cgconfig service: [ OK ] [root@ip-10-40-54-111 ~]# service cgconfig start Starting cgconfig service: [ OK ] [root@ip-10-40-54-111 ~]# cat /cgroup/all/openshift/521b31aaddde1c0acd000003/cgroup.procs [root@ip-10-40-54-111 ~]# If process is not existing in /cgroup/all/openshift/UUID/cgroup.procs, but existing in ps table, oo-accept-node will report the error [root@ip-10-40-54-111 ~]# oo-accept-node FAIL: 521b31aaddde1c0acd000003 has a process missing from cgroups: 16951 cgroups controller: all FAIL: 521b31aaddde1c0acd000003 has a process missing from cgroups: 16952 cgroups controller: all FAIL: 521b31aaddde1c0acd000003 has a process missing from cgroups: 16953 cgroups controller: all FAIL: 521b31aaddde1c0acd000003 has a process missing from cgroups: 16967 cgroups controller: all After running the following scripts, all the cgroup configure will come back to normal: [root@ip-10-40-54-111 ~]# oo-cgroup-enable --with-all-containers [root@ip-10-40-54-111 ~]# cat /cgroup/all/openshift/521b31aaddde1c0acd000003/cgroup.procs 16951 16952 16953 16967 [root@ip-10-40-54-111 ~]# oo-accept-node PASS The only remaining issue in this bug is max and min gear id is not built in the devenv-stage image, not sure why. [root@ip-10-40-54-111 ~]# cat /etc/openshift/node.conf|grep GEAR_M GEAR_MIN_UID=500 # Lower bound of UID used to create gears GEAR_MAX_UID=6500 # Upper bound of UID used to create gears The package version is rubygem-openshift-origin-node-1.13.12-1.el6oso.noarch --- Additional comment from Jason DeTiberus on 2013-08-26 09:23:16 EDT --- Step 2 can also be replicated by using cgclassify: 'cglassify -g cpu,cpuacct,memory,net_cls,freezer:/ <pidlist>' The node.conf file is listed as noreplace in the spec file, so it will not be updated by just updating the RPMs. Also, the devenv RPM copies node.conf.libra (in the li repo) to node.conf, submitted PR: https://github.com/openshift/li/pull/1857 to address this. For the other environments, Ops will need to make any changes needed to the config files that already exist in production.
https://github.com/openshift/enterprise-server/pull/131
Missed some origin/enterprise differences in the first go round: https://github.com/openshift/enterprise-server/pull/132
Verify this bug on puddle: 1.2/2013-09-10.2 Steps: 1. Create an app 2. Remove the pid list in /cgroup/all/openshift/UUID/cgroup.procs cgclassify -g cpu,cpuacct,memory,net_cls,freezer:/ $(</cgroup/memory/openshift/52302ac5aeb9055fdd000006/cgroup.procs) 3. Run "oo-accept-node" [root@node2 ~]# oo-accept-node -v INFO: using default accept-node extensions INFO: loading node configuration file /etc/openshift/node.conf INFO: loading resource limit file /etc/openshift/resource_limits.conf INFO: checking node public hostname resolution INFO: checking selinux status INFO: checking selinux openshift-origin policy INFO: checking selinux booleans INFO: checking package list INFO: checking services INFO: checking kernel semaphores >= 512 INFO: checking cgroups configuration INFO: checking cgroups processes FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18456 cgroups controller: memory FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18457 cgroups controller: memory FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18458 cgroups controller: memory FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18459 cgroups controller: memory FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18460 cgroups controller: memory FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18461 cgroups controller: memory FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18456 cgroups controller: cpu FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18457 cgroups controller: cpu FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18458 cgroups controller: cpu FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18459 cgroups controller: cpu FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18460 cgroups controller: cpu FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18461 cgroups controller: cpu FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18456 cgroups controller: net_cls FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18457 cgroups controller: net_cls FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18458 cgroups controller: net_cls FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18459 cgroups controller: net_cls FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18460 cgroups controller: net_cls FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18461 cgroups controller: net_cls FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18456 cgroups controller: freezer FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18457 cgroups controller: freezer FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18458 cgroups controller: freezer FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18459 cgroups controller: freezer FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18460 cgroups controller: freezer FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18461 cgroups controller: freezer FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18456 cgroups controller: cpuacct FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18457 cgroups controller: cpuacct FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18458 cgroups controller: cpuacct FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18459 cgroups controller: cpuacct FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18460 cgroups controller: cpuacct FAIL: 52302ac5aeb9055fdd000006 has a process missing from cgroups: 18461 cgroups controller: cpuacct INFO: checking filesystem quotas INFO: checking quota db file selinux label INFO: checking 3 user accounts INFO: checking application dirs INFO: checking system httpd configs 30 ERRORS
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1275.html