Hide Forgot
Description of problem: While doing a little research into checking process' memory I stumbled upon a process that was _not_ being confined in cgroups but oo-accept-node did not report on this. After some investigation I found a definite bug in oo-accept-node that doesn't detect cgroup processes correctly. Version-Release number of selected component (if applicable): Current How reproducible: Easy Steps to Reproduce: 1. Start a process. 2. Remove the procs out of /cgroup/all/openshift/UUID/cgroup.procs 3. Run oo-accept-node -v Actual results: The check completely fails to detect when cgroup procs are not in the cgroup.procs file but are running in the ps table. Expected results: Properly detect processes that are running but not in cgroups. Additional info: There are a few problems here. The ENV['GEAR_MIN_UID'] is '500'. -This is a string. -This is _also_ incorrect as the minimum gear UID should be 1000. There should never be a gear less than 1000. The ENV['GEAR_MAX_UID'] is '6500'. -This is a string. -This is _also_ incorrect as the minimum gear UID should be 1000. There should never be a gear less than 1000. FIX: min_uid = ENV['GEAR_MIN_UID'].to_i max_uid = ENV['GEAR_MAX_UID'].to_i uid and pid were strings. Fix: all_user_procs.each do |line| uid,pid = line.split uid = uid.to_i pid = pid.to_i Let's also keep in mind that some of our nodes have 3000+ users on them and we need this script to achieve decent performance. Would be nice if $USERS was a hash: $USERS['uuid'] = #old user data passwd_lines = $USERS.select { |u| u.uid == uid }
https://github.com/openshift/origin-server/pull/3483
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/2003bc01f12cccd54b9e61390e8ea3931f889a2c <oo-accept-node> Bug 1000174 - oo-accept-node fixes https://bugzilla.redhat.com/show_bug.cgi?id=1000174 In check_cgroups_procs: Convert uid string values to integer before comparisons, test for all defined cgroups controllers (not just all or memory) Remove unnecessary call to $USERS.dup Fix an issue where 3 digit uids would not be verified in check_cgroups_procs (this is the case for a non-district node wit the default node.conf) Update default node.conf values to match the default district values for min/max uids
Tested on devenv-stage_457 The node.conf file is not merged in the env, for I still get: GEAR_MIN_UID=500 # Lower bound of UID used to create gears GEAR_MAX_UID=6500 In my test, I found that the /cgroup/all/openshift/UUID/cgroup.procs is not writable, whenever I tried to update the fail, I got rejected, is there any way to achieve step 2 in the bug description? Please also help to move the bug to on_qa, thanks!
Tested it on devenv-stage_457, after the stop and start cgconfig service, the existing process will be removed from cgroup.procs files: [root@ip-10-40-54-111 ~]# service cgconfig stop Stopping cgconfig service: [ OK ] [root@ip-10-40-54-111 ~]# service cgconfig start Starting cgconfig service: [ OK ] [root@ip-10-40-54-111 ~]# cat /cgroup/all/openshift/521b31aaddde1c0acd000003/cgroup.procs [root@ip-10-40-54-111 ~]# If process is not existing in /cgroup/all/openshift/UUID/cgroup.procs, but existing in ps table, oo-accept-node will report the error [root@ip-10-40-54-111 ~]# oo-accept-node FAIL: 521b31aaddde1c0acd000003 has a process missing from cgroups: 16951 cgroups controller: all FAIL: 521b31aaddde1c0acd000003 has a process missing from cgroups: 16952 cgroups controller: all FAIL: 521b31aaddde1c0acd000003 has a process missing from cgroups: 16953 cgroups controller: all FAIL: 521b31aaddde1c0acd000003 has a process missing from cgroups: 16967 cgroups controller: all After running the following scripts, all the cgroup configure will come back to normal: [root@ip-10-40-54-111 ~]# oo-cgroup-enable --with-all-containers [root@ip-10-40-54-111 ~]# cat /cgroup/all/openshift/521b31aaddde1c0acd000003/cgroup.procs 16951 16952 16953 16967 [root@ip-10-40-54-111 ~]# oo-accept-node PASS The only remaining issue in this bug is max and min gear id is not built in the devenv-stage image, not sure why. [root@ip-10-40-54-111 ~]# cat /etc/openshift/node.conf|grep GEAR_M GEAR_MIN_UID=500 # Lower bound of UID used to create gears GEAR_MAX_UID=6500 # Upper bound of UID used to create gears The package version is rubygem-openshift-origin-node-1.13.12-1.el6oso.noarch
Step 2 can also be replicated by using cgclassify: 'cglassify -g cpu,cpuacct,memory,net_cls,freezer:/ <pidlist>' The node.conf file is listed as noreplace in the spec file, so it will not be updated by just updating the RPMs. Also, the devenv RPM copies node.conf.libra (in the li repo) to node.conf, submitted PR: https://github.com/openshift/li/pull/1857 to address this. For the other environments, Ops will need to make any changes needed to the config files that already exist in production.
Commit pushed to master at https://github.com/openshift/li https://github.com/openshift/li/commit/f15cc4308622ac1f86c7d93a393d9ab79840729b Bug 1000174 - Update node.conf.libra for GEAR_MIN_UID and GEAR_MAX_UID https://bugzilla.redhat.com/show_bug.cgi?id=1000174 Update default node.conf.libra values to match the default district values for GEAR_MIN_UID and GEAR_MAX_UID
[root@ip-10-40-93-30 ~]# cat /etc/openshift/node.conf|grep GEAR|grep UID GEAR_MIN_UID=1000 # Lower bound of UID used to create gears GEAR_MAX_UID=6999 # Upper bound of UID used to create gears The gear uid range has been updated on devenv_3734. Move bug to verified.