Hide Forgot
Description of problem: oo-accept-node is randomly (I haven't figured out what causes it) during the cgroup check. The output looks like this: INFO: loading node configuration file /etc/openshift/node.conf INFO: loading resource limit file /etc/openshift/resource_limits.conf INFO: finding external network device INFO: checking node public hostname resolution INFO: checking selinux status INFO: checking selinux openshift-hosted policy INFO: checking selinux booleans INFO: checking selinux nodes INFO: checking package list INFO: checking services INFO: checking kernel semaphores >= 512 INFO: checking cgroups configuration INFO: checking cgroups processes FAIL: 522a43f84382ec840c0001cc has a process missing from cgroups: 1855 cgroups controller: all /usr/sbin/oo-accept-node:450:in `split': invalid byte sequence in US-ASCII (ArgumentError) from /usr/sbin/oo-accept-node:450:in `block (3 levels) in check_cgroup_procs' from /usr/sbin/oo-accept-node:449:in `each' from /usr/sbin/oo-accept-node:449:in `block (2 levels) in check_cgroup_procs' from /usr/sbin/oo-accept-node:446:in `each' from /usr/sbin/oo-accept-node:446:in `block in check_cgroup_procs' from /usr/sbin/oo-accept-node:445:in `each' from /usr/sbin/oo-accept-node:445:in `check_cgroup_procs' from /usr/sbin/oo-accept-node:841:in `<main>' Version-Release number of selected component (if applicable): openshift-origin-node-util-1.13.9-1.el6oso.noarch How reproducible: This is happening about 1 in 3 or 4 runs with cgroup issues Steps to Reproduce: 1. find machine with cgroup issues 2. run oo-accept-node 3. see if it throws the error Actual results: Sometimes the above error will be thrown, other times it will not Expected results: no error, finish the run Additional info: I can try to troubleshoot this more as we are seeing this issue often.
Interesting. It would appear as though the bug is that: /bin/ps -p #{pid} -o uid,pid,ppid,etime,cmd ...is returning characters outside US-ASCII. The only field that should have a chance to return non-ascii would be "cmd" (the whole command and args). I'd be really curious what command has unicode args - that seems suspicious. Next time you see the issue, save the output of the following command to a file. ps -e -o uid,pid,ppid,etime,cmd I'll modify the function so that it does not rely on split. Thanks!
Putting in NEEDINFO to collect output of the following command next time the issue shows up. It needs to be output directly to a file instead of a pastebin to preserve original encoding. Thanks! ps -e -o uid,pid,ppid,etime,cmd
Pull request to scrub non-ascii characters out of the command result... https://github.com/openshift/origin-server/pull/3587
Skipping the needinfo.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/70d7b0c718225f505be206cde27a5c9795f02d25 Bug 1005421 - the ps command was returning unicode characters, strip them out.
Hi Rob, Any suggestion for how to verify this bug? I am not sure why it is using US-ASCII encoding for the cgroup check. During my test, I used a command which contain Chinese chars, which should out of acsii certainly. But it will get pass without the fix. # ps -p 15460 -o uid,pid,cmd UID PID CMD 1002 15460 vim 测试.txt And it will not throw error during oo-accept-node INFO: checking cgroups processes FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 717 cgroups controller: all FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 14880 cgroups controller: all FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 14881 cgroups controller: all FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 15460 cgroups controller: all FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 30480 cgroups controller: all FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 30481 cgroups controller: all FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 30482 cgroups controller: all FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 30501 cgroups controller: all INFO: checking presence of tc qdisc
The problem is only reproducible with LANG="C" and devenv normally runs LANG="en_US.UTF-8". This script does a lot of text parsing, I believe a better fix would be to force ruby to use unicode internally for text processing or set LANG in its running environment. Also, determine what other scripts are running under LANG=C in INT/STG/PROD.
The issue of LANG being set improperly has been run into during gear moves. The issue in INT/STG/PROD likely has broader ramifications - chasing that down instead.
We haven't been able to root-cause the reason why this script occasionally, spontaneously fails to properly handle unicode input. I'm hesitant to hard-code the workaround in Origin source but we should change how oo-accept-node runs in INT/STG/PROD to force unicode handling by default. /usr/bin/env oo-ruby -E UTF-8:UTF-8 /usr/sbin/oo-accept-node The release ticket has been updated to make that change.
Running Q/E on this will be difficult since it requires a set of coincidences but this should work. 1. Create an app app 2. Log into the app and run the following ruby script: $0 = "Cédric" sleep 3. Run ps and observe that its process name changed to "Cédric". 4. Use cgclassify to move the process in #2 into the root cgroup. cgclassify -g cpu,cpuacct,memory,freezer,net_cls:/ [pid] 5. Run the following to observe the failure: LANG="C" /usr/sbin/oo-accept-node 6. Run the following to observe it working even though the language is set incorrectly. LANG="C" /usr/bin/env oo-ruby -E UTF-8:UTF-8 /usr/sbin/oo-accept-node
Test on devenv_3776 with method in comment#11, It can get the expect error in step5 and can get expect result in step6, # LANG="C" /usr/sbin/oo-accept-node /usr/sbin/oo-accept-node:450:in `split': invalid byte sequence in US-ASCII (ArgumentError) from /usr/sbin/oo-accept-node:450:in `block (3 levels) in check_cgroup_procs' from /usr/sbin/oo-accept-node:449:in `each' from /usr/sbin/oo-accept-node:449:in `block (2 levels) in check_cgroup_procs' from /usr/sbin/oo-accept-node:446:in `each' from /usr/sbin/oo-accept-node:446:in `block in check_cgroup_procs' from /usr/sbin/oo-accept-node:445:in `each' from /usr/sbin/oo-accept-node:445:in `check_cgroup_procs' from /usr/sbin/oo-accept-node:841:in `<main>' # LANG="C" /usr/bin/env oo-ruby -E UTF-8:UTF-8 /usr/sbin/oo-accept-node FAIL: 523152a76849bb753100011c has a process missing from cgroups: 30977 cgroups controller: all Move bug to verified.
Its facter. /opt/rh/ruby193/root/usr/share/ruby/vendor_ruby/facter.rb line 44 sets LANG to "C". Re-opening the ticket to review the implications for mcollective.
Loading facter does not appear to change the internal string management; however, it affects oo_spawn. irb(main):001:0> require 'rubygems' => false irb(main):002:0> require 'openshift-origin-node' => true irb(main):003:0> ::OpenShift::Runtime::Utils::oo_spawn('echo $LANG') => ["en_US.UTF-8\n", "", 0] irb(main):004:0> require 'facter' => true irb(main):005:0> ::OpenShift::Runtime::Utils::oo_spawn('echo $LANG') => ["C\n", "", 0]
Changing severity to low since a workaround was provided for the original problem. We don't currently know of any other issues related to factor changing LANG; but it seems like the sort of thing that causes subtle issues in the future.
Puppet labs fixed the problem in facter 1.7.0 and 2.0.0. http://projects.puppetlabs.com/issues/12012
Built a new facter package for 1.7.3 which no longer sets LANG="C". https://brewweb.devel.redhat.com/buildinfo?buildID=296881 Waiting for it to be tagged into the release.
The newer facter package has been tagged for the release. ruby193-facter-1.7.3-4.el6oso
irb(main):003:0> require 'openshift-origin-node' => true irb(main):004:0> ::OpenShift::Runtime::Utils::oo_spawn('echo $LANG') => ["en_US.UTF-8\n", "", 0] irb(main):005:0> require 'facter' => true irb(main):006:0> ::OpenShift::Runtime::Utils::oo_spawn('echo $LANG') => ["en_US.UTF-8\n", "", 0] Tested on devenv_3837, issue has been fixed. The facter package version is: ruby193-facter-1.7.3-4.el6oso.x86_64