Bug 1005421

Summary: oo-accept-node throwing "split" error during cgroup check
Product: OpenShift Online Reporter: Matt Woodson <mwoodson>
Component: ContainersAssignee: Rob Millner <rmillner>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: low Docs Contact:
Priority: low    
Version: 2.xCC: bmeng, chunchen, dmcphers, mfisher, mwoodson, rmillner, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-17 13:28:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matt Woodson 2013-09-06 21:26:17 UTC
Description of problem:

oo-accept-node is randomly (I haven't figured out what causes it) during the cgroup check. The output looks like this:


INFO: loading node configuration file /etc/openshift/node.conf
INFO: loading resource limit file /etc/openshift/resource_limits.conf
INFO: finding external network device
INFO: checking node public hostname resolution
INFO: checking selinux status
INFO: checking selinux openshift-hosted policy
INFO: checking selinux booleans
INFO: checking selinux nodes
INFO: checking package list
INFO: checking services
INFO: checking kernel semaphores >= 512
INFO: checking cgroups configuration
INFO: checking cgroups processes
FAIL: 522a43f84382ec840c0001cc has a process missing from cgroups: 1855 cgroups controller: all
/usr/sbin/oo-accept-node:450:in `split': invalid byte sequence in US-ASCII (ArgumentError)
	from /usr/sbin/oo-accept-node:450:in `block (3 levels) in check_cgroup_procs'
	from /usr/sbin/oo-accept-node:449:in `each'
	from /usr/sbin/oo-accept-node:449:in `block (2 levels) in check_cgroup_procs'
	from /usr/sbin/oo-accept-node:446:in `each'
	from /usr/sbin/oo-accept-node:446:in `block in check_cgroup_procs'
	from /usr/sbin/oo-accept-node:445:in `each'
	from /usr/sbin/oo-accept-node:445:in `check_cgroup_procs'
	from /usr/sbin/oo-accept-node:841:in `<main>'

Version-Release number of selected component (if applicable):

openshift-origin-node-util-1.13.9-1.el6oso.noarch

How reproducible:

This is happening about 1 in 3 or 4 runs with cgroup issues

Steps to Reproduce:
1.  find machine with cgroup issues 
2. run oo-accept-node
3.  see if it throws the error

Actual results:

Sometimes the above error will be thrown, other times it will not

Expected results:

no error, finish the run

Additional info:

I can try to troubleshoot this more as we are seeing this issue often.

Comment 1 Rob Millner 2013-09-09 17:14:15 UTC
Interesting.  It would appear as though the bug is that:
/bin/ps -p #{pid} -o uid,pid,ppid,etime,cmd

...is returning characters outside US-ASCII.  

The only field that should have a chance to return non-ascii would be "cmd" (the whole command and args).

I'd be really curious what command has unicode args - that seems suspicious.  Next time you see the issue, save the output of the following command to a file.
ps -e -o uid,pid,ppid,etime,cmd

I'll modify the function so that it does not rely on split.

Thanks!

Comment 2 Rob Millner 2013-09-09 17:16:02 UTC
Putting in NEEDINFO to collect output of the following command next time the issue shows up.  It needs to be output directly to a file instead of a pastebin to preserve original encoding.  Thanks!

ps -e -o uid,pid,ppid,etime,cmd

Comment 3 Rob Millner 2013-09-09 18:02:23 UTC
Pull request to scrub non-ascii characters out of the command result...
https://github.com/openshift/origin-server/pull/3587

Comment 4 Rob Millner 2013-09-09 18:35:34 UTC
Skipping the needinfo.

Comment 5 openshift-github-bot 2013-09-10 05:05:46 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/70d7b0c718225f505be206cde27a5c9795f02d25
Bug 1005421 - the ps command was returning unicode characters, strip them out.

Comment 6 Meng Bo 2013-09-10 09:31:35 UTC
Hi Rob,

Any suggestion for how to verify this bug?

I am not sure why it is using US-ASCII encoding for the cgroup check.

During my test, I used a command which contain Chinese chars, which should out of acsii certainly. But it will get pass without the fix.

# ps -p 15460 -o uid,pid,cmd
  UID   PID CMD
 1002 15460 vim 测试.txt

And it will not throw error during oo-accept-node

INFO: checking cgroups processes
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 717 cgroups controller: all
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 14880 cgroups controller: all
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 14881 cgroups controller: all
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 15460 cgroups controller: all
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 30480 cgroups controller: all
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 30481 cgroups controller: all
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 30482 cgroups controller: all
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 30501 cgroups controller: all
INFO: checking presence of tc qdisc

Comment 8 Rob Millner 2013-09-10 19:02:08 UTC
The problem is only reproducible with LANG="C" and devenv normally runs LANG="en_US.UTF-8".

This script does a lot of text parsing, I believe a better fix would be to force ruby to use unicode internally for text processing or set LANG in its running environment.

Also, determine what other scripts are running under LANG=C in INT/STG/PROD.

Comment 9 Rob Millner 2013-09-10 21:53:28 UTC
The issue of LANG being set improperly has been run into during gear moves.  The issue in INT/STG/PROD likely has broader ramifications - chasing that down instead.

Comment 10 Rob Millner 2013-09-11 18:54:47 UTC
We haven't been able to root-cause the reason why this script occasionally, spontaneously fails to properly handle unicode input.

I'm hesitant to hard-code the workaround in Origin source but we should change how oo-accept-node runs in INT/STG/PROD to force unicode handling by default.

/usr/bin/env oo-ruby -E UTF-8:UTF-8 /usr/sbin/oo-accept-node

The release ticket has been updated to make that change.

Comment 11 Rob Millner 2013-09-11 19:02:52 UTC
Running Q/E on this will be difficult since it requires a set of coincidences but this should work.

1. Create an app app

2. Log into the app and run the following ruby script:
  $0 = "Cédric"
  sleep

3. Run ps and observe that its process name changed to "Cédric".

4. Use cgclassify to move the process in #2 into the root cgroup.

  cgclassify -g cpu,cpuacct,memory,freezer,net_cls:/ [pid]

5. Run the following to observe the failure:

LANG="C" /usr/sbin/oo-accept-node

6. Run the following to observe it working even though the language is set incorrectly.

LANG="C" /usr/bin/env oo-ruby -E UTF-8:UTF-8 /usr/sbin/oo-accept-node

Comment 13 Meng Bo 2013-09-12 06:19:34 UTC
Test on devenv_3776 with method in comment#11,

It can get the expect error in step5 and can get expect result in step6,

# LANG="C" /usr/sbin/oo-accept-node
/usr/sbin/oo-accept-node:450:in `split': invalid byte sequence in US-ASCII (ArgumentError)
	from /usr/sbin/oo-accept-node:450:in `block (3 levels) in check_cgroup_procs'
	from /usr/sbin/oo-accept-node:449:in `each'
	from /usr/sbin/oo-accept-node:449:in `block (2 levels) in check_cgroup_procs'
	from /usr/sbin/oo-accept-node:446:in `each'
	from /usr/sbin/oo-accept-node:446:in `block in check_cgroup_procs'
	from /usr/sbin/oo-accept-node:445:in `each'
	from /usr/sbin/oo-accept-node:445:in `check_cgroup_procs'
	from /usr/sbin/oo-accept-node:841:in `<main>'


# LANG="C" /usr/bin/env oo-ruby -E UTF-8:UTF-8 /usr/sbin/oo-accept-node
FAIL: 523152a76849bb753100011c has a process missing from cgroups: 30977 cgroups controller: all

Move bug to verified.

Comment 14 Rob Millner 2013-09-17 21:23:21 UTC
Its facter.

/opt/rh/ruby193/root/usr/share/ruby/vendor_ruby/facter.rb line 44 sets LANG to "C".


Re-opening the ticket to review the implications for mcollective.

Comment 15 Rob Millner 2013-09-20 00:36:35 UTC
Loading facter does not appear to change the internal string management; however, it affects oo_spawn.

irb(main):001:0> require 'rubygems'
=> false

irb(main):002:0> require 'openshift-origin-node'
=> true

irb(main):003:0> ::OpenShift::Runtime::Utils::oo_spawn('echo $LANG')
=> ["en_US.UTF-8\n", "", 0]

irb(main):004:0> require 'facter'
=> true

irb(main):005:0> ::OpenShift::Runtime::Utils::oo_spawn('echo $LANG')
=> ["C\n", "", 0]

Comment 16 Rob Millner 2013-09-20 00:40:30 UTC
Changing severity to low since a workaround was provided for the original problem.

We don't currently know of any other issues related to factor changing LANG; but it seems like the sort of thing that causes subtle issues in the future.

Comment 17 Rob Millner 2013-09-20 00:46:51 UTC
Puppet labs fixed the problem in facter 1.7.0 and 2.0.0.

http://projects.puppetlabs.com/issues/12012

Comment 18 Rob Millner 2013-09-25 19:43:52 UTC
Built a new facter package for 1.7.3 which no longer sets LANG="C".
https://brewweb.devel.redhat.com/buildinfo?buildID=296881

Waiting for it to be tagged into the release.

Comment 19 Rob Millner 2013-09-26 17:06:08 UTC
The newer facter package has been tagged for the release.
ruby193-facter-1.7.3-4.el6oso

Comment 20 Meng Bo 2013-09-27 06:05:38 UTC
irb(main):003:0> require 'openshift-origin-node'
=> true
irb(main):004:0> ::OpenShift::Runtime::Utils::oo_spawn('echo $LANG')
=> ["en_US.UTF-8\n", "", 0]
irb(main):005:0> require 'facter'
=> true
irb(main):006:0> ::OpenShift::Runtime::Utils::oo_spawn('echo $LANG')
=> ["en_US.UTF-8\n", "", 0]


Tested on devenv_3837, issue has been fixed.

The facter package version is:
ruby193-facter-1.7.3-4.el6oso.x86_64