Bug 1005421 - oo-accept-node throwing "split" error during cgroup check
oo-accept-node throwing "split" error during cgroup check
Status: CLOSED CURRENTRELEASE
Product: OpenShift Online
Classification: Red Hat
Component: Containers (Show other bugs)
2.x
Unspecified Linux
low Severity low
: ---
: ---
Assigned To: Rob Millner
libra bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-06 17:26 EDT by Matt Woodson
Modified: 2015-05-14 19:28 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-10-17 09:28:20 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Matt Woodson 2013-09-06 17:26:17 EDT
Description of problem:

oo-accept-node is randomly (I haven't figured out what causes it) during the cgroup check. The output looks like this:


INFO: loading node configuration file /etc/openshift/node.conf
INFO: loading resource limit file /etc/openshift/resource_limits.conf
INFO: finding external network device
INFO: checking node public hostname resolution
INFO: checking selinux status
INFO: checking selinux openshift-hosted policy
INFO: checking selinux booleans
INFO: checking selinux nodes
INFO: checking package list
INFO: checking services
INFO: checking kernel semaphores >= 512
INFO: checking cgroups configuration
INFO: checking cgroups processes
FAIL: 522a43f84382ec840c0001cc has a process missing from cgroups: 1855 cgroups controller: all
/usr/sbin/oo-accept-node:450:in `split': invalid byte sequence in US-ASCII (ArgumentError)
	from /usr/sbin/oo-accept-node:450:in `block (3 levels) in check_cgroup_procs'
	from /usr/sbin/oo-accept-node:449:in `each'
	from /usr/sbin/oo-accept-node:449:in `block (2 levels) in check_cgroup_procs'
	from /usr/sbin/oo-accept-node:446:in `each'
	from /usr/sbin/oo-accept-node:446:in `block in check_cgroup_procs'
	from /usr/sbin/oo-accept-node:445:in `each'
	from /usr/sbin/oo-accept-node:445:in `check_cgroup_procs'
	from /usr/sbin/oo-accept-node:841:in `<main>'

Version-Release number of selected component (if applicable):

openshift-origin-node-util-1.13.9-1.el6oso.noarch

How reproducible:

This is happening about 1 in 3 or 4 runs with cgroup issues

Steps to Reproduce:
1.  find machine with cgroup issues 
2. run oo-accept-node
3.  see if it throws the error

Actual results:

Sometimes the above error will be thrown, other times it will not

Expected results:

no error, finish the run

Additional info:

I can try to troubleshoot this more as we are seeing this issue often.
Comment 1 Rob Millner 2013-09-09 13:14:15 EDT
Interesting.  It would appear as though the bug is that:
/bin/ps -p #{pid} -o uid,pid,ppid,etime,cmd

...is returning characters outside US-ASCII.  

The only field that should have a chance to return non-ascii would be "cmd" (the whole command and args).

I'd be really curious what command has unicode args - that seems suspicious.  Next time you see the issue, save the output of the following command to a file.
ps -e -o uid,pid,ppid,etime,cmd

I'll modify the function so that it does not rely on split.

Thanks!
Comment 2 Rob Millner 2013-09-09 13:16:02 EDT
Putting in NEEDINFO to collect output of the following command next time the issue shows up.  It needs to be output directly to a file instead of a pastebin to preserve original encoding.  Thanks!

ps -e -o uid,pid,ppid,etime,cmd
Comment 3 Rob Millner 2013-09-09 14:02:23 EDT
Pull request to scrub non-ascii characters out of the command result...
https://github.com/openshift/origin-server/pull/3587
Comment 4 Rob Millner 2013-09-09 14:35:34 EDT
Skipping the needinfo.
Comment 5 openshift-github-bot 2013-09-10 01:05:46 EDT
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/70d7b0c718225f505be206cde27a5c9795f02d25
Bug 1005421 - the ps command was returning unicode characters, strip them out.
Comment 6 Meng Bo 2013-09-10 05:31:35 EDT
Hi Rob,

Any suggestion for how to verify this bug?

I am not sure why it is using US-ASCII encoding for the cgroup check.

During my test, I used a command which contain Chinese chars, which should out of acsii certainly. But it will get pass without the fix.

# ps -p 15460 -o uid,pid,cmd
  UID   PID CMD
 1002 15460 vim 测试.txt

And it will not throw error during oo-accept-node

INFO: checking cgroups processes
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 717 cgroups controller: all
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 14880 cgroups controller: all
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 14881 cgroups controller: all
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 15460 cgroups controller: all
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 30480 cgroups controller: all
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 30481 cgroups controller: all
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 30482 cgroups controller: all
FAIL: 522ec6dab57b356b3a000001 has a process missing from cgroups: 30501 cgroups controller: all
INFO: checking presence of tc qdisc
Comment 8 Rob Millner 2013-09-10 15:02:08 EDT
The problem is only reproducible with LANG="C" and devenv normally runs LANG="en_US.UTF-8".

This script does a lot of text parsing, I believe a better fix would be to force ruby to use unicode internally for text processing or set LANG in its running environment.

Also, determine what other scripts are running under LANG=C in INT/STG/PROD.
Comment 9 Rob Millner 2013-09-10 17:53:28 EDT
The issue of LANG being set improperly has been run into during gear moves.  The issue in INT/STG/PROD likely has broader ramifications - chasing that down instead.
Comment 10 Rob Millner 2013-09-11 14:54:47 EDT
We haven't been able to root-cause the reason why this script occasionally, spontaneously fails to properly handle unicode input.

I'm hesitant to hard-code the workaround in Origin source but we should change how oo-accept-node runs in INT/STG/PROD to force unicode handling by default.

/usr/bin/env oo-ruby -E UTF-8:UTF-8 /usr/sbin/oo-accept-node

The release ticket has been updated to make that change.
Comment 11 Rob Millner 2013-09-11 15:02:52 EDT
Running Q/E on this will be difficult since it requires a set of coincidences but this should work.

1. Create an app app

2. Log into the app and run the following ruby script:
  $0 = "Cédric"
  sleep

3. Run ps and observe that its process name changed to "Cédric".

4. Use cgclassify to move the process in #2 into the root cgroup.

  cgclassify -g cpu,cpuacct,memory,freezer,net_cls:/ [pid]

5. Run the following to observe the failure:

LANG="C" /usr/sbin/oo-accept-node

6. Run the following to observe it working even though the language is set incorrectly.

LANG="C" /usr/bin/env oo-ruby -E UTF-8:UTF-8 /usr/sbin/oo-accept-node
Comment 13 Meng Bo 2013-09-12 02:19:34 EDT
Test on devenv_3776 with method in comment#11,

It can get the expect error in step5 and can get expect result in step6,

# LANG="C" /usr/sbin/oo-accept-node
/usr/sbin/oo-accept-node:450:in `split': invalid byte sequence in US-ASCII (ArgumentError)
	from /usr/sbin/oo-accept-node:450:in `block (3 levels) in check_cgroup_procs'
	from /usr/sbin/oo-accept-node:449:in `each'
	from /usr/sbin/oo-accept-node:449:in `block (2 levels) in check_cgroup_procs'
	from /usr/sbin/oo-accept-node:446:in `each'
	from /usr/sbin/oo-accept-node:446:in `block in check_cgroup_procs'
	from /usr/sbin/oo-accept-node:445:in `each'
	from /usr/sbin/oo-accept-node:445:in `check_cgroup_procs'
	from /usr/sbin/oo-accept-node:841:in `<main>'


# LANG="C" /usr/bin/env oo-ruby -E UTF-8:UTF-8 /usr/sbin/oo-accept-node
FAIL: 523152a76849bb753100011c has a process missing from cgroups: 30977 cgroups controller: all

Move bug to verified.
Comment 14 Rob Millner 2013-09-17 17:23:21 EDT
Its facter.

/opt/rh/ruby193/root/usr/share/ruby/vendor_ruby/facter.rb line 44 sets LANG to "C".


Re-opening the ticket to review the implications for mcollective.
Comment 15 Rob Millner 2013-09-19 20:36:35 EDT
Loading facter does not appear to change the internal string management; however, it affects oo_spawn.

irb(main):001:0> require 'rubygems'
=> false

irb(main):002:0> require 'openshift-origin-node'
=> true

irb(main):003:0> ::OpenShift::Runtime::Utils::oo_spawn('echo $LANG')
=> ["en_US.UTF-8\n", "", 0]

irb(main):004:0> require 'facter'
=> true

irb(main):005:0> ::OpenShift::Runtime::Utils::oo_spawn('echo $LANG')
=> ["C\n", "", 0]
Comment 16 Rob Millner 2013-09-19 20:40:30 EDT
Changing severity to low since a workaround was provided for the original problem.

We don't currently know of any other issues related to factor changing LANG; but it seems like the sort of thing that causes subtle issues in the future.
Comment 17 Rob Millner 2013-09-19 20:46:51 EDT
Puppet labs fixed the problem in facter 1.7.0 and 2.0.0.

http://projects.puppetlabs.com/issues/12012
Comment 18 Rob Millner 2013-09-25 15:43:52 EDT
Built a new facter package for 1.7.3 which no longer sets LANG="C".
https://brewweb.devel.redhat.com/buildinfo?buildID=296881

Waiting for it to be tagged into the release.
Comment 19 Rob Millner 2013-09-26 13:06:08 EDT
The newer facter package has been tagged for the release.
ruby193-facter-1.7.3-4.el6oso
Comment 20 Meng Bo 2013-09-27 02:05:38 EDT
irb(main):003:0> require 'openshift-origin-node'
=> true
irb(main):004:0> ::OpenShift::Runtime::Utils::oo_spawn('echo $LANG')
=> ["en_US.UTF-8\n", "", 0]
irb(main):005:0> require 'facter'
=> true
irb(main):006:0> ::OpenShift::Runtime::Utils::oo_spawn('echo $LANG')
=> ["en_US.UTF-8\n", "", 0]


Tested on devenv_3837, issue has been fixed.

The facter package version is:
ruby193-facter-1.7.3-4.el6oso.x86_64

Note You need to log in before you can comment on or make changes to this bug.