Bug 1014768 - [PERF] node_utilization function is very expensive
Summary: [PERF] node_utilization function is very expensive
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Rob Millner
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-10-02 17:50 UTC by Andy Grimm
Modified: 2018-12-03 20:08 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-10-17 13:33:29 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Andy Grimm 2013-10-02 17:50:49 UTC
On OpenShift nodes, /usr/libexec/mcollective/update_yaml.rb regularly collects information from facter including "node utilization" stats (active gears, idle gears, etc.).  It uses OpenShift::Runtime::Node.node_utilization to collect these gear stats.  This function builds an ApplicationContainer object for each gear on the system, which is quite expensive -- more than 20 seconds of CPU time on a system with 1500 gears.  We need to find a way to make this perform better.

When I was writing very similar code for OpenShift Online, I was told to use the GearDB and ApacheIdlerDB directly instead of initializing ApplicationContainer objects.  I would think that the same advice applies here.

Comment 1 Rob Millner 2013-10-02 18:11:19 UTC
Now that GearDB and ApacheIdlerDB have moved out of node and into a plugin, it may not be a good idea to access these directly.  The plugins will change.

We should find out what's taking so much time in ApplicationContainer and see if it can be made faster.

Comment 2 Rob Millner 2013-10-02 20:10:36 UTC
Pull request:
https://github.com/openshift/origin-server/pull/3763

Comment 3 openshift-github-bot 2013-10-02 22:20:58 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/fea2d3d5523b6cfff5be53ed2c84b0bdd9c6ab55
Bug 1014768 - Loading the environ dominated the cost to enumerate ApplicationContainers.  Narrowing down the list of files to load shaves 50%.

Comment 4 Andy Grimm 2013-10-03 14:07:56 UTC
Rob and I discussed yesterday that there are two other things that still impact ApplicationContainer initialization time.

1) Loading the container_plugin object is expensive.  Switching to lazy loading makes a huge difference for things where it's not needed.

2) Even when a uid is passed in, we still do Etc.getpwnam(@uuid) to get uid, gid, gecos, and container_dir.  We don't need the lookup if we can assume the following:
  * gid = uid
  * gecos = @config.get("GEAR_GECOS")
  * container_dir = "#{@base_dir}/#{@uuid}"

I told him I was okay with calling this bug "fixed" with the one-line patch, but I wanted to mention the other two things here, because patches for those will be coming soon.

I've been testing these changes by timing this simple script:

#!/usr/bin/env oo-ruby
require 'openshift-origin-node/model/node'
results = OpenShift::Runtime::Node.node_utilization
puts results

The cpu times I see on a system with 1500 gears (each line assuming the prior patches are also applied) are:

Original code:       user: 6.0s, system: 13.9s
w/ Environ patch:    user: 2.4s, system:  2.6s
w/ lazy load:        user: 1.4s, system:  1.7s
w/o getpwnam call:   user: 1.0s, system:  1.4s

Comment 5 Rob Millner 2013-10-03 17:01:46 UTC
Nice results!

I believe item #2 is a reasonable assertion; however, there are contexts where ApplicationContainer.new is called with uid=nil.  Etc.getpwnam is required when that happens.

Keeping this ticket open to work with the lazy load and getpwnam changes.


For Q/E, the most important thing to test is whether anything is broken by these changes.

Comment 6 Andy Grimm 2013-10-03 17:55:12 UTC
Sure, I knew we couldn't completely avoid the getpwnam call; it just seems silly to call it again when uid is passed in.

FWIW, I also looked at one other thing, which was to allow a config to be passed in instead calling Config.new everywhere.  I got rid of four "stat" calls per gear that way, but it turns out those are extremely cheap, so the performance difference was negligible.

Comment 7 Rob Millner 2013-10-03 18:10:10 UTC
Yea, I added code to the Config class a few months back to cache the parsed configuration for every file its called on and test the mtime for changes.  

There's two files that account for almost every invocation of Config, both in /etc/openshift. The relevant inodes are very likely to be cached by the OS.


Within an instance of a class, it makes sense to call Config once in initialize and store commonly used results.

Comment 8 Rob Millner 2013-10-03 21:48:33 UTC
We now allow a pwent to be passed as the uid.  

Determined that the mcs_label generator dominated the cost of the container plugin and the context of how its used is much clearer so I lazy loaded that instead.

Removed an unnecessary call to OpenShift::Config.

Pull request:
https://github.com/openshift/origin-server/pull/3769

Comment 9 Andy Grimm 2013-10-04 13:01:56 UTC
Ha!  The mcs_label generator that I already spent a ton of time optimizing.  So of course I had to go look at it.  This code is the problem, and it's actually not even correct:

          begin
            uid = Etc.getpwnam(name.to_s).uid
          rescue ArgumentError, TypeError, NoMethodError
            uid = name.to_i
          end

The reason it's not correct is that we have usernames like '5432abcdef', and if the user doesn't exist on the system, an ArgumentError is raised, and we then set uid to name.to_i, which evaluates to 5432.  That very possibly _is_ a valid uid on the system, which is a big problem.

Better would be something like:

  if name.class < Integer or name.to_i.to_s == name
    uid = name.to_i
  else
    # Wrap a begin/rescue here if you want
    uid = Etc.getpwnam(name.to_s).uid
  end

Then passing uids into get_mcs_label becomes very cheap.  Sorry that this is a bit of a tangent from the original issue; I can put that in a separate bug if you want.

Comment 10 Rob Millner 2013-10-04 18:18:51 UTC
Fixed the '5432abcdef' issue and moved the getpwnam call to after testing for a numeric argument.

Pull request:
https://github.com/openshift/origin-server/pull/3771

Comment 11 openshift-github-bot 2013-10-04 19:40:52 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/885b7d7c28ca9ca44b94ec2edf75642e70bdcf00
Bug 1014768 - the previous logic could not properly distinguish usernames that begin with numbers and called Etc.getpwnam too often.

Comment 12 Rob Millner 2013-10-04 20:28:16 UTC
Looked around the code for additional calls to Etc that could be optimised.
https://github.com/openshift/origin-server/pull/3774

Comment 13 openshift-github-bot 2013-10-04 21:54:14 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/6264e8f38d0476501636ba50497ced4958c9d0a3
Bug 1014768 - Audit additional uses of password calls.

Comment 14 Rob Millner 2013-10-07 21:02:25 UTC
The fix to get_mcs_label didn't make stage cut, sending a stage pull request.
https://github.com/openshift/origin-server/pull/3779

Comment 15 Rob Millner 2013-10-08 23:37:52 UTC
Moving this to Q/E for testing.

Comment 16 Meng Bo 2013-10-10 11:42:58 UTC
Do regression testing on devenv-stage_491 and devenv_3880, no related regression issue found.

Move bug to verified.


Note You need to log in before you can comment on or make changes to this bug.