Bug 846445

Summary: Updating rhc-node rpm takes over an hour on nodes with a lot of gears
Product: OKD Reporter: Thomas Wiest <twiest>
Component: ContainersAssignee: Rob Millner <rmillner>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: high    
Version: 2.xCC: jialiu, mfisher, mmcgrath
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libra_ami #2027 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-09-17 21:29:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Thomas Wiest 2012-08-07 20:07:39 UTC
Description of problem:
Updating rhc-node rpm on nodes with a lot of gears is extremely slow. I didn't clock it, but it was definitely over an hour, possible as much as two.

The output showed a lot of chcons were run, and while it was running, pstree showed that it was running libra-tc.

|      |-sshd---ruby---yum---sh---service---libra-tc


Version-Release number of selected component (if applicable):
rhc-node-0.96.14-1.el6_3.x86_64


How reproducible:
Very on nodes with a high number of gears.


Steps to Reproduce:
1. Create over 3k gears on an instance
2. Upgrade the rhc-node rpm (it'd be a good idea to run this using 'time')
3. Notice how long it takes to upgrade

  
Actual results:
Takes a _very_ long time to update the rpm.


Expected results:
It shouldn't take that long to update the rpm.

Comment 1 Thomas Wiest 2012-08-07 20:20:41 UTC
Here's an example of a chcon that was run during the update:

chcon -t libra_var_lib_t -l s0:c2,c560 -R /var/lib/stickshift/088b2f0e4a764ee7a492254406d7f657/[^.]*

Comment 2 Rob Millner 2012-08-07 20:54:08 UTC
A patch went in today which should fix the issue where restorecon sets the wrong selinux context.

The libra-tc script sets traffic control limits on the gears.  Its unclear where or why it calls chcon; I'll have a look.

Comment 3 Rob Millner 2012-08-08 00:23:49 UTC
The rhc-node %post script does an rhc-restorecon to fix selinux permissions in /var/lib/stickshift.

Libra-tc sets up traffic control.

Both iterate over each gear and are likely suffer when there's a large number of gears on the node.

Comment 4 Mike McGrath 2012-08-10 18:46:20 UTC
One thing worth looking at is if these scripts even need to be run as a result of a simple update.

Comment 5 Rob Millner 2012-08-17 18:17:09 UTC
Created a node with 3000 gears and ran through the restarts in node's %post by hand with the following results:

cgconfig restart:          fast, but wipes out libra-cgroups
libra-cgroups restart:     2021 seconds
libra-tc restart:           226 seconds
rhc-restorecon:             749 seconds
rhc-ip-prep:               already controlled, didn't measure, takes a long time.


The conundrum is that changes to the cgroups and tc rules must take effect on upgrade, even on C9 nodes.

Similarly, either rhc-restorecon or a fixfiles should be run if the selinux file configuration policy is updated (libra.fc).

Maby the correct path is to touch /.autorelabel if libra.fc is new or has changed.

Comment 6 Rob Millner 2012-08-17 18:33:07 UTC
rhc-restorecon does the wrong thing anyway, we should remove the automatic invocation.

Comment 7 Rob Millner 2012-08-18 00:12:47 UTC
Commit 34af28a removes rhc-restorecon, and only runs cgconfig/libra-cgroups and libra-tc if they are not already initialized.

Comment 8 Rob Millner 2012-08-18 00:14:37 UTC
Pull request: https://github.com/openshift/li/pull/265

Comment 9 Rob Millner 2012-08-18 17:19:08 UTC
Pull request merged.

Comment 10 Johnny Liu 2012-08-21 11:25:25 UTC
Verified this bug with rhc-node-0.97.6-1.el6_3.x86_64.rpm, and PASS.

1. Start an old instance (devenv-stage_232)
2. Create an app
3. Run the following command to create a dummy testing envrionment that about 2000 gears are existing on this node.
$ for i in `seq 1 2000`; do useradd -b /var/lib/stickshift -c "libra guest" user$i; runuser -l user$i -s /bin/sh -c "cp -r /var/lib/stickshift/655c12fb14b14d9c820adb105e3c76e2/* /var/lib/stickshift/user${i}/"; done
4. Re-install rhc-node
# time yum -y reinstall rhc-node
<--snip-->
Running Transaction
  Installing : rhc-node-0.96.14-1.el6_3.x86_64                                                                                                                                       1/1
Stopping system message bus: [  OK  ]
Starting system message bus: [  OK  ]
Shutting down oddjobd: [  OK  ]
Starting oddjobd: [  OK  ]
<--snip-->
chcon -t libra_var_lib_t -l s0:c2,c200 -R /var/lib/stickshift/user1742/[^.]*
chcon -t libra_tmp_t -l s0:c2,c200 -R /var/lib/stickshift/user1742/.tmp/*
<--snip-->
Stopping stickshift-proxy: [  OK  ]
Starting stickshift-proxy: [  OK  ]
Stopping stickshift-proxy: [  OK  ]
Starting stickshift-proxy: [  OK  ]
  Verifying  : rhc-node-0.96.14-1.el6_3.x86_64         
<--snip-->
real    8m9.938s
user    0m47.861s
sys     5m6.053s


5. Download the latest rhc-node package, then re-install it.
# time yum install -y rhc-node-0.97.6-1.el6_3.x86_64.rpm
Stopping system message bus: [  OK  ]
Starting system message bus: [  OK  ]
Shutting down oddjobd: [  OK  ]
Starting oddjobd: [  OK  ]
Stopping stickshift-proxy: [  OK  ]
Starting stickshift-proxy: [  OK  ]
Stopping stickshift-proxy: [  OK  ]
Starting stickshift-proxy: [  OK  ]
  Cleanup    : rhc-node-0.96.14-1.el6_3.x86_64                                                                                                                                       2/2
  Verifying  : rhc-node-0.97.6-1.el6_3.x86_64                                                                                                                                        1/2

  Verifying  : rhc-node-0.96.14-1.el6_3.x86_64                                                                                                                                       2/2

<--snip-->
Updated:
  rhc-node.x86_64 0:0.97.6-1.el6_3
<--snip-->
real    0m27.186s
user    0m15.123s
sys     0m4.418s

The eclipsed time is shorter than before, it is very obvious.