Bug 964205

Summary: oo-admin-ctl-cgroups restartall fails with no such file or directory
Product: OpenShift Online Reporter: Kenny Woodson <kwoodson>
Component: ContainersAssignee: John W. Lamb <jolamb>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: bmeng, chunchen, jolamb, xtian
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-11 04:05:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
output from oo-admin-ctl-cgroups restartall none

Description Kenny Woodson 2013-05-17 14:32:03 UTC
Created attachment 749358 [details]
output from oo-admin-ctl-cgroups restartall

Description of problem:

Lately we have a many of number of apps that are not in cgroups and I am trying to repair them using oo-admin-ctl-cgroups.

When using oo-admin-ctl-cgroups with the restartall flag it has failures like the following:
-----
stopping cgroups for 8fdbd8ae6554400fa9a7538dc74665df...cgdelete: cannot remove group '/openshift/8fdbd8ae6554400fa9a7538dc74665df': No such file or directory
 [OK] 
stopping cgroups for 5e88f0517ab941439469d47f34d8cc20...cgdelete: cannot remove group '/openshift/5e88f0517ab941439469d47f34d8cc20': No such file or directory
 [OK] 
stopping cgroups for 7cb9616e40684e47954adec9db9d08f6...cgdelete: cannot remove group '/openshift/7cb9616e40684e47954adec9db9d08f6': No such file or directory
 [OK] 
stopping cgroups for bcc2673217764812b116c2bdd3f0c736...cgdelete: cannot remove group '/openshift/bcc2673217764812b116c2bdd3f0c736': No such file or directory
----

I believe that this could be optimized and skip the gears that are idled as they are not running or consuming any system resources.  This script is very slow and can take 60+ minutes to complete on systems with > 1000 users.


Version-Release number of selected component (if applicable):

2.0.27.1

How reproducible:

Very.

Steps to Reproduce:
1.  Remove a user from cgroups by removing their /cgroups/all/openshift/<uuid> directory.  
2. Run oo-accept-node and view any processes without cgroups.
3. Run oo-admin-ctl-cgroups restartall and see the time pass by.
4. Gears that aren't running are getting placed into cgroups and errors like the above are shown.
  
Actual results:

Errors of "no such file or directory" are shown.

Expected results:

Skip directories that are already in cgroups properly.  This should be fast and efficient.

Additional info:

We depend on cgroups to maintain that our resources are managed properly.  This would be a big win for us if we could trim the time down that it takes to run these and get rid of any errors or problems while running this.

Comment 1 John W. Lamb 2013-05-24 18:13:17 UTC
Spoke with rmillner, kwoodson, markllama to clarify the goals of this ticket:
1) oo-admin-ctl-cgroups restartall should restart all cgroups more cleanly than it does (capture/prevent the cgdelete errors for missing cgroups)
2) oo-admin-ctl-cgroups restartall is currently being used to repair missing cgroups, but since it iterates across all cgroups, even working ones, this takes too long. A new command needs to be added - "repair" - that only starts missing cgroups
3) there are a number of logical errors that may be preventing this script from working properly in all instances - these need to be addressed.

I will fix these issues in reverse order - code fixes for existing functionality need to be in place and tested before adding new features.

Comment 2 John W. Lamb 2013-05-24 21:10:11 UTC
Created pull request to address this bug: https://github.com/openshift/origin-server/pull/2640

Comment 3 openshift-github-bot 2013-05-29 17:01:08 UTC
Commits pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/6a9616617ee30ac0e757d6f21bcd2e2cd3f1b729
<oo-admin-ctl-cgroups> Bug 964205 - add "repair" command

This commit adds a new command to oo-admin-ctl-cgroups called
"repair". This command identifies users which lack a cgroup and
creates a cgroup for them. It can be called with an optional username
argument (or a quoted space-separated list of usernames) to restrict
the action to a specific subset.

The "restartall" command remains functionally unchanged - this will be
addressed in the next commit

https://github.com/openshift/origin-server/commit/38cd1a799f79ad445c000cf9fd451aca8074a2d9
<oo-admin-ctl-cgroups> Bug 964205 - prevent stopping already stopped cgroups

Modified stopuser() to check if the cgroup for the specified user has
already been stopped, and echo a useful message if that is the case
instead running cgdelete and generating "file not found" noise

https://github.com/openshift/origin-server/commit/09c88bc4591b2590fee705f6793c45694d1e0d6b
<oo-admin-ctl-groups> Bug 964205 - fix set_blkio function comment to be more accurate

https://github.com/openshift/origin-server/commit/a9114ba7c657869ad38aa3bf8a7f433a4ef97c6d
<oo-admin-ctl-cgroups> Bug 964205 - amend comments for cgroup_exists function

Made function comment more "standard" according to advice from @markllama

Comment 4 Meng Bo 2013-06-03 12:44:59 UTC
Checked on devenv-stage_356, oo-admin-ctl-cgroups restartall will not show file not found issue.

New command oo-admin-ctl-cgroups repair can fix the stopped users.

# oo-admin-ctl-cgroups restartall
Removing Openshift guest control groups: 
stopping cgroups for 2f98d53acc4a11e28c4c22000a9708e9... cgroup already stopped [OK] 
stopping cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... cgroup already stopped [OK] 
[ OK ]Openshift cgroups uninitialized
Initializing Openshift guest control groups: 
starting cgroups for 2f98d53acc4a11e28c4c22000a9708e9... [OK] 
starting cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... [OK] 
[ OK ]Openshift cgroups initialized

WARNING !!! WARNING !!! WARNING !!!
Cgroups may have just restarted.  It's important to confirm all the openshift apps are actively running.
It's suggested you run service openshift restart now
WARNING !!! WARNING !!! WARNING !!!




[root@ip-10-151-8-233 all]# oo-admin-ctl-cgroups stopuser 2f98d53acc4a11e28c4c22000a9708e9
stopping cgroups for 2f98d53acc4a11e28c4c22000a9708e9... [OK] 
[root@ip-10-151-8-233 all]# oo-admin-ctl-cgroups stopuser 4e5f64b6cc4a11e28c4c22000a9708e9
stopping cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... [OK] 
[root@ip-10-151-8-233 all]# oo-admin-ctl-cgroups repair
starting cgroups for 2f98d53acc4a11e28c4c22000a9708e9... [OK] 
starting cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... [OK]