Bug 964205 - oo-admin-ctl-cgroups restartall fails with no such file or directory
Summary: oo-admin-ctl-cgroups restartall fails with no such file or directory
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: John W. Lamb
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-05-17 14:32 UTC by Kenny Woodson
Modified: 2015-05-14 23:18 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-06-11 04:05:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
output from oo-admin-ctl-cgroups restartall (165.40 KB, text/plain)
2013-05-17 14:32 UTC, Kenny Woodson
no flags Details

Description Kenny Woodson 2013-05-17 14:32:03 UTC
Created attachment 749358 [details]
output from oo-admin-ctl-cgroups restartall

Description of problem:

Lately we have a many of number of apps that are not in cgroups and I am trying to repair them using oo-admin-ctl-cgroups.

When using oo-admin-ctl-cgroups with the restartall flag it has failures like the following:
-----
stopping cgroups for 8fdbd8ae6554400fa9a7538dc74665df...cgdelete: cannot remove group '/openshift/8fdbd8ae6554400fa9a7538dc74665df': No such file or directory
 [OK] 
stopping cgroups for 5e88f0517ab941439469d47f34d8cc20...cgdelete: cannot remove group '/openshift/5e88f0517ab941439469d47f34d8cc20': No such file or directory
 [OK] 
stopping cgroups for 7cb9616e40684e47954adec9db9d08f6...cgdelete: cannot remove group '/openshift/7cb9616e40684e47954adec9db9d08f6': No such file or directory
 [OK] 
stopping cgroups for bcc2673217764812b116c2bdd3f0c736...cgdelete: cannot remove group '/openshift/bcc2673217764812b116c2bdd3f0c736': No such file or directory
----

I believe that this could be optimized and skip the gears that are idled as they are not running or consuming any system resources.  This script is very slow and can take 60+ minutes to complete on systems with > 1000 users.


Version-Release number of selected component (if applicable):

2.0.27.1

How reproducible:

Very.

Steps to Reproduce:
1.  Remove a user from cgroups by removing their /cgroups/all/openshift/<uuid> directory.  
2. Run oo-accept-node and view any processes without cgroups.
3. Run oo-admin-ctl-cgroups restartall and see the time pass by.
4. Gears that aren't running are getting placed into cgroups and errors like the above are shown.
  
Actual results:

Errors of "no such file or directory" are shown.

Expected results:

Skip directories that are already in cgroups properly.  This should be fast and efficient.

Additional info:

We depend on cgroups to maintain that our resources are managed properly.  This would be a big win for us if we could trim the time down that it takes to run these and get rid of any errors or problems while running this.

Comment 1 John W. Lamb 2013-05-24 18:13:17 UTC
Spoke with rmillner, kwoodson, markllama to clarify the goals of this ticket:
1) oo-admin-ctl-cgroups restartall should restart all cgroups more cleanly than it does (capture/prevent the cgdelete errors for missing cgroups)
2) oo-admin-ctl-cgroups restartall is currently being used to repair missing cgroups, but since it iterates across all cgroups, even working ones, this takes too long. A new command needs to be added - "repair" - that only starts missing cgroups
3) there are a number of logical errors that may be preventing this script from working properly in all instances - these need to be addressed.

I will fix these issues in reverse order - code fixes for existing functionality need to be in place and tested before adding new features.

Comment 2 John W. Lamb 2013-05-24 21:10:11 UTC
Created pull request to address this bug: https://github.com/openshift/origin-server/pull/2640

Comment 3 openshift-github-bot 2013-05-29 17:01:08 UTC
Commits pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/6a9616617ee30ac0e757d6f21bcd2e2cd3f1b729
<oo-admin-ctl-cgroups> Bug 964205 - add "repair" command

This commit adds a new command to oo-admin-ctl-cgroups called
"repair". This command identifies users which lack a cgroup and
creates a cgroup for them. It can be called with an optional username
argument (or a quoted space-separated list of usernames) to restrict
the action to a specific subset.

The "restartall" command remains functionally unchanged - this will be
addressed in the next commit

https://github.com/openshift/origin-server/commit/38cd1a799f79ad445c000cf9fd451aca8074a2d9
<oo-admin-ctl-cgroups> Bug 964205 - prevent stopping already stopped cgroups

Modified stopuser() to check if the cgroup for the specified user has
already been stopped, and echo a useful message if that is the case
instead running cgdelete and generating "file not found" noise

https://github.com/openshift/origin-server/commit/09c88bc4591b2590fee705f6793c45694d1e0d6b
<oo-admin-ctl-groups> Bug 964205 - fix set_blkio function comment to be more accurate

https://github.com/openshift/origin-server/commit/a9114ba7c657869ad38aa3bf8a7f433a4ef97c6d
<oo-admin-ctl-cgroups> Bug 964205 - amend comments for cgroup_exists function

Made function comment more "standard" according to advice from @markllama

Comment 4 Meng Bo 2013-06-03 12:44:59 UTC
Checked on devenv-stage_356, oo-admin-ctl-cgroups restartall will not show file not found issue.

New command oo-admin-ctl-cgroups repair can fix the stopped users.

# oo-admin-ctl-cgroups restartall
Removing Openshift guest control groups: 
stopping cgroups for 2f98d53acc4a11e28c4c22000a9708e9... cgroup already stopped [OK] 
stopping cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... cgroup already stopped [OK] 
[ OK ]Openshift cgroups uninitialized
Initializing Openshift guest control groups: 
starting cgroups for 2f98d53acc4a11e28c4c22000a9708e9... [OK] 
starting cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... [OK] 
[ OK ]Openshift cgroups initialized

WARNING !!! WARNING !!! WARNING !!!
Cgroups may have just restarted.  It's important to confirm all the openshift apps are actively running.
It's suggested you run service openshift restart now
WARNING !!! WARNING !!! WARNING !!!




[root@ip-10-151-8-233 all]# oo-admin-ctl-cgroups stopuser 2f98d53acc4a11e28c4c22000a9708e9
stopping cgroups for 2f98d53acc4a11e28c4c22000a9708e9... [OK] 
[root@ip-10-151-8-233 all]# oo-admin-ctl-cgroups stopuser 4e5f64b6cc4a11e28c4c22000a9708e9
stopping cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... [OK] 
[root@ip-10-151-8-233 all]# oo-admin-ctl-cgroups repair
starting cgroups for 2f98d53acc4a11e28c4c22000a9708e9... [OK] 
starting cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... [OK]


Note You need to log in before you can comment on or make changes to this bug.