Bug 964205 - oo-admin-ctl-cgroups restartall fails with no such file or directory
oo-admin-ctl-cgroups restartall fails with no such file or directory
Product: OpenShift Online
Classification: Red Hat
Component: Containers (Show other bugs)
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: John W. Lamb
libra bugs
: FutureFeature
Depends On:
  Show dependency treegraph
Reported: 2013-05-17 10:32 EDT by Kenny Woodson
Modified: 2015-05-14 19:18 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-06-11 00:05:56 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
output from oo-admin-ctl-cgroups restartall (165.40 KB, text/plain)
2013-05-17 10:32 EDT, Kenny Woodson
no flags Details

  None (edit)
Description Kenny Woodson 2013-05-17 10:32:03 EDT
Created attachment 749358 [details]
output from oo-admin-ctl-cgroups restartall

Description of problem:

Lately we have a many of number of apps that are not in cgroups and I am trying to repair them using oo-admin-ctl-cgroups.

When using oo-admin-ctl-cgroups with the restartall flag it has failures like the following:
stopping cgroups for 8fdbd8ae6554400fa9a7538dc74665df...cgdelete: cannot remove group '/openshift/8fdbd8ae6554400fa9a7538dc74665df': No such file or directory
stopping cgroups for 5e88f0517ab941439469d47f34d8cc20...cgdelete: cannot remove group '/openshift/5e88f0517ab941439469d47f34d8cc20': No such file or directory
stopping cgroups for 7cb9616e40684e47954adec9db9d08f6...cgdelete: cannot remove group '/openshift/7cb9616e40684e47954adec9db9d08f6': No such file or directory
stopping cgroups for bcc2673217764812b116c2bdd3f0c736...cgdelete: cannot remove group '/openshift/bcc2673217764812b116c2bdd3f0c736': No such file or directory

I believe that this could be optimized and skip the gears that are idled as they are not running or consuming any system resources.  This script is very slow and can take 60+ minutes to complete on systems with > 1000 users.

Version-Release number of selected component (if applicable):

How reproducible:


Steps to Reproduce:
1.  Remove a user from cgroups by removing their /cgroups/all/openshift/<uuid> directory.  
2. Run oo-accept-node and view any processes without cgroups.
3. Run oo-admin-ctl-cgroups restartall and see the time pass by.
4. Gears that aren't running are getting placed into cgroups and errors like the above are shown.
Actual results:

Errors of "no such file or directory" are shown.

Expected results:

Skip directories that are already in cgroups properly.  This should be fast and efficient.

Additional info:

We depend on cgroups to maintain that our resources are managed properly.  This would be a big win for us if we could trim the time down that it takes to run these and get rid of any errors or problems while running this.
Comment 1 John W. Lamb 2013-05-24 14:13:17 EDT
Spoke with rmillner, kwoodson, markllama to clarify the goals of this ticket:
1) oo-admin-ctl-cgroups restartall should restart all cgroups more cleanly than it does (capture/prevent the cgdelete errors for missing cgroups)
2) oo-admin-ctl-cgroups restartall is currently being used to repair missing cgroups, but since it iterates across all cgroups, even working ones, this takes too long. A new command needs to be added - "repair" - that only starts missing cgroups
3) there are a number of logical errors that may be preventing this script from working properly in all instances - these need to be addressed.

I will fix these issues in reverse order - code fixes for existing functionality need to be in place and tested before adding new features.
Comment 2 John W. Lamb 2013-05-24 17:10:11 EDT
Created pull request to address this bug: https://github.com/openshift/origin-server/pull/2640
Comment 3 openshift-github-bot 2013-05-29 13:01:08 EDT
Commits pushed to master at https://github.com/openshift/origin-server

<oo-admin-ctl-cgroups> Bug 964205 - add "repair" command

This commit adds a new command to oo-admin-ctl-cgroups called
"repair". This command identifies users which lack a cgroup and
creates a cgroup for them. It can be called with an optional username
argument (or a quoted space-separated list of usernames) to restrict
the action to a specific subset.

The "restartall" command remains functionally unchanged - this will be
addressed in the next commit

<oo-admin-ctl-cgroups> Bug 964205 - prevent stopping already stopped cgroups

Modified stopuser() to check if the cgroup for the specified user has
already been stopped, and echo a useful message if that is the case
instead running cgdelete and generating "file not found" noise

<oo-admin-ctl-groups> Bug 964205 - fix set_blkio function comment to be more accurate

<oo-admin-ctl-cgroups> Bug 964205 - amend comments for cgroup_exists function

Made function comment more "standard" according to advice from @markllama
Comment 4 Meng Bo 2013-06-03 08:44:59 EDT
Checked on devenv-stage_356, oo-admin-ctl-cgroups restartall will not show file not found issue.

New command oo-admin-ctl-cgroups repair can fix the stopped users.

# oo-admin-ctl-cgroups restartall
Removing Openshift guest control groups: 
stopping cgroups for 2f98d53acc4a11e28c4c22000a9708e9... cgroup already stopped [OK] 
stopping cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... cgroup already stopped [OK] 
[ OK ]Openshift cgroups uninitialized
Initializing Openshift guest control groups: 
starting cgroups for 2f98d53acc4a11e28c4c22000a9708e9... [OK] 
starting cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... [OK] 
[ OK ]Openshift cgroups initialized

Cgroups may have just restarted.  It's important to confirm all the openshift apps are actively running.
It's suggested you run service openshift restart now

[root@ip-10-151-8-233 all]# oo-admin-ctl-cgroups stopuser 2f98d53acc4a11e28c4c22000a9708e9
stopping cgroups for 2f98d53acc4a11e28c4c22000a9708e9... [OK] 
[root@ip-10-151-8-233 all]# oo-admin-ctl-cgroups stopuser 4e5f64b6cc4a11e28c4c22000a9708e9
stopping cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... [OK] 
[root@ip-10-151-8-233 all]# oo-admin-ctl-cgroups repair
starting cgroups for 2f98d53acc4a11e28c4c22000a9708e9... [OK] 
starting cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... [OK]

Note You need to log in before you can comment on or make changes to this bug.