Bug 964205
Summary: | oo-admin-ctl-cgroups restartall fails with no such file or directory | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Online | Reporter: | Kenny Woodson <kwoodson> | ||||
Component: | Containers | Assignee: | John W. Lamb <jolamb> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | libra bugs <libra-bugs> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 2.x | CC: | bmeng, chunchen, jolamb, xtian | ||||
Target Milestone: | --- | Keywords: | FutureFeature | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Enhancement | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-06-11 04:05:56 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Spoke with rmillner, kwoodson, markllama to clarify the goals of this ticket: 1) oo-admin-ctl-cgroups restartall should restart all cgroups more cleanly than it does (capture/prevent the cgdelete errors for missing cgroups) 2) oo-admin-ctl-cgroups restartall is currently being used to repair missing cgroups, but since it iterates across all cgroups, even working ones, this takes too long. A new command needs to be added - "repair" - that only starts missing cgroups 3) there are a number of logical errors that may be preventing this script from working properly in all instances - these need to be addressed. I will fix these issues in reverse order - code fixes for existing functionality need to be in place and tested before adding new features. Created pull request to address this bug: https://github.com/openshift/origin-server/pull/2640 Commits pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/6a9616617ee30ac0e757d6f21bcd2e2cd3f1b729 <oo-admin-ctl-cgroups> Bug 964205 - add "repair" command This commit adds a new command to oo-admin-ctl-cgroups called "repair". This command identifies users which lack a cgroup and creates a cgroup for them. It can be called with an optional username argument (or a quoted space-separated list of usernames) to restrict the action to a specific subset. The "restartall" command remains functionally unchanged - this will be addressed in the next commit https://github.com/openshift/origin-server/commit/38cd1a799f79ad445c000cf9fd451aca8074a2d9 <oo-admin-ctl-cgroups> Bug 964205 - prevent stopping already stopped cgroups Modified stopuser() to check if the cgroup for the specified user has already been stopped, and echo a useful message if that is the case instead running cgdelete and generating "file not found" noise https://github.com/openshift/origin-server/commit/09c88bc4591b2590fee705f6793c45694d1e0d6b <oo-admin-ctl-groups> Bug 964205 - fix set_blkio function comment to be more accurate https://github.com/openshift/origin-server/commit/a9114ba7c657869ad38aa3bf8a7f433a4ef97c6d <oo-admin-ctl-cgroups> Bug 964205 - amend comments for cgroup_exists function Made function comment more "standard" according to advice from @markllama Checked on devenv-stage_356, oo-admin-ctl-cgroups restartall will not show file not found issue. New command oo-admin-ctl-cgroups repair can fix the stopped users. # oo-admin-ctl-cgroups restartall Removing Openshift guest control groups: stopping cgroups for 2f98d53acc4a11e28c4c22000a9708e9... cgroup already stopped [OK] stopping cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... cgroup already stopped [OK] [ OK ]Openshift cgroups uninitialized Initializing Openshift guest control groups: starting cgroups for 2f98d53acc4a11e28c4c22000a9708e9... [OK] starting cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... [OK] [ OK ]Openshift cgroups initialized WARNING !!! WARNING !!! WARNING !!! Cgroups may have just restarted. It's important to confirm all the openshift apps are actively running. It's suggested you run service openshift restart now WARNING !!! WARNING !!! WARNING !!! [root@ip-10-151-8-233 all]# oo-admin-ctl-cgroups stopuser 2f98d53acc4a11e28c4c22000a9708e9 stopping cgroups for 2f98d53acc4a11e28c4c22000a9708e9... [OK] [root@ip-10-151-8-233 all]# oo-admin-ctl-cgroups stopuser 4e5f64b6cc4a11e28c4c22000a9708e9 stopping cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... [OK] [root@ip-10-151-8-233 all]# oo-admin-ctl-cgroups repair starting cgroups for 2f98d53acc4a11e28c4c22000a9708e9... [OK] starting cgroups for 4e5f64b6cc4a11e28c4c22000a9708e9... [OK] |
Created attachment 749358 [details] output from oo-admin-ctl-cgroups restartall Description of problem: Lately we have a many of number of apps that are not in cgroups and I am trying to repair them using oo-admin-ctl-cgroups. When using oo-admin-ctl-cgroups with the restartall flag it has failures like the following: ----- stopping cgroups for 8fdbd8ae6554400fa9a7538dc74665df...cgdelete: cannot remove group '/openshift/8fdbd8ae6554400fa9a7538dc74665df': No such file or directory [OK] stopping cgroups for 5e88f0517ab941439469d47f34d8cc20...cgdelete: cannot remove group '/openshift/5e88f0517ab941439469d47f34d8cc20': No such file or directory [OK] stopping cgroups for 7cb9616e40684e47954adec9db9d08f6...cgdelete: cannot remove group '/openshift/7cb9616e40684e47954adec9db9d08f6': No such file or directory [OK] stopping cgroups for bcc2673217764812b116c2bdd3f0c736...cgdelete: cannot remove group '/openshift/bcc2673217764812b116c2bdd3f0c736': No such file or directory ---- I believe that this could be optimized and skip the gears that are idled as they are not running or consuming any system resources. This script is very slow and can take 60+ minutes to complete on systems with > 1000 users. Version-Release number of selected component (if applicable): 2.0.27.1 How reproducible: Very. Steps to Reproduce: 1. Remove a user from cgroups by removing their /cgroups/all/openshift/<uuid> directory. 2. Run oo-accept-node and view any processes without cgroups. 3. Run oo-admin-ctl-cgroups restartall and see the time pass by. 4. Gears that aren't running are getting placed into cgroups and errors like the above are shown. Actual results: Errors of "no such file or directory" are shown. Expected results: Skip directories that are already in cgroups properly. This should be fast and efficient. Additional info: We depend on cgroups to maintain that our resources are managed properly. This would be a big win for us if we could trim the time down that it takes to run these and get rid of any errors or problems while running this.