Bug 984609 - Node fails to completely remove users with processes in the freezer
Node fails to completely remove users with processes in the freezer
Product: OpenShift Online
Classification: Red Hat
Component: Containers (Show other bugs)
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: Rob Millner
libra bugs
: UpcomingRelease
Depends On:
  Show dependency treegraph
Reported: 2013-07-15 10:29 EDT by Sten Turpin
Modified: 2015-05-14 19:23 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-08-07 18:55:19 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Sten Turpin 2013-07-15 10:29:27 EDT
Description of problem: When an application is to be destroyed, processes belonging to the application remain in the freezer, and thus are not killed, and prevent the gear directory from being removed. 

Version-Release number of selected component (if applicable): rubygem-openshift-origin-node-1.10.7-1

How reproducible: sometimes

Steps to Reproduce:
1. Delete an application
2. Check /cgroup/all/openshift/<uuid>/freezer.state. If state is FROZEN, some of the application's procesess and files may remain. 
3. To remove the application, echo THAWED > /cgroup/all/openshift/<uuid>/freezer.state - this will allow the user's processes to finish, then manually remove the gear files. 

Actual results: Gear removed completely

Expected results: Parts of gear remain

Additional info:
Comment 1 Rob Millner 2013-07-15 20:54:28 EDT
We appear to have been vulnerable to this issue for 8 or 9 months (perhaps longer) but something changed with how the specific parter is using the service to make it more prominent - the pattern is as follows for the partner gear experiencing this issue:

ssh into the gear

ssh into the gear

ssh into the gear

destroy is called on the gear

ssh into the gear

sshd forks, transitions into the cgroup as root and then forks again to setuid.

destroy kills all processes owned by the user (but not the root owned sshd)

destroy freezes pam and cgroups

destroy kills again (but not the root owned sshd)

userdel fails because directory polydir prevents deletion

destroy raises an exception instead of thawing and cleaning up cgroups

Its a narrow window, but they may be doing this thousands of times and occasionally get lucky.

Doing some testing to see what we can do to prevent this issue.  One of the simple things to try is SIGKILL every process in the gear cgroup tasks file as well as processes owned by the gear uid (both, in case any leaked out of the gear cgroup somehow).  We may need to make bigger changes to the gear destroy logic.
Comment 2 Rob Millner 2013-07-16 14:45:49 EDT
Pull request:
Comment 3 openshift-github-bot 2013-07-16 20:06:51 EDT
Commit pushed to master at https://github.com/openshift/origin-server

Bug 984609 - fix a narrow condition where sshd leaves a root owned process in the frozen gear cgroup causing gear delete to fail and stale processes/
Comment 4 Rob Millner 2013-07-19 20:15:26 EDT
I'm not sure if its feasible to Q/E test this case as its a pretty time window for the coincidence between ssh and gear destroy.

One option would be to verify that gear destroy operates as expected (nothing broke with the patch) and then see if Operations is still having this issue on the partner nodes a week after 2.0.30 makes it to production.
Comment 5 Meng Bo 2013-08-01 06:51:00 EDT
To verify this bug, use the oo-cgroup-template tool.

1. Create app
2. Set the app to FROZEN
# oo-cgroup-template -t frozen -c 8a8bc378fa9711e2a82412313b0a6df4
# cat /cgroup/all/openshift/8a8bc378fa9711e2a82412313b0a6df4/freezer.state 
3. Delete the app from client
4. Check there is no cgroup files left on node
# ls /cgroup/all/openshift/8a8bc378fa9711e2a82412313b0a6df4/
ls: cannot access /cgroup/all/openshift/8a8bc378fa9711e2a82412313b0a6df4/: No such file or directory

And for regression testing, seems nothing broke with this patch.

Move the bug to verified.

Note You need to log in before you can comment on or make changes to this bug.