Bug 984609

Summary: Node fails to completely remove users with processes in the freezer
Product: OpenShift Online Reporter: Sten Turpin <sten>
Component: ContainersAssignee: Rob Millner <rmillner>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: bmeng, mfisher, rmillner, twiest
Target Milestone: ---Keywords: UpcomingRelease
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-08-07 22:55:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sten Turpin 2013-07-15 14:29:27 UTC
Description of problem: When an application is to be destroyed, processes belonging to the application remain in the freezer, and thus are not killed, and prevent the gear directory from being removed. 


Version-Release number of selected component (if applicable): rubygem-openshift-origin-node-1.10.7-1


How reproducible: sometimes


Steps to Reproduce:
1. Delete an application
2. Check /cgroup/all/openshift/<uuid>/freezer.state. If state is FROZEN, some of the application's procesess and files may remain. 
3. To remove the application, echo THAWED > /cgroup/all/openshift/<uuid>/freezer.state - this will allow the user's processes to finish, then manually remove the gear files. 

Actual results: Gear removed completely


Expected results: Parts of gear remain


Additional info:

Comment 1 Rob Millner 2013-07-16 00:54:28 UTC
We appear to have been vulnerable to this issue for 8 or 9 months (perhaps longer) but something changed with how the specific parter is using the service to make it more prominent - the pattern is as follows for the partner gear experiencing this issue:

ssh into the gear

ssh into the gear

ssh into the gear

destroy is called on the gear

ssh into the gear

sshd forks, transitions into the cgroup as root and then forks again to setuid.

destroy kills all processes owned by the user (but not the root owned sshd)

destroy freezes pam and cgroups

destroy kills again (but not the root owned sshd)

userdel fails because directory polydir prevents deletion

destroy raises an exception instead of thawing and cleaning up cgroups


Its a narrow window, but they may be doing this thousands of times and occasionally get lucky.


Doing some testing to see what we can do to prevent this issue.  One of the simple things to try is SIGKILL every process in the gear cgroup tasks file as well as processes owned by the gear uid (both, in case any leaked out of the gear cgroup somehow).  We may need to make bigger changes to the gear destroy logic.

Comment 2 Rob Millner 2013-07-16 18:45:49 UTC
Pull request:
https://github.com/openshift/origin-server/pull/3100

Comment 3 openshift-github-bot 2013-07-17 00:06:51 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/a9b4663ee55eabd3daff163f9d7dbb02cfbe935b
Bug 984609 - fix a narrow condition where sshd leaves a root owned process in the frozen gear cgroup causing gear delete to fail and stale processes/

Comment 4 Rob Millner 2013-07-20 00:15:26 UTC
I'm not sure if its feasible to Q/E test this case as its a pretty time window for the coincidence between ssh and gear destroy.

One option would be to verify that gear destroy operates as expected (nothing broke with the patch) and then see if Operations is still having this issue on the partner nodes a week after 2.0.30 makes it to production.

Comment 5 Meng Bo 2013-08-01 10:51:00 UTC
To verify this bug, use the oo-cgroup-template tool.

1. Create app
2. Set the app to FROZEN
# oo-cgroup-template -t frozen -c 8a8bc378fa9711e2a82412313b0a6df4
# cat /cgroup/all/openshift/8a8bc378fa9711e2a82412313b0a6df4/freezer.state 
FROZEN
3. Delete the app from client
4. Check there is no cgroup files left on node
# ls /cgroup/all/openshift/8a8bc378fa9711e2a82412313b0a6df4/
ls: cannot access /cgroup/all/openshift/8a8bc378fa9711e2a82412313b0a6df4/: No such file or directory


And for regression testing, seems nothing broke with this patch.

Move the bug to verified.