Description of problem: When an application is to be destroyed, processes belonging to the application remain in the freezer, and thus are not killed, and prevent the gear directory from being removed. Version-Release number of selected component (if applicable): rubygem-openshift-origin-node-1.10.7-1 How reproducible: sometimes Steps to Reproduce: 1. Delete an application 2. Check /cgroup/all/openshift/<uuid>/freezer.state. If state is FROZEN, some of the application's procesess and files may remain. 3. To remove the application, echo THAWED > /cgroup/all/openshift/<uuid>/freezer.state - this will allow the user's processes to finish, then manually remove the gear files. Actual results: Gear removed completely Expected results: Parts of gear remain Additional info:
We appear to have been vulnerable to this issue for 8 or 9 months (perhaps longer) but something changed with how the specific parter is using the service to make it more prominent - the pattern is as follows for the partner gear experiencing this issue: ssh into the gear ssh into the gear ssh into the gear destroy is called on the gear ssh into the gear sshd forks, transitions into the cgroup as root and then forks again to setuid. destroy kills all processes owned by the user (but not the root owned sshd) destroy freezes pam and cgroups destroy kills again (but not the root owned sshd) userdel fails because directory polydir prevents deletion destroy raises an exception instead of thawing and cleaning up cgroups Its a narrow window, but they may be doing this thousands of times and occasionally get lucky. Doing some testing to see what we can do to prevent this issue. One of the simple things to try is SIGKILL every process in the gear cgroup tasks file as well as processes owned by the gear uid (both, in case any leaked out of the gear cgroup somehow). We may need to make bigger changes to the gear destroy logic.
Pull request: https://github.com/openshift/origin-server/pull/3100
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/a9b4663ee55eabd3daff163f9d7dbb02cfbe935b Bug 984609 - fix a narrow condition where sshd leaves a root owned process in the frozen gear cgroup causing gear delete to fail and stale processes/
I'm not sure if its feasible to Q/E test this case as its a pretty time window for the coincidence between ssh and gear destroy. One option would be to verify that gear destroy operates as expected (nothing broke with the patch) and then see if Operations is still having this issue on the partner nodes a week after 2.0.30 makes it to production.
To verify this bug, use the oo-cgroup-template tool. 1. Create app 2. Set the app to FROZEN # oo-cgroup-template -t frozen -c 8a8bc378fa9711e2a82412313b0a6df4 # cat /cgroup/all/openshift/8a8bc378fa9711e2a82412313b0a6df4/freezer.state FROZEN 3. Delete the app from client 4. Check there is no cgroup files left on node # ls /cgroup/all/openshift/8a8bc378fa9711e2a82412313b0a6df4/ ls: cannot access /cgroup/all/openshift/8a8bc378fa9711e2a82412313b0a6df4/: No such file or directory And for regression testing, seems nothing broke with this patch. Move the bug to verified.