984609 – Node fails to completely remove users with processes in the freezer

Bug 984609 - Node fails to completely remove users with processes in the freezer

Summary: Node fails to completely remove users with processes in the freezer

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Rob Millner
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-07-15 14:29 UTC by Sten Turpin
Modified:	2015-05-14 23:23 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-08-07 22:55:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Sten Turpin 2013-07-15 14:29:27 UTC

Description of problem: When an application is to be destroyed, processes belonging to the application remain in the freezer, and thus are not killed, and prevent the gear directory from being removed. 


Version-Release number of selected component (if applicable): rubygem-openshift-origin-node-1.10.7-1


How reproducible: sometimes


Steps to Reproduce:
1. Delete an application
2. Check /cgroup/all/openshift/<uuid>/freezer.state. If state is FROZEN, some of the application's procesess and files may remain. 
3. To remove the application, echo THAWED > /cgroup/all/openshift/<uuid>/freezer.state - this will allow the user's processes to finish, then manually remove the gear files. 

Actual results: Gear removed completely


Expected results: Parts of gear remain


Additional info:

Comment 1 Rob Millner 2013-07-16 00:54:28 UTC

We appear to have been vulnerable to this issue for 8 or 9 months (perhaps longer) but something changed with how the specific parter is using the service to make it more prominent - the pattern is as follows for the partner gear experiencing this issue:

ssh into the gear

ssh into the gear

ssh into the gear

destroy is called on the gear

ssh into the gear

sshd forks, transitions into the cgroup as root and then forks again to setuid.

destroy kills all processes owned by the user (but not the root owned sshd)

destroy freezes pam and cgroups

destroy kills again (but not the root owned sshd)

userdel fails because directory polydir prevents deletion

destroy raises an exception instead of thawing and cleaning up cgroups


Its a narrow window, but they may be doing this thousands of times and occasionally get lucky.


Doing some testing to see what we can do to prevent this issue.  One of the simple things to try is SIGKILL every process in the gear cgroup tasks file as well as processes owned by the gear uid (both, in case any leaked out of the gear cgroup somehow).  We may need to make bigger changes to the gear destroy logic.

Comment 2 Rob Millner 2013-07-16 18:45:49 UTC

Pull request:
https://github.com/openshift/origin-server/pull/3100

Comment 3 openshift-github-bot 2013-07-17 00:06:51 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/a9b4663ee55eabd3daff163f9d7dbb02cfbe935b
Bug 984609 - fix a narrow condition where sshd leaves a root owned process in the frozen gear cgroup causing gear delete to fail and stale processes/

Comment 4 Rob Millner 2013-07-20 00:15:26 UTC

I'm not sure if its feasible to Q/E test this case as its a pretty time window for the coincidence between ssh and gear destroy.

One option would be to verify that gear destroy operates as expected (nothing broke with the patch) and then see if Operations is still having this issue on the partner nodes a week after 2.0.30 makes it to production.

Comment 5 Meng Bo 2013-08-01 10:51:00 UTC

To verify this bug, use the oo-cgroup-template tool.

1. Create app
2. Set the app to FROZEN
# oo-cgroup-template -t frozen -c 8a8bc378fa9711e2a82412313b0a6df4
# cat /cgroup/all/openshift/8a8bc378fa9711e2a82412313b0a6df4/freezer.state 
FROZEN
3. Delete the app from client
4. Check there is no cgroup files left on node
# ls /cgroup/all/openshift/8a8bc378fa9711e2a82412313b0a6df4/
ls: cannot access /cgroup/all/openshift/8a8bc378fa9711e2a82412313b0a6df4/: No such file or directory


And for regression testing, seems nothing broke with this patch.

Move the bug to verified.

Note You need to log in before you can comment on or make changes to this bug.