Bug 853582

Summary:	app-destroy is failing to remove all parts of gears sometimes
Product:	OKD	Reporter:	Thomas Wiest <twiest>
Component:	Containers	Assignee:	Jhon Honce <jhonce>
Status:	CLOSED CURRENTRELEASE	QA Contact:	libra bugs <libra-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.x	CC:	jhonce, jhou, mmcgrath, mpatel, nicholas_schuetz
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	devenv_2280+	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-11-06 18:48:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Thomas Wiest 2012-09-01 00:06:27 UTC

Description of problem:
When a user calls for a gear to be destroyed, app-destroy doesn't always remove all parts of the gear. This in turn causes our gear consistency checks to alert, which makes it so that we have to go and manually clean up these gears, which is tedious and time consuming.

Here is an example of an app-destroy that left around the home directory (it removed the unix user and proxypass file).

NOTE: This happened on the current release in PROD (it just happened today).


I, [2012-08-31T01:00:34.308953 #4643]  INFO -- : stickshift.rb:315:in `cartridge_do_action' cartridge_do_action call / request = #<MCollective::RPC::Request:0x7f1a258e71f0
 @action="cartridge_do",
 @agent="stickshift",
 @caller="uid=431",
 @data=
  {:process_results=>true,
   :cartridge=>"stickshift-node",
   :args=>
    {"--with-namespace"=>"REDACTED",
     "--with-app-uuid"=>"1f3349d26d1440dab56e23e59a9a2b95",
     "--with-container-name"=>"REDACTED",
     "--with-app-name"=>"REDACTED",
     "--with-container-uuid"=>"1f3349d26d1440dab56e23e59a9a2b95"},
   :action=>"app-destroy"},
 @sender="REDACTED",
 @time=1346389234,
 @uniqid="312e28efac3cf0da851b6a8566acf8f4">

I, [2012-08-31T01:00:34.310188 #4643]  INFO -- : stickshift.rb:316:in `cartridge_do_action' cartridge_do_action validation = stickshift-node app-destroy --with-namespaceREDACTED--with-app-uuid1f3349d26d1440dab56e23e59a9a2b95--with-container-nameREDACTED--with-app-nameREDACTED--with-container-uuid1f3349d26d1440dab56e23e59a9a2b95
I, [2012-08-31T01:00:34.310609 #4643]  INFO -- : stickshift.rb:54:in `ss_app_destroy' COMMAND: ss-app-destroy
I, [2012-08-31T01:00:39.721505 #4643]  INFO -- : stickshift.rb:336:in `cartridge_do_action' cartridge_do_action (0)
------

------)


Only the gear directory and it's ~/.tmp were left around in this case (this was possibly caused by the polyinstantiated /tmp dir).

We manually removed the rest of the gear with these two commands:
   57  rmdir /var/lib/stickshift/1f3349d26d1440dab56e23e59a9a2b95/.tmp/
   58  rmdir /var/lib/stickshift/1f3349d26d1440dab56e23e59a9a2b95


Version-Release number of selected component (if applicable):
rhc-node-0.97.7-1.el6_3.x86_64


How reproducible:
Sporadic, but happens to quite a few gears on a daily basis in PROD.


Steps to Reproduce:
1. Unknown.

  
Actual results:
Parts of gears are left around.


Expected results:
Gears should always be completely removed.

Comment 1 John Poelstra 2012-09-11 17:05:52 UTC

researching more today, will probably make FutureFeature by the end of day because unable to reproduce

Comment 2 Jhon Honce 2012-09-27 18:36:47 UTC

twiest monitoring syslogs, will bounce back to me with any results.

Comment 3 Jhon Honce 2012-09-27 21:01:39 UTC

ls -la /var/lib/stickshift/5f0120cf46c943e9a2987497c9e57611
total 16
drwxr-x---.  4 root  546 4096 Sep 27 16:19 .
drwxr-x--x. 49 root root 4096 Sep 27 16:20 ..
d---------.  2 root root 4096 Sep 27 16:19 .sandbox
d---------.  2 root root 4096 Sep 27 16:19 .tmp

Sep 27 16:19:49 domU-12-31-39-09-32-32 CGRE[1087]: Reloading rules configuration.
Sep 27 16:19:51 domU-12-31-39-09-32-32 python: user 5f0120cf46c943e9a2987497c9e57611: putting process 10712 in cgroup /cgroup/all/libra
Sep 27 16:19:51 domU-12-31-39-09-32-32 python: ctl_all stop
Sep 27 16:19:54 domU-12-31-39-09-32-32 CGRE[1087]: Reloading rules configuration.
Sep 27 16:19:55 domU-12-31-39-09-32-32 stickshift-node[1417]: 1st attempt to remove '/var/lib/stickshift/5f0120cf46c943e9a2987497c9e57611/' from filesystem failed.
Sep 27 16:19:55 domU-12-31-39-09-32-32 stickshift-node[1417]: Dirs  5f0120cf46c943e9a2987497c9e57611 => ruby-1.8, app-root, git
Sep 27 16:19:55 domU-12-31-39-09-32-32 python: user 5f0120cf46c943e9a2987497c9e57611: putting process 11161 in cgroup /cgroup/all/libra
Sep 27 16:19:55 domU-12-31-39-09-32-32 stickshift-node[1417]: Procs 5f0120cf46c943e9a2987497c9e57611 =>   PID TTY          TIME CMD#01211158 ?        00:00:00 sshd
Sep 27 16:19:55 domU-12-31-39-09-32-32 stickshift-node[1417]: 2nd attempt to remove '/var/lib/stickshift/5f0120cf46c943e9a2987497c9e57611/' from filesystem failed.
Sep 27 16:19:59 domU-12-31-39-09-32-32 CGRE[1087]: Cgroup change for PID: 11541, UID: 510, GID: 510, PROCNAME: /bin/grep FAILED! (Error Code: 50002)
Sep 27 16:19:59 domU-12-31-39-09-32-32 CGRE[1087]: Cgroup change for PID: 11545, UID: 510, GID: 510, PROCNAME: /usr/libexec/stickshift/cartridges/embedded/haproxy-1.4/info/bin/haproxy_ctld_daemon FAILED! (Error Code: 50002)
Sep 27 16:19:59 domU-12-31-39-09-32-32 CGRE[1087]: Cgroup change for PID: 11545, UID: 510, GID: 510, PROCNAME: /usr/libexec/stickshift/cartridges/embedded/haproxy-1.4/info/bin/haproxy_ctld_daemon FAILED! (Error Code: 50002)
Sep 27 16:19:59 domU-12-31-39-09-32-32 CGRE[1087]: Reloading rules configuration.

Comment 4 John Poelstra 2012-09-28 18:46:22 UTC

Jhon able to repro and working on a fix

Comment 5 Jhon Honce 2012-10-01 17:38:38 UTC

Status update:

The root cause of the issue is an existing process in a state where pkill -9 doesn't kill said process.  This was determined using PAM debugging on sshd and runuser.

pam_namespace is "protecting" the mount...

E.g: (output edited to be brief)

pam_namespace(sshd:session): Setting poly ns for user 549 for dir /var/tmp
pam_namespace(sshd:session): instance_dir /var/lib/stickshift/13a21350770e47b483c66f3cb1d19187/.tmp/13a21350770e47b483c66f3cb1d19187
pam_namespace(sshd:session): namespace setup ok for pid 29020
pkill -u 549 (exit code 0)
ps -u 549    (exit code 0)
PID S CMD
29120 S sshd: 13a21350770e47b483c66f3cb1d19187@notty
29121 R /bin/bash -c . /usr/bin/rhcsh > /dev/null ; ctl_all stop
libra-cgroups[29297]: Freezing /cgroup/all//libra/13a21350770e47b483c66f3cb1d19187/freezer.state
libra-cgroups[29300]: Timeout freezing /cgroup/all//libra/13a21350770e47b483c66f3cb1d19187/freezer.state (30 second timeout)
userdel[29391]: delete user '13a21350770e47b483c66f3cb1d19187'
userdel[29391]: removed group '13a21350770e47b483c66f3cb1d19187' owned by '13a21350770e47b483c66f3cb1d19187'
stickshift-node[11391]: 1st attempt to remove '/var/lib/stickshift/13a21350770e47b483c66f3cb1d19187/' from filesystem failed.
stickshift-node[11391]: Dir(before)   13a21350770e47b483c66f3cb1d19187/549 => .env, .npm, ., .node-gyp, app-root, .tmp, .., git, .npmrc, .gitconfig, node_modules, .sandbox, nodejs-0.6, .ssh
stickshift-node[11391]: Dir(after)    13a21350770e47b483c66f3cb1d19187/549 => ., .tmp, .., .sandbox
stickshift-node[11391]: 2nd attempt to remove '/var/lib/stickshift/13a21350770e47b483c66f3cb1d19187/' from filesystem failed.
libra-cgroups[29457]: Thawing /cgroup/all//libra/13a21350770e47b483c66f3cb1d19187/freezer.state
sshd[29020]: pam_unix(sshd:session): session closed for user 13a21350770e47b483c66f3cb1d19187
sshd[29020]: pam_namespace(sshd:session): close_session - start
sshd[29020]: pam_namespace(sshd:session): close_session - sucessful

Current patch WIP
 - serializing deletes and creates (helps with scale-up and destroy event contention)
 - Freezing cgroups (As seen above doesn't always help)
 - setting nprocs to 0 (prevents new logins from happening)

These help but problem still exists.

Comment 6 John Poelstra 2012-10-02 17:26:55 UTC

Jhon has a patch and is getting it reviewed

Comment 7 Jhon Honce 2012-10-03 17:46:30 UTC

Patch on typeless gear branch as the extended testing on that branch causes the bug to be reproduced in the old code.

That branch is scheduled to be merged onto master 2012/10/03.

Comment 8 Jhon Honce 2012-10-03 23:55:08 UTC

Fixed in

crankcase commit#9cf7d0517bc77d8c04790ebd0a67ba8b55d6851d
li commit#4e0a3532eabb4874240860e4ea23703d9700edf9

The most reliable method to reproduce the issue is running:
  devenv --include_extended=runtime

Comment 9 Jianwei Hou 2012-10-09 10:15:10 UTC

Verified on devenv_2301

Since this problem can not always be reproduced, verified this bug using regression test
Steps:
1. Create 94 apps
2. Run 'rhc-admin-chk'
3. Destroy all created apps
4. Run 'rhc-admin-chk' again

There were no errors encountered, and gear's directories are removed successfully.