Description of problem: When a user calls for a gear to be destroyed, app-destroy doesn't always remove all parts of the gear. This in turn causes our gear consistency checks to alert, which makes it so that we have to go and manually clean up these gears, which is tedious and time consuming. Here is an example of an app-destroy that left around the home directory (it removed the unix user and proxypass file). NOTE: This happened on the current release in PROD (it just happened today). I, [2012-08-31T01:00:34.308953 #4643] INFO -- : stickshift.rb:315:in `cartridge_do_action' cartridge_do_action call / request = #<MCollective::RPC::Request:0x7f1a258e71f0 @action="cartridge_do", @agent="stickshift", @caller="uid=431", @data= {:process_results=>true, :cartridge=>"stickshift-node", :args=> {"--with-namespace"=>"REDACTED", "--with-app-uuid"=>"1f3349d26d1440dab56e23e59a9a2b95", "--with-container-name"=>"REDACTED", "--with-app-name"=>"REDACTED", "--with-container-uuid"=>"1f3349d26d1440dab56e23e59a9a2b95"}, :action=>"app-destroy"}, @sender="REDACTED", @time=1346389234, @uniqid="312e28efac3cf0da851b6a8566acf8f4"> I, [2012-08-31T01:00:34.310188 #4643] INFO -- : stickshift.rb:316:in `cartridge_do_action' cartridge_do_action validation = stickshift-node app-destroy --with-namespaceREDACTED--with-app-uuid1f3349d26d1440dab56e23e59a9a2b95--with-container-nameREDACTED--with-app-nameREDACTED--with-container-uuid1f3349d26d1440dab56e23e59a9a2b95 I, [2012-08-31T01:00:34.310609 #4643] INFO -- : stickshift.rb:54:in `ss_app_destroy' COMMAND: ss-app-destroy I, [2012-08-31T01:00:39.721505 #4643] INFO -- : stickshift.rb:336:in `cartridge_do_action' cartridge_do_action (0) ------ ------) Only the gear directory and it's ~/.tmp were left around in this case (this was possibly caused by the polyinstantiated /tmp dir). We manually removed the rest of the gear with these two commands: 57 rmdir /var/lib/stickshift/1f3349d26d1440dab56e23e59a9a2b95/.tmp/ 58 rmdir /var/lib/stickshift/1f3349d26d1440dab56e23e59a9a2b95 Version-Release number of selected component (if applicable): rhc-node-0.97.7-1.el6_3.x86_64 How reproducible: Sporadic, but happens to quite a few gears on a daily basis in PROD. Steps to Reproduce: 1. Unknown. Actual results: Parts of gears are left around. Expected results: Gears should always be completely removed.
researching more today, will probably make FutureFeature by the end of day because unable to reproduce
twiest monitoring syslogs, will bounce back to me with any results.
ls -la /var/lib/stickshift/5f0120cf46c943e9a2987497c9e57611 total 16 drwxr-x---. 4 root 546 4096 Sep 27 16:19 . drwxr-x--x. 49 root root 4096 Sep 27 16:20 .. d---------. 2 root root 4096 Sep 27 16:19 .sandbox d---------. 2 root root 4096 Sep 27 16:19 .tmp Sep 27 16:19:49 domU-12-31-39-09-32-32 CGRE[1087]: Reloading rules configuration. Sep 27 16:19:51 domU-12-31-39-09-32-32 python: user 5f0120cf46c943e9a2987497c9e57611: putting process 10712 in cgroup /cgroup/all/libra Sep 27 16:19:51 domU-12-31-39-09-32-32 python: ctl_all stop Sep 27 16:19:54 domU-12-31-39-09-32-32 CGRE[1087]: Reloading rules configuration. Sep 27 16:19:55 domU-12-31-39-09-32-32 stickshift-node[1417]: 1st attempt to remove '/var/lib/stickshift/5f0120cf46c943e9a2987497c9e57611/' from filesystem failed. Sep 27 16:19:55 domU-12-31-39-09-32-32 stickshift-node[1417]: Dirs 5f0120cf46c943e9a2987497c9e57611 => ruby-1.8, app-root, git Sep 27 16:19:55 domU-12-31-39-09-32-32 python: user 5f0120cf46c943e9a2987497c9e57611: putting process 11161 in cgroup /cgroup/all/libra Sep 27 16:19:55 domU-12-31-39-09-32-32 stickshift-node[1417]: Procs 5f0120cf46c943e9a2987497c9e57611 => PID TTY TIME CMD#01211158 ? 00:00:00 sshd Sep 27 16:19:55 domU-12-31-39-09-32-32 stickshift-node[1417]: 2nd attempt to remove '/var/lib/stickshift/5f0120cf46c943e9a2987497c9e57611/' from filesystem failed. Sep 27 16:19:59 domU-12-31-39-09-32-32 CGRE[1087]: Cgroup change for PID: 11541, UID: 510, GID: 510, PROCNAME: /bin/grep FAILED! (Error Code: 50002) Sep 27 16:19:59 domU-12-31-39-09-32-32 CGRE[1087]: Cgroup change for PID: 11545, UID: 510, GID: 510, PROCNAME: /usr/libexec/stickshift/cartridges/embedded/haproxy-1.4/info/bin/haproxy_ctld_daemon FAILED! (Error Code: 50002) Sep 27 16:19:59 domU-12-31-39-09-32-32 CGRE[1087]: Cgroup change for PID: 11545, UID: 510, GID: 510, PROCNAME: /usr/libexec/stickshift/cartridges/embedded/haproxy-1.4/info/bin/haproxy_ctld_daemon FAILED! (Error Code: 50002) Sep 27 16:19:59 domU-12-31-39-09-32-32 CGRE[1087]: Reloading rules configuration.
Jhon able to repro and working on a fix
Status update: The root cause of the issue is an existing process in a state where pkill -9 doesn't kill said process. This was determined using PAM debugging on sshd and runuser. pam_namespace is "protecting" the mount... E.g: (output edited to be brief) pam_namespace(sshd:session): Setting poly ns for user 549 for dir /var/tmp pam_namespace(sshd:session): instance_dir /var/lib/stickshift/13a21350770e47b483c66f3cb1d19187/.tmp/13a21350770e47b483c66f3cb1d19187 pam_namespace(sshd:session): namespace setup ok for pid 29020 pkill -u 549 (exit code 0) ps -u 549 (exit code 0) PID S CMD 29120 S sshd: 13a21350770e47b483c66f3cb1d19187@notty 29121 R /bin/bash -c . /usr/bin/rhcsh > /dev/null ; ctl_all stop libra-cgroups[29297]: Freezing /cgroup/all//libra/13a21350770e47b483c66f3cb1d19187/freezer.state libra-cgroups[29300]: Timeout freezing /cgroup/all//libra/13a21350770e47b483c66f3cb1d19187/freezer.state (30 second timeout) userdel[29391]: delete user '13a21350770e47b483c66f3cb1d19187' userdel[29391]: removed group '13a21350770e47b483c66f3cb1d19187' owned by '13a21350770e47b483c66f3cb1d19187' stickshift-node[11391]: 1st attempt to remove '/var/lib/stickshift/13a21350770e47b483c66f3cb1d19187/' from filesystem failed. stickshift-node[11391]: Dir(before) 13a21350770e47b483c66f3cb1d19187/549 => .env, .npm, ., .node-gyp, app-root, .tmp, .., git, .npmrc, .gitconfig, node_modules, .sandbox, nodejs-0.6, .ssh stickshift-node[11391]: Dir(after) 13a21350770e47b483c66f3cb1d19187/549 => ., .tmp, .., .sandbox stickshift-node[11391]: 2nd attempt to remove '/var/lib/stickshift/13a21350770e47b483c66f3cb1d19187/' from filesystem failed. libra-cgroups[29457]: Thawing /cgroup/all//libra/13a21350770e47b483c66f3cb1d19187/freezer.state sshd[29020]: pam_unix(sshd:session): session closed for user 13a21350770e47b483c66f3cb1d19187 sshd[29020]: pam_namespace(sshd:session): close_session - start sshd[29020]: pam_namespace(sshd:session): close_session - sucessful Current patch WIP - serializing deletes and creates (helps with scale-up and destroy event contention) - Freezing cgroups (As seen above doesn't always help) - setting nprocs to 0 (prevents new logins from happening) These help but problem still exists.
Jhon has a patch and is getting it reviewed
Patch on typeless gear branch as the extended testing on that branch causes the bug to be reproduced in the old code. That branch is scheduled to be merged onto master 2012/10/03.
Fixed in crankcase commit#9cf7d0517bc77d8c04790ebd0a67ba8b55d6851d li commit#4e0a3532eabb4874240860e4ea23703d9700edf9 The most reliable method to reproduce the issue is running: devenv --include_extended=runtime
Verified on devenv_2301 Since this problem can not always be reproduced, verified this bug using regression test Steps: 1. Create 94 apps 2. Run 'rhc-admin-chk' 3. Destroy all created apps 4. Run 'rhc-admin-chk' again There were no errors encountered, and gear's directories are removed successfully.