Description of problem: app-destroy is not cleaning up the gear dirs. There are remnants leftover after the app-destroy is run. oo-accept-node then picks up the error and reports it. # oo-accept-node FAIL: directory 517e897c4382ec5edb0000ef doesn't have a .ssh directory FAIL: directory 517e897c4382ec5edb0000ef doesn't have a .env directory FAIL: directory 517e897c4382ec5edb0000ef doesn't have a .sandbox directory FAIL: directory 517e897c4382ec5edb0000ef doesn't have a .tmp directory FAIL: directory 517e897c4382ec5edb0000ef doesn't have an associated user 5 ERRORS # ls -la /var/lib/openshift/517e897c4382ec5edb0000ef/ total 20 drwxr-xr-x. 3 root root 4096 Apr 29 11:03 . drwxr-x--x. 133 root root 12288 Apr 29 14:46 .. drwxr-xr-x. 2 root root 4096 Apr 29 11:03 git # ls -la /var/lib/openshift/517e897c4382ec5edb0000ef/git total 8 drwxr-xr-x. 2 root root 4096 Apr 29 11:03 . drwxr-xr-x. 3 root root 4096 Apr 29 11:03 .. From the mcollective.log: ======================================================================== I, [2013-04-29T10:54:03.689177 #6595] INFO -- : openshift.rb:43:in `cartridge_do_action' cartridge_do_action validation = php-5.3 configure {"--with-app-uuid"=>"517e 897c4382ec5edb0000ef", "--with-app-name"=>"blog", "--with-container-uuid"=>"517e897c4382ec5edb0000ef", "--with-container-name"=>"blog", "--with-namespace"=>"787x", "- -with-uid"=>2934, "--with-request-id"=>"eabf1b14715d533c97f323ab4eb4a1f6", "--cart-name"=>"php-5.3", "--with-template-git-url"=>"git://github.com/openshift/wordpress- example.git"} I, [2013-04-29T10:54:03.689581 #6595] INFO -- : openshift.rb:82:in `execute_action' Executing action [configure] using method oo_configure with args [{"--with-app-uu id"=>"517e897c4382ec5edb0000ef", "--with-app-name"=>"blog", "--with-container-uuid"=>"517e897c4382ec5edb0000ef", "--with-container-name"=>"blog", "--with-namespace"=> "787x", "--with-uid"=>2934, "--with-request-id"=>"eabf1b14715d533c97f323ab4eb4a1f6", "--cart-name"=>"php-5.3", "--with-template-git-url"=>"git://github.com/openshift/ wordpress-example.git"}] I, [2013-04-29T10:57:12.298005 #6595] INFO -- : openshift.rb:42:in `cartridge_do_action' cartridge_do_action call / action: cartridge_do, agent=openshift, data={:car tridge=>"openshift-origin-node", :action=>"app-destroy", :args=> {"--with-app-uuid"=>"517e897c4382ec5edb0000ef", "--with-app-name"=>"blog", "--with-container-uuid"=>"517e897c4382ec5edb0000ef", "--with-container-name"=>"blog", "--with-namespace"=>"787x", "--with-uid"=>2934, "--with-request-id"=>"eabf1b14715d533c97f323ab4eb4a1f6", "--cart-name"=>"openshift-origin-node"}, :process_results=>true} I, [2013-04-29T10:57:12.298591 #6595] INFO -- : openshift.rb:43:in `cartridge_do_action' cartridge_do_action validation = openshift-origin-node app-destroy {"--with- app-uuid"=>"517e897c4382ec5edb0000ef", "--with-app-name"=>"blog", "--with-container-uuid"=>"517e897c4382ec5edb0000ef", "--with-container-name"=>"blog", "--with-namesp ace"=>"787x", "--with-uid"=>2934, "--with-request-id"=>"eabf1b14715d533c97f323ab4eb4a1f6", "--cart-name"=>"openshift-origin-node"} I, [2013-04-29T10:57:12.298983 #6595] INFO -- : openshift.rb:82:in `execute_action' Executing action [app-destroy] using method oo_app_destroy with args [{"--with-ap p-uuid"=>"517e897c4382ec5edb0000ef", "--with-app-name"=>"blog", "--with-container-uuid"=>"517e897c4382ec5edb0000ef", "--with-container-name"=>"blog", "--with-namespac e"=>"787x", "--with-uid"=>2934, "--with-request-id"=>"eabf1b14715d533c97f323ab4eb4a1f6", "--cart-name"=>"openshift-origin-node"}] I, [2013-04-29T10:57:15.736344 #6595] INFO -- : openshift.rb:91:in `execute_action' Finished executing action [app-destroy] (0) I, [2013-04-29T10:57:15.736873 #6595] INFO -- : openshift.rb:62:in `cartridge_do_action' cartridge_do_action reply (0): ------ ------) I, [2013-04-29T10:58:03.695511 #6595] INFO -- : openshift.rb:658:in `rescue in oo_configure' execution expired I, [2013-04-29T10:58:03.696262 #6595] INFO -- : openshift.rb:659:in `rescue in oo_configure' ["/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.7.32/ lib/openshift-origin-node/model/v1_cart_model.rb:257:in `waitpid2'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.7.32/lib/openshift-origin-node/ model/v1_cart_model.rb:257:in `complete_process_gracefully'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.7.32/lib/openshift-origin-node/model/v 1_cart_model.rb:249:in `handle_cartridge_action'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.7.32/lib/openshift-origin-node/model/v1_cart_mode l.rb:218:in `do_control'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.7.32/lib/openshift-origin-node/model/v1_cart_model.rb:188:in `configure'" , "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.7.32/lib/openshift-origin-node/model/application_container.rb:97:in `configure'", "/opt/rh/ruby193 /root/usr/libexec/mcollective/mcollective/agent/openshift.rb:654:in `oo_configure'", "/opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/agent/openshift.rb:84:in `execute_action'", "/opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/agent/openshift.rb:56:in `cartridge_do_action'", "/opt/rh/ruby193/root/usr/share/ruby/mcollective/rpc/agent.rb:86:in `handlemsg'", "/opt/rh/ruby193/root/usr/share/ruby/mcollective/agents.rb:126:in `block (2 levels) in dispatch'", "/opt/rh/ruby193/root/usr/share/ruby/timeout.rb:68:in `timeout'", "/opt/rh/ruby193/root/usr/share/ruby/mcollective/agents.rb:125:in `block in dispatch'"] I, [2013-04-29T10:58:03.696381 #6595] INFO -- : openshift.rb:91:in `execute_action' Finished executing action [configure] (-1) I, [2013-04-29T10:58:03.696503 #6595] INFO -- : openshift.rb:64:in `cartridge_do_action' cartridge_do_action failed (-1) ================================================================================ How reproducible: I have about 20 nodes that this is on right now. Steps to Reproduce: I believe that the app-destroy is not cleaning up everything. Actual results: gear dir left with the empty git directory Expected results: Remove the gear dir and everything in it. Additional info: I can help debug if you need help.
From the logs and discussion, it appears as though the pattern is if a gear create and a gear destroy happen at the same time they blow each other up.
Further analysis shows the following pattern appears to trigger the issue. 1. gear create 2. gear configure called 3. 3 minutes later, broker sends destroy 4. 4 minutes after configure is called, it times out and returns "execution expired". I believe the configure is still running when destroy is called and its still creating files.
It was determined that the "git clone" step is causing the timeout when accessing a password protected https based repository. Git forks off a process (git-remote-https) to access the https repository. This process will attempt to ask for a password in two ways: 1. Open the tty. - If mcollective happens to have a tty, then git will hang for ever waiting for a password from it. 2. Use stdin that it gets from git. - Even if we close stdin or redirect it from /dev/null, git-remote-https still gets an open pipe for stdin and will hang for ever waiting for a password on it. This triggers the sequence in comment 2. While it was seen with https repositories, we believe it can happen on other types (eg: ssh) as well. The best option seems to be to timeout the call in 150 seconds and kill all child processes associated with it to free the gear before broker decides the whole operation has hung in 300 seconds.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/ede54489fb4142b599878cbc08206e3c29a23b74 Bug 957883 - git clone was getting stuck asking for a password in remote repository and no amount of redirecting or closing stdin prevented a deadlock.
Ops will need to manually clean up any dangling gear configure hooks running. To Q/E this request, create an app and specify the https url to a password protected github repository. The app creation will fail, but will not leave a half-populated gear or running configure hook.
Checked on devenv_3196, with the steps which rob mentioned in comment#5. Create app with private repo which need password. [bmeng@localhost openshift_testdir]$ rhc app create app1 php-5.3 --from-code https://bmeng@bitbucket.org/bmeng/testrepo.git Please sign in to start a new session to ec2-204-236-242-11.compute-1.amazonaws.com. Password: ****** Application Options ------------------- Namespace: bmeng1dev Cartridges: php-5.3 Source Code: https://bmeng@bitbucket.org/bmeng/testrepo.git Gear Size: default Scaling: no Creating application 'app1' ... Reference ID: 5005997291f5d510497241e0cd2d55da App creation got failed. App dir appears during the app creation and been removed after app creation failed. Following event found in development.log 2013-05-08 05:58:05.324 [DEBUG] Rollback create_gear (pid:2186) 2013-05-08 05:58:05.326 [DEBUG] DEBUG: rpc_client.custom_request('cartridge_do', {:cartridge=>"openshift-origin-node", :action=>"app-destroy", :args=>{"--with-app-uuid"=>"7626bcfeb7c511e2becd22000a91e3fc", "--with-app-name"=>"app1", "--with-container-uuid"=>"7626bcfeb7c511e2becd22000a91e3fc", "--with-container-name"=>"app1", "--with-namespace"=>"bmeng1dev", "--with-request-id"=>"5005997291f5d510497241e0cd2d55da", "--cart-name"=>"openshift-origin-node"}}, ip-10-145-227-252, {'identity' => ip-10-145-227-252}) (Request ID: 5005997291f5d510497241e0cd2d55da) (pid:2186) Mark bug as VERIFIED.