Bug 957883 - Gear remnants (<gear dir>/git) are being leftover from gear deletes (NEW 4/29/13)
Summary: Gear remnants (<gear dir>/git) are being leftover from gear deletes (NEW 4/29...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OKD
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: Rob Millner
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-04-29 19:20 UTC by Matt Woodson
Modified: 2015-05-14 23:10 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-05-11 20:15:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 963895 0 medium CLOSED defunct process not in cgroups 2021-02-22 00:41:40 UTC

Internal Links: 963895

Description Matt Woodson 2013-04-29 19:20:12 UTC
Description of problem:

app-destroy is not cleaning up the gear dirs.  There are remnants leftover after the app-destroy is run.  oo-accept-node then picks up the error and reports it.

# oo-accept-node 
FAIL: directory 517e897c4382ec5edb0000ef doesn't have a .ssh directory
FAIL: directory 517e897c4382ec5edb0000ef doesn't have a .env directory
FAIL: directory 517e897c4382ec5edb0000ef doesn't have a .sandbox directory
FAIL: directory 517e897c4382ec5edb0000ef doesn't have a .tmp directory
FAIL: directory 517e897c4382ec5edb0000ef doesn't have an associated user
5 ERRORS

# ls -la /var/lib/openshift/517e897c4382ec5edb0000ef/
total 20
drwxr-xr-x.   3 root root  4096 Apr 29 11:03 .
drwxr-x--x. 133 root root 12288 Apr 29 14:46 ..
drwxr-xr-x.   2 root root  4096 Apr 29 11:03 git

# ls -la /var/lib/openshift/517e897c4382ec5edb0000ef/git
total 8
drwxr-xr-x. 2 root root 4096 Apr 29 11:03 .
drwxr-xr-x. 3 root root 4096 Apr 29 11:03 ..


From the mcollective.log:

========================================================================

I, [2013-04-29T10:54:03.689177 #6595]  INFO -- : openshift.rb:43:in `cartridge_do_action' cartridge_do_action validation = php-5.3 configure {"--with-app-uuid"=>"517e
897c4382ec5edb0000ef", "--with-app-name"=>"blog", "--with-container-uuid"=>"517e897c4382ec5edb0000ef", "--with-container-name"=>"blog", "--with-namespace"=>"787x", "-
-with-uid"=>2934, "--with-request-id"=>"eabf1b14715d533c97f323ab4eb4a1f6", "--cart-name"=>"php-5.3", "--with-template-git-url"=>"git://github.com/openshift/wordpress-
example.git"}
I, [2013-04-29T10:54:03.689581 #6595]  INFO -- : openshift.rb:82:in `execute_action' Executing action [configure] using method oo_configure with args [{"--with-app-uu
id"=>"517e897c4382ec5edb0000ef", "--with-app-name"=>"blog", "--with-container-uuid"=>"517e897c4382ec5edb0000ef", "--with-container-name"=>"blog", "--with-namespace"=>
"787x", "--with-uid"=>2934, "--with-request-id"=>"eabf1b14715d533c97f323ab4eb4a1f6", "--cart-name"=>"php-5.3", "--with-template-git-url"=>"git://github.com/openshift/
wordpress-example.git"}]
I, [2013-04-29T10:57:12.298005 #6595]  INFO -- : openshift.rb:42:in `cartridge_do_action' cartridge_do_action call / action: cartridge_do, agent=openshift, data={:car
tridge=>"openshift-origin-node",
 :action=>"app-destroy",
 :args=>
  {"--with-app-uuid"=>"517e897c4382ec5edb0000ef",
   "--with-app-name"=>"blog",
   "--with-container-uuid"=>"517e897c4382ec5edb0000ef",
   "--with-container-name"=>"blog",
   "--with-namespace"=>"787x",
   "--with-uid"=>2934,
   "--with-request-id"=>"eabf1b14715d533c97f323ab4eb4a1f6",
   "--cart-name"=>"openshift-origin-node"},
 :process_results=>true}

I, [2013-04-29T10:57:12.298591 #6595]  INFO -- : openshift.rb:43:in `cartridge_do_action' cartridge_do_action validation = openshift-origin-node app-destroy {"--with-
app-uuid"=>"517e897c4382ec5edb0000ef", "--with-app-name"=>"blog", "--with-container-uuid"=>"517e897c4382ec5edb0000ef", "--with-container-name"=>"blog", "--with-namesp
ace"=>"787x", "--with-uid"=>2934, "--with-request-id"=>"eabf1b14715d533c97f323ab4eb4a1f6", "--cart-name"=>"openshift-origin-node"}
I, [2013-04-29T10:57:12.298983 #6595]  INFO -- : openshift.rb:82:in `execute_action' Executing action [app-destroy] using method oo_app_destroy with args [{"--with-ap
p-uuid"=>"517e897c4382ec5edb0000ef", "--with-app-name"=>"blog", "--with-container-uuid"=>"517e897c4382ec5edb0000ef", "--with-container-name"=>"blog", "--with-namespac
e"=>"787x", "--with-uid"=>2934, "--with-request-id"=>"eabf1b14715d533c97f323ab4eb4a1f6", "--cart-name"=>"openshift-origin-node"}]
I, [2013-04-29T10:57:15.736344 #6595]  INFO -- : openshift.rb:91:in `execute_action' Finished executing action [app-destroy] (0)
I, [2013-04-29T10:57:15.736873 #6595]  INFO -- : openshift.rb:62:in `cartridge_do_action' cartridge_do_action reply (0):
------

------)
I, [2013-04-29T10:58:03.695511 #6595]  INFO -- : openshift.rb:658:in `rescue in oo_configure' execution expired
I, [2013-04-29T10:58:03.696262 #6595]  INFO -- : openshift.rb:659:in `rescue in oo_configure' ["/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.7.32/
lib/openshift-origin-node/model/v1_cart_model.rb:257:in `waitpid2'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.7.32/lib/openshift-origin-node/
model/v1_cart_model.rb:257:in `complete_process_gracefully'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.7.32/lib/openshift-origin-node/model/v
1_cart_model.rb:249:in `handle_cartridge_action'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.7.32/lib/openshift-origin-node/model/v1_cart_mode
l.rb:218:in `do_control'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.7.32/lib/openshift-origin-node/model/v1_cart_model.rb:188:in `configure'"
, "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.7.32/lib/openshift-origin-node/model/application_container.rb:97:in `configure'", "/opt/rh/ruby193
/root/usr/libexec/mcollective/mcollective/agent/openshift.rb:654:in `oo_configure'", "/opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/agent/openshift.rb:84:in `execute_action'", "/opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/agent/openshift.rb:56:in `cartridge_do_action'", "/opt/rh/ruby193/root/usr/share/ruby/mcollective/rpc/agent.rb:86:in `handlemsg'", "/opt/rh/ruby193/root/usr/share/ruby/mcollective/agents.rb:126:in `block (2 levels) in dispatch'", "/opt/rh/ruby193/root/usr/share/ruby/timeout.rb:68:in `timeout'", "/opt/rh/ruby193/root/usr/share/ruby/mcollective/agents.rb:125:in `block in dispatch'"]
I, [2013-04-29T10:58:03.696381 #6595]  INFO -- : openshift.rb:91:in `execute_action' Finished executing action [configure] (-1)
I, [2013-04-29T10:58:03.696503 #6595]  INFO -- : openshift.rb:64:in `cartridge_do_action' cartridge_do_action failed (-1)
================================================================================


How reproducible:

I have about 20 nodes that this is on right now.

Steps to Reproduce:

I believe that the app-destroy is not cleaning up everything.
  
Actual results:

gear dir left with the empty git directory

Expected results:

Remove the gear dir and everything in it.

Additional info:

I can help debug if you need help.

Comment 1 Rob Millner 2013-04-29 19:42:32 UTC
From the logs and discussion, it appears as though the pattern is if a gear create and a gear destroy happen at the same time they blow each other up.

Comment 2 Rob Millner 2013-04-29 20:13:07 UTC
Further analysis shows the following pattern appears to trigger the issue.

1. gear create
2. gear configure called
3. 3 minutes later, broker sends destroy
4. 4 minutes after configure is called, it times out and returns "execution expired".

I believe the configure is still running when destroy is called and its still creating files.

Comment 3 Rob Millner 2013-04-30 00:39:53 UTC
It was determined that the "git clone" step is causing the timeout when accessing a password protected https based repository.

Git forks off a process (git-remote-https) to access the https repository.  This process will attempt to ask for a password in two ways:

1. Open the tty.
  - If mcollective happens to have a tty, then git will hang for ever waiting for a password from it.

2. Use stdin that it gets from git.
  - Even if we close stdin or redirect it from /dev/null, git-remote-https still gets an open pipe for stdin and will hang for ever waiting for a password on it.


This triggers the sequence in comment 2.

While it was seen with https repositories, we believe it can happen on other types (eg: ssh) as well.

The best option seems to be to timeout the call in 150 seconds and kill all child processes associated with it to free the gear before broker decides the whole operation has hung in 300 seconds.

Comment 4 openshift-github-bot 2013-04-30 04:05:44 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/ede54489fb4142b599878cbc08206e3c29a23b74
Bug 957883 - git clone was getting stuck asking for a password in remote repository and no amount of redirecting or closing stdin prevented a deadlock.

Comment 5 Rob Millner 2013-04-30 18:38:42 UTC
Ops will need to manually clean up any dangling gear configure hooks running.

To Q/E this request, create an app and specify the https url to a password protected github repository.  The app creation will fail, but will not leave a half-populated gear or running configure hook.

Comment 6 Meng Bo 2013-05-08 10:03:19 UTC
Checked on devenv_3196, with the steps which rob mentioned in comment#5.

Create app with private repo which need password.

[bmeng@localhost openshift_testdir]$ rhc app create app1 php-5.3 --from-code https://bmeng@bitbucket.org/bmeng/testrepo.git
Please sign in to start a new session to
ec2-204-236-242-11.compute-1.amazonaws.com.
Password: ******

Application Options
-------------------
  Namespace:   bmeng1dev
  Cartridges:  php-5.3
  Source Code: https://bmeng@bitbucket.org/bmeng/testrepo.git
  Gear Size:   default
  Scaling:     no

Creating application 'app1' ... 
Reference ID: 5005997291f5d510497241e0cd2d55da

App creation got failed. 

App dir appears during the app creation and been removed after app creation failed.

Following event found in development.log
2013-05-08 05:58:05.324 [DEBUG] Rollback create_gear (pid:2186)
2013-05-08 05:58:05.326 [DEBUG] DEBUG: rpc_client.custom_request('cartridge_do', {:cartridge=>"openshift-origin-node", :action=>"app-destroy", :args=>{"--with-app-uuid"=>"7626bcfeb7c511e2becd22000a91e3fc", "--with-app-name"=>"app1", "--with-container-uuid"=>"7626bcfeb7c511e2becd22000a91e3fc", "--with-container-name"=>"app1", "--with-namespace"=>"bmeng1dev", "--with-request-id"=>"5005997291f5d510497241e0cd2d55da", "--cart-name"=>"openshift-origin-node"}}, ip-10-145-227-252, {'identity' => ip-10-145-227-252}) (Request ID: 5005997291f5d510497241e0cd2d55da) (pid:2186)

Mark bug as VERIFIED.


Note You need to log in before you can comment on or make changes to this bug.