Bug 1265183

Summary:	gear at nproc limit blocks the pending_ops queue
Product:	OpenShift Online	Reporter:	clemens
Component:	oc	Assignee:	Timothy Williams <tiwillia>
Status:	CLOSED WONTFIX	QA Contact:	Wei Sun <wsun>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	2.x	CC:	abhgupta, aos-bugs, dmcphers, jgoulding, jokerman, mmccomas, mwhittin, nutrilord0, rthrashe
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-05-31 18:22:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1277547

Description clemens 2015-09-22 10:42:27 UTC

Description of problem:
Deletion of app failed. 

Deleting application 'test' ... Resources unavailable for operation. You may need to run 'rhc force-stop-app -a test' and retry.
/sbin/runuser: cannot set user id: Resource temporarily unavailable

I run - 'rhc force-stop-app -a test' several times, same failure again

Failed to delete application "test"
Could not request https://openshift.redhat.com/broker/rest/domain/sweep/application/test: Shell command '/sbin/runuser -s /bin/sh 560114817628e139ef0000d3 -c "exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c5,c458' /bin/sh -c \"/var/lib/openshift/560114817628e139ef0000d3/jenkins-client/bin/setup --version 1\""' returned an error. rc=125

Unhandled exception reference #471fb9944704a895b8280080e43e2f00: Failed.  Response code = 422.  Response message = Unprocessable Entity.

Comment 1 Max Whittingham 2015-09-25 18:38:18 UTC

Seems related to 1160494

Attemps to forcestop the gear were unsuccessful:
sudo oo-admin-ctl-gears forcestopgear 560114817628e139ef0000d3
Stopping gear 560114817628e139ef0000d3 ... [  FAILED ]
 CLIENT_ERROR: Failed to execute: 'control disable-server' for /var/lib/openshift/560114817628e139ef0000d3/haproxy


Relevant logs from the node:
September 25 14:10:23 INFO [app_uuid=560114817628e139ef0000d3] Shell command '/sbin/runuser -s /bin/sh 560114817628e139ef0000d3 -c "exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c5,c458' /bin/sh -c \"set -e; /var/lib/openshift/560114817628e139ef0000d3/jenkins-client/bin/control stop \""' ran. rc=125 out=
September 25 14:10:23 INFO [app_uuid=560114817628e139ef0000d3] Disconnecting frontend mapping for 560114817628e139ef0000d3/jenkins-client: []
September 25 14:10:23 INFO [app_uuid=560114817628e139ef0000d3] Deleting cartridge directory for 560114817628e139ef0000d3/jenkins-client
September 25 14:10:23 INFO [app_uuid=560114817628e139ef0000d3] Deleted cartridge directory for 560114817628e139ef0000d3/jenkins-client
September 25 14:10:23 INFO [app_uuid=560114817628e139ef0000d3] Shell command 'quota -p --always-resolve -w 560114817628e139ef0000d3' ran. rc=0 out=Disk quotas for user 560114817628e139ef0000d3 (uid 5558): 
September 25 14:10:23 INFO [app_uuid=560114817628e139ef0000d3] openshift-agent: request end: action=cartridge_do, requestid=12712adb60ff5191845ba510938877de, senderid=mcollect.cloud.redhat.com, statuscode=1, data={:time=>nil, :output=>"\nCLIENT_ERROR: /sbin/runuser: [HIDDEN] set user id: Resource temporarily unavailable\n\nCLIENT_MESSAGE: Resources unavailable for operation. You may need to run 'rhc force-stop-app -a test' and retry.\n", :exitcode=>222, :addtl_params=>nil}
September 25 14:10:40 INFO [] AdminGearsControl: initialized for gear(s) 560114817628e139ef0000d3
September 25 14:10:40 INFO [] 560114817628e139ef0000d3 disable-server against 'haproxy'
September 25 14:10:40 INFO [] Shell command '/sbin/runuser -s /bin/sh 560114817628e139ef0000d3 -c "exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c5,c458' /bin/sh -c \"set -e; /var/lib/openshift/560114817628e139ef0000d3/haproxy/bin/control disable-server 560114817628e139ef0000d3\""' ran. rc=125 out=
September 25 14:10:40 ERROR [] (295221) Stopping gear 560114817628e139ef0000d3 ... [ FAILED ]
  CLIENT_ERROR: Failed to execute: 'control disable-server' for /var/lib/openshift/560114817628e139ef0000d3/haproxy
September 25 14:10:40 ERROR [] Gear: 560114817628e139ef0000d3 failed, Error: CLIENT_ERROR: Failed to execute: 'control disable-server' for /var/lib/openshift/560114817628e139ef0000d3/haproxy
September 25 14:10:40 INFO [] Gear: 560114817628e139ef0000d3 failed, Exception: #<OpenShift::Runtime::Utils::ShellExecutionException: CLIENT_ERROR: Failed to execute: 'control disable-server' for /var/lib/openshift/560114817628e139ef0000d3/haproxy>
September 25 14:10:40 INFO [] Gear: 560114817628e139ef0000d3 failed, Backtrace: ["/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/v2_cart_model.rb:1380:in `block in do_control_with_directory'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/v2_cart_model.rb:1170:in `process_cartridges'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/v2_cart_model.rb:1343:in `do_control_with_directory'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/v2_cart_model.rb:1192:in `do_control'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/application_container_ext/cartridge_actions.rb:1522:in `update_proxy_status_for_gear'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/application_container_ext/cartridge_actions.rb:1589:in `update_local_proxy_status'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/application_container_ext/cartridge_actions.rb:1540:in `update_remote_proxy_status'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/application_container_ext/cartridge_actions.rb:1669:in `block in update_proxy_status'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/utils/threads.rb:21:in `call'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/utils/threads.rb:21:in `block in map'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:345:in `call'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:345:in `call_with_index'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:188:in `block (3 levels) in work_in_threads'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:352:in `with_instrumentation'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:186:in `block (2 levels) in work_in_threads'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:180:in `loop'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:180:in `block in work_in_threads'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:65:in `block (2 levels) in in_threads'"]
September 25 14:10:40 INFO [] Gear: 560114817628e139ef0000d3 output, CLIENT_ERROR: Failed to execute: 'control disable-server' for /var/lib/openshift/560114817628e139ef0000d3/haproxy
September 25 14:15:44 INFO [app_uuid=560114817628e139ef0000d3] openshift-agent: request start: action=cartridge_do requestid=c9daf8413721546ab7b06a2d7462b061, senderid=mcollect.cloud.redhat.com, data={:time=>nil, :output=>nil, :exitcode=>nil, :addtl_params=>nil}


Only after a pkill -9 was run for the user was I able to successfully forcestop the gear.

Comment 2 Timothy Williams 2015-09-25 18:40:04 UTC

clemens, you should be able to delete your application now. Please try again.

Comment 3 clemens 2015-09-25 18:40:59 UTC

Thanks for solving :)

Comment 6 Andy Grimm 2015-10-15 18:01:43 UTC

I am reopening this bug, because we are seeing this regularly with aerogear applications, and I believe this is somehow different from bugs we've addressed previously around handling of the nproc limit.

Comment 7 Abhishek Gupta 2015-10-19 20:35:05 UTC

Not urgent since the original issue has been resolved and aerogear applications on existing small gears can get resource starved.

Comment 8 Rory Thrasher 2016-02-02 22:11:49 UTC

As per our discussion, looking for more information on how often this is occurring with new nproc limits.

Comment 9 Andy Grimm 2016-02-03 15:25:21 UTC

Across our environment, we are seeing about 160 different gears per week hit this issue (and that's not the same 160 every week; it's about 50% repeats and 50% new).  For gears where this is a problem, it's typical to see it happen about once per day on average, but at the high end, it's 4 to 5 times per day.

In the vast majority of cases, the gear is running the wildfly-10 cartridge with a colocated database.  I have made a comment in https://github.com/openshift-cartridges/openshift-wildfly-cartridge/issues/32 mentioning this.

The worst gear, however, is running only the mongodb cartridge, as part of a scaled app.  We restart it literally every ten minutes (the current frequency of our cron job), and I imagine it hits the limit almost immediately after the restart.

My concerns here are:

1) the owner of the application still has no way to get their own app out of this state; they just have to wait for our cron job to run, and then try to push code / stop the gear / etc. before it hits the problem again.  It would be nice if they could issue a "force stop" that would essentially bypass the pending ops queue somehow.

2) we currently are not distinguishing between apps "stuck" at the nproc limit (due to thread leaks or misconfiguration of a cartridge) and those that just briefly hit the limit and would have returned to normal on their own.

Comment 16 Rory Thrasher 2016-04-04 20:36:33 UTC

*** Bug 1299198 has been marked as a duplicate of this bug. ***

Comment 17 Eric Paris 2017-05-31 18:22:11 UTC

We apologize, however, we do not plan to address this report at this time. The majority of our active development is for the v3 version of OpenShift. If you would like for Red Hat to reconsider this decision, please reach out to your support representative. We are very sorry for any inconvenience this may cause.