Bug 1265183
Summary: | gear at nproc limit blocks the pending_ops queue | ||
---|---|---|---|
Product: | OpenShift Online | Reporter: | clemens |
Component: | oc | Assignee: | Timothy Williams <tiwillia> |
Status: | CLOSED WONTFIX | QA Contact: | Wei Sun <wsun> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 2.x | CC: | abhgupta, aos-bugs, dmcphers, jgoulding, jokerman, mmccomas, mwhittin, nutrilord0, rthrashe |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-05-31 18:22:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1277547 |
Description
clemens
2015-09-22 10:42:27 UTC
Seems related to 1160494 Attemps to forcestop the gear were unsuccessful: sudo oo-admin-ctl-gears forcestopgear 560114817628e139ef0000d3 Stopping gear 560114817628e139ef0000d3 ... [ FAILED ] CLIENT_ERROR: Failed to execute: 'control disable-server' for /var/lib/openshift/560114817628e139ef0000d3/haproxy Relevant logs from the node: September 25 14:10:23 INFO [app_uuid=560114817628e139ef0000d3] Shell command '/sbin/runuser -s /bin/sh 560114817628e139ef0000d3 -c "exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c5,c458' /bin/sh -c \"set -e; /var/lib/openshift/560114817628e139ef0000d3/jenkins-client/bin/control stop \""' ran. rc=125 out= September 25 14:10:23 INFO [app_uuid=560114817628e139ef0000d3] Disconnecting frontend mapping for 560114817628e139ef0000d3/jenkins-client: [] September 25 14:10:23 INFO [app_uuid=560114817628e139ef0000d3] Deleting cartridge directory for 560114817628e139ef0000d3/jenkins-client September 25 14:10:23 INFO [app_uuid=560114817628e139ef0000d3] Deleted cartridge directory for 560114817628e139ef0000d3/jenkins-client September 25 14:10:23 INFO [app_uuid=560114817628e139ef0000d3] Shell command 'quota -p --always-resolve -w 560114817628e139ef0000d3' ran. rc=0 out=Disk quotas for user 560114817628e139ef0000d3 (uid 5558): September 25 14:10:23 INFO [app_uuid=560114817628e139ef0000d3] openshift-agent: request end: action=cartridge_do, requestid=12712adb60ff5191845ba510938877de, senderid=mcollect.cloud.redhat.com, statuscode=1, data={:time=>nil, :output=>"\nCLIENT_ERROR: /sbin/runuser: [HIDDEN] set user id: Resource temporarily unavailable\n\nCLIENT_MESSAGE: Resources unavailable for operation. You may need to run 'rhc force-stop-app -a test' and retry.\n", :exitcode=>222, :addtl_params=>nil} September 25 14:10:40 INFO [] AdminGearsControl: initialized for gear(s) 560114817628e139ef0000d3 September 25 14:10:40 INFO [] 560114817628e139ef0000d3 disable-server against 'haproxy' September 25 14:10:40 INFO [] Shell command '/sbin/runuser -s /bin/sh 560114817628e139ef0000d3 -c "exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c5,c458' /bin/sh -c \"set -e; /var/lib/openshift/560114817628e139ef0000d3/haproxy/bin/control disable-server 560114817628e139ef0000d3\""' ran. rc=125 out= September 25 14:10:40 ERROR [] (295221) Stopping gear 560114817628e139ef0000d3 ... [ FAILED ] CLIENT_ERROR: Failed to execute: 'control disable-server' for /var/lib/openshift/560114817628e139ef0000d3/haproxy September 25 14:10:40 ERROR [] Gear: 560114817628e139ef0000d3 failed, Error: CLIENT_ERROR: Failed to execute: 'control disable-server' for /var/lib/openshift/560114817628e139ef0000d3/haproxy September 25 14:10:40 INFO [] Gear: 560114817628e139ef0000d3 failed, Exception: #<OpenShift::Runtime::Utils::ShellExecutionException: CLIENT_ERROR: Failed to execute: 'control disable-server' for /var/lib/openshift/560114817628e139ef0000d3/haproxy> September 25 14:10:40 INFO [] Gear: 560114817628e139ef0000d3 failed, Backtrace: ["/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/v2_cart_model.rb:1380:in `block in do_control_with_directory'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/v2_cart_model.rb:1170:in `process_cartridges'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/v2_cart_model.rb:1343:in `do_control_with_directory'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/v2_cart_model.rb:1192:in `do_control'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/application_container_ext/cartridge_actions.rb:1522:in `update_proxy_status_for_gear'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/application_container_ext/cartridge_actions.rb:1589:in `update_local_proxy_status'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/application_container_ext/cartridge_actions.rb:1540:in `update_remote_proxy_status'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/model/application_container_ext/cartridge_actions.rb:1669:in `block in update_proxy_status'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/utils/threads.rb:21:in `call'", "/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.36.3/lib/openshift-origin-node/utils/threads.rb:21:in `block in map'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:345:in `call'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:345:in `call_with_index'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:188:in `block (3 levels) in work_in_threads'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:352:in `with_instrumentation'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:186:in `block (2 levels) in work_in_threads'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:180:in `loop'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:180:in `block in work_in_threads'", "/opt/rh/ruby193/root/usr/share/gems/gems/parallel-0.8.0/lib/parallel.rb:65:in `block (2 levels) in in_threads'"] September 25 14:10:40 INFO [] Gear: 560114817628e139ef0000d3 output, CLIENT_ERROR: Failed to execute: 'control disable-server' for /var/lib/openshift/560114817628e139ef0000d3/haproxy September 25 14:15:44 INFO [app_uuid=560114817628e139ef0000d3] openshift-agent: request start: action=cartridge_do requestid=c9daf8413721546ab7b06a2d7462b061, senderid=mcollect.cloud.redhat.com, data={:time=>nil, :output=>nil, :exitcode=>nil, :addtl_params=>nil} Only after a pkill -9 was run for the user was I able to successfully forcestop the gear. clemens, you should be able to delete your application now. Please try again. Thanks for solving :) I am reopening this bug, because we are seeing this regularly with aerogear applications, and I believe this is somehow different from bugs we've addressed previously around handling of the nproc limit. Not urgent since the original issue has been resolved and aerogear applications on existing small gears can get resource starved. As per our discussion, looking for more information on how often this is occurring with new nproc limits. Across our environment, we are seeing about 160 different gears per week hit this issue (and that's not the same 160 every week; it's about 50% repeats and 50% new). For gears where this is a problem, it's typical to see it happen about once per day on average, but at the high end, it's 4 to 5 times per day. In the vast majority of cases, the gear is running the wildfly-10 cartridge with a colocated database. I have made a comment in https://github.com/openshift-cartridges/openshift-wildfly-cartridge/issues/32 mentioning this. The worst gear, however, is running only the mongodb cartridge, as part of a scaled app. We restart it literally every ten minutes (the current frequency of our cron job), and I imagine it hits the limit almost immediately after the restart. My concerns here are: 1) the owner of the application still has no way to get their own app out of this state; they just have to wait for our cron job to run, and then try to push code / stop the gear / etc. before it hits the problem again. It would be nice if they could issue a "force stop" that would essentially bypass the pending ops queue somehow. 2) we currently are not distinguishing between apps "stuck" at the nproc limit (due to thread leaks or misconfiguration of a cartridge) and those that just briefly hit the limit and would have returned to normal on their own. *** Bug 1299198 has been marked as a duplicate of this bug. *** We apologize, however, we do not plan to address this report at this time. The majority of our active development is for the v3 version of OpenShift. If you would like for Red Hat to reconsider this decision, please reach out to your support representative. We are very sorry for any inconvenience this may cause. |