Description of problem: It is quite often to meet 50X errors when performing operations against INT via both RHC and REST API. Moreover, the following operations are of more chances to fail (almost always): 1. Delete apps via RHC or REST API 2. Create scaling apps 3. Creating Jenkins servers 4. Enable a Jenkins-client Version-Release number of selected component (if applicable): INT(devenv_3948) How reproducible: 80% Steps to Reproduce: 1. Delete apps via RHC or REST API 2. Create scaling apps 3. Creating Jenkins servers 4. Enable a Jenkins-client Actual results: The responses of the failed operations include (not exhaustive): * Creating application 'seejxmhn' ... Server returned an unexpected error code: 504 * [22:41:48] INFO> Shell Command: rhc app create j1 jenkins-1 -g small -l zhewang+2 -p 'redhat' --insecure --server int.openshift.redhat.com Application Options ------------------- Domain: ny4yg7ygnf Cartridges: jenkins-1 Gear Size: small Scaling: no Creating application 'j1' ... An error occurred while communicating with the server. This problem may only be temporary. Check that you have correctly specified your OpenShift server 'https://int.openshift.redhat.com/broker/rest/domain/ny4yg7y * [21:52:13] INFO> Rest Request: DELETE https://int.openshift.redhat.com/broker/rest/domain/zcyhp0buoo/application/nodejsakn8lwx {} [21:57:14] ERROR> Unable to perform REST request: 502 Bad Gateway [21:57:14] INFO> Rest Request: DELETE https://int.openshift.redhat.com/broker/rest/domain/zcyhp0buoo/application/nodejsakn8lwx {} [21:58:05] ERROR> Unable to perform REST request: 503 Service Unavailable [21:58:05] ERROR> Failed to delete the 'nodejsakn8lwx' app in the zcyhp0buoo domain [21:58:05] INFO> Rest Request: DELETE https://int.openshift.redhat.com/broker/rest/domain/zcyhp0buoo/application/rubye8cm84u {} [22:02:09] ERROR> Unable to perform REST request: 500 Internal Server Error [22:02:09] INFO> Rest Request: DELETE https://int.openshift.redhat.com/broker/rest/domain/zcyhp0buoo/application/rubye8cm84u {} [22:06:13] ERROR> Unable to perform REST request: 500 Internal Server Error [22:06:13] ERROR> Failed to delete the 'rubye8cm84u' app in the zcyhp0buoo domain Expected results: Should not return 50X errors. Additional info:
Got an Execution Expired: …ib/openshift/runtime/frontend/http/plugins/apachedb.rb: 125:in `lock' …ib/openshift/runtime/frontend/http/plugins/apachedb.rb: 125:in `initialize' …ib/openshift/runtime/frontend/http/plugins//opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-frontend-apachedb-0.2.1/lib/openshift/runtime/frontend/http/plugins/apachedb.rbapachedb.rb: 257:in `new' …ib/openshift/runtime/frontend/http/plugins/apachedb.rb: 257:in `open' …hift/runtime/frontend/http/plugins/nodejs-websocket.rb: 35:in `destroy' …16.6/lib/openshift-origin-node/model/frontend_httpd.rb: 546:in `block in call_plugins' …16.6/lib/openshift-origin-node/model/frontend_httpd.rb: 542:in `map' …16.6/lib/openshift-origin-node/model/frontend_httpd.rb: 542:in `call_plugins' …16.6/lib/openshift-origin-node/model/frontend_httpd.rb: 193:in `destroy' …penshift/runtime/containerization/selinux_container.rb: 115:in `destroy' …b/openshift-origin-node/model/application_container.rb: 260:in `block in destroy' …2/lib/openshift-origin-common/utils/file_needs_sync.rb: 38:in `block in open' …2/lib/openshift-origin-common/utils/file_needs_sync.rb: 36:in `open' …2/lib/openshift-origin-common/utils/file_needs_sync.rb: 36:in `open' …b/openshift-origin-node/model/application_container.rb: 245:in `destroy' …usr/libexec/mcollective/mcollective/agent/openshift.rb: 351:in `oo_app_destroy' …usr/libexec/mcollective/mcollective/agent/openshift.rb: 140:in `execute_action' …usr/libexec/mcollective/mcollective/agent/openshift.rb: 105:in `cartridge_do_action' …h/ruby193/root/usr/share/ruby/mcollective/rpc/agent.rb: 86:in `handlemsg' …t/rh/ruby193/root/usr/share/ruby/mcollective/agents.rb: 126:in `block (2 levels) in dispatch' /opt/rh/ruby193/root/usr/share/ruby/timeout.rb: 69:in `timeout' …t/rh/ruby193/root/usr/share/ruby/mcollective/agents.rb: 125:in `block in dispatch'
The mutex for nodejs-websocket.rb is being held by one of the tasks in mcollective. Stopping mcollective, ensuring that its process is really dead and restarting it should make the problem go away; however, its unclear what caused the block in the first place.
Can't currently reproduce. $ rhc app create rm1 php-5.3 Application Options ------------------- Domain: rmillnerint Cartridges: php-5.3 Gear Size: default Scaling: no Creating application 'rm1' ... done Waiting for your DNS name to be available ... done Cloning into 'rm1'... The authenticity of host 'rm1-rmillnerint.int.rhcloud.com (23.20.252.31)' can't be established. RSA key fingerprint is cf:ee:77:cb:0e:fc:02:d7:72:7e:ae:80:c0:90:88:a7. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'rm1-rmillnerint.int.rhcloud.com,23.20.252.31' (RSA) to the list of known hosts. Your application 'rm1' is now available. URL: http://rm1-rmillnerint.int.rhcloud.com/ SSH to: 526ea3d503ef64df6b00013a.rhcloud.com Git remote: ssh://526ea3d503ef64df6b00013a.rhcloud.com/~/git/rm1.git/ Cloned to: /home/rmillner/rm1 Run 'rhc show-app rm1' for more details about your app.
Correction its reproducible right now. $ rhc app create rm1 php-5.3 --no-git --no-dns Application Options ------------------- Domain: rmillnerint Cartridges: php-5.3 Gear Size: default Scaling: no Creating application 'rm1' ... Server returned an unexpected error code: 504
This appears to be due to one of the disks on INT developing serious IO errors and the RAID-1 code attempting to deal with it. Following one sample failure for app chkexsrv1, here's a summary of its events. 14:07:30.713647 configure php-5.3 on chkexsrv1 14:07:30.715116 Executing action [configure] 14:13:30.721892 execution expired At the same time, zabbix was reporting the following processes in D state: USER PID PPID STAT %CPU WCHAN ELAPSED COMMAND root 681 2 D 0.0 wait_barrier 10-11:59:52 jbd2/dm-0-8 root 23198 23194 Dl 5.0 jbd2_log_wait_commit 00:32 ruby root 31114 2 DN 0.2 raise_barrier 1-13:12:00 md127_resync Throughout the day, there were log messages about one of the disks getting IO errors: I/O error, dev xvdk, sector 194523704
*** Bug 1023857 has been marked as a duplicate of this bug. ***
Ops is dealing with the disk issues and that should resolve this problem.
The problem is still reproducible in INT(devenv_3957), but not as frequent as yesterday. Here are two failures: 1) returned 500 when deleting an app via REST API: [23:59:52] ERROR> Unable to perform REST request DELETE https://int.openshift.redhat.com/broker/rest/application/526f529f03ef64ae77000003 {}: 500 Internal Server Error 2) returned 504 when creating an non-scaling app Running Command - rhc app create 15pzeb3my4 php-5.3 -l xtian+test2 -p 123456 --insecure --timeout 360 Application Options ------------------- Domain: oxqnt76t5x Cartridges: php-5.3 Gear Size: default Scaling: no Creating application '15pzeb3my4' ... Server returned an unexpected error code: 504 Since I don't have the access to the server logs in INT, I will appreciate a lot if you would do some investigation in the context of this request. Thanks, Zhe Wang
Analysis of the logs for 15pzeb3my4 show that mcollective was restarted while the gear configure was running. SIGTERM was set at 21:18:45, last log message out of mcollective was 21:18:47, the node API was still executing for the gear at 21:18:48, the start happened at 21:18:58. The broker never received its response. Platform logs: October 28 21:18:48 INFO Connecting frontend mapping for 526f0c7803ef647141000070/php: [] => [127.10.218.1:8080] with options: {"websocket"=>true, "protocols"=>["http", "ws"]} Mcollective logs: Oct 28 21:18:45 ex-std-node2 mcollectived[25013]: runner.rb:60:in `rescue in block in run' Exiting after signal: SIGTERM Oct 28 21:18:47 ex-std-node2 mcollectived[25013]: openshift.rb:150:in `execute_action' Finished executing action [app-destroy] (0) Oct 28 21:18:47 ex-std-node2 mcollectived[25013]: openshift.rb:112:in `cartridge_do_action' cartridge_do_action reply (0): ------ NOTIFY_ENDPOINT_DELETE: 23.20.252.31 61346 ------) Oct 28 21:18:58 ex-std-node2 mcollectived[3442]: mcollectived:31:in `<main>' The Marionette Collective 2.2.3 started logging at info level
Gear 526f529f03ef64ae77000003 seems to have failed for a related reason. It looks like mcollective crashed in the middle of the post-configure step for gear. The sequence in the mcollective logs: 02:16:45 post-configure starts 02:16:48 rsync runs successfully *** I don't see what killed mcollective, it was down for a long time. 02:16:49 last mcollective log before restart 02:46:57 mcollective restart 02:59:42 deconfigure is called 02:59:45 deconfigure is successful 02:59:47 app-destroy is called 02:59:53 app-destroy is successful. The sequence in the broker logs: 02:27:50.173 DELETE request 02:59:41.210 DELETE request -> exception leading to a 500 error.
The logs for ex-std-node3.int don't reveal a cause for mcollective to go down. It just stops outputting logs at 02:16:49 and goes through startup 30 minutes later. There's no related peripheral activity at the time.
The broker exception happens when a prior operation was in the middle of a rollback. My reading of the looks looks like the first delete call triggered a rollback which waited for mcollective to come back online. The second delete operation happened while the first one was busy rolling back (deleting the app) and it raised an exception getting a 500 error. The exception is intentional; however, the broker devs agree that a 500 error is not as good as a message stating that its in the middle of a rollback for that gear.
So far, each of the failures were due to some system issue: 1. Disk failed, the RAID rebuild was saturating the node with disk IO causing OpenShift operations to take too long and time out. 2. Mcollective was deliberately restarted, causing executing operations to fail. 3. Mcollective crashed, causing executing operations to fail. Unfortunately, we weren't able to determine the root cause for issue #3. I'm passing this back to Q/E to see if the problems repeat tonight. I've verified the disk issues were dealt with and mcollective restart happened unusually late yesterday.
Dropping severity since this appears to be specific to system conditions on INT nodes rather than code which will affect STG or PROD.
INT(devenv_3965) when I create app with --from-code on the INT , always show "Server returned an unexpected error code: 504".
Re-raising severity since this new symptom is not known to be constrained to INT.
Hi, on INT Yesterday,the percentage of failure to create app with the flag '--from-code' or with multi cartridges was about 80%. But today the number is about 30%. When it was created failed, used 'rhc domain show' could show the failed app, but when deleted this app, just tell me not found. 1.[wxj@wangxiuj .openshift]$ rhc app create railsapp1 ruby-1.9 postgresql-9.2 --from-code https://github.com/BanzaiMan/openshift-rails-example-postgis.git Application Options ------------------- Domain: wxj Cartridges: ruby-1.9, postgresql-9.2 Source Code: https://github.com/BanzaiMan/openshift-rails-example-postgis.git Gear Size: default Scaling: no Creating application 'railsapp1' ... Server returned an unexpected error code: 504 2.[wxj@wangxiuj .openshift]$ rhc domain show Domain wxj (owned by xiuwang+1) ------------------------------------------ Created: Oct 25 11:09 AM Allowed Gear Sizes: small railsapp1 @ http://railsapp1-wxj.int.rhcloud.com/ (uuid: 5271c6bd6cec0e51a900004e) ---------------------------------------------------------------------------------- Domain: wxj Created: 10:55 AM Gears: 1 (defaults to small) Git URL: ssh://5271c6bd6cec0e51a900004e.rhcloud.com/~/git/railsapp1.git/ Initial Git URL: https://github.com/BanzaiMan/openshift-rails-example-postgis.git SSH: 5271c6bd6cec0e51a900004e.rhcloud.com Deployment: auto (on git push) ruby-1.9 (Ruby 1.9) ------------------- Gears: Located with postgresql-9.2 postgresql-9.2 (PostgreSQL 9.2) ------------------------------- Gears: Located with ruby-1.9 Connection URL: postgresql://$OPENSHIFT_POSTGRESQL_DB_HOST:$OPENSHIFT_POSTGRESQL_DB_PORT Database Name: railsapp1 Password: vcyMaZe1yypw Username: admins9u9jvw You have 1 applications in your domain. 3.[wxj@wangxiuj .openshift]$ rhc app delete railsapp1 --confirm Deleting application 'railsapp1' ... Application '5271c6bd6cec0e51a900004e' not found.
Gear 5271c6bd6cec0e51a900004e failed the post-configure step with a timeout at around 23:02:44 (EDT). At 23:02:50, roll-back began deconfiguring postgres, then ruby at 23:03:03. The gear destroy call was made at 23:03:27 and finished successfully at 23:03:42. At that point, the broker should not have show the app since it had been successfully deleted. The cause of the gear timeout was that it took too long to build, likely related to system load at the time and is a reminder of the fundamental flaws with how we use timeouts. There may be a bug in the broker related to rollback though since the application still shows up after deletion. Checking those logs next.
A new ticket was created to manage the broker issue with stale application data.
The remainder of fixing the latest symptoms are with Bug 1025482. *** This bug has been marked as a duplicate of bug 1025482 ***