Bug 1023854 - Frequently meet 50X errors (500, 503 and 504) in INT(devenv_3948)
Summary: Frequently meet 50X errors (500, 503 and 504) in INT(devenv_3948)
Keywords:
Status: CLOSED DUPLICATE of bug 1025482
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: ---
Assignee: Rob Millner
QA Contact: libra bugs
URL:
Whiteboard:
: 1023857 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-10-28 05:58 UTC by Zhe Wang
Modified: 2015-05-14 23:31 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-10-31 20:42:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1025482 0 unspecified CLOSED Application timedout in post_configure, rolled back, but was still in broker mongodb 2021-02-22 00:41:40 UTC

Internal Links: 1025482

Description Zhe Wang 2013-10-28 05:58:13 UTC
Description of problem:
It is quite often to meet 50X errors when performing operations against INT via both RHC and REST API. Moreover, the following operations are of more chances to fail (almost always):

1. Delete apps via RHC or REST API
2. Create scaling apps
3. Creating Jenkins servers
4. Enable a Jenkins-client

Version-Release number of selected component (if applicable):
INT(devenv_3948)

How reproducible:
80%

Steps to Reproduce:
1. Delete apps via RHC or REST API
2. Create scaling apps
3. Creating Jenkins servers
4. Enable a Jenkins-client

Actual results:
The responses of the failed operations include (not exhaustive):

* Creating application 'seejxmhn' ... Server returned an unexpected error code: 504

* [22:41:48] INFO> Shell Command: rhc app create j1 jenkins-1 -g small -l zhewang+2 -p 'redhat' --insecure --server int.openshift.redhat.com
      Application Options
      -------------------
        Domain:     ny4yg7ygnf
        Cartridges: jenkins-1
        Gear Size:  small
        Scaling:    no
      
      Creating application 'j1' ... An error occurred while communicating with the server. This problem may only be temporary. Check that you have correctly specified your OpenShift server 'https://int.openshift.redhat.com/broker/rest/domain/ny4yg7y

* [21:52:13] INFO> Rest Request: DELETE https://int.openshift.redhat.com/broker/rest/domain/zcyhp0buoo/application/nodejsakn8lwx {}

[21:57:14] ERROR> Unable to perform REST request: 502 Bad Gateway
[21:57:14] INFO> Rest Request: DELETE https://int.openshift.redhat.com/broker/rest/domain/zcyhp0buoo/application/nodejsakn8lwx {}

[21:58:05] ERROR> Unable to perform REST request: 503 Service Unavailable
[21:58:05] ERROR> Failed to delete the 'nodejsakn8lwx' app in the zcyhp0buoo domain
[21:58:05] INFO> Rest Request: DELETE https://int.openshift.redhat.com/broker/rest/domain/zcyhp0buoo/application/rubye8cm84u {}

[22:02:09] ERROR> Unable to perform REST request: 500 Internal Server Error
[22:02:09] INFO> Rest Request: DELETE https://int.openshift.redhat.com/broker/rest/domain/zcyhp0buoo/application/rubye8cm84u {}

[22:06:13] ERROR> Unable to perform REST request: 500 Internal Server Error
[22:06:13] ERROR> Failed to delete the 'rubye8cm84u' app in the zcyhp0buoo domain

Expected results:
Should not return 50X errors.

Additional info:

Comment 1 Dan McPherson 2013-10-28 14:28:19 UTC
Got an Execution Expired:

…ib/openshift/runtime/frontend/http/plugins/apachedb.rb: 125:in `lock'
…ib/openshift/runtime/frontend/http/plugins/apachedb.rb: 125:in `initialize'
…ib/openshift/runtime/frontend/http/plugins//opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-frontend-apachedb-0.2.1/lib/openshift/runtime/frontend/http/plugins/apachedb.rbapachedb.rb: 257:in `new'
…ib/openshift/runtime/frontend/http/plugins/apachedb.rb: 257:in `open'
…hift/runtime/frontend/http/plugins/nodejs-websocket.rb:  35:in `destroy'
…16.6/lib/openshift-origin-node/model/frontend_httpd.rb: 546:in `block in call_plugins'
…16.6/lib/openshift-origin-node/model/frontend_httpd.rb: 542:in `map'
…16.6/lib/openshift-origin-node/model/frontend_httpd.rb: 542:in `call_plugins'
…16.6/lib/openshift-origin-node/model/frontend_httpd.rb: 193:in `destroy'
…penshift/runtime/containerization/selinux_container.rb: 115:in `destroy'
…b/openshift-origin-node/model/application_container.rb: 260:in `block in destroy'
…2/lib/openshift-origin-common/utils/file_needs_sync.rb:  38:in `block in open'
…2/lib/openshift-origin-common/utils/file_needs_sync.rb:  36:in `open'
…2/lib/openshift-origin-common/utils/file_needs_sync.rb:  36:in `open'
…b/openshift-origin-node/model/application_container.rb: 245:in `destroy'
…usr/libexec/mcollective/mcollective/agent/openshift.rb: 351:in `oo_app_destroy'
…usr/libexec/mcollective/mcollective/agent/openshift.rb: 140:in `execute_action'
…usr/libexec/mcollective/mcollective/agent/openshift.rb: 105:in `cartridge_do_action'
…h/ruby193/root/usr/share/ruby/mcollective/rpc/agent.rb:  86:in `handlemsg'
…t/rh/ruby193/root/usr/share/ruby/mcollective/agents.rb: 126:in `block (2 levels) in dispatch'
         /opt/rh/ruby193/root/usr/share/ruby/timeout.rb:  69:in `timeout'
…t/rh/ruby193/root/usr/share/ruby/mcollective/agents.rb: 125:in `block in dispatch'

Comment 2 Rob Millner 2013-10-28 17:05:41 UTC
The mutex for nodejs-websocket.rb is being held by one of the tasks in mcollective.

Stopping mcollective, ensuring that its process is really dead and restarting it should make the problem go away; however, its unclear what caused the block in the first place.

Comment 3 Rob Millner 2013-10-28 18:12:38 UTC
Can't currently reproduce.

$ rhc app create rm1 php-5.3
Application Options
-------------------
  Domain:     rmillnerint
  Cartridges: php-5.3
  Gear Size:  default
  Scaling:    no

Creating application 'rm1' ... done


Waiting for your DNS name to be available ... done

Cloning into 'rm1'...
The authenticity of host 'rm1-rmillnerint.int.rhcloud.com (23.20.252.31)' can't be established.
RSA key fingerprint is cf:ee:77:cb:0e:fc:02:d7:72:7e:ae:80:c0:90:88:a7.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'rm1-rmillnerint.int.rhcloud.com,23.20.252.31' (RSA) to the list of known hosts.

Your application 'rm1' is now available.

  URL:        http://rm1-rmillnerint.int.rhcloud.com/
  SSH to:     526ea3d503ef64df6b00013a.rhcloud.com
  Git remote: ssh://526ea3d503ef64df6b00013a.rhcloud.com/~/git/rm1.git/
  Cloned to:  /home/rmillner/rm1

Run 'rhc show-app rm1' for more details about your app.

Comment 4 Rob Millner 2013-10-28 18:56:55 UTC
Correction its reproducible right now.

$ rhc app create rm1 php-5.3 --no-git --no-dns
Application Options
-------------------
  Domain:     rmillnerint
  Cartridges: php-5.3
  Gear Size:  default
  Scaling:    no

Creating application 'rm1' ... Server returned an unexpected error code: 504

Comment 5 Rob Millner 2013-10-28 21:24:41 UTC
This appears to be due to one of the disks on INT developing serious IO errors and the RAID-1 code attempting to deal with it.

Following one sample failure for app chkexsrv1, here's a summary of its events.


14:07:30.713647   configure php-5.3 on chkexsrv1
14:07:30.715116      Executing action [configure]
14:13:30.721892      execution expired


At the same time, zabbix was reporting the following processes in D state:

USER       PID  PPID STAT %CPU WCHAN           ELAPSED    COMMAND
root       681     2 D     0.0 wait_barrier   10-11:59:52 jbd2/dm-0-8
root     23198 23194 Dl    5.0 jbd2_log_wait_commit  00:32 ruby
root     31114     2 DN    0.2 raise_barrier  1-13:12:00 md127_resync


Throughout the day, there were log messages about one of the disks getting IO errors:
I/O error, dev xvdk, sector 194523704

Comment 6 Rob Millner 2013-10-28 21:25:15 UTC
*** Bug 1023857 has been marked as a duplicate of this bug. ***

Comment 7 Rob Millner 2013-10-29 00:41:34 UTC
Ops is dealing with the disk issues and that should resolve this problem.

Comment 8 Zhe Wang 2013-10-29 07:35:53 UTC
The problem is still reproducible in INT(devenv_3957), but not as frequent as yesterday.

Here are two failures:

1) returned 500 when deleting an app via REST API:

[23:59:52] ERROR> Unable to perform REST request DELETE https://int.openshift.redhat.com/broker/rest/application/526f529f03ef64ae77000003 {}: 500 Internal Server Error

2) returned 504 when creating an non-scaling app

Running Command - rhc app create 15pzeb3my4 php-5.3 -l xtian+test2 -p 123456 --insecure --timeout 360
Application Options
-------------------
Domain: oxqnt76t5x
Cartridges: php-5.3
Gear Size: default
Scaling: no

Creating application '15pzeb3my4' ... Server returned an unexpected error code: 504

Since I don't have the access to the server logs in INT, I will appreciate a lot if you would do some investigation in the context of this request.

Thanks,
Zhe Wang

Comment 9 Rob Millner 2013-10-29 18:29:36 UTC
Analysis of the logs for 15pzeb3my4 show that mcollective was restarted while the gear configure was running.

SIGTERM was set at 21:18:45, last log message out of mcollective was 21:18:47, the node API was still executing for the gear at 21:18:48, the start happened at 21:18:58.  The broker never received its response.


Platform logs:

October 28 21:18:48 INFO Connecting frontend mapping for 526f0c7803ef647141000070/php: [] => [127.10.218.1:8080] with options: {"websocket"=>true, "protocols"=>["http", "ws"]}



Mcollective logs:

Oct 28 21:18:45 ex-std-node2 mcollectived[25013]: runner.rb:60:in `rescue in block in run' Exiting after signal: SIGTERM

Oct 28 21:18:47 ex-std-node2 mcollectived[25013]: openshift.rb:150:in `execute_action' Finished executing action [app-destroy] (0)

Oct 28 21:18:47 ex-std-node2 mcollectived[25013]: openshift.rb:112:in `cartridge_do_action' cartridge_do_action reply (0):
------
NOTIFY_ENDPOINT_DELETE: 23.20.252.31 61346

------)

Oct 28 21:18:58 ex-std-node2 mcollectived[3442]: mcollectived:31:in `<main>' The Marionette Collective 2.2.3 started logging at info level

Comment 10 Rob Millner 2013-10-29 19:26:20 UTC
Gear 526f529f03ef64ae77000003 seems to have failed for a related reason.

It looks like mcollective crashed in the middle of the post-configure step for gear.

The sequence in the mcollective logs:

02:16:45 post-configure starts
02:16:48    rsync runs successfully

*** I don't see what killed mcollective, it was down for a long time.
02:16:49 last mcollective log before restart
02:46:57 mcollective restart

02:59:42 deconfigure is called
02:59:45 deconfigure is successful

02:59:47 app-destroy is called
02:59:53 app-destroy is successful.


The sequence in the broker logs:
02:27:50.173 DELETE request
02:59:41.210 DELETE request -> exception leading to a 500 error.

Comment 11 Rob Millner 2013-10-29 20:56:15 UTC
The logs for ex-std-node3.int don't reveal a cause for mcollective to go down.  It just stops outputting logs at 02:16:49 and goes through startup 30 minutes later.  There's no related peripheral activity at the time.

Comment 12 Rob Millner 2013-10-29 22:21:01 UTC
The broker exception happens when a prior operation was in the middle of a rollback.

My reading of the looks looks like the first delete call triggered a rollback which waited for mcollective to come back online.  The second delete operation happened while the first one was busy rolling back (deleting the app) and it raised an exception getting a 500 error.

The exception is intentional; however, the broker devs agree that a 500 error is not as good as a message stating that its in the middle of a rollback for that gear.

Comment 13 Rob Millner 2013-10-29 22:26:58 UTC
So far, each of the failures were due to some system issue:

1. Disk failed, the RAID rebuild was saturating the node with disk IO causing OpenShift operations to take too long and time out.

2. Mcollective was deliberately restarted, causing executing operations to fail.

3. Mcollective crashed, causing executing operations to fail.


Unfortunately, we weren't able to determine the root cause for issue #3.

I'm passing this back to Q/E to see if the problems repeat tonight.  I've verified the disk issues were dealt with and mcollective restart happened unusually late yesterday.

Comment 14 Rob Millner 2013-10-29 22:55:40 UTC
Dropping severity since this appears to be specific to system conditions on INT nodes rather than code which will affect STG or PROD.

Comment 15 XiuJuan Wang 2013-10-30 09:18:18 UTC
INT(devenv_3965)

when I create app with --from-code on the INT , always show "Server returned an unexpected error code: 504".

Comment 16 Rob Millner 2013-10-31 02:03:12 UTC
Re-raising severity since this new symptom is not known to be constrained to INT.

Comment 17 XiuJuan Wang 2013-10-31 03:26:28 UTC
Hi, 
on INT
Yesterday,the percentage of failure to create app with the flag '--from-code' or with multi cartridges was about 80%. But today the number is about 30%.

When it was created failed, used 'rhc domain show' could show the failed app, but when deleted this app, just tell me not found.

1.[wxj@wangxiuj .openshift]$ rhc app create railsapp1 ruby-1.9 postgresql-9.2  --from-code https://github.com/BanzaiMan/openshift-rails-example-postgis.git
Application Options
-------------------
  Domain:      wxj
  Cartridges:  ruby-1.9, postgresql-9.2
  Source Code: https://github.com/BanzaiMan/openshift-rails-example-postgis.git
  Gear Size:   default
  Scaling:     no

Creating application 'railsapp1' ... Server returned an unexpected error code: 504
2.[wxj@wangxiuj .openshift]$ rhc domain show
Domain wxj (owned by xiuwang+1)
------------------------------------------

  Created:            Oct 25 11:09 AM
  Allowed Gear Sizes: small
 railsapp1 @ http://railsapp1-wxj.int.rhcloud.com/ (uuid: 5271c6bd6cec0e51a900004e)
  ----------------------------------------------------------------------------------
    Domain:          wxj
    Created:         10:55 AM
    Gears:           1 (defaults to small)
    Git URL:         ssh://5271c6bd6cec0e51a900004e.rhcloud.com/~/git/railsapp1.git/
    Initial Git URL: https://github.com/BanzaiMan/openshift-rails-example-postgis.git
    SSH:             5271c6bd6cec0e51a900004e.rhcloud.com
    Deployment:      auto (on git push)

    ruby-1.9 (Ruby 1.9)
    -------------------
      Gears: Located with postgresql-9.2

    postgresql-9.2 (PostgreSQL 9.2)
    -------------------------------
      Gears:          Located with ruby-1.9
      Connection URL: postgresql://$OPENSHIFT_POSTGRESQL_DB_HOST:$OPENSHIFT_POSTGRESQL_DB_PORT
      Database Name:  railsapp1
      Password:       vcyMaZe1yypw
      Username:       admins9u9jvw

You have 1 applications in your domain.
3.[wxj@wangxiuj .openshift]$ rhc app delete railsapp1 --confirm
Deleting application 'railsapp1' ... Application '5271c6bd6cec0e51a900004e' not found.

Comment 18 Rob Millner 2013-10-31 18:12:59 UTC
Gear 5271c6bd6cec0e51a900004e failed the post-configure step with a timeout at around 23:02:44 (EDT).

At 23:02:50, roll-back began deconfiguring postgres, then ruby at 23:03:03.  The gear destroy call was made at 23:03:27 and finished successfully at 23:03:42.

At that point, the broker should not have show the app since it had been successfully deleted.

The cause of the gear timeout was that it took too long to build, likely related to system load at the time and is a reminder of the fundamental flaws with how we use timeouts.

There may be a bug in the broker related to rollback though since the application still shows up after deletion.  Checking those logs next.

Comment 19 Rob Millner 2013-10-31 19:28:04 UTC
A new ticket was created to manage the broker issue with stale application data.

Comment 20 Rob Millner 2013-10-31 20:42:21 UTC
The remainder of fixing the latest symptoms are with Bug 1025482.

*** This bug has been marked as a duplicate of bug 1025482 ***


Note You need to log in before you can comment on or make changes to this bug.