1055653 – If deployment registration fails in a scaled app, haproxy needs restarting

Bug 1055653 - If deployment registration fails in a scaled app, haproxy needs restarting

Summary: If deployment registration fails in a scaled app, haproxy needs restarting

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Paul Morie
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1054944
TreeView+	depends on / blocked

Reported:	2014-01-20 17:21 UTC by Luke Meyer
Modified:	2015-05-14 23:33 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	1054944
Environment:
Last Closed:	2014-03-12 03:06:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Luke Meyer 2014-01-20 17:21:16 UTC

+++ This bug was initially created as a clone of Bug #1054944 +++

Description of problem:
When a git push to a scaled app is done, the new deployment is registered with the broker. If this fails for any reason, activation of the deploy fails and HAproxy is left in a state where it returns 503 errors until restarted.

How reproducible:
100% so far

Steps to Reproduce:
1. Create a scaled app.
2. on broker: service httpd stop
3. git push a change to the app
4. Try to access the app. To clearly see what's happening, start broker httpd again, port-forward from the app, and curl -I each of the forwarded ports. The one from HAproxy will return 503 even though the framework cartridge itself is 200.

Actual results:
App is unavailable, error 503

Expected results:
App is available, even though the broker is not.


Additional info:

This doesn't appear to be a problem with non-scaled apps - deployment still fails but the app is available. It's just HAproxy that doesn't survive the deployment; perhaps there's some haproxy reconfigure step that's supposed to complete after the deployment is registered?

It occurs with Online too. Fortunately our brokers are never down.

Comment 1 Paul Morie 2014-01-21 21:20:58 UTC

PR submitted.  I changed some of the post-receive output to be more readable and made a failure to report deployments to the broker no longer mean the failure of the activation call for that deployment.  The entire list of deployments is transmitted to the broker each time a deployment is reported, so the system will recover on the next git push.  There will be a message in the post-receive output if this happens.

Comment 2 openshift-github-bot 2014-01-22 03:10:49 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/19e2995306bff7bea037823675f5cf279bafe880
Fix bug 1055653 and improve post-receive output readability

Comment 3 Meng Bo 2014-01-22 08:23:29 UTC

Checked on devenv_4257, the behaviour is the same as the description. But the output of git push is updated as following.


remote: Starting PHP 5.4 cartridge (Apache+mod_php)
remote: -------------------------
remote: Git Post-Receive Result: failure
remote: Activation status: failure
remote: Activation failed for the following gears:
remote: 52df7edc0454ff324a000020 (Error activating gear: Connection refused - connect(2))
remote: Deployment completed with status: failure
remote: postreceive failed
To ssh://52df7edc0454ff324a000020.rhcloud.com/~/git/app1s.git/


Move bug to verified.

Comment 4 Luke Meyer 2014-01-22 18:52:21 UTC

Paul indicated the behavior should be improved as well as the output.

I don't think this fix made it into the devenv. Mine has rubygem-openshift-origin-node-1.19.13 which was prior to the patch being merged, and the problem still exists, with the output unchanged as above. Adam just tagged rubygem-openshift-origin-node-1.19.14-1 which should show up in the next devenv.

Should be considered fixed if:
1. A scaled app is still reachable (e.g. by curl) after a deployment, even though that deployment wasn't successfully registered with the broker.
2. When the deployment fails to be registered, the message shows "Failed to report deployment to broker.  This will be corrected on the next git push." per https://github.com/openshift/origin-server/commit/19e2995306bff7bea037823675f5cf279bafe880#diff-3dd15e5626c3052b01cc71d63ea06fb5R587

Comment 5 Meng Bo 2014-01-23 12:32:32 UTC

According comment#4, the issue should still there.

1. Create scale app
2. Stop the httpd service on broker
3. Do git push for the app

The output is the same as my comment#3,
And the cartridge status as following:

To connect to a service running on OpenShift, use the Local address

Service Local               OpenShift
------- -------------- ---- ----------------
haproxy 127.0.0.1:8082  =>  127.1.244.2:8080
haproxy 127.0.0.1:8083  =>  127.1.244.3:8080
httpd   127.0.0.1:8084  =>  127.1.244.1:8080

Press CTRL-C to terminate port forwarding

[root@ip-10-239-20-157 app2s]# 
[root@ip-10-239-20-157 app2s]# curl -I 127.0.0.1:8082
HTTP/1.0 503 Service Unavailable
Cache-Control: no-cache
Connection: close
Content-Type: text/html

[root@ip-10-239-20-157 app2s]# curl -I 127.0.0.1:8083
HTTP/1.0 200 OK
Cache-Control: no-cache
Connection: close
Content-Type: text/html

[root@ip-10-239-20-157 app2s]# curl -I 127.0.0.1:8084
HTTP/1.1 200 OK
Date: Thu, 23 Jan 2014 12:31:08 GMT
Server: Apache/2.2.15 (Red Hat)
Connection: close
Content-Type: text/html

Comment 6 Andy Goldstein 2014-01-23 14:09:05 UTC

Per the RestClient code, it will return the response if the code is 200..206, redirect if 301,302,303,307, and raises for all other codes. In this case, with the broker stopped, it will raise RestClient::ServiceUnavailable. I think we need to rescue exceptions here: https://github.com/openshift/origin-server/blob/master/node/lib/openshift-origin-node/model/application_container.rb#L581-L586

Comment 7 Luke Meyer 2014-01-23 14:42:01 UTC

If it is relevant, aside from the obvious test case of the broker being unavailable (which could be connection refused, or a 502/503 from the proxy), the "in the wild" reason for hitting this was a 401 from the broker because broker auth was failing. In any of these cases, the app should remain available after a deployment.

Comment 8 openshift-github-bot 2014-01-23 17:23:31 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/1fa84300ec27093f0f7f10643f4d46ecd1ba8eec
Fix bug 1055653: handle exceptions from RestClient

Comment 9 Meng Bo 2014-01-24 08:54:41 UTC

Still get the same result with comment#5, on devenv_4270.

Comment 10 Paul Morie 2014-01-24 14:36:26 UTC

I get the correct result on devenv_4270:

Delta compression using up to 4 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 306 bytes | 0 bytes/s, done.
Total 3 (delta 1), reused 0 (delta 0)
remote: Stopping Mock cartridge
remote: Syncing git content to other proxy gears
remote: Building git ref 'master', commit 8a31ecb
remote: Building Mock cartridge
remote: Mock successfully built.
remote: Preparing build for deployment
remote: Deployment id is bb2f7d9e
remote: Distributing deployment to child gears
remote: Activating deployment
remote: HAProxy already running
remote: HAProxy instance is started
remote: Starting Mock cartridge
remote: Failed to report deployment to broker.  This will be corrected on the next git push.
remote: -------------------------
remote: Git Post-Receive Result: success
remote: Distribution status: success
remote: Activation status: success
remote: Deployment completed with status: success
To ssh://52e276b42d2888dd5f000020.rhcloud.com/~/git/mock.git/
   b988531..8a31ecb  master -> master
/home/pmorie/code/test_apps

Comment 11 Meng Bo 2014-01-26 06:27:07 UTC

Checked again on devenv_4278, and I can get the following result with rhc-broker service stopped on broker.


remote: Stopping PHP 5.3 cartridge (Apache+mod_php)
remote: [Sat Jan 25 22:34:22 2014] [warn] PassEnv variable SHELL was undefined
remote: [Sat Jan 25 22:34:22 2014] [warn] PassEnv variable USER was undefined
remote: [Sat Jan 25 22:34:22 2014] [warn] PassEnv variable LOGNAME was undefined
remote: Waiting for stop to finish
remote: Syncing git content to other proxy gears
remote: Building git ref 'master', commit 6bd4771
remote: Checking deplist.txt for PEAR dependency..
remote: Preparing build for deployment
remote: Deployment id is 33402440
remote: Activating deployment
remote: HAProxy already running
remote: HAProxy instance is started
remote: Starting PHP 5.3 cartridge (Apache+mod_php)
remote: Failed to report deployment to broker.  This will be corrected on the next git push.
remote: -------------------------
remote: Git Post-Receive Result: success
remote: Activation status: success
remote: Deployment completed with status: success
To ssh://52e481f241f06985da000011.rhcloud.com/~/git/php53s.git/
   e4e6b1f..6bd4771  master -> master

And both haproxy and web cartridge return 200 via curl -I.


But when the httpd stopped on broker, the result still same as Comment#5.

@lmeyer  Is this what you want? I will mark the bug as verified. If it is not acceptable, you can reopen this.

Comment 12 Luke Meyer 2014-01-27 13:43:35 UTC

(In reply to Meng Bo from comment #11)
> @lmeyer  Is this what you want? I will mark the bug as verified. If it is
> not acceptable, you can reopen this.

Agreed that the bug appears solved for HTTP errors (which would cover rhc-broker being down, and hopefully 401 auth errors too - can't get this to happen on a devenv so haven't tested). So that's an improvement.

But for the case of httpd being down, i.e. the original test case, it is still the original wrong behavior. I guess a connection refused sparks a different exception; I bet connection timed out and other such network problems would similarly escape the logic here. These are things that real customers run into, and while they need to fix their problems, we should not surprise them needlessly with unavailable apps. No matter *what* happens on the deployment registration, the app should continue to be available.

Comment 13 openshift-github-bot 2014-01-27 17:31:39 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/2a7ca5491b59bbcbbaa7504cd0c383215b28465a
Fix bug 1055653 for cases when httpd is down

Comment 14 Paul Morie 2014-01-27 18:11:19 UTC

Fixed the case where httpd is down.

Comment 15 Meng Bo 2014-01-28 03:18:27 UTC

Checked on devenv_4287, with httpd on broker stopped, it will also get the meaningful warning message.

remote: Failed to report deployment to broker.  This will be corrected on the next git push.
remote: -------------------------
remote: Git Post-Receive Result: success
remote: Activation status: success
remote: Deployment completed with status: success
To ssh://52e71d8a6abc0c8336000007.rhcloud.com/~/git/app1.git/
   345c371..e4b9f91  master -> master


And all the cartridges are running.

To connect to a service running on OpenShift, use the Local address

Service Local               OpenShift
------- -------------- ---- ----------------
haproxy 127.0.0.1:8082  =>  127.1.244.2:8080
haproxy 127.0.0.1:8083  =>  127.1.244.3:8080
httpd   127.0.0.1:8084  =>  127.1.244.1:8080

Press CTRL-C to terminate port forwarding

[root@domU-12-31-39-0C-58-8E app1]# 
[root@domU-12-31-39-0C-58-8E app1]# curl -I 127.0.0.1:8082
HTTP/1.1 200 OK
Date: Tue, 28 Jan 2014 03:17:32 GMT
Server: Apache/2.2.15 (Red Hat)
Content-Type: text/html
Set-Cookie: GEAR=local-52e71d8a6abc0c8336000007; path=/
Cache-control: private

[root@domU-12-31-39-0C-58-8E app1]# curl -I 127.0.0.1:8083
HTTP/1.0 200 OK
Cache-Control: no-cache
Connection: close
Content-Type: text/html

[root@domU-12-31-39-0C-58-8E app1]# curl -I 127.0.0.1:8084
HTTP/1.1 200 OK
Date: Tue, 28 Jan 2014 03:17:36 GMT
Server: Apache/2.2.15 (Red Hat)
Connection: close
Content-Type: text/html

Note You need to log in before you can comment on or make changes to this bug.