+++ This bug was initially created as a clone of Bug #1054944 +++ Description of problem: When a git push to a scaled app is done, the new deployment is registered with the broker. If this fails for any reason, activation of the deploy fails and HAproxy is left in a state where it returns 503 errors until restarted. How reproducible: 100% so far Steps to Reproduce: 1. Create a scaled app. 2. on broker: service httpd stop 3. git push a change to the app 4. Try to access the app. To clearly see what's happening, start broker httpd again, port-forward from the app, and curl -I each of the forwarded ports. The one from HAproxy will return 503 even though the framework cartridge itself is 200. Actual results: App is unavailable, error 503 Expected results: App is available, even though the broker is not. Additional info: This doesn't appear to be a problem with non-scaled apps - deployment still fails but the app is available. It's just HAproxy that doesn't survive the deployment; perhaps there's some haproxy reconfigure step that's supposed to complete after the deployment is registered? It occurs with Online too. Fortunately our brokers are never down.
PR submitted. I changed some of the post-receive output to be more readable and made a failure to report deployments to the broker no longer mean the failure of the activation call for that deployment. The entire list of deployments is transmitted to the broker each time a deployment is reported, so the system will recover on the next git push. There will be a message in the post-receive output if this happens.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/19e2995306bff7bea037823675f5cf279bafe880 Fix bug 1055653 and improve post-receive output readability
Checked on devenv_4257, the behaviour is the same as the description. But the output of git push is updated as following. remote: Starting PHP 5.4 cartridge (Apache+mod_php) remote: ------------------------- remote: Git Post-Receive Result: failure remote: Activation status: failure remote: Activation failed for the following gears: remote: 52df7edc0454ff324a000020 (Error activating gear: Connection refused - connect(2)) remote: Deployment completed with status: failure remote: postreceive failed To ssh://52df7edc0454ff324a000020.rhcloud.com/~/git/app1s.git/ Move bug to verified.
Paul indicated the behavior should be improved as well as the output. I don't think this fix made it into the devenv. Mine has rubygem-openshift-origin-node-1.19.13 which was prior to the patch being merged, and the problem still exists, with the output unchanged as above. Adam just tagged rubygem-openshift-origin-node-1.19.14-1 which should show up in the next devenv. Should be considered fixed if: 1. A scaled app is still reachable (e.g. by curl) after a deployment, even though that deployment wasn't successfully registered with the broker. 2. When the deployment fails to be registered, the message shows "Failed to report deployment to broker. This will be corrected on the next git push." per https://github.com/openshift/origin-server/commit/19e2995306bff7bea037823675f5cf279bafe880#diff-3dd15e5626c3052b01cc71d63ea06fb5R587
According comment#4, the issue should still there. 1. Create scale app 2. Stop the httpd service on broker 3. Do git push for the app The output is the same as my comment#3, And the cartridge status as following: To connect to a service running on OpenShift, use the Local address Service Local OpenShift ------- -------------- ---- ---------------- haproxy 127.0.0.1:8082 => 127.1.244.2:8080 haproxy 127.0.0.1:8083 => 127.1.244.3:8080 httpd 127.0.0.1:8084 => 127.1.244.1:8080 Press CTRL-C to terminate port forwarding [root@ip-10-239-20-157 app2s]# [root@ip-10-239-20-157 app2s]# curl -I 127.0.0.1:8082 HTTP/1.0 503 Service Unavailable Cache-Control: no-cache Connection: close Content-Type: text/html [root@ip-10-239-20-157 app2s]# curl -I 127.0.0.1:8083 HTTP/1.0 200 OK Cache-Control: no-cache Connection: close Content-Type: text/html [root@ip-10-239-20-157 app2s]# curl -I 127.0.0.1:8084 HTTP/1.1 200 OK Date: Thu, 23 Jan 2014 12:31:08 GMT Server: Apache/2.2.15 (Red Hat) Connection: close Content-Type: text/html
Per the RestClient code, it will return the response if the code is 200..206, redirect if 301,302,303,307, and raises for all other codes. In this case, with the broker stopped, it will raise RestClient::ServiceUnavailable. I think we need to rescue exceptions here: https://github.com/openshift/origin-server/blob/master/node/lib/openshift-origin-node/model/application_container.rb#L581-L586
If it is relevant, aside from the obvious test case of the broker being unavailable (which could be connection refused, or a 502/503 from the proxy), the "in the wild" reason for hitting this was a 401 from the broker because broker auth was failing. In any of these cases, the app should remain available after a deployment.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/1fa84300ec27093f0f7f10643f4d46ecd1ba8eec Fix bug 1055653: handle exceptions from RestClient
Still get the same result with comment#5, on devenv_4270.
I get the correct result on devenv_4270: Delta compression using up to 4 threads. Compressing objects: 100% (3/3), done. Writing objects: 100% (3/3), 306 bytes | 0 bytes/s, done. Total 3 (delta 1), reused 0 (delta 0) remote: Stopping Mock cartridge remote: Syncing git content to other proxy gears remote: Building git ref 'master', commit 8a31ecb remote: Building Mock cartridge remote: Mock successfully built. remote: Preparing build for deployment remote: Deployment id is bb2f7d9e remote: Distributing deployment to child gears remote: Activating deployment remote: HAProxy already running remote: HAProxy instance is started remote: Starting Mock cartridge remote: Failed to report deployment to broker. This will be corrected on the next git push. remote: ------------------------- remote: Git Post-Receive Result: success remote: Distribution status: success remote: Activation status: success remote: Deployment completed with status: success To ssh://52e276b42d2888dd5f000020.rhcloud.com/~/git/mock.git/ b988531..8a31ecb master -> master /home/pmorie/code/test_apps
Checked again on devenv_4278, and I can get the following result with rhc-broker service stopped on broker. remote: Stopping PHP 5.3 cartridge (Apache+mod_php) remote: [Sat Jan 25 22:34:22 2014] [warn] PassEnv variable SHELL was undefined remote: [Sat Jan 25 22:34:22 2014] [warn] PassEnv variable USER was undefined remote: [Sat Jan 25 22:34:22 2014] [warn] PassEnv variable LOGNAME was undefined remote: Waiting for stop to finish remote: Syncing git content to other proxy gears remote: Building git ref 'master', commit 6bd4771 remote: Checking deplist.txt for PEAR dependency.. remote: Preparing build for deployment remote: Deployment id is 33402440 remote: Activating deployment remote: HAProxy already running remote: HAProxy instance is started remote: Starting PHP 5.3 cartridge (Apache+mod_php) remote: Failed to report deployment to broker. This will be corrected on the next git push. remote: ------------------------- remote: Git Post-Receive Result: success remote: Activation status: success remote: Deployment completed with status: success To ssh://52e481f241f06985da000011.rhcloud.com/~/git/php53s.git/ e4e6b1f..6bd4771 master -> master And both haproxy and web cartridge return 200 via curl -I. But when the httpd stopped on broker, the result still same as Comment#5. @lmeyer Is this what you want? I will mark the bug as verified. If it is not acceptable, you can reopen this.
(In reply to Meng Bo from comment #11) > @lmeyer Is this what you want? I will mark the bug as verified. If it is > not acceptable, you can reopen this. Agreed that the bug appears solved for HTTP errors (which would cover rhc-broker being down, and hopefully 401 auth errors too - can't get this to happen on a devenv so haven't tested). So that's an improvement. But for the case of httpd being down, i.e. the original test case, it is still the original wrong behavior. I guess a connection refused sparks a different exception; I bet connection timed out and other such network problems would similarly escape the logic here. These are things that real customers run into, and while they need to fix their problems, we should not surprise them needlessly with unavailable apps. No matter *what* happens on the deployment registration, the app should continue to be available.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/2a7ca5491b59bbcbbaa7504cd0c383215b28465a Fix bug 1055653 for cases when httpd is down
Fixed the case where httpd is down.
Checked on devenv_4287, with httpd on broker stopped, it will also get the meaningful warning message. remote: Failed to report deployment to broker. This will be corrected on the next git push. remote: ------------------------- remote: Git Post-Receive Result: success remote: Activation status: success remote: Deployment completed with status: success To ssh://52e71d8a6abc0c8336000007.rhcloud.com/~/git/app1.git/ 345c371..e4b9f91 master -> master And all the cartridges are running. To connect to a service running on OpenShift, use the Local address Service Local OpenShift ------- -------------- ---- ---------------- haproxy 127.0.0.1:8082 => 127.1.244.2:8080 haproxy 127.0.0.1:8083 => 127.1.244.3:8080 httpd 127.0.0.1:8084 => 127.1.244.1:8080 Press CTRL-C to terminate port forwarding [root@domU-12-31-39-0C-58-8E app1]# [root@domU-12-31-39-0C-58-8E app1]# curl -I 127.0.0.1:8082 HTTP/1.1 200 OK Date: Tue, 28 Jan 2014 03:17:32 GMT Server: Apache/2.2.15 (Red Hat) Content-Type: text/html Set-Cookie: GEAR=local-52e71d8a6abc0c8336000007; path=/ Cache-control: private [root@domU-12-31-39-0C-58-8E app1]# curl -I 127.0.0.1:8083 HTTP/1.0 200 OK Cache-Control: no-cache Connection: close Content-Type: text/html [root@domU-12-31-39-0C-58-8E app1]# curl -I 127.0.0.1:8084 HTTP/1.1 200 OK Date: Tue, 28 Jan 2014 03:17:36 GMT Server: Apache/2.2.15 (Red Hat) Connection: close Content-Type: text/html