Description of problem: OpenShift uses a "rolling" strategy when deploying new code on scalable applications. This means that backend gears are updated in batches of 20% at a time. To update each of these batches we do: 1.- rotate-out the gears to be updated (i.e. remove them from haproxy's configuration) 2.- activate the new code 3.- rotate-in the gears again Based on the observation it can happen that a gear is added back into rotation when it's still not ready to serve application content. It might happens when you have an application that takes some time to startup in EWS cartridge. In step 2 above is the EWS cartridge of the gears that are being updated is restarted (using the new code). A restart is a stop + a start. The control code for the 'start' operation makes sure that the servlet container starts up and starts listening to its port. This is the code that does this: # Check if the server http port is up function ishttpup() { count=0 while [ ${count} -lt 64 ] do if echo 2> /dev/null > "/dev/tcp/${OPENSHIFT_JBOSSEWS_IP}/${OPENSHIFT_JBOSSEWS_HTTP_PORT}"; then echo "Found ${OPENSHIFT_JBOSSEWS_IP}:${OPENSHIFT_JBOSSEWS_HTTP_PORT} listening port" return 0 fi ... called from the start_app function (https://github.com/openshift/origin-server/blob/openshift-origin-release-4/cartridges/openshift-origin-cartridge-jbossews/bin/control#L98) As you can see, this check succeeds as soon as the (tomcat) port is ready - after that the start operation is declared successful and we proceed to step 3 (rotate the gear in again). However, some application(s) might take a while to be ready to serve content and results in 503 responses. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. rhc app create bin jbossews-2.0 -s --no-git 2. rhc cartridge scale jbossews-2.0 -a bin --min 2 --max 4 3. rhc app configure -a bin --deployment-type binary 4. rhc deploy New.tar.gz -a bin 1)[WARNING] 282/031520 (19269) : Server express/local-gear is DOWN for maintenance. 2)[WARNING] 282/031539 (19269) : Server express/local-gear is UP (leaving maintenance). 3)[WARNING] 282/031540 (19269) : Server express/gear-replica-l91lwz3istg-2-replica is DOWN for maintenance. 4)[WARNING] 282/031550 (19269) : Server express/local-gear is DOWN, reason: Layer7 timeout, check duration: 10030ms. 0 active Actual results: The gear is marked up though the application is not ready to serve the request(rolled in) and the other up gears are made down or rolled out. Expected results: Before rolling out other gears it should check if the gear is ready to serve the request. Additional info: It would be helpful if the rolling update logic and haproxy check are in sync or tied in EWS.
Verify this bug with openshift-origin-cartridge-jbossews-1.35.3.2-1.el6op.noarch Steps: 1. Create a scaleble ews app rhc app create bin jbossews-2.0 -s --no-git 2. Scale-up this app rhc cartridge scale jbossews-2.0 -a bin --min 2 --max 4 3. Configure this app as binary deployment type rhc app configure -a bin --deployment-type binary 4. Run the binary deployment [root@broker ~]# rhc deploy new.tar.gz -a bin Deployment of file '/root/new.tar.gz' in progress for application bin ... Starting deploy for binary artifact Stopping gear Stopping jbossews cartridge Creating new deployment directory Preparing deployment Preparing build for deployment Deployment id is f9492a36 Distributing deployment Distributing deployment to child gears Distribution status: success Activating deployment HAProxy already running CLIENT_RESULT: HAProxy instance is started Starting jbossews cartridge Found 127.3.192.1:8080 listening port Activation status: success Deployment status: success Success [root@broker ~]# rhc app show bin --gears ID State Cartridges Size SSH URL --------- ------- ------------------------ ----- ----------------------------------------- yes-bin-1 started haproxy-1.4 jbossews-2.0 small yes-bin-1.com.cn yes-bin-2 started haproxy-1.4 jbossews-2.0 small yes-bin-2.com.cn In app-root/logs/haproxy.log: [WARNING] 322/032732 (32529) : Proxy stats stopped (FE: 1 conns, BE: 0 conns). [WARNING] 322/032732 (32529) : Proxy express stopped (FE: 0 conns, BE: 0 conns). [WARNING] 322/034851 (1544) : Server express/local-gear is DOWN for maintenance. [WARNING] 322/034911 (1544) : Server express/local-gear is UP (leaving maintenance). [WARNING] 322/034911 (1544) : Server express/gear-yes-bin-2-yes is DOWN for maintenance. [WARNING] 322/034913 (1544) : Server express/local-gear is DOWN, reason: Layer7 wrong status, code: 404, info: "HTTP status check returned code <3C>404<3E>", check duration: 1ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [ALERT] 322/034913 (1544) : proxy 'express' has no server available! [WARNING] 322/034929 (1544) : Server express/gear-yes-bin-2-yes is UP (leaving maintenance). [WARNING] 322/034929 (1544) : Server express/gear-yes-bin-2-yes is DOWN, reason: Layer7 wrong status, code: 404, info: "HTTP status check returned code <3C>404<3E>", check duration: 2ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [ALERT] 322/034929 (1544) : proxy 'express' has no server available! The app couldn't be accessed. [root@broker ~]# curl -I http://bin-yes.ose22-auto.com.cn/ HTTP/1.1 503 Service Unavailable Date: Thu, 19 Nov 2015 08:54:28 GMT Cache-Control: no-cache Content-Type: text/html; charset=UTF-8 Connection: close
QA: I apologize for the broken reproducer. Please follow the below steps to verify this bug is resolved: 1. Create a jbossews-2.0 scalable application: rhc app create jbtest jbossews-2.0 -s 2. Set a new environment variable on the application to extend the amount of tries jbossews will attempt to wait for a deployment to complete before starting. rhc set-env OPENSHIFT_JBOSSEWS_START_TRIES=30 jbtest 3. Scale the cartridge up to at least two: rhc cartridge scale jbossews-2.0 -a jbtest --min 2 --max 4 4. In the application's git repository, add the attached 'SlowStart.java' file to the './src/main/java/' directory mv SlowStart.java jbtest/src/main/java/ 5. Make the following changes to the src/main/webapp/WEB-INF/web.xml file: diff --git a/src/main/webapp/WEB-INF/web.xml b/src/main/webapp/WEB-INF/web.xml index fa91269..84f53f9 100755 --- a/src/main/webapp/WEB-INF/web.xml +++ b/src/main/webapp/WEB-INF/web.xml @@ -6,6 +6,15 @@ xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_0.xsd" metadata-complete="false"> + <servlet> + <servlet-name>SlowStart</servlet-name> + <servlet-class>SlowStart</servlet-class> + </servlet> + + <servlet-mapping> + <servlet-name>SlowStart</servlet-name> + <url-pattern>/SlowStart</url-pattern> + <load-on-startup>1</load-on-startup> + </servlet-mapping> </web-app> 6. Be ready to watch the haproxy.log on the head gear and ensure that, when the changes are pushed, no gear is started and then taken back down due to a 503 or 404. The gears should not be made available until they are done actually starting. This process will take ~30 seconds. 7. Git add, commit, and push the changes to the application. Watch the haproxy.log for the failures.
Created attachment 1097883 [details] Reproducer Class
There was an error in the steps provided. In step #5, the diff is incorrect. The url-pattern sepcified is '/SlowStart', but should just be '/'. Below is the correct diff: diff --git a/src/main/webapp/WEB-INF/web.xml b/src/main/webapp/WEB-INF/web.xml index fa91269..84f53f9 100755 --- a/src/main/webapp/WEB-INF/web.xml +++ b/src/main/webapp/WEB-INF/web.xml @@ -6,6 +6,15 @@ xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_0.xsd" metadata-complete="false"> + <servlet> + <servlet-name>SlowStart</servlet-name> + <servlet-class>SlowStart</servlet-class> + </servlet> + + <servlet-mapping> + <servlet-name>SlowStart</servlet-name> + <url-pattern>/</url-pattern> + <load-on-startup>1</load-on-startup> + </servlet-mapping> </web-app>
Test this bug by following the steps in Comment 14, the gears are back to UP finally. And the app is available after the git push. ==> app-root/logs/haproxy.log <== [WARNING] 327/020259 (11151) : Server express/local-gear is DOWN for maintenance. [WARNING] 327/020700 (11151) : Server express/local-gear is UP (leaving maintenance). [WARNING] 327/020700 (11151) : Server express/gear-yes-jbtest-2-yes is DOWN for maintenance. [WARNING] 327/020824 (11151) : Server express/gear-yes-jbtest-2-yes is UP (leaving maintenance).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-2666.html