Bug 1270660
| Summary: | Haproxy health check should be in sync with rolling updates in EWS | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jaspreet Kaur <jkaur> | ||||
| Component: | ImageStreams | Assignee: | Timothy Williams <tiwillia> | ||||
| Status: | CLOSED ERRATA | QA Contact: | DeShuai Ma <dma> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 2.2.0 | CC: | adellape, aos-bugs, bperkins, erich, gpei, jokerman, misalunk, mmccomas, nicholas_schuetz, pep, tiwillia | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | openshift-origin-cartridge-jbossews-1.35.3.2-1.el6op | Doc Type: | Enhancement | ||||
| Doc Text: |
Previously, the EWS cartridge started an application's gear and marked the application as "up" in HAProxy. If the application took some time to deploy, it caused an outage window as traffic was routed to the application. This was corrected either when HAProxy executed its next health check and disabled the application or when the deployment finished. As a result, it was possible for EWS cartridge deployments to be marked "up" and routable when in fact they were not yet ready to service requests. This bug fix introduces the OPENSHIFT_JBOSSEWS_START_DELAY EWS environment variable to allow application owners to delay the registration of there deployment with haproxy. Set this variable to include a delay (in seconds) which causes deployments to halt after a gear start in a similar manner to the EAP cartridge. The difference between the EWS and EAP cartridges is that Tomcat implements a sleep (or hang) because it does not have a management interface (https://access.redhat.com/solutions/901043) to interact with like EAP to check that deployments have finished. Using the OPENSHIFT_JBOSSEWS_START_DELAY variable can cause application deployments to take longer but it can be used to avoid outages with new deployments.
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 1285084 (view as bug list) | Environment: | |||||
| Last Closed: | 2015-12-17 17:11:00 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1273542, 1285084 | ||||||
| Attachments: |
|
||||||
Verify this bug with openshift-origin-cartridge-jbossews-1.35.3.2-1.el6op.noarch Steps: 1. Create a scaleble ews app rhc app create bin jbossews-2.0 -s --no-git 2. Scale-up this app rhc cartridge scale jbossews-2.0 -a bin --min 2 --max 4 3. Configure this app as binary deployment type rhc app configure -a bin --deployment-type binary 4. Run the binary deployment [root@broker ~]# rhc deploy new.tar.gz -a bin Deployment of file '/root/new.tar.gz' in progress for application bin ... Starting deploy for binary artifact Stopping gear Stopping jbossews cartridge Creating new deployment directory Preparing deployment Preparing build for deployment Deployment id is f9492a36 Distributing deployment Distributing deployment to child gears Distribution status: success Activating deployment HAProxy already running CLIENT_RESULT: HAProxy instance is started Starting jbossews cartridge Found 127.3.192.1:8080 listening port Activation status: success Deployment status: success Success [root@broker ~]# rhc app show bin --gears ID State Cartridges Size SSH URL --------- ------- ------------------------ ----- ----------------------------------------- yes-bin-1 started haproxy-1.4 jbossews-2.0 small yes-bin-1.com.cn yes-bin-2 started haproxy-1.4 jbossews-2.0 small yes-bin-2.com.cn In app-root/logs/haproxy.log: [WARNING] 322/032732 (32529) : Proxy stats stopped (FE: 1 conns, BE: 0 conns). [WARNING] 322/032732 (32529) : Proxy express stopped (FE: 0 conns, BE: 0 conns). [WARNING] 322/034851 (1544) : Server express/local-gear is DOWN for maintenance. [WARNING] 322/034911 (1544) : Server express/local-gear is UP (leaving maintenance). [WARNING] 322/034911 (1544) : Server express/gear-yes-bin-2-yes is DOWN for maintenance. [WARNING] 322/034913 (1544) : Server express/local-gear is DOWN, reason: Layer7 wrong status, code: 404, info: "HTTP status check returned code <3C>404<3E>", check duration: 1ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [ALERT] 322/034913 (1544) : proxy 'express' has no server available! [WARNING] 322/034929 (1544) : Server express/gear-yes-bin-2-yes is UP (leaving maintenance). [WARNING] 322/034929 (1544) : Server express/gear-yes-bin-2-yes is DOWN, reason: Layer7 wrong status, code: 404, info: "HTTP status check returned code <3C>404<3E>", check duration: 2ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [ALERT] 322/034929 (1544) : proxy 'express' has no server available! The app couldn't be accessed. [root@broker ~]# curl -I http://bin-yes.ose22-auto.com.cn/ HTTP/1.1 503 Service Unavailable Date: Thu, 19 Nov 2015 08:54:28 GMT Cache-Control: no-cache Content-Type: text/html; charset=UTF-8 Connection: close QA:
I apologize for the broken reproducer.
Please follow the below steps to verify this bug is resolved:
1. Create a jbossews-2.0 scalable application:
rhc app create jbtest jbossews-2.0 -s
2. Set a new environment variable on the application to extend the amount of tries jbossews will attempt to wait for a deployment to complete before starting.
rhc set-env OPENSHIFT_JBOSSEWS_START_TRIES=30 jbtest
3. Scale the cartridge up to at least two:
rhc cartridge scale jbossews-2.0 -a jbtest --min 2 --max 4
4. In the application's git repository, add the attached 'SlowStart.java' file to the './src/main/java/' directory
mv SlowStart.java jbtest/src/main/java/
5. Make the following changes to the src/main/webapp/WEB-INF/web.xml file:
diff --git a/src/main/webapp/WEB-INF/web.xml b/src/main/webapp/WEB-INF/web.xml
index fa91269..84f53f9 100755
--- a/src/main/webapp/WEB-INF/web.xml
+++ b/src/main/webapp/WEB-INF/web.xml
@@ -6,6 +6,15 @@
xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_0.xsd"
metadata-complete="false">
+ <servlet>
+ <servlet-name>SlowStart</servlet-name>
+ <servlet-class>SlowStart</servlet-class>
+ </servlet>
+
+ <servlet-mapping>
+ <servlet-name>SlowStart</servlet-name>
+ <url-pattern>/SlowStart</url-pattern>
+ <load-on-startup>1</load-on-startup>
+ </servlet-mapping>
</web-app>
6. Be ready to watch the haproxy.log on the head gear and ensure that, when the changes are pushed, no gear is started and then taken back down due to a 503 or 404. The gears should not be made available until they are done actually starting. This process will take ~30 seconds.
7. Git add, commit, and push the changes to the application. Watch the haproxy.log for the failures.
Created attachment 1097883 [details]
Reproducer Class
There was an error in the steps provided. In step #5, the diff is incorrect. The url-pattern sepcified is '/SlowStart', but should just be '/'. Below is the correct diff:
diff --git a/src/main/webapp/WEB-INF/web.xml b/src/main/webapp/WEB-INF/web.xml
index fa91269..84f53f9 100755
--- a/src/main/webapp/WEB-INF/web.xml
+++ b/src/main/webapp/WEB-INF/web.xml
@@ -6,6 +6,15 @@
xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_0.xsd"
metadata-complete="false">
+ <servlet>
+ <servlet-name>SlowStart</servlet-name>
+ <servlet-class>SlowStart</servlet-class>
+ </servlet>
+
+ <servlet-mapping>
+ <servlet-name>SlowStart</servlet-name>
+ <url-pattern>/</url-pattern>
+ <load-on-startup>1</load-on-startup>
+ </servlet-mapping>
</web-app>
Test this bug by following the steps in Comment 14, the gears are back to UP finally. And the app is available after the git push. ==> app-root/logs/haproxy.log <== [WARNING] 327/020259 (11151) : Server express/local-gear is DOWN for maintenance. [WARNING] 327/020700 (11151) : Server express/local-gear is UP (leaving maintenance). [WARNING] 327/020700 (11151) : Server express/gear-yes-jbtest-2-yes is DOWN for maintenance. [WARNING] 327/020824 (11151) : Server express/gear-yes-jbtest-2-yes is UP (leaving maintenance). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-2666.html |
Description of problem: OpenShift uses a "rolling" strategy when deploying new code on scalable applications. This means that backend gears are updated in batches of 20% at a time. To update each of these batches we do: 1.- rotate-out the gears to be updated (i.e. remove them from haproxy's configuration) 2.- activate the new code 3.- rotate-in the gears again Based on the observation it can happen that a gear is added back into rotation when it's still not ready to serve application content. It might happens when you have an application that takes some time to startup in EWS cartridge. In step 2 above is the EWS cartridge of the gears that are being updated is restarted (using the new code). A restart is a stop + a start. The control code for the 'start' operation makes sure that the servlet container starts up and starts listening to its port. This is the code that does this: # Check if the server http port is up function ishttpup() { count=0 while [ ${count} -lt 64 ] do if echo 2> /dev/null > "/dev/tcp/${OPENSHIFT_JBOSSEWS_IP}/${OPENSHIFT_JBOSSEWS_HTTP_PORT}"; then echo "Found ${OPENSHIFT_JBOSSEWS_IP}:${OPENSHIFT_JBOSSEWS_HTTP_PORT} listening port" return 0 fi ... called from the start_app function (https://github.com/openshift/origin-server/blob/openshift-origin-release-4/cartridges/openshift-origin-cartridge-jbossews/bin/control#L98) As you can see, this check succeeds as soon as the (tomcat) port is ready - after that the start operation is declared successful and we proceed to step 3 (rotate the gear in again). However, some application(s) might take a while to be ready to serve content and results in 503 responses. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. rhc app create bin jbossews-2.0 -s --no-git 2. rhc cartridge scale jbossews-2.0 -a bin --min 2 --max 4 3. rhc app configure -a bin --deployment-type binary 4. rhc deploy New.tar.gz -a bin 1)[WARNING] 282/031520 (19269) : Server express/local-gear is DOWN for maintenance. 2)[WARNING] 282/031539 (19269) : Server express/local-gear is UP (leaving maintenance). 3)[WARNING] 282/031540 (19269) : Server express/gear-replica-l91lwz3istg-2-replica is DOWN for maintenance. 4)[WARNING] 282/031550 (19269) : Server express/local-gear is DOWN, reason: Layer7 timeout, check duration: 10030ms. 0 active Actual results: The gear is marked up though the application is not ready to serve the request(rolled in) and the other up gears are made down or rolled out. Expected results: Before rolling out other gears it should check if the gear is ready to serve the request. Additional info: It would be helpful if the rolling update logic and haproxy check are in sync or tied in EWS.