Bug 1270660

Summary: Haproxy health check should be in sync with rolling updates in EWS
Product: OpenShift Container Platform Reporter: Jaspreet Kaur <jkaur>
Component: ImageStreamsAssignee: Timothy Williams <tiwillia>
Status: CLOSED ERRATA QA Contact: DeShuai Ma <dma>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 2.2.0CC: adellape, aos-bugs, bperkins, erich, gpei, jokerman, misalunk, mmccomas, nicholas_schuetz, pep, tiwillia
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openshift-origin-cartridge-jbossews-1.35.3.2-1.el6op Doc Type: Enhancement
Doc Text:
Previously, the EWS cartridge started an application's gear and marked the application as "up" in HAProxy. If the application took some time to deploy, it caused an outage window as traffic was routed to the application. This was corrected either when HAProxy executed its next health check and disabled the application or when the deployment finished. As a result, it was possible for EWS cartridge deployments to be marked "up" and routable when in fact they were not yet ready to service requests. This bug fix introduces the OPENSHIFT_JBOSSEWS_START_DELAY EWS environment variable to allow application owners to delay the registration of there deployment with haproxy. Set this variable to include a delay (in seconds) which causes deployments to halt after a gear start in a similar manner to the EAP cartridge. The difference between the EWS and EAP cartridges is that Tomcat implements a sleep (or hang) because it does not have a management interface (https://access.redhat.com/solutions/901043) to interact with like EAP to check that deployments have finished. Using the OPENSHIFT_JBOSSEWS_START_DELAY variable can cause application deployments to take longer but it can be used to avoid outages with new deployments.
Story Points: ---
Clone Of:
: 1285084 (view as bug list) Environment:
Last Closed: 2015-12-17 17:11:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1273542, 1285084    
Attachments:
Description Flags
Reproducer Class none

Description Jaspreet Kaur 2015-10-12 05:14:24 UTC
Description of problem:


 OpenShift uses a "rolling" strategy when deploying new code on scalable applications. This means that backend gears are updated in batches of 20% at a time. To update each of these batches we do:

  1.- rotate-out the gears to be updated (i.e. remove them from haproxy's configuration)
  2.- activate the new code
  3.- rotate-in the gears again

Based on the observation it can happen that a gear is added back into rotation when it's still not ready to serve application content.

It might happens when you have an application that takes some time to startup in EWS cartridge.

In step 2 above is the EWS cartridge of the gears that are being updated is restarted (using the new code). A restart is a stop + a start. The control code for the 'start' operation makes sure that the servlet container starts up and starts listening to its port. This is the code that does this:

  # Check if the server http port is up
  function ishttpup() {
      count=0
      while [ ${count} -lt 64 ]
      do
	  if echo 2> /dev/null > "/dev/tcp/${OPENSHIFT_JBOSSEWS_IP}/${OPENSHIFT_JBOSSEWS_HTTP_PORT}"; then
	      echo "Found ${OPENSHIFT_JBOSSEWS_IP}:${OPENSHIFT_JBOSSEWS_HTTP_PORT} listening port"
	      return 0
	  fi
      ...

called from the start_app function (https://github.com/openshift/origin-server/blob/openshift-origin-release-4/cartridges/openshift-origin-cartridge-jbossews/bin/control#L98)

As you can see, this check succeeds as soon as the (tomcat) port is ready - after that the start operation is declared successful and we proceed to step 3 (rotate the gear in again). However, some application(s) might take a while to be ready to serve content and results in 503 responses.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. rhc app create bin jbossews-2.0 -s --no-git
2. rhc cartridge scale jbossews-2.0 -a bin --min 2 --max 4
3. rhc app configure -a bin --deployment-type binary
4. rhc deploy New.tar.gz -a bin


 1)[WARNING] 282/031520 (19269) : Server express/local-gear is DOWN for maintenance. 
2)[WARNING] 282/031539 (19269) : Server express/local-gear is UP (leaving maintenance). 
3)[WARNING] 282/031540 (19269) : Server express/gear-replica-l91lwz3istg-2-replica is DOWN for maintenance.
4)[WARNING] 282/031550 (19269) : Server express/local-gear is DOWN, reason: Layer7 timeout, check duration: 10030ms. 0 active 

Actual results:

The gear is marked up though the application is not ready to serve the request(rolled in) and the other up gears are made down or rolled out.


Expected results:

Before rolling out other gears it should check if the gear is ready to serve the request.


Additional info:


It would be helpful if the rolling update logic and haproxy check are in sync or tied in EWS.

Comment 11 Gaoyun Pei 2015-11-19 09:03:20 UTC
Verify this bug with openshift-origin-cartridge-jbossews-1.35.3.2-1.el6op.noarch

Steps:
1. Create a scaleble ews app
rhc app create bin jbossews-2.0 -s --no-git

2. Scale-up this app
rhc cartridge scale jbossews-2.0 -a bin --min 2 --max 4

3. Configure this app as binary deployment type
rhc app configure -a bin --deployment-type binary

4. Run the binary deployment
[root@broker ~]# rhc deploy new.tar.gz -a bin
Deployment of file '/root/new.tar.gz' in progress for application bin ...
Starting deploy for binary artifact
Stopping gear
Stopping jbossews cartridge
Creating new deployment directory
Preparing deployment
Preparing build for deployment
Deployment id is f9492a36
Distributing deployment
Distributing deployment to child gears
Distribution status: success
Activating deployment
HAProxy already running
CLIENT_RESULT: HAProxy instance is started
Starting jbossews cartridge
Found 127.3.192.1:8080 listening port
Activation status: success
Deployment status: success
Success

[root@broker ~]# rhc app show bin --gears
ID        State   Cartridges               Size  SSH URL
--------- ------- ------------------------ ----- -----------------------------------------
yes-bin-1 started haproxy-1.4 jbossews-2.0 small yes-bin-1.com.cn
yes-bin-2 started haproxy-1.4 jbossews-2.0 small yes-bin-2.com.cn


In app-root/logs/haproxy.log:
[WARNING] 322/032732 (32529) : Proxy stats stopped (FE: 1 conns, BE: 0 conns).
[WARNING] 322/032732 (32529) : Proxy express stopped (FE: 0 conns, BE: 0 conns).
[WARNING] 322/034851 (1544) : Server express/local-gear is DOWN for maintenance.
[WARNING] 322/034911 (1544) : Server express/local-gear is UP (leaving maintenance).
[WARNING] 322/034911 (1544) : Server express/gear-yes-bin-2-yes is DOWN for maintenance.
[WARNING] 322/034913 (1544) : Server express/local-gear is DOWN, reason: Layer7 wrong status, code: 404, info: "HTTP status check returned code <3C>404<3E>", check duration: 1ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 322/034913 (1544) : proxy 'express' has no server available!
[WARNING] 322/034929 (1544) : Server express/gear-yes-bin-2-yes is UP (leaving maintenance).
[WARNING] 322/034929 (1544) : Server express/gear-yes-bin-2-yes is DOWN, reason: Layer7 wrong status, code: 404, info: "HTTP status check returned code <3C>404<3E>", check duration: 2ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 322/034929 (1544) : proxy 'express' has no server available!

The app couldn't be accessed.
[root@broker ~]# curl -I http://bin-yes.ose22-auto.com.cn/
HTTP/1.1 503 Service Unavailable
Date: Thu, 19 Nov 2015 08:54:28 GMT
Cache-Control: no-cache
Content-Type: text/html; charset=UTF-8
Connection: close

Comment 14 Timothy Williams 2015-11-23 22:34:52 UTC
QA:

I apologize for the broken reproducer.

Please follow the below steps to verify this bug is resolved:

1. Create a jbossews-2.0 scalable application:
     rhc app create jbtest jbossews-2.0 -s

2. Set a new environment variable on the application to extend the amount of tries jbossews will attempt to wait for a deployment to complete before starting.

    rhc set-env OPENSHIFT_JBOSSEWS_START_TRIES=30 jbtest

3. Scale the cartridge up to at least two:
     rhc cartridge scale jbossews-2.0 -a jbtest --min 2 --max 4

4. In the application's git repository, add the attached 'SlowStart.java' file to the './src/main/java/' directory
     mv SlowStart.java jbtest/src/main/java/

5. Make the following changes to the src/main/webapp/WEB-INF/web.xml file:
diff --git a/src/main/webapp/WEB-INF/web.xml b/src/main/webapp/WEB-INF/web.xml
index fa91269..84f53f9 100755
--- a/src/main/webapp/WEB-INF/web.xml
+++ b/src/main/webapp/WEB-INF/web.xml
@@ -6,6 +6,15 @@
          xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_0.xsd"
          metadata-complete="false">
 
+         <servlet>
+             <servlet-name>SlowStart</servlet-name>
+             <servlet-class>SlowStart</servlet-class>
+         </servlet>
+ 
+         <servlet-mapping>
+             <servlet-name>SlowStart</servlet-name>
+             <url-pattern>/SlowStart</url-pattern>
+             <load-on-startup>1</load-on-startup>
+         </servlet-mapping>          
 </web-app>

6. Be ready to watch the haproxy.log on the head gear and ensure that, when the changes are pushed, no gear is started and then taken back down due to a 503 or 404. The gears should not be made available until they are done actually starting. This process will take ~30 seconds. 

7. Git add, commit, and push the changes to the application. Watch the haproxy.log for the failures.

Comment 15 Timothy Williams 2015-11-23 22:35:35 UTC
Created attachment 1097883 [details]
Reproducer Class

Comment 16 Timothy Williams 2015-11-23 22:37:13 UTC
There was an error in the steps provided. In step #5, the diff is incorrect. The url-pattern sepcified is '/SlowStart', but should just be '/'. Below is the correct diff:

diff --git a/src/main/webapp/WEB-INF/web.xml b/src/main/webapp/WEB-INF/web.xml
index fa91269..84f53f9 100755
--- a/src/main/webapp/WEB-INF/web.xml
+++ b/src/main/webapp/WEB-INF/web.xml
@@ -6,6 +6,15 @@
          xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_0.xsd"
          metadata-complete="false">
 
+         <servlet>
+             <servlet-name>SlowStart</servlet-name>
+             <servlet-class>SlowStart</servlet-class>
+         </servlet>
+ 
+         <servlet-mapping>
+             <servlet-name>SlowStart</servlet-name>
+             <url-pattern>/</url-pattern>
+             <load-on-startup>1</load-on-startup>
+         </servlet-mapping>          
 </web-app>

Comment 17 Gaoyun Pei 2015-11-24 07:11:46 UTC
Test this bug by following the steps in Comment 14, the gears are back to UP finally. And the app is available after the git push.

==> app-root/logs/haproxy.log <==
[WARNING] 327/020259 (11151) : Server express/local-gear is DOWN for maintenance.
[WARNING] 327/020700 (11151) : Server express/local-gear is UP (leaving maintenance).
[WARNING] 327/020700 (11151) : Server express/gear-yes-jbtest-2-yes is DOWN for maintenance.
[WARNING] 327/020824 (11151) : Server express/gear-yes-jbtest-2-yes is UP (leaving maintenance).

Comment 19 errata-xmlrpc 2015-12-17 17:11:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-2666.html