Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 899173 (JBEWS-240)

Summary: Failover triggers 503 Service Temporarily Unavailable
Product: [JBoss] JBoss Enterprise Web Server 1 Reporter: Radoslav Husar <rhusar>
Component: unspecifiedAssignee: Permaine Cheung <pcheung>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: jfclere, misty, pcheung, pslavice, rajesh.rajasekaran
Target Milestone: ---   
Target Release: EWS 1.0.2   
Hardware: Unspecified   
OS: Unspecified   
URL: http://jira.jboss.org/jira/browse/JBEWS-240
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
perf lab (and all other platforms as well)
Last Closed: 2011-05-17 11:25:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
error_log.zip
none
patch.txt none

Description Radoslav Husar 2010-02-01 14:01:37 UTC
Affects: Documentation (Ref Guide, User Guide, etc.)
Date of First Response: 2010-02-01 17:15:20
project_key: JBEWS

Simple scenario: mod_cluster and 4 JBoss nodes are deployed. When a JBoss server fails (JVM is forcibly killed), about 0.5% of sessions result in 503.

These edge scenarios are why people are going to deploy mod_cluster and it should be our priority to provide 100% session correctness. We should target all of these QoS problems for production release since mod_jk handles these cases.

Same question as to JBPAPP-3463: is there a known workaround we could provide to customers?

503 Service Temporarily Unavailable</title>
</head><body>
<h1>Service Temporarily Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
<hr>
<address>Apache/2.2.10 (Red Hat) Server at perf08 Port 8888</address>

Test results:
http://hudson.qa.jboss.com/hudson/view/mod_cluster-QE/job/eap-50-mod_cluster-failover-rhel4-x86_64/7/

perf01-perf04 = JBoss nodes
perf08 = mod_cluster + load driver node

Comment 1 Paul Ferraro 2010-02-01 22:15:20 UTC
If the JVM is forcibly killed, all requests accepted by that node *should* result in a 50x error.  It would be incorrect for these requests to failover, since it may result in duplicated processing.
In the case of a JVM crash, the best we can do is to provide failover for subsequent requests.

Comment 2 Radoslav Husar 2010-02-02 01:22:09 UTC
So a problem would emerge if (e.g.) all the transactions on the server committed successfully and while/before sending the response to the client, the server would crash. To avoid duplicate processing we don't failover. Valid point.

Practically speaking, aren't most users getting such error going to simply refresh (the RFC says that this is a temporary problem and suggests that user retries after time)? Couldnt be that apps which are sensitive to double processing have this covered (i.e. request already contains which 'transaction id' it is therefore would see it completed if tried to double process)?

Arent most websites be wanting just HA, shouldn't there be an option to failover anyway?


Comment 3 Jean-Frederic Clere 2010-02-08 08:41:59 UTC
503 (HTTP_SERVICE_UNAVAILABLE) causes a retry if Maxattempts is greater than one
See http://www.jboss.org/mod_cluster/java/properties.html maxAttempts (default is one: no retry.

Comment 4 Radoslav Husar 2010-02-10 15:20:54 UTC
Thanks for the hint Jean-Frederic, but I am still not happy :-)

Tested with maxAttempts to value 2, still several sessions return 503 to the client. This should not be happening (see test run [1]).

My speculation is that 503 were retried but all 500 (from JBPAPP-3615) were not retried and these now resulted in 503 on 2nd attempt, see the code:

        else if (access_status == HTTP_INTERNAL_SERVER_ERROR) {
            /* Unrecoverable server error.
             * We can not failover to another worker.
             * Mark the worker as unusable if member of load balancer
             */
            if (balancer) {
                worker->s->status |= PROXY_WORKER_IN_ERROR;
                worker->s->error_time = apr_time_now();
            }
            break;
        }

or the problem is why the 500 were returned on the 1st attempt (and therefore not retried now).

[1] http://hudson.qa.jboss.com/hudson/view/mod_cluster-QE/job/eap-50-mod_cluster-failover-rhel4-x86_64/8/console-perf08/

Comment 5 Jean-Frederic Clere 2010-02-10 15:34:10 UTC
Actually mod_proxy (and mod_cluster is based on it) doesn't allow to retry on a status returned by the worker.
Note that access_status is not r->status.
Probably in fact the 500 and 503 are return by AS so it is JBPAPP-3463 and not JBPAPP-3614, please check in the error_log or access_log.

Comment 6 Radoslav Husar 2010-02-10 15:37:45 UTC
The server was forcefully killed (-9) in this case, it cannot have returned anything, so that cant be the case. Maybe we are onto other problem which I meant by JBPAPP-3615 - it should return 503, not 500 in my opinion.

Comment 7 Jean-Frederic Clere 2010-02-10 16:00:01 UTC
"it cannot have returned anything" how can you be so sure that some response have been truncated or that the request where processed but killed before sending the response?

Please give a error_log (in debug better) and I am sure I will prove that those error can't be retried (http can't tell if the request has been processed or not processed.

Comment 8 Radoslav Husar 2010-02-11 13:15:17 UTC
You  are right, I cannot be sure there wasnt a partial response. Those are the problem here right; even with maxAttempts. I am attaching about first 25% of the log, its very verbose :-) Can you please grep through it and find the evidence for that?

So is there another option to allow retry even on these responses? 

Comment 9 Radoslav Husar 2010-02-11 13:15:18 UTC
Attachment: Added: error_log.zip


Comment 10 Radoslav Husar 2010-02-11 13:23:15 UTC
Unfortunately the first part doesnt include such request. Two of them (503) happen at times:

07:55:23:944 
07:58:20:494

Here are the full logs:

http://hudson.qa.jboss.com/hudson/view/mod_cluster-QE/job/eap-50-mod_cluster-failover-rhel4-x86_64/9/artifact/error_log.zip

Comment 11 Jean-Frederic Clere 2010-02-11 14:07:05 UTC
According to the error_log at least 14 request shouldn't be retried so you should have at least 14 error.
How many errors do you get?

Comment 12 Radoslav Husar 2010-02-11 14:12:55 UTC
Strange. Only 4. The test was running from 07:48:47:246 to 08:01:18:625 - maybe 14 was including the test tear-down?

http://hudson.qa.jboss.com/hudson/view/mod_cluster-QE/job/eap-50-mod_cluster-failover-rhel4-x86_64/9/artifact/report/

(summing up 'sampling errors' field:)
http://hudson.qa.jboss.com/hudson/view/mod_cluster-QE/job/eap-50-mod_cluster-failover-rhel4-x86_64/9/artifact/report/failover.txt

Comment 13 Jean-Frederic Clere 2010-02-11 14:23:13 UTC
ok let me have a closer look...

Comment 14 Jean-Frederic Clere 2010-02-11 15:05:32 UTC
Ok I was wrong in fact there should an errors....
it seems you are testing with 3 nodes and kill one but for some reasons the others can't accept the requests that are failed over.
Check that the nodes are configured correctly to support the increase of load.
I think that the:
Connection refused: proxy: AJP: attempt to connect to 10.16.88.182:8009 (10.16.88.182) ..
are due to maxThreads too small in <Connector/>

Comment 15 Radoslav Husar 2010-02-11 15:11:25 UTC
Nope, not the case, here is the config:

      <!-- A AJP 1.3 Connector on port 8009 -->
      <Connector protocol="AJP/1.3" port="8009" address="${cluster.route.name}"
         redirectPort="8443" maxThreads="10000"/>

Yes, 4 nodes, one is down for a minute then up again, then another fails, etc.


Comment 16 Jean-Frederic Clere 2010-02-11 17:36:41 UTC
I have found it... I think it is a bug in mod_proxy not in mod_cluster.
There are serveral:
[Thu Feb 11 07:58:20 2010] [debug] proxy_util.c(1989): proxy: BALANCER: worker for (10.16.88.182) has been marked for retry
they are too early: at the same time the worker failed.

I have commented that in httpd-dev list.

Comment 17 Radoslav Husar 2010-02-22 14:10:12 UTC
Hi Jean-Frederic,
did you get any reply on the dev list?

Comment 19 Anne-Louise Tangring 2011-01-27 19:48:08 UTC
This is a candidate, but not committed to EAP 5.1.1

Comment 20 Misty Stanley-Jones 2011-02-04 05:16:37 UTC
Release Notes Docs Status: Added: Needs More Info


Comment 21 Jean-Frederic Clere 2011-02-04 07:48:08 UTC
It is not fixed as it requires further investigations and a fix in upstream httpd.
Note the fix I proposed worked in my tests but that is a race condition that can't be fixed easy...

Comment 22 Misty Stanley-Jones 2011-02-24 22:16:15 UTC
Release Notes Docs Status: Removed: Needs More Info Added: Not Yet Documented


Comment 23 Misty Stanley-Jones 2011-02-25 00:52:20 UTC
Writer: Added: mistysj


Comment 25 Misty Stanley-Jones 2011-02-25 02:02:24 UTC
Release Notes Docs Status: Removed: Not Yet Documented Added: Needs More Info


Comment 26 Jean-Frederic Clere 2011-02-25 09:06:57 UTC
cause: mod_proxy bug race condition.
consequence: 503 could be returned instead failing over to another node after killing a node.
fix: none
work-around: none.

Comment 27 Misty Stanley-Jones 2011-02-27 22:11:52 UTC
Release Notes Docs Status: Removed: Needs More Info Added: Documented as Known Issue
Release Notes Text: Added: Due to a race condition caused by a bug in mod_proxy, a <literal>503</literal> error may be returned when a node is killed, instead of failing over to another node. 


Comment 28 Jean-Frederic Clere 2011-04-14 11:46:20 UTC
Patch for httpd-2.2.17

Comment 29 Jean-Frederic Clere 2011-04-14 11:46:20 UTC
Attachment: Added: patch.txt


Comment 30 Permaine Cheung 2011-04-20 13:17:32 UTC
Patch incorporated in:
httpd22-2.2.17-11.ep5.el4
httpd-2.2.17-11.1.ep5.el5
httpd-2.2.17-11.2.ep5.el6
The patch is also applied in Hudson builds. The fix will be in EWS 1.0.2 CR2.

Comment 31 Rajesh Rajasekaran 2011-04-20 21:52:08 UTC
Please verify in EWS CR2 1.0.2 build.

Comment 32 Michal Karm Babacek 2011-05-04 08:13:48 UTC
Link: Added: This issue Cloned to JBPAPP-6451


Comment 33 Michal Karm Babacek 2011-05-17 11:25:14 UTC
https://issues.jboss.org/browse/JBPAPP-6451

Comment 34 Jiri Skrabal 2012-11-13 16:26:51 UTC
Release Notes Docs Status: Removed: Documented as Known Issue 
Writer: Removed: mistysj 
Release Notes Text: Removed: Due to a race condition caused by a bug in mod_proxy, a <literal>503</literal> error may be returned when a node is killed, instead of failing over to another node.